User login

Search Projects

Project Members

Cuz

The aim of this project is to develop a system whereby network measurements from a variety of sources can be used to detect and report on events occurring on the network in a timely and useful fashion. The project can be broken down into four major components:

Measurement: The development and evaluation of software to collect the network measurements. Some software used will be pre-existing, e.g. SmokePing, but most of the collection will use our own software, such as AMP, libprotoident and maji. This component is mostly complete.

Collection: The collection, storage and conversion to a standardised format of the network measurements. Measurements will come from multiple locations within or around the network so we will need a system for receiving measurements from monitor hosts. Raw measurement values will need to be stored and allow for querying, particularly for later presentation. Finally, each measurement technology is likely to use a different output format so will need to be converted to a standard format that is suitable for the next component.

Eventing: Analysis of the measurements to determine whether network events have occurred. Because we are using multiple measurement sources, this component will need to aggregate events that are detected by multiple sources into a single event. This component also covers alerting, i.e. deciding how serious an event is and alerting network operators appropriately.

Presentation: Allowing network operators to inspect the measurements being reported for their network and see the context of the events that they are being alerted on. The general plan here is for web-based zoomable graphs with a flexible querying system.

25

Jan

2016

Ran some full experiments with the stats and rtstats workloads (using ring: to capture) to make sure the numbers match up with what Richard was seeing in his earlier tests. So far, we're getting the expected behaviour: adding more threads makes stats perform worse (due to threading overhead outweighing any performance gain), but helps with rtstats. Wrote the section in the paper that describes our evaluation methodology, so now I just need to fill it in with some results!

Wrote my talk on NNTSC for AIMS. It's a 10 minute talk that is meant to provoke some discussion, so it is pretty light on implementation details. At least I hope people will come away from the talk knowing that there are some battles in this space that we've already fought so they won't repeat our mistakes.

Helped Andy get started with NNTSC so he can try implementing some InfluxDB support for storing data. The idea so far is to keep postgres around for doing the things it does well (streams, traceroute data) and use Influx for the rest.

11

Jan

2016

Added a new graph type to the AMP website for showing loss as a percentage over time. This graph is now shown when clicking on a cell in the loss matrix, as well as being able to be accessed through the graph browser. Fixed a complaint regarding the matrix where clicking on a cell in an IPv4 only matrix would take you to a graph showing lines for both IPv4 and IPv6 so you would never get the smokeping-style colouring via the matrix.

Started messing around with ostinato scripting on the 10g dev boxes and using DPDK to generate packets at 10G rates. Had a few issues initially because I was using an old version of the DPDK-enabled ostinato that Richard had lying around; updating to Dan's most recent version seemed to fix that.

Spent a bit of time looking at the data collected during the CSC and how it might be able to be used as ground truth for developing some security event detection techniques.

07

Dec

2015

Started writing some content for the parallel libtrace paper. Managed to churn out an introduction, a background and a little bit of the implementation section.

Fixed a couple of bugs in netevmon prior to the deployment: crashing when trying to reconnect to a restarted NNTSC and some confusing event descriptions for changepoint events.

Finished setting up a mobile app test environment for JP. I've configured my old iPhone to act as an extra client for 2-way communication apps (messaging etc.). So far the environment has already been helpful, as we've managed to identify one of the major outstanding patterns as being used by the Taobao mobile shopping app.

30

Nov

2015

Finished up the demo for STRATUS forum and helped Harris put together both a video and a live website.

Spent a bit of time trying to fix some unintuitive traceroute events that we were seeing on lamp. The problem was arising when a normally unresponsive hop was responding to traceroute, which was inserting an extra AS transition into our "path".

Rebuilt DPDK and Ostinato on 10g-dev2 after Richard upgraded it to Jessie so that I can resume my parallel libtrace development and testing once he's done with his experiments.

Installed and tested a variety of Android emulators to try and setup an environment where JP and I can more easily capture mobile app traffic. Turned out Bluestacks on my iMac ended up being the most useful, as the others I tried either lacked the Google Play Store (so finding and installing the "official" apps would be hard) or needed more computing power than I had available.

23

Nov

2015

Played around with getting netevmon to produce some useful events from the Ceilometer data and updated amp-web to be able to show those events on the dashboard. Some of our existing detection algorithms (Plateau, BinSegChangepoint, Changepoint) worked surprisingly well so we should have something useful to demo at the STRATUS forum on Friday.

Helped Brendon get netevmon up and running on lamp. There were a few issues unfortunately, mostly due to permission issues and R being terrible, but managed to get things running eventually. Spent a bit of time fixing some redundant event groups that we observed from the lamp data which were a side-effect from the fact that a group of traceroute events can be combined with both latency increase and decrease events. We also worked together to track down some bad IP traceroute paths that were being inserted into the database -- new amplets were not including a 'None' entry for non-responsive hops which NNTSC was expecting so an 11 hop path with 6 non-responsive hops was being recorded as a 5 hop contiguous path. Updated NNTSC to recognise a missing address field as a non-responsive hop.

Gave JP a crash course in libprotoident development so he can get started on his summer project.

09

Nov

2015

Tested and fixed my vanilla PF_RING libtrace code. I've been able to get comparable performance with the pfcount tool included with the PF_RING release so I'm fairly happy with that. Started working on adding support for the ZC version of the PF_RING driver, which uses an entirely different API.

Helped Harris get his head around how NNTSC works so that he could add support for the Ceilometer data. Set myself up with an OpenStack VM so that I can start working on the web graphs to display the data now that it is in a NNTSC database. Also spent a bit of time writing up an explanation of how netevmon works so that Harris can start looking into running our detectors against the Ceilometer data.

Worked with Brendon on Friday to get NNTSC and netevmon installed and running on the lamp machine.

24

Aug

2015

Continued working on wdcap4. The overall structure is in place and I'm now adding and testing features one at a time. So far, I've got snapping, direction tagging, VLAN stripping and BPF filtering all working. Checksum validation is working for IP and TCP; just need to test it for other protocols.

Still adding and updating protocols in libprotoident. The biggest win this week was being able to identify Shuijing (Crystal): a protocol for operating a CDN using P2P.

Helped Brendon roll out the latest develop code for ampsave, NNTSC, ampy and amp-web to skeptic. This brings skeptic in line with what is running on prophet and will allow us to upgrade the public amplets without their measurement data being rejected.

17

Aug

2015

Noticed a bug in my Plateau parameter evaluation which meant that Time Series Variability changes were being included in the set of Plateau events. Removing those meant that my results were a lot saner. The best set of parameters now gives a 83% precision rating and the average delay is now below 5 minutes. Started on a similar analysis for the next detector -- the Changepoint detector.

Continued updating libprotoident. I've managed to capture a few days of traffic from the University now, so that is introducing some new patterns that weren't present in my previous dataset. Added new rules for MongoDB, DOTA2, Line and BMDP.

Still having problems with long duration captures being interrupted, either by the DAG dropping packets or by the RT protocol FIFO filling up. This prompted me to start working on WDCap4: the parallel libtrace edition. It's a complete re-write from scratch so I am taking the time to carefully consider every feature that currently exists in WDCap and deciding whether we actually need it or whether we can do it better.

10

Aug

2015

Made a video demonstrating BSOD with the current University capture point. The final cut can be seen at https://www.youtube.com/watch?v=kJlDY0XvbA4

Alistair King got in touch and requested that libwandio be separated from libtrace so that he can release projects that use libwandio without having libtrace as a dependency as well. With his help, this was pretty straightforward so now libwandio has a separate download page on the WAND website.

Continued my investigation into optimal Plateau detector parameters. Used my web-app to classify ~230 new events in a morning (less than 5 of which qualified as significant) and merged those results back into my original ground truth. Re-ran the analysis comparing the results for each parameter configuration against the updated ground truth. I've now got an "optimal" set of parameters, although the optimal parameters still only achieve 55% precision and 60% recall.

Poked around at some more unknown flows while waiting for the Plateau analysis to run. Managed to identify some new BitTorrent and eMule clients and also added two new protocols: BDMP and Trion games.

03

Aug

2015

Continued digging into the unknown traffic in the day-long Waikato trace I captured last week. Diminishing returns are starting to really kick in now, but I've still managed to add another 9 new protocols (including SPDY) and improved the rules for a further 8.

Worked on a series of scripts to process the results of running the Plateau detector using a variety of different possible configurations (e.g. history and trigger buffer sizes, sensitivity thresholds etc). The aim is to find the optimal set of parameters based on the ground truth we already have. Of course, some parameter combinations are going to produce events that we have never seen before so I've also had to write code to find these events and generate suitable graphs so I can use my web-app to quickly manually classify them appropriately.

Spent a fair bit of time helping Yindong with his experiments.

20

Jul

2015

More work on the dashboard this week:
* added the ability to remove "common" events from the recent event list and made the graphs collapsible.
* added a table that shows the most frequently occuring events in the past day, e.g. "increased latency from A to B (ipv4)".
* polished up some of the styling on the dashboard and moved the dashboard-specific CSS (of which there is now quite a lot) into its own separate file.

Started thinking about how to include loss-related events in the event groups, as these are ignored at the moment.

The new capture point came online on Wednesday, so the rest of my week was spent playing with the packet captures. This involved:
* learning to operate EndaceVision.
* installing wdcap on the vDAG VM.
* adding the ability to anonymise only the local network in wdcap.
* performing a short test capture.
* getting BSOD working again, which required the application of a little "in-flow" packet sampling to run smoothly.
* running libprotoident against the test capture to see what new rules I can add.

13

Jul

2015

Continued testing and tweaking the event grouping in netevmon. My main problem was the creation of seemingly duplicate groups in cases where further (usually out-of-order) detections were being observed for events that were members of an already expired group. Eventually I tracked the problem down to the fact that the event was being deleted when the group expired so the later detection was resulting in the event and its parent group being re-created.

Started looking into methods for determining whether an event is "common" or "rare", so that we can allow users to filter events that occur regularly from the dashboard. This meant I had to change our method of populating the dashboard -- previously, we just grabbed the appropriate events in the pyramid view and passed them into the dashboard template, but now we need to be able to dynamically update the dashboard depending on whether the common events are being filtered or not.

Added some nice little icons to each event group to show what type of events are present within the group without having to click on the group. The current icons show latency increases, latency decreases and path changes.

06

Jul

2015

Another short week -- this time on leave Monday and Tuesday.

Started integrating traceroute events into the event grouping part of netevmon. Changed the focus of the path change detection away from which ASN appears at each hop; instead, we look for new transitions between ASNs as this will mean we don't trigger events when it takes 4 hops to get through REANNZ instead of the usual 3.

Developed a system for assigning traceroute events to ASN groups. PathChange events are matched with the ASNs for both the old and new path transition, e.g. a change from REANNZ->AARNET to REANNZ-Vocus will be assigned to the ["REANNZ", "AARNET", "Vocus"] groups. A TargetUnreachable event will be matched with the ASNs that are now missing from the traceroute as well as the last observed ASN. A TargetReachable event uses the same "identify common path elements" approach that latency events use (for want of a better system right now).

Fixed some more event detection and grouping bugs as I've found them. One fix was to make sure we at least create a group for the test target if the AS path for the stream does not reach the destination.

Spent some time on Friday proof-reading the BTM report.

01

Jul

2015

Short week as I was on leave on Thursday and Friday.

Continued tweaking the event groups produced by netevmon. My main focus has been on ensuring that the start time for a group lines up with the start time of the earliest event in the group. When this doesn't happen, it suggests that there is an incongruity in the logic for updating events and groups based on a new observed detection. Now the problem happens rarely -- which is good from the perspective that I am making progress but it is also bad because it takes a lot longer for a bad group to occur so testing and debugging is much slower.

Spent a bit of time rewriting Yindong's python trace analysis using C++ and libflowmanager. My program was able to run much faster and use a lot less memory, which should mean that wraith won't be hosed for months while Yindong waits for his analysis to run.

Added a new API function to libtrace to strip VLAN and MPLS headers from packets. This makes the packets easier to analyse with BPF filters as you don't need to construct complicated filters to deal with the possible presence of VLAN tags that you don't care about.

Installed libtrace on the Endace probe and managed to get it happily processing packets from a virtual DAG without too much difficulty.

22

Jun

2015

Continued fine-tuning the event groupings produced by netevmon. Major changes I made include:
* When displaying groups on the dashboard, merge any event groups that have the exact same members.
* Avoid including events in a group if they don't have a reasonable D-S score, even if there are other similar significant events happening at the same time. This gets rid of a lot of pointless (and probably unrelated events) from each group and also ensures groups expire promptly. This change has introduced a few flow-on effects: the insignificant events still need to be members of the group (in case they do eventually become significant) but shouldn't affect any of the group statistics -- particularly the group start time.
* Allow events that occur within one expiry period before the first event in a group to be included in that group -- threaded netevmon doesn't export events in a strict chronological order any more, so we need to be careful not to throw away out-of-order events.
* Have a fallback strategy if there is no AS path information available for a given source, dest pair (e.g. there is no traceroute test scheduled or the traceroute test has failed for some reason). Instead, we will create 2 groups: one for the source and one for the target.
* Polished up the styling of the dashboard and event group list and fixed a few UI issues that Brendon had suggested after looking at it on Friday.

15

Jun

2015

Brad managed to track down a newer video card for quarterpounder, so now BSOD is up and running again.

Added Meena's lpicollector to our github so now I can finally deprecate the lpi_live tool that comes with libprotoident. Spent a bit of time updating some documentation and reworking the example client scripts so that everything is a bit easier to use. Also fixed a couple of memory bugs that I may have introduced last time I worked on the collector.

Continued working with the new event groups. Found a problem where I was incorrectly preferring shorter AS path segments over longer ones when determining whether I could remove a group for being redundant. Having fixed that, many event groups now cover several ASNs so I've redesigned the event list on the dashboard to be better at displaying multiple AS names.

08

Jun

2015

My NNTSC live queue continued to keep up satisfactorily, so I've turned my attention back to testing AS-based event grouping in netevmon. Updated the dashboard to use AS names rather than numbers to describe event groups. Replaced the "top sources" and "top targets" graph with a "top networks" graph.

Spent Thursday hosting one of the candidates applying for a position with STRATUS.

Added BSOD to our github on Friday. Tried to get the client running on the big TV, but ran into some issues with our video card no longer being supported by fglrx. Attempting to get the client to build and run on the Mac was not much more successful, since Xcode seems to have lost track of some of our dynamic libraries.

02

Jun

2015

Fixed my remaining issues with threaded anomaly_ts. Had a few problems where a call to sscanf was interfering with some strtok_r calls I was making, but once I replace the sscanf with some manual string parsing everything worked again.

Continued looking into my NNTSC live queue delays. Narrowed the problem down to there being a time delay between publishing a message to the live rabbit queue and the message actually appearing in the queue (thanks to the firehose feature in rabbitmq!). After doing a fair bit of reading and experimenting, I theorised that the cause was the live queue being 'durable'. Even though the published messages themselves are not marked as persistent, publishing to a durable queue seems to require touching disk which can be slow on a resource-constrained machine like prophet. Removed the durable flag from the live queue and managed to run successfully over the long weekend without ever falling behind.

Migrated all netevmon configuration to use a single YAML config file for all three components. Previously, each component supported a series of getopt command line arguments which was a bit unwieldy.

25

May

2015

Continued refactoring the matrix javascript code in amp-web to be less of an embarrassment. This took quite a bit longer than anticipated because a) javascript and b) I was trying to ensure that switching between different matrix types would result in sensible meshs, metrics and splits being chosen based on past user choices. Eventually got to the stage where I'm pretty happy with the new code so now we just need to find a time to deploy some of the changes on BTM.

Started testing my new parallel anomaly_ts code. The main hiccup was that embedded R is not thread-safe, so I've had to wrap any calls out to R with a mutex. This creates a bit of a bottleneck in the parallel system so we may need to revisit writing our own implementation of the complex math that I've been fobbing off to R. After fixing that, latency time series seem to work fairly well in parallel but AS traceroute series definitely do not so I'll be looking into that some more next week.

18

May

2015

Spent a week working on the amp-web matrix. First task was to add HTTP and Throughput test matrices so we could make the BTM website available to the various participants. This was a bit trickier than anticipated as a lot of the matrix code was written with just the ICMP test in mind so there were a lot of hard-coded references to IPv4/IPv6 splits that were not appropriate for either test.

Updated amp mesh database to list which tests were appropriate for each mesh. This enabled us to limit the mesh selection dropdowns to only contain meshes that were appropriate for the currently selected matrix, as there was little overlap between the targets for the latency, HTTP and throughput tests.

Once that was all done, I started going back over all of the matrix code to make it much more maintainable and extendable. Collection-specific code was moved into the existing collection modules that already handled other aspects of amp-web, rather than the previous approach of hideous if-else blocks all through the matrix and tooltip code. Finished fixing all the python code in amp-web and started on the javascript on Friday afternoon.