User login

Search Projects

Project Members

Cuz

The aim of this project is to develop a system whereby network measurements from a variety of sources can be used to detect and report on events occurring on the network in a timely and useful fashion. The project can be broken down into four major components:

Measurement: The development and evaluation of software to collect the network measurements. Some software used will be pre-existing, e.g. SmokePing, but most of the collection will use our own software, such as AMP, libprotoident and maji. This component is mostly complete.

Collection: The collection, storage and conversion to a standardised format of the network measurements. Measurements will come from multiple locations within or around the network so we will need a system for receiving measurements from monitor hosts. Raw measurement values will need to be stored and allow for querying, particularly for later presentation. Finally, each measurement technology is likely to use a different output format so will need to be converted to a standard format that is suitable for the next component.

Eventing: Analysis of the measurements to determine whether network events have occurred. Because we are using multiple measurement sources, this component will need to aggregate events that are detected by multiple sources into a single event. This component also covers alerting, i.e. deciding how serious an event is and alerting network operators appropriately.

Presentation: Allowing network operators to inspect the measurements being reported for their network and see the context of the events that they are being alerted on. The general plan here is for web-based zoomable graphs with a flexible querying system.

11

May

2015

Continued keeping an eye on BTM until Brendon got back on Thursday. Briefed Brendon on all the problems we had noticed and what we thought was required to fix them.

Finished up my event detection webapp. Started experimenting with automating the running event detectors with a range of different parameter options. The first detector I'm looking at (Plateau) has about 15,000 different parameter combinations that I would like to try, so I'm going to have to be pretty smart about recognising events as being the same across different runs.

Started adding worker threads to anomaly_ts so that we can be more parallel. Each stream will be hashed to a consistent worker thread so that measurements will always be evaluated in order, but I still have to consider the impact of the resulting events not being in strict chronological order across all streams.

04

May

2015

Continued keeping an eye on the BTM monitors. Changed several connections to use the ISPs DNS server rather than relying on the modem to provide DNS, which seems to have resolved many of our DNS issues.

Spent a bit of time digging into the problem of intermittent latency results for Akamai sites. It appears that our latency tests are interfering with one another as moving one of the previously failing tests to a new offset away from the others fixed the problem for that test.

Continued working on my Event Detection webapp. Added two new modes: one where the user does the tutorial, then rates 20 pre-chosen events and one where the user rates the same events without doing the tutorial. This will hopefully give us some feedback on how useful the tutorial is and whether the time required to complete the tutorial is worth it. Also added proper user tracking, with the generation of a unique code at the end of the 'survey' that the user can enter into the Mechanical Turk to indicate they have completed the task.

28

Apr

2015

Spent much of my week keeping an eye on BTM and dealing with new connections as they come online. Had a couple of false starts with the Wellington machine, as the management interface was up but was not allowing any inbound connections. This was finally sorted on Thursday night (turning the modem on and off again did the trick), so much of Friday was figuring out which Wellington connections were working and which were not.

A few of the BTM connections have a lot of difficulty running AMP tests to a few of the scheduled targets: AMP fails to resolve DNS properly for these targets but using dig or ping gets the right results. Did some packet captures to see what was going on: it looks like the answer record appears in the wrong section of the response and I guess libunbound doesn't deal with that too well. The problem seems to affect only connections using a specific brand of modem, so I am imagining there is some bug in the DNS cache software on the modem.

Continued tracing my NNTSC live export problem. It soon became apparent that NNTSC itself was not the problem: instead, the client was not reading data from NNTSC, causing the receive window to fill up and preventing NNTSC from sending new data. A bit of profiling suggested that the HMM detector in netevmon was potentially the problem. After disabling that detector, I was able to keep things running over the long weekend without any problems.

Fixed a libwandio bug in the LZO writer. Turns out that the "compression" can sometimes result in a larger block than the original uncompressed data, especially when doing full payload capture. In that case, you are supposed to write out the original block instead but we were mistakenly writing the compressed block.

20

Apr

2015

Much of my week was taken up with matter relating to the Wynyard meeting on Wednesday. Meeting itself went reasonably well and definitely got the impression there was some interest in what we do and how we do it.

Continued marking the libtrace assignment for 513. Just a handful more to go.

Started getting familiar with the new AMP deployment, so I am better able to keep an eye on it while Brendon and Brad are away. Had a few connections come online on Friday which required a little attention, but overall I think it is still running smoothly.

17

Apr

2015

Short week due to the Easter break.

Prepared an extended version of my latency event detection talk to give to Wynyard Group next week. It'll be nice to not be under so much time pressure when giving the talk this time around :)

Started marking the 513 libtrace assignment.

The live exporting bug in NNTSC remains unsolved. I've narrowed it down to the internal client queue not being read from for a decent chunk of time, but am not yet sure what the client thread is doing instead of reading from the queue.

08

Apr

2015

Continued hunting for the bug in the NNTSC live exporter with mixed success. I've narrowed it down to definitely being the per-client queue that is the problem and it doesn't appear to be due to any obvious slowness inserting into the queue. Unfortunately, the problem seems to only occur once or twice a day so it takes a day before any changes or additional debugging take effect.

Went back to working on the mechanical Turk app for event detection. Finally finished a tutorial that shows most of the basic event types and how to classify them properly. Got Brendon and Brad to run through the tutorial and tweaked it according to their feedback. The biggest problem is the length of the tutorial -- it takes a decent chunk of our survey time to just run through the tutorial so I'm working on ways to speed it up a bit (as well as event classification in general). These include adding hot-keys for significance rating and using an imagemap to make the "start time" graph clickable.

Spent a decent chunk of my week trying to track down an obscure libtrace bug that affected a couple of 513 students, which would cause the threaded I/O to segfault whenever reading from the larger trace file. Replicating the bug proved quite difficult as I didn't have much info about the systems they were working with. After going through a few VMs, I eventually figured out that the bug was specific to 32 bit little-endian architectures: due to some lazy #includes, the size of an off_t was either 4 to 8 bytes between different parts of the libwandio source code which resulted in some very badly sized reads. The bug was found and fixed a bit too late for those affected students unfortunately.

30

Mar

2015

Continued developing code to group events by common AS path segments. Managed to add an "update tree" function to the suffix tree implementation I was using and then changed it to use ASNs rather than characters to reduce the number of comparisons required. Also developed code to query NNTSC for an AS path based on the source, destination and address family for a latency event, so all of the pieces are now in place.

In testing, I found a problem where live NNTSC exporting would occasionally fall several minutes behind the data that was being inserted in the database. Because this would only happen occasionally (and usually overnight), debugging this problem has taken a very long time. Found a potential cause in a unhandled E_WOULDBLOCK on the client socket so I've fixed that and am waiting to see if that has resolved the problem.

Did some basic testing of libtrace 4 for Richard, mainly trying to build it on the various OS's that we currently support. This has created a whole bunch of extra work for him due the various ways in which pthreads are implemented on different systems. Wrote my first parallel libtrace program on Friday -- there was a bit of a learning curve but I got it working in the end.

23

Mar

2015

Back after a week on holiday. Spent a decent chunk of time catching up on emails, mostly from students having trouble with the 513 libtrace assignment.

Continued tweaking and testing the new eventing code. Discovered an issue where the "live" exporter was operating several hours behind the time data was arriving. Looks like there is a bottleneck with one of the internal queues when a client subscribes to a large number of streams, but still investigating this one.

prophet started to run out disk space again, so had to stop our test data collection, purge some old data and wait for the database to finish vacuuming to regain some disk. Discovering that we had a couple of GBs of rabbit logs wasn't ideal either.

While fixing the prophet problem, did some reading and experimenting with suffix trees created from AS paths with the aim of identifying common path segments that could be used to group latency events. There doesn't appear to be a python suffix tree module that does exactly what I want, but I'm hoping I can tweak one of the existing ones. The main thing I'm missing is the ability to update an existing suffix tree after concatenating a new string rather than having to create a whole new tree from scratch.

16

Mar

2015

Added graphs for the HTTP test to amp-web, which helped reveal a couple of problems with the HTTP test that Brendon duly fixed.

Updated amp-web to support the new eventing database schema. Managed to get eventing up and running successfully, with significant event groups appearing on the dashboard and events also marked on the graphs themselves.

Gave my annual libtrace lectures, which seemed to go fairly well. Already I've got students thinking about and working on the assignment so they are at least smart enough to start early :)

Continued to develop a tutorial for my event classification app. Found plenty of good examples of different types of events so now it is just a matter of writing all the explanatory text for each example.

02

Mar

2015

Continued working on my Django app for crowd-sourcing event classifications from the general public. The core app is functional but I found that it was very difficult to explain all the intricacies of my personal classification approach using textual instructions alone. As a result, I'm working on adding a tutorial component so that users can be trained using a series of practical exercises where they encounter most of the various types of events we typically see.

Updated eventing to be able to handle more than just latency events and hopefully group events across different collections, i.e. group traceroute path changes with changes in latency. Started testing eventing with live data, so hopefully our event dashboard shouldn't be too far away from being up and running again.

Ryan has asked me to help out with teaching libtrace in 513 again this semester. Spent a day or so working on a new assignment and updating my slides. Looking forward to seeing how the students go with this year's assignment, especially the MSS analysis task.

23

Feb

2015

Finished refactoring the new eventing code to be a bit more efficient. Seems to work reasonably well in testing, so will be looking to try and deploy netevmon on prophet again in the near future.

Started looking into the possibility of farming out event classification in our ground truth dataset to a service like the Mechanical Turk. My long term goal is to experiment with the various parameters that each event detector can take to ensure that we are using the best possible settings, but this is likely to end up generating a lot of additional events that will be tedious and time-consuming for me to evaluate on my own. Instead, I'm going to try and present random strangers with a time series graph highlighting an event and a set of classification guidelines and see if we can get useful results this way.

For a start, I'm going to do some experiments using events in my ground truth dataset that I've already classified. My initial aim is to see if, with suitable guidelines, people will be able to classify events in a manner reasonably consistent with myself.

So far I've written a script to generate the time series graphs for each event found for a stream and started working on a little Django app that will display the graphs, ask for a classification and store the result in an SQLite database.

16

Feb

2015

Tested and released a new libtrace version (3.0.22). The plan is that this will be the last release of libtrace3 (barring urgent bug fixes) and Richard and I can now focus on turning parallel libtrace into the first libtrace4 release.

Continued working on adding detector configuration to netevmon. Everything seems to be in place and working now, so will look to extend the configuration to cover other aspects of netevmon (e.g. eventing or anomalyfeed) in the near future.

Spent a bit of time on Friday refactoring some of Meena's eventing code to hopefully perform a bit better and avoid attempts to insert duplicate events into the database. The code currently catches the duplicate exception and continues on, but it would be much nicer if we weren't relying on the database to tell us that the event already exists.

09

Feb

2015

Generated some fresh DS probabilities based on the results of the new DistDiff detector. Turns out it isn't quite as good as I was originally hoping (credibility is around 56%) but we'll see how the additional detector pans out for us in the long run.

Started adding a proper configuration system to netevmon, so that we can easily enable and tweak the individual detectors via a config file (as opposed to the parameters all being hard-coded). I'm using libyaml to parse the config file itself, using Brendon's AMP schedule parsing code as an example.

Spent a day looking into DUCK support for libtrace, since the newest major DAG library release had changed both the duckinf structure and the name and value of the ioctl to read it. When trying to test my changes, I found that we had broken DUCK in both libtrace and wdcap so managed to get all that working sensibly again. Whether this was worthwhile is a bit debatable, seeing as we don't really capture DUCK information anymore and nobody else had noticed this stuff was broken for months :)

27

Jan

2015

Continued working on the new detector for netevmon. It is no longer a KS test, strictly speaking, but performs a similar function. Experimented with using Earth Movers Distance as an alternative, but this tended to be badly affected by outliers in the distribution. Managed to come up with a couple of tweaks that improved the performance of the detector overall.

The first was to examine the distribution of the interquartile values only, i.e. discard the bottom 2 and top 2 values from the original distribution, to minimise the impact of outliers in general. Another change I made was to require the total sum of the values in each sample to differ by a non-trivial amount, which would prevent the detector from alerting when the distance between the two distributions is very small.

Ran the new detector against the ground truth dataset to determine how well it performs. Results are not too bad so far -- looks like it will reach similar levels of reliability to the BinSeg detector which is one of the better detectors we have.

19

Jan

2015

Wrote some slides on our latency event detection work for presentation at NZNOG. Had to shrink my original presentation a bit after realising I was sharing our timeslot with 2 other talks, so hopefully we'll all fit.

Experimented with using the Kolmogorov-Smirnov test as a detector for netevmon. I'm currently comparing the distributions of the latencies observed in the last 30 minutes with those observed 30 minutes prior to that. Initial results are somewhat promising, although my current method for evaluating distance between two distributions does not account for the difference between the values -- it just adds or subtracts a fixed amount to the distance depending on which value is larger. This means that a change from 40 to 42ms is just as likely to trigger an event as a change from 40 to 340ms.

19

Dec

2014

Short final week for the year, as I had to take a couple of days of leave.

Finished fixing the highlighting of segments on the AS traceroute graph. I ended up going with a borderless approach as it was very difficult to get the border drawing right in a number of cases). Instead, the highlighted segment becomes slightly brighter which has much the same effect.

Added AS names to both the AS traceroute and monitor map graphs. These come from querying the Team Cymru whois server via its netcat interface and are heavily cached, so we shouldn't have to make too many queries. The monitor map has also been updated to use the same colour to draw nodes that belong to the same AS.

Migrated the last of the old libtrace trac wiki pages over to our GitHub wiki.

18

Dec

2014

Finished updating the AMP latency event ground truth to include our new detectors. Generated some fresh probabilities for use in Meena's DS code. Also generated some probabilities based on the magnitude of the change in latency for an event so that we are more likely to recognise a large change as significant even if only one or two detectors fire for the event.

Updated the tooltips on the amp-web graphs to show the timestamp and value for the portion of the graph that the mouse is hovering over.

Started looking into fixing the bad border drawing on the AS path graphs, which would result in borders being drawn between segments that should be combined.

08

Dec

2014

Deployed the new and improved NNTSC on skeptic. Had a few little glitches, but overall went fairly smoothly. Most importantly, the new NNTSC can process result messages faster than they are coming in -- although it'll be interesting to see if this continues once we upgrade the amplets and push out bigger schedules to them.

Continued plugging away at updating the ground truth event dataset. Tweaked the SeriesMode detector to be able to trigger faster, although faster is still pretty slow (30-45 min detection delay). Also fixed a bug in the BinSeg detector that was causing it to incorrectly report the time when an event was detected.

Spent an afternoon going over my rejected PAM paper to see if we could fix it in time to submit to TMA. Unfortunately, we probably needed to do a lot of work to show the parameters we chose for the detectors were optimal so this will have to wait until next year.

01

Dec

2014

Added code to NNTSC to be able to receive and parse measurements via the collectd network protocol. This will allow us to start adding support for specific collectd metrics based on the requirements of our industry partners, particularly data that is collected using SNMP.

Spent a couple of days updating the latency event ground truth to include the two new detectors. Managed to get about half-way through the streams in the data set in that time, as I had made some minor modifications to other detectors that meant their detection results had also changed.

Much of Friday was spent investigating the Changepoint detector in more detail, as it had started giving a few new false positives. Still not sure whether this is a problem with our implementation or the underlying algorithm itself so this is going to require a bit more investigation, unfortunately.

24

Nov

2014

Developed a new method for calculating the magnitude of a latency event in netevmon, as the existing methods were naive at best and did not properly account for the fact that absolute change is important when the latency is very low but relative change is more important otherwise. For example, going from 1ms to 2ms is much less significant than going from 100ms to 200ms. Similarly, going from 1ms to 21ms is much more significant than going from 40ms to 60ms.

The new method was derived by choosing a number of latency values and subjectively deciding the point at which an increase in latency should be treated as significant. Plotting a graph of these points gave me a function that I could use to determine a 'significance' base line. When new events are detected, I can use the distance from the base line as an input into my magnitude calculation -- being above the line increases the magnitude, being below the line decreases it.

Also developed a method for finding the magnitude of a change in T-Entropy. This is less reliable than the method for latency change, but will provide us with a value we can use for events that are only detected by the T-Entropy detectors.