User login

Search Projects

Project Members

Cuz

The aim of this project is to develop a system whereby network measurements from a variety of sources can be used to detect and report on events occurring on the network in a timely and useful fashion. The project can be broken down into four major components:

Measurement: The development and evaluation of software to collect the network measurements. Some software used will be pre-existing, e.g. SmokePing, but most of the collection will use our own software, such as AMP, libprotoident and maji. This component is mostly complete.

Collection: The collection, storage and conversion to a standardised format of the network measurements. Measurements will come from multiple locations within or around the network so we will need a system for receiving measurements from monitor hosts. Raw measurement values will need to be stored and allow for querying, particularly for later presentation. Finally, each measurement technology is likely to use a different output format so will need to be converted to a standard format that is suitable for the next component.

Eventing: Analysis of the measurements to determine whether network events have occurred. Because we are using multiple measurement sources, this component will need to aggregate events that are detected by multiple sources into a single event. This component also covers alerting, i.e. deciding how serious an event is and alerting network operators appropriately.

Presentation: Allowing network operators to inspect the measurements being reported for their network and see the context of the events that they are being alerted on. The general plan here is for web-based zoomable graphs with a flexible querying system.

05

Jul

2013

Added support for the Libprotoident byte counters that we have been collecting from the red cable network to netevmon, ampy and amp-web. Now we can visualise the different protocols being used on the network and receive event alerts whenever someone does something out of the ordinary.

Replaced the dropdown list code in amp-web with a much nicer object-oriented approach. This should make it a lot easier to add dropdown lists for future NNTSC collections.

Managed to get our Munin graphs showing data using a Mbps unit. This was trickier than anticipated, as Munin sneakily divides the byte counts it gets from SNMP by its polling interval but this isn't very prominently documented. It took a little while for myself, Cathy and Brad to figure out why our numbers didn't match those being reported by the original Munin graphs.

Chased down and fixed a libtrace bug where converting a trace from any ERF format (including legacy) to PCAP would result in horrendously broken timestamps on Mac OS X. It turned out that the __BYTE_ORDER macro doesn't exist on BSD systems and so we were erroneously treating the timestamps as big endian regardless of what byte order the machine actually had.

Migrated wdcap and the LPI collector to use the new libwandevent3

Changed the NNTSC exporter to create a separate thread for each client rather than trying to deal with them all asynchronously. This alleviates the problem where a single client could request a large amount of history and prevent anyone else from connecting to the exporter until that request was served. Also made NNTSC and netevmon behave more robustly when a data source disappears -- rather than halting, they will now periodically try to reconnect so I don't have to restart everything from scratch when I want to apply changes to one component.

Finally, my paper on comparing the accuracy of various open-source traffic classifiers was accepted for WNM 2013. There's a few minor nits to possibly tidy up but it shouldn't require too much work to get camera-ready.

24

Jun

2013

Added manpages to netevmon to get it ready for Debian packaging. During this process, fixed a few little oversights in the netevmon script and the existing documentation.

Re-wrote much of the NNTSC API in ampy. The main goal was to reduce the amount of duplicated code in modules for individual NNTSC collections that was better suited to a more general NNTSC API. In the process I also changed the API to only use a single "NNTSC Connection" instance rather than creating and destroying one for every AJAX request. The main benefit of this is that we don't have to ask the database about collections and streams every time we make a request now -- instead we get them once and store that info for subsequent use. This will hopefully make the graph interface feel a bit more responsive.

Updated amp-web to use the new NNTSC API in ampy. I also spent a bit of time on Friday testing the web graphs on various browsers and fixing a few of the more obvious problems. Unsurprisingly, IE 10 was the biggest source of grief.

Added a new time series type to anomaly_ts -- JitterVariance. This time series tracks the standard deviation of the latencies reported by the individual smokeping pings. Using this, I've added a new event type designed to detect when the standard deviation has moved away from being near zero, e.g. the pings have started reporting variable latency. This helps us pick up on situations where the median stays roughly the same but the variance clearly indicates some issues. It also serves as a good early indicator of upcoming Plateau or Mode events on the median latency.

17

Jun

2013

Finished preparing NNTSC for packaging. Wrote an init script for the NNTSC collector and ensured that all of the subprocesses are cleaned up when the main collector process is killed. Wrote some manpages, updated the other documentation and added some licensing to NNTSC before handing it off to Brendon for packaging.

Also moved towards packaging netevmon. Again, lots of messing around with daemonisation and ensuring that the monitor can be started and stopped nicely without anyone having to manually hunt down processes.

Spent the rest of my time working on the interaction between amp-web and History.js. Only one entry is placed in the history for each visited graph now and selecting a graph from the history will actually show you the right graph. Navigating to a graph via the history will also now update the dropdown lists to match the currently viewed graph. When using click and drag to explore a graph, clicking once on the graph will return to the previous zoom level (this was already present, but only worked for exploring the detailed graph, not the summary one).

10

Jun

2013

Spent most of my week working on making the various components of NNTSC and netevmon backgroundable so that they are a lot easier to run long-term. This was pretty straightforward for the C++ programs but the python scripts have been a bit trickier, especially in terms of getting the logging going to the right place.

Also fixed a few of the outstanding issues with amp-web. In particular, I fixed the problems we were having with the X-axis of the summary graph being garbled and ensured that the summary graph will always show a sensible time period based on the region shown in the detailed view. These changes also meant I could remove the summary timestamps from the page URL, which cleans that up quite a bit.

04

Jun

2013

Finished fixing the URLs in amp-web so that they are ordered sensibly and can support NNTSC streams that are defined using more than just "source" and "target". I also changed the ordering of the timestamps in the URL so that we can specify a start and end time for the detailed graph only (sensible defaults for the summary graph are meant to be chosen in this case). This is really handy when creating URLs that link to graphs showing events.

Started looking into what needed to be done to prepare NNTSC and netevmon for packaging and a possible distribution for our friends at Lightwire. Spent a decent chunk of time writing a README that should describe exactly how to get a NNTSC instance up and running.

NNTSC and netevmon both have tracs now and I've added a series of tickets to each with the aim of getting a release ready for Lightwire by the end of the month.

27

May

2013

Finished adding simple time series graphs for our switch interface byte count data. Got Brendon's event rendering working with these new graphs too, so we can now see and explore the events detected using the Plunge and ArimaShewhart detectors. They seem to be working reasonably well so far.

The next task I started on was fixing the URLs for the amp-web graphs -- the current setup is graph/// which is not sustainable going forward. Firstly, the metric needs to come first so that we can handle time series that are defined by more than just a source and target, e.g. a direction or an application protocol. Next, instead of explicitly listing the source, target or whatever else describes the time series data, we want to use the unique stream id from within NNTSC. This also avoids the problem of our URLs being really long or containing spaces. Unfortunately, much of the original code was written with only source and target in mind so there's a lot to change to be able to support LPI data, for example.

Developed a new version of libwandevent. There are two main changes in the new version. Firstly, the allocation and management of event structures is all handled internally by libwandevent -- no more filling in event structures and passing them off to libwandevent. The main reason for this is to try and minimise the chance of bugs where the programmer inadvertantly overwrites an existing event, much like the BSOD bug I had last week. However, it does break the existing API so there may be a slightly messy transition period. Secondly, I've added support for epoll so that will now be used instead of select, if available. Switched BSOD server over to use the new libwandevent and it seems to work pretty well.

20

May

2013

Spent much of my week working on getting BSOD ready to be wheeled out at Open Day once again. During this process, I managed to find and fix a couple of bugs in the server that were now causing nasty crashes. I also tracked down a bug in the client where the UI elements aren't redrawn properly if the window is resized. Normally this hasn't been a big problem, but newer versions of Gnome like to try and silently resize full-screen apps and this meant that our UI was disappearing off the bottom of the screen. As an interim fix, I've disabled resizing in BSOD client but we really should be trying to handle resize events properly.

Received a bug report for libtrace about the compression detection occasionally giving a false positive for uncompressed ERF traces. This is because the ERF header has no identifying 'magic' at the start, so every now and again the first few bytes (where the timestamp is stored) end up matching the bytes we use to identify a gzip header. I've strengthened the gzip check to use an extra byte so the chance of this happening now is 1 in 16 million. I've also added a special URI format called rawerf: so users can force libtrace to treat traces as uncompressed ERF.

Started working on trying to get amp-web to plot graphs of interface byte counts. I've managed to draw a line on the graph, but much of the graph styling is still using the smokeping style. I'm now looking at rewriting the javascript for the graph styling to be a bit more generic and configurable, rather than having one (mostly copied) javascript file for each of our metrics.

Friday was mostly consumed with looking after our displays at Open Day. BSOD continued to impress quite a few people and we were reasonably busy most of the day, so it seemed a worthwhile exercise.

13

May

2013

Spent a little time reviewing my old YouTube paper in preparation for discussing it in 513.

Tracked down and fixed a few outstanding bugs in my new and improved anomaly_ts. The main problem was with my algorithm for keeping a running update of the median -- I had a rather obscure bug when inserting a new value that was between the two values I was averaging to calculate the median that was causing all sorts of problems.

Added an API to ampy for querying the event database. This will hopefully allow us to add little event markers on our time series graphs. Also integrated my code for querying data for Munin time series into ampy.

Churned out a revised version of my L7 filter paper for the IEEE Workshop on Network Measurements. I have repositioned the paper as an evaluation of open-source payload-based traffic classifers rather than a critique of L7 filter. I also spent a fair chunk of time replacing my nice pass-fail system for representing results with the exact accuracy numbers because apparently reviewers found the former confusing.

Tried to continue my work in tidying up and releasing various trace sets, but ran into some problems with my rsyncs being flooded out over the faculty network. This was quite a nuisance so we need to be more careful in future about how we move traces around (despite it not really being our fault!).

06

May

2013

Managed to get a decent little algorithm going for quickly detecting a change between a noisy and constant time series. Seems to work fairly well with the examples I have so far.

Decided to completely re-factor the existing anomaly_ts code as it was getting a little unkempt, especially if we hope to have students working on it. For instance, there were several implementations of a buffer containing the recent history for a time series spread across the various detector modules. Also, most of the detectors that we had implemented were not being used and were creating a lot of confusion and our main source file had a lot of branching based on the metric being used by a time series, e.g. latency, bytes, users.

It took the whole week, but I managed to produce a fresh implementation that was clean, tidy and did not have extraneous code. All of the old detectors were placed in an archive directory in case we need them later. Each time series metric is now implemented as a separate class, so there is a lot less branching in the main source. There is also now a single HistoryBuffer implementation that can be used by any detector, including future detectors.

Released the ISP DSL I traces on WITS -- we are now sharing (anonymised) residential DSL traces for the first time, which will no doubt prove to be very popular.

29

Apr

2013

Finished up the 513 marking (eventually!) and released the marks to the students.

Released a new version of libtrace -- 3.0.17.

Started working on releasing some new public trace sets. Waikato 8 is now available on WITS and the DSL traffic from our 2009 ISP traces will hopefully soon follow. In the process, I found a couple of little glitches in traceanon that I was able to fix before the libtrace release.

Decided that our anomaly detection code does not handle time series that switch from constant to noisy and back again particularly well. A classic example is latency to Google: during working hours it is noisy, but it is constant other times. We detect the switch, but only after a long time. I would like to detect this change sooner and report it as an event (although not necessarily alert on it). I've started looking into an alternative method of detecting the change in time series style based on a pair of sliding windows: one for the last hour, one for the previous 12 hours before that. It is working better, but is currently a bit too sensitive to the effect of an individual outlier.

22

Apr

2013

Fixed the bugs in the anomaly_ts / eventing chain that I introduced last week. We're back reporting events again on the web dashboard.

Wrote ampy modules for retrieving smokeping and munin data from NNTSC so that Brendon could plot graphs of those time series. Doing this showed up some (more) problems in the graphing which Brendon eventually tracked down to being related to how aggregation was being performed within the NNTSC database.

Spent a large chunk of my week marking the 513 libtrace assignment. It is a much bigger class than previous years (over 30 students) so it was pretty time consuming to mark. In general, it was pleasing to see most students had gotten the basics of passive measurement worked out and hopefully they got some valuable experience from it. My biggest disappointment was how many students didn't read the instructions carefully -- especially those who missed the requirement to write original programs rather than blindly copying huge chunks of the example code.

15

Apr

2013

Another short week, due to being away on Tuesday and Wednesday.

Started writing up a decent description of the design and implementation of NNTSC, which would hopefully make for a decent blog post. It also means that the entire thing is stored somewhere other than in my head...

Revisited the eventing side of our anomaly detection process. Had a long but eventually productive discussion with Brendon about what information needs to be stored in the events database to be able to support the visualisation side. We decided that, given the NNTSC query mechanism, events should have information about the collection and stream that they belong to so that we can easily filter them based on those parameters. We used to use "source" and "destination" for this, but streams are defined using more than just a source and destination now.

Updated anomalyfeed, anomaly_ts and eventing to support the new info that needs to be exported all the way to the eventing program. In the process, I moved eventing into the anomaly_ts source tree (because they shared some common header files) and wrangled automake into building them properly as separate tools. Got to the stage where everything was building happily, but not running so good :(

08

Apr

2013

Very short week this week, but managed to get a few little things sorted.

Added a new dataparser to NNTSC for reading the RRDs used by Munin, a program that Brad is using to monitor the switches in charge of our red cables. The data in these RRDs is a lot noisier than smokeping data, so it will be interesting to see how our anomaly detection goes with that data. Also finally got the AMP data actually being exported to our anomaly detector - the glue program that converted NNTSC data into something that can be read by anomaly_ts wasn't parsing AMP records properly.

Spent a bit of time working on adding some new rules to libprotoident to identify previously unknown traffic in some traces sent to me by one of our users.

Spent Friday afternoon talking with Brian Trammell about some mutual interests, in particular passive measurement of TCP congestion window state and large-scale measurement data collection, storage and access. In terms of the latter, it looks many of the design decisions we have reached with NNTSC are very similar to those that he had reached with mPlane (albeit mPlane is a fair bit more ambitious than what we are doing) which I think was pretty reassuring for both sides. Hopefully we will be able to collaborate more in this space, e.g. developing translation code to make our data collection compatible with mPlane.

03

Apr

2013

Exporting from NNTSC is now back to a functional state and the whole event detection chain is back online. Added table and view descriptions for more complicated AMP tests; traceroute, http2 and udpstream are now all present. Hopefully we can get new AMP collecting and reporting data for these tests soon so we can test whether it actually works!

Had some user-sourced libtrace patches come in, so spent a bit of time integrating these into the source tree and testing the results. One simply cleans up the libpacketdump install directory to not create as many useless or unused files (e.g. static libraries and versioned library symlinks). The other adds support for the OpenBSD loopback DLT, which is actually a real nuisance because OpenBSD isn't entirely consistent with other OS's as to the values of some DLTs.

Helped Nathan with some TCP issues that Lightwire were seeing on a link. Was nice to have an excuse to bust out tcptrace again...

Looks like my L7 Filter paper is going to be rejected. Started thinking about ways in which it can be reworked to be more palatable, maybe present it as a comparative evaluation of open-source traffic classifiers instead.

11

Mar

2013

Added a data parser module to NNTSC to process the tunnel user count data that we got from Lightwire. Managed to get the data going all the way through to the event detection program which spat out a ton of events. Spent a bit of time combing through them manually to see whether the reported events were actually worth reporting -- in a lot of cases they weren't, so I've refined the old Plateau and Mode algorithms a bit to hopefully resolve the issues. I also employed the Plunge detector on all time series types, rather than just libprotoident data, and this was useful in reporting the most interesting behaviours in the tunnel user data (i.e. all the users disappearing).

Added a couple of new features to the libtrace API. The first was the ability to ask libtrace to give you the source or destination IP address as a string. This is quite handy because normally processing IP addresses in libtrace involves messing around with sockaddrs which is not particularly n00b-friendly. The second API feature was the ability to ask libtrace to calculate the checksum at either layer 3 or 4 based on the current packet contents. This was already done (poorly) inside the tracereplay tool, but is now part of the libtrace API. This is quite useful for checksum validation or if you've modified the packet somehow (e.g. modified the IP addresses) and want to recalculate the checksum to match.

Also spent a decent bit of time reading over chapters from Meenakshee's report and offering plenty of constructive criticism.

18

Feb

2013

The development of NNTSC took another dramatic turn this week. After conferring with Brendon, we realised that the current design of the data storage tables was not going to support the level of querying and analysis that he wanted for AMP data. This spurred me to quickly write up a prototype for a new NNTSC from scratch that allowed each different data collection method to specify exactly how the data table should look. This means that instead of having one unified data table with the inflexible schema of (stream id, timestamp, data value), we now have an AMP ICMP test data table that is (stream id, timestamp, pkt size, rtt, loss, error code, error type) and a Smokeping data table that is (stream_id, timestamp, uptime, loss, median, ping1, ... ping20).

We've also done away with the central queue and simply given each data parser its own connection to our database. This fixes a problem I was having where trying to read data from a file too fast was causing the queue to fill up and run the machine out of RAM.

Smokeping data collection is now working with the new NNTSC, so I now need to write the data parsing modules for each of the other input sources we used to support as well as re-do all the nice installation script stuff I had done for the previous version of NNTSC.

11

Feb

2013

Made some significant modifications to the structure of NNTSC so that it can be packaged and installed nicely. It is now no longer dependent on scripts or config files being in specific locations and handles configuration errors robustly rather than crashing into a python exception. Still got a few bugs and tidy-ups still to do, particularly relating to processes hanging around even after killing the main collector.

Managed to get some tunnel user counts from Scott at Lightwire to run through the event detection code. Added a new module to NNTSC for parsing the data, but have not quite got the data into the database for processing yet.

Spent a decent chunk of time helping Meenakshee write and practice her talk for Thursday. Once the talk was done, we got back into the swing of development by fixing some obvious problems with the current collector.

04

Feb

2013

Made a few modifications to Brendon's detectors which make them perform better across a variety of AMP time-series. In particular, the Plateau detector no longer uses a fixed percentage of the trigger buffer mean as its event threshold - instead it uses several standard deviations from the history buffer. Also fixed some problems we were having with being in an event and treating all the following measurements that are similar to those that triggered the event as anomalous. This is a problem in cases where the "event" is actually the time series moving to a new normality: our algorithm just kept us in the event state the whole time!

Once I was happy with that, got the eventing code up and running against the events reported by the anomaly detection stage. Had to make a couple of modifications to the protocol used to communicate between the two to get it working properly (there were some hard-coded entries in Brendon's database that needed a more automated way of being inserted). Tried to get the graphing / visualisation stuff going after that, but there are quite a few issues there so that may have to wait a bit.

Started looking into packaging and documenting the usage of all the tools in the chain that we've now got working. First up was Nathan's code, which is proving a bit tricky so far because a) it's python so no autotools and b) his code is rather reliant on other scripts being in certain locations relative to the script being run.

Added another protocol to libprotoident: League of Legends.

29

Jan

2013

Spent a day messing around with the event detection software, mainly seeing how Brendon's detectors work with the existing AMP data. The new "is it constant" calculation seems to be working reasonably well, but there are still a lot of issues with some of the detectors. Need to spend a bit of uninterrupted time with it to really see how it all works.

Had a quick look at the latest ISP traces with libprotoident to see if there are any obvious missing protocols I can add to the library. Added one new protocol (Minecraft) and tweaked a few existing protocols.

Spent the rest of the week at NZNOG, catching up on the state of the Internets. Most of the talks were pretty interesting and it was good to meet up with a few familiar faces.

21

Jan

2013

Decided to replace the PACE comparison in my L7 Filter paper with Tstat, a somewhat well-known open-source program that does traffic classification (along with a whole lot of other statistic collection). Tstat's results were disappointing - I was hoping they would be a lot better so that the ineptitude of L7 Filter would be more obvious, but I guess this does make libprotoident look even better.

Fixed a major bug in the lpicollector that was causing us to insert duplicate entries in our IP and User maps. Memory usage is way down now and our active IP counts are much more in line with expectations. Also added a special PUSH message to the protocol so that any clients will know when the collector is done sending messages for the current reporting period.

Spent a fair chunk of time refining Nathan to a) just work as intended, b) be more efficient and c) be more user-friendly / deployable. I've got it reading data properly from LPI, RRDs and AMP and exporting data in an appropriate format for our event detection code to be able to read.

Started toying with using the event detection code on our various inputs. Have run into some problems with the math used to determine whether a time series is relatively constant or not - this is used to determine which of our detectors should be run against the data.

Got the bad news that the libprotoident paper was rejected by TMA over the weekend. A bit disappointed with the reviews - felt like they were too busy trying to find flaws with the 4-byte approach rather than recognising the results I presented that showed it to be more accurate, faster and less memory-intensive than existing OSS DPI classifiers. Regardless, it is back to the drawing board on this one - looks like it might be the libtrace paper all over again.