User login

Shane Alcock's Blog




Spent most of my week working on making the various components of NNTSC and netevmon backgroundable so that they are a lot easier to run long-term. This was pretty straightforward for the C++ programs but the python scripts have been a bit trickier, especially in terms of getting the logging going to the right place.

Also fixed a few of the outstanding issues with amp-web. In particular, I fixed the problems we were having with the X-axis of the summary graph being garbled and ensured that the summary graph will always show a sensible time period based on the region shown in the detailed view. These changes also meant I could remove the summary timestamps from the page URL, which cleans that up quite a bit.




Finished fixing the URLs in amp-web so that they are ordered sensibly and can support NNTSC streams that are defined using more than just "source" and "target". I also changed the ordering of the timestamps in the URL so that we can specify a start and end time for the detailed graph only (sensible defaults for the summary graph are meant to be chosen in this case). This is really handy when creating URLs that link to graphs showing events.

Started looking into what needed to be done to prepare NNTSC and netevmon for packaging and a possible distribution for our friends at Lightwire. Spent a decent chunk of time writing a README that should describe exactly how to get a NNTSC instance up and running.

NNTSC and netevmon both have tracs now and I've added a series of tickets to each with the aim of getting a release ready for Lightwire by the end of the month.




Finished adding simple time series graphs for our switch interface byte count data. Got Brendon's event rendering working with these new graphs too, so we can now see and explore the events detected using the Plunge and ArimaShewhart detectors. They seem to be working reasonably well so far.

The next task I started on was fixing the URLs for the amp-web graphs -- the current setup is graph/// which is not sustainable going forward. Firstly, the metric needs to come first so that we can handle time series that are defined by more than just a source and target, e.g. a direction or an application protocol. Next, instead of explicitly listing the source, target or whatever else describes the time series data, we want to use the unique stream id from within NNTSC. This also avoids the problem of our URLs being really long or containing spaces. Unfortunately, much of the original code was written with only source and target in mind so there's a lot to change to be able to support LPI data, for example.

Developed a new version of libwandevent. There are two main changes in the new version. Firstly, the allocation and management of event structures is all handled internally by libwandevent -- no more filling in event structures and passing them off to libwandevent. The main reason for this is to try and minimise the chance of bugs where the programmer inadvertantly overwrites an existing event, much like the BSOD bug I had last week. However, it does break the existing API so there may be a slightly messy transition period. Secondly, I've added support for epoll so that will now be used instead of select, if available. Switched BSOD server over to use the new libwandevent and it seems to work pretty well.




Spent much of my week working on getting BSOD ready to be wheeled out at Open Day once again. During this process, I managed to find and fix a couple of bugs in the server that were now causing nasty crashes. I also tracked down a bug in the client where the UI elements aren't redrawn properly if the window is resized. Normally this hasn't been a big problem, but newer versions of Gnome like to try and silently resize full-screen apps and this meant that our UI was disappearing off the bottom of the screen. As an interim fix, I've disabled resizing in BSOD client but we really should be trying to handle resize events properly.

Received a bug report for libtrace about the compression detection occasionally giving a false positive for uncompressed ERF traces. This is because the ERF header has no identifying 'magic' at the start, so every now and again the first few bytes (where the timestamp is stored) end up matching the bytes we use to identify a gzip header. I've strengthened the gzip check to use an extra byte so the chance of this happening now is 1 in 16 million. I've also added a special URI format called rawerf: so users can force libtrace to treat traces as uncompressed ERF.

Started working on trying to get amp-web to plot graphs of interface byte counts. I've managed to draw a line on the graph, but much of the graph styling is still using the smokeping style. I'm now looking at rewriting the javascript for the graph styling to be a bit more generic and configurable, rather than having one (mostly copied) javascript file for each of our metrics.

Friday was mostly consumed with looking after our displays at Open Day. BSOD continued to impress quite a few people and we were reasonably busy most of the day, so it seemed a worthwhile exercise.




Spent a little time reviewing my old YouTube paper in preparation for discussing it in 513.

Tracked down and fixed a few outstanding bugs in my new and improved anomaly_ts. The main problem was with my algorithm for keeping a running update of the median -- I had a rather obscure bug when inserting a new value that was between the two values I was averaging to calculate the median that was causing all sorts of problems.

Added an API to ampy for querying the event database. This will hopefully allow us to add little event markers on our time series graphs. Also integrated my code for querying data for Munin time series into ampy.

Churned out a revised version of my L7 filter paper for the IEEE Workshop on Network Measurements. I have repositioned the paper as an evaluation of open-source payload-based traffic classifers rather than a critique of L7 filter. I also spent a fair chunk of time replacing my nice pass-fail system for representing results with the exact accuracy numbers because apparently reviewers found the former confusing.

Tried to continue my work in tidying up and releasing various trace sets, but ran into some problems with my rsyncs being flooded out over the faculty network. This was quite a nuisance so we need to be more careful in future about how we move traces around (despite it not really being our fault!).




Managed to get a decent little algorithm going for quickly detecting a change between a noisy and constant time series. Seems to work fairly well with the examples I have so far.

Decided to completely re-factor the existing anomaly_ts code as it was getting a little unkempt, especially if we hope to have students working on it. For instance, there were several implementations of a buffer containing the recent history for a time series spread across the various detector modules. Also, most of the detectors that we had implemented were not being used and were creating a lot of confusion and our main source file had a lot of branching based on the metric being used by a time series, e.g. latency, bytes, users.

It took the whole week, but I managed to produce a fresh implementation that was clean, tidy and did not have extraneous code. All of the old detectors were placed in an archive directory in case we need them later. Each time series metric is now implemented as a separate class, so there is a lot less branching in the main source. There is also now a single HistoryBuffer implementation that can be used by any detector, including future detectors.

Released the ISP DSL I traces on WITS -- we are now sharing (anonymised) residential DSL traces for the first time, which will no doubt prove to be very popular.




Finished up the 513 marking (eventually!) and released the marks to the students.

Released a new version of libtrace -- 3.0.17.

Started working on releasing some new public trace sets. Waikato 8 is now available on WITS and the DSL traffic from our 2009 ISP traces will hopefully soon follow. In the process, I found a couple of little glitches in traceanon that I was able to fix before the libtrace release.

Decided that our anomaly detection code does not handle time series that switch from constant to noisy and back again particularly well. A classic example is latency to Google: during working hours it is noisy, but it is constant other times. We detect the switch, but only after a long time. I would like to detect this change sooner and report it as an event (although not necessarily alert on it). I've started looking into an alternative method of detecting the change in time series style based on a pair of sliding windows: one for the last hour, one for the previous 12 hours before that. It is working better, but is currently a bit too sensitive to the effect of an individual outlier.




Libtrace 3.0.17 has finally been released.

This release adds some new convenience functions to the libtrace API and fixes a number of bugs, many of which have been reported by users.

The major changes in this release are:
* Added API functions for getting the IP address from a packet as a string.
* Added API functions for calculating packet checksums at the IP and transport layers.
* Fixed major bug where the event API was not working with int: inputs.
* Fixed broken checksum calculations in tracereplay.
* Fixed bug where IP headers embedded inside ICMP messages were not being anonymised by traceanon.
* Added API support for working with ICMPv6 headers.

The full list of changes in this release can be found in the libtrace ChangeLog.

You can download the new version of libtrace from the libtrace website.




Fixed the bugs in the anomaly_ts / eventing chain that I introduced last week. We're back reporting events again on the web dashboard.

Wrote ampy modules for retrieving smokeping and munin data from NNTSC so that Brendon could plot graphs of those time series. Doing this showed up some (more) problems in the graphing which Brendon eventually tracked down to being related to how aggregation was being performed within the NNTSC database.

Spent a large chunk of my week marking the 513 libtrace assignment. It is a much bigger class than previous years (over 30 students) so it was pretty time consuming to mark. In general, it was pleasing to see most students had gotten the basics of passive measurement worked out and hopefully they got some valuable experience from it. My biggest disappointment was how many students didn't read the instructions carefully -- especially those who missed the requirement to write original programs rather than blindly copying huge chunks of the example code.




Another short week, due to being away on Tuesday and Wednesday.

Started writing up a decent description of the design and implementation of NNTSC, which would hopefully make for a decent blog post. It also means that the entire thing is stored somewhere other than in my head...

Revisited the eventing side of our anomaly detection process. Had a long but eventually productive discussion with Brendon about what information needs to be stored in the events database to be able to support the visualisation side. We decided that, given the NNTSC query mechanism, events should have information about the collection and stream that they belong to so that we can easily filter them based on those parameters. We used to use "source" and "destination" for this, but streams are defined using more than just a source and destination now.

Updated anomalyfeed, anomaly_ts and eventing to support the new info that needs to be exported all the way to the eventing program. In the process, I moved eventing into the anomaly_ts source tree (because they shared some common header files) and wrangled automake into building them properly as separate tools. Got to the stage where everything was building happily, but not running so good :(