Shane Alcock's Blog
Developed another detector for netevmon based on the Binary Segmentation algorithm for detecting changepoints. The detector appears to work very well and outperforms most of our existing detectors in terms of detection latency, i.e. the time between an event beginning and the event being reported.
Finally was able to migrate prophet's database over to our new faster schema and upgrade NNTSC accordingly. Aside from a couple of minor glitches that were easily fixed, the upgrade went pretty well and our database is performs somewhat better than before although I'm not convinced it will be fast enough for the full production AMP mesh.
Experimented with a few other event detection approaches for our latency time series, but unfortunately these didn't really go anywhere useful.
Finished translating the mode detection over to C++ and managed to get it producing the same results as my original python prototype. Started running it against all of our AMP latency streams which was mostly successful but it looks like there are one or two very rare edge cases that can cause it to fall over entirely. Unfortunately, the problems are difficult to replicate, especially as the failures can occur at a point where I have no idea which time series I'm looking at, so debugging looks like it might be painful.
Wrote a new detector that uses the modes reported by my new code to identify mode changes or the appearance of new modes. It would possibly be more effective if the mode detection was performed more often (currently I look for new modes every 10 minutes), but I'm concerned about the performance impact of doing it more frequently.
Started investigating other potential anomaly detection methods. Had a look at Twitter's recent breakout detection R module, but it didn't perform very well with our latency data. Found another changepoint module in R which appears to work much better, so will start looking at developing our own version of this algorithm.
Continued the painful process of migrating my python prototype for mode detection over to C++ for inclusion in netevmon. Managed to get the embedded R portion working correctly, which should be the trickiest part.
Spent a bit of time with our new libtrace testbed, getting the DAG 7.5G2s configured and capturing correctly. Ran into some problems getting the card to steer packets captured on each interface into separate stream buffers, as the firmware we are currently running doesn't appear to support steering.
Managed to get my python prototype doing a reasonable job of finding modes in a selection of time series from the current prophet database. Added a new system for determining the 'width' of a detected mode -- wide modes cover a large range of values in the probability density function and so therefore are more likely to indicate a noisy data series. Width is calculated using both the relative standard deviation and the quartile coefficient of dispersion.
Started converting the python prototype into C++ code so it can be incorporated into netevmon.
Spent the remainder of my week reading over Richard and Craig's Honours reports and making plenty of little suggestions as to how to improve the language and make sure the important points come across clearly to the reader.
Modified the amp-web matrix to add a dropdown selector for the type of latency to show on the latency matrix (TCP, ICMP or DNS). Removed the tabs for absolute and relative DNS latency, as this is now incorporated into the generic latency tabs.
My heuristics for identifying multimodal series were not quite as effective as I had hoped, so I spent the remainder of my week investigating methods used by real statisticians to find modes in a sample set. The approach I have taken involves estimating the probability density from the observed measurements using a kernel function. This results in a smoothed line graph where the peaks represent likely modes.
By examining the differences between consecutive values on the line graph, I find the local maxima and minima in the density function. The maxima are, of course, the modes themselves while the minima are required for the following step. I then use Fisher and Marron's method to eliminate or merge "minor" modes in my set of maxima. This seems to work reasonably well in the limited test cases I have provided so far, although much of the math is too complicated for me to implement entirely within netevmon. Instead, it looks like we will be calling out to R to generate the density function, but it seems likely that R will be able to do this much faster than any naive implementation I write anyway.
Finished and submitted my PAM paper, after incorporating some feedback from Richard.
Fixed a minor libwandio bug where it was not giving any indication that a gzipped file was truncated early and content was missing.
Managed to get a new version of the amplet code from Brendon installed on my test amplet. Set up a full schedule of tests and found a few bugs that I reported back to the developer. By the end of the week, we were getting closer to having a full set of tests working properly -- just one or two outstanding bugs in the traceroute test.
Got netevmon running again on the test NNTSC. Noticed that we are getting a lot of false positives for the changepoint and mode detectors for test targets that are hosted on Akamai. This is because the series is fluctuating between two latency values and the detectors get confused as to which of the values is "normal" -- whenever it switches between them, we get an erroneous event. Added a new time series type to combat this: multimodal, where the series has 2 or 3 clear modes that it is always switching between. Multimodal series will not run the changepoint or mode detectors, but I hope to add a special multimode detector that alerts if a new and different mode appears (or an old mode disappears).
Spent last week on leave, getting my balance down :)
Finished developing and testing stream / collection selection in netevmon.
Added support for the HTTP test back into NNTSC. We only store basic statistics from the test, i.e. number of objects, bytes, servers and the time taken to fetch everything, as opposed to the previous schema which tried to store detailed information about each individual fetched object. Managed to get my own amplet VM to do some testing and have been happily running HTTP tests for most of the week.
Replaced the pika code in NNTSC to use asynchronous connections rather than blocking connections. This should make our rabbit queue publishing and consuming code a bit more robust, especially if a TCP connection breaks down, and it also appears to have made our backlog processing much faster.
Spent a decent chunk of time chasing down a bug in the AMP HTTP test that would cause it to segfault if you tested to certain sites. After delving deep into the flex code that parses the HTML on the fetched pages looking for other objects to fetch, we eventually found that the buffer being provided to store the URL of the found object was not big enough to fit all the URLs we were seeing.
Released a new version of libtrace on Tuesday that contains the most recent batch of bug fixes. Started moving the libtrace wiki from trac to github; only the tool pages are left to migrate.
Updated netevmon to support the new family-based streams in NNTSC. Since this new approach results in one time series per stream (as opposed to multiple streams having to be aggregated into each time series), this greatly simplified the anomalyfeed script. Added event detection for changes in AS paths which operates in much the same way as the old IP path event detection.
Started adding the ability to specify a subset of streams / collections for event detection in netevmon, rather than automatically running against all streams. The streams / collections of interest are provided via a config file and a SIGHUP will cause the file to be re-read and any necessary changes made. This also
meant I had to add unsubscribe support to the NNTSC exporter, so that it would stop sending live updates for streams that had been removed from the config file.
Libtrace 3.0.21 has been released today.
This release fixes many bugs that have been reported by our users, including:
* trace_interrupt() now works properly for int, bpf, dag and ring formats.
* fixed double-counting of accepted packets when using the event API.
* fixed incorrect filtered packet counts for bpf format.
* fixed crash when performing very large reads with libwandio.
* fixed inconsistent behaviour if a bad filter string is used with int and dag formats.
* fixed potential infinite loop when combining filters, the event API and the pcapint format.
* fixed incorrect wire lengths when using SNAPLEN config option to truncate packets captured using the int format.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.