Shane Alcock's Blog
Finished updating the AMP latency event ground truth to include our new detectors. Generated some fresh probabilities for use in Meena's DS code. Also generated some probabilities based on the magnitude of the change in latency for an event so that we are more likely to recognise a large change as significant even if only one or two detectors fire for the event.
Updated the tooltips on the amp-web graphs to show the timestamp and value for the portion of the graph that the mouse is hovering over.
Started looking into fixing the bad border drawing on the AS path graphs, which would result in borders being drawn between segments that should be combined.
Deployed the new and improved NNTSC on skeptic. Had a few little glitches, but overall went fairly smoothly. Most importantly, the new NNTSC can process result messages faster than they are coming in -- although it'll be interesting to see if this continues once we upgrade the amplets and push out bigger schedules to them.
Continued plugging away at updating the ground truth event dataset. Tweaked the SeriesMode detector to be able to trigger faster, although faster is still pretty slow (30-45 min detection delay). Also fixed a bug in the BinSeg detector that was causing it to incorrectly report the time when an event was detected.
Spent an afternoon going over my rejected PAM paper to see if we could fix it in time to submit to TMA. Unfortunately, we probably needed to do a lot of work to show the parameters we chose for the detectors were optimal so this will have to wait until next year.
Added code to NNTSC to be able to receive and parse measurements via the collectd network protocol. This will allow us to start adding support for specific collectd metrics based on the requirements of our industry partners, particularly data that is collected using SNMP.
Spent a couple of days updating the latency event ground truth to include the two new detectors. Managed to get about half-way through the streams in the data set in that time, as I had made some minor modifications to other detectors that meant their detection results had also changed.
Much of Friday was spent investigating the Changepoint detector in more detail, as it had started giving a few new false positives. Still not sure whether this is a problem with our implementation or the underlying algorithm itself so this is going to require a bit more investigation, unfortunately.
Developed a new method for calculating the magnitude of a latency event in netevmon, as the existing methods were naive at best and did not properly account for the fact that absolute change is important when the latency is very low but relative change is more important otherwise. For example, going from 1ms to 2ms is much less significant than going from 100ms to 200ms. Similarly, going from 1ms to 21ms is much more significant than going from 40ms to 60ms.
The new method was derived by choosing a number of latency values and subjectively deciding the point at which an increase in latency should be treated as significant. Plotting a graph of these points gave me a function that I could use to determine a 'significance' base line. When new events are detected, I can use the distance from the base line as an input into my magnitude calculation -- being above the line increases the magnitude, being below the line decreases it.
Also developed a method for finding the magnitude of a change in T-Entropy. This is less reliable than the method for latency change, but will provide us with a value we can use for events that are only detected by the T-Entropy detectors.
Developed another detector for netevmon based on the Binary Segmentation algorithm for detecting changepoints. The detector appears to work very well and outperforms most of our existing detectors in terms of detection latency, i.e. the time between an event beginning and the event being reported.
Finally was able to migrate prophet's database over to our new faster schema and upgrade NNTSC accordingly. Aside from a couple of minor glitches that were easily fixed, the upgrade went pretty well and our database is performs somewhat better than before although I'm not convinced it will be fast enough for the full production AMP mesh.
Experimented with a few other event detection approaches for our latency time series, but unfortunately these didn't really go anywhere useful.
Finished translating the mode detection over to C++ and managed to get it producing the same results as my original python prototype. Started running it against all of our AMP latency streams which was mostly successful but it looks like there are one or two very rare edge cases that can cause it to fall over entirely. Unfortunately, the problems are difficult to replicate, especially as the failures can occur at a point where I have no idea which time series I'm looking at, so debugging looks like it might be painful.
Wrote a new detector that uses the modes reported by my new code to identify mode changes or the appearance of new modes. It would possibly be more effective if the mode detection was performed more often (currently I look for new modes every 10 minutes), but I'm concerned about the performance impact of doing it more frequently.
Started investigating other potential anomaly detection methods. Had a look at Twitter's recent breakout detection R module, but it didn't perform very well with our latency data. Found another changepoint module in R which appears to work much better, so will start looking at developing our own version of this algorithm.
Continued the painful process of migrating my python prototype for mode detection over to C++ for inclusion in netevmon. Managed to get the embedded R portion working correctly, which should be the trickiest part.
Spent a bit of time with our new libtrace testbed, getting the DAG 7.5G2s configured and capturing correctly. Ran into some problems getting the card to steer packets captured on each interface into separate stream buffers, as the firmware we are currently running doesn't appear to support steering.
Managed to get my python prototype doing a reasonable job of finding modes in a selection of time series from the current prophet database. Added a new system for determining the 'width' of a detected mode -- wide modes cover a large range of values in the probability density function and so therefore are more likely to indicate a noisy data series. Width is calculated using both the relative standard deviation and the quartile coefficient of dispersion.
Started converting the python prototype into C++ code so it can be incorporated into netevmon.
Spent the remainder of my week reading over Richard and Craig's Honours reports and making plenty of little suggestions as to how to improve the language and make sure the important points come across clearly to the reader.
Modified the amp-web matrix to add a dropdown selector for the type of latency to show on the latency matrix (TCP, ICMP or DNS). Removed the tabs for absolute and relative DNS latency, as this is now incorporated into the generic latency tabs.
My heuristics for identifying multimodal series were not quite as effective as I had hoped, so I spent the remainder of my week investigating methods used by real statisticians to find modes in a sample set. The approach I have taken involves estimating the probability density from the observed measurements using a kernel function. This results in a smoothed line graph where the peaks represent likely modes.
By examining the differences between consecutive values on the line graph, I find the local maxima and minima in the density function. The maxima are, of course, the modes themselves while the minima are required for the following step. I then use Fisher and Marron's method to eliminate or merge "minor" modes in my set of maxima. This seems to work reasonably well in the limited test cases I have provided so far, although much of the math is too complicated for me to implement entirely within netevmon. Instead, it looks like we will be calling out to R to generate the density function, but it seems likely that R will be able to do this much faster than any naive implementation I write anyway.
Finished and submitted my PAM paper, after incorporating some feedback from Richard.
Fixed a minor libwandio bug where it was not giving any indication that a gzipped file was truncated early and content was missing.
Managed to get a new version of the amplet code from Brendon installed on my test amplet. Set up a full schedule of tests and found a few bugs that I reported back to the developer. By the end of the week, we were getting closer to having a full set of tests working properly -- just one or two outstanding bugs in the traceroute test.
Got netevmon running again on the test NNTSC. Noticed that we are getting a lot of false positives for the changepoint and mode detectors for test targets that are hosted on Akamai. This is because the series is fluctuating between two latency values and the detectors get confused as to which of the values is "normal" -- whenever it switches between them, we get an erroneous event. Added a new time series type to combat this: multimodal, where the series has 2 or 3 clear modes that it is always switching between. Multimodal series will not run the changepoint or mode detectors, but I hope to add a special multimode detector that alerts if a new and different mode appears (or an old mode disappears).