Shane Alcock's Blog
Generated some fresh DS probabilities based on the results of the new DistDiff detector. Turns out it isn't quite as good as I was originally hoping (credibility is around 56%) but we'll see how the additional detector pans out for us in the long run.
Started adding a proper configuration system to netevmon, so that we can easily enable and tweak the individual detectors via a config file (as opposed to the parameters all being hard-coded). I'm using libyaml to parse the config file itself, using Brendon's AMP schedule parsing code as an example.
Spent a day looking into DUCK support for libtrace, since the newest major DAG library release had changed both the duckinf structure and the name and value of the ioctl to read it. When trying to test my changes, I found that we had broken DUCK in both libtrace and wdcap so managed to get all that working sensibly again. Whether this was worthwhile is a bit debatable, seeing as we don't really capture DUCK information anymore and nobody else had noticed this stuff was broken for months :)
Spent Tuesday combining all of our slides into a single presentation and therefore fixing all the weird LibreOffice glitches that resulted (broken diagrams, funky background colours, etc.). Tweaked a few slides to be less wordy.
The talk itself on Friday went OK, if a little over-time.
Continued working on the new detector for netevmon. It is no longer a KS test, strictly speaking, but performs a similar function. Experimented with using Earth Movers Distance as an alternative, but this tended to be badly affected by outliers in the distribution. Managed to come up with a couple of tweaks that improved the performance of the detector overall.
The first was to examine the distribution of the interquartile values only, i.e. discard the bottom 2 and top 2 values from the original distribution, to minimise the impact of outliers in general. Another change I made was to require the total sum of the values in each sample to differ by a non-trivial amount, which would prevent the detector from alerting when the distance between the two distributions is very small.
Ran the new detector against the ground truth dataset to determine how well it performs. Results are not too bad so far -- looks like it will reach similar levels of reliability to the BinSeg detector which is one of the better detectors we have.
Wrote some slides on our latency event detection work for presentation at NZNOG. Had to shrink my original presentation a bit after realising I was sharing our timeslot with 2 other talks, so hopefully we'll all fit.
Experimented with using the Kolmogorov-Smirnov test as a detector for netevmon. I'm currently comparing the distributions of the latencies observed in the last 30 minutes with those observed 30 minutes prior to that. Initial results are somewhat promising, although my current method for evaluating distance between two distributions does not account for the difference between the values -- it just adds or subtracts a fixed amount to the distance depending on which value is larger. This means that a change from 40 to 42ms is just as likely to trigger an event as a change from 40 to 340ms.
Short final week for the year, as I had to take a couple of days of leave.
Finished fixing the highlighting of segments on the AS traceroute graph. I ended up going with a borderless approach as it was very difficult to get the border drawing right in a number of cases). Instead, the highlighted segment becomes slightly brighter which has much the same effect.
Added AS names to both the AS traceroute and monitor map graphs. These come from querying the Team Cymru whois server via its netcat interface and are heavily cached, so we shouldn't have to make too many queries. The monitor map has also been updated to use the same colour to draw nodes that belong to the same AS.
Migrated the last of the old libtrace trac wiki pages over to our GitHub wiki.
Finished updating the AMP latency event ground truth to include our new detectors. Generated some fresh probabilities for use in Meena's DS code. Also generated some probabilities based on the magnitude of the change in latency for an event so that we are more likely to recognise a large change as significant even if only one or two detectors fire for the event.
Updated the tooltips on the amp-web graphs to show the timestamp and value for the portion of the graph that the mouse is hovering over.
Started looking into fixing the bad border drawing on the AS path graphs, which would result in borders being drawn between segments that should be combined.
Deployed the new and improved NNTSC on skeptic. Had a few little glitches, but overall went fairly smoothly. Most importantly, the new NNTSC can process result messages faster than they are coming in -- although it'll be interesting to see if this continues once we upgrade the amplets and push out bigger schedules to them.
Continued plugging away at updating the ground truth event dataset. Tweaked the SeriesMode detector to be able to trigger faster, although faster is still pretty slow (30-45 min detection delay). Also fixed a bug in the BinSeg detector that was causing it to incorrectly report the time when an event was detected.
Spent an afternoon going over my rejected PAM paper to see if we could fix it in time to submit to TMA. Unfortunately, we probably needed to do a lot of work to show the parameters we chose for the detectors were optimal so this will have to wait until next year.
Added code to NNTSC to be able to receive and parse measurements via the collectd network protocol. This will allow us to start adding support for specific collectd metrics based on the requirements of our industry partners, particularly data that is collected using SNMP.
Spent a couple of days updating the latency event ground truth to include the two new detectors. Managed to get about half-way through the streams in the data set in that time, as I had made some minor modifications to other detectors that meant their detection results had also changed.
Much of Friday was spent investigating the Changepoint detector in more detail, as it had started giving a few new false positives. Still not sure whether this is a problem with our implementation or the underlying algorithm itself so this is going to require a bit more investigation, unfortunately.
Developed a new method for calculating the magnitude of a latency event in netevmon, as the existing methods were naive at best and did not properly account for the fact that absolute change is important when the latency is very low but relative change is more important otherwise. For example, going from 1ms to 2ms is much less significant than going from 100ms to 200ms. Similarly, going from 1ms to 21ms is much more significant than going from 40ms to 60ms.
The new method was derived by choosing a number of latency values and subjectively deciding the point at which an increase in latency should be treated as significant. Plotting a graph of these points gave me a function that I could use to determine a 'significance' base line. When new events are detected, I can use the distance from the base line as an input into my magnitude calculation -- being above the line increases the magnitude, being below the line decreases it.
Also developed a method for finding the magnitude of a change in T-Entropy. This is less reliable than the method for latency change, but will provide us with a value we can use for events that are only detected by the T-Entropy detectors.
Developed another detector for netevmon based on the Binary Segmentation algorithm for detecting changepoints. The detector appears to work very well and outperforms most of our existing detectors in terms of detection latency, i.e. the time between an event beginning and the event being reported.
Finally was able to migrate prophet's database over to our new faster schema and upgrade NNTSC accordingly. Aside from a couple of minor glitches that were easily fixed, the upgrade went pretty well and our database is performs somewhat better than before although I'm not convinced it will be fast enough for the full production AMP mesh.
Experimented with a few other event detection approaches for our latency time series, but unfortunately these didn't really go anywhere useful.