Shane Alcock's Blog
Spent a lot of time chasing down deadlock behaviour in netevmon when it first starts up. The problem ultimately turned out to be that anomalyfeed was requesting a large amount of stream data from NNTSC, which was causing both ends to get stuck trying to complete a blocking send to the other. Reduced the likelihood of this occuring in the future by forcing anomalyfeed to wait for all streams for a collection to arrive before asking for any more streams, but the proper solution is going to be moving to non-blocking transmits.
Also replaced some pipes within the NNTSC exporting code with Queues, as full pipes were also causing problems. These problems were much worse, as one of the full pipes would stop NNTSC from processing new data and inserting it into the database.
Fixed a segfault in anomaly_ts due to reading off the end of a buffer. The problem was that we were using strchr to look for a newline character but never checking if the character we found was within sensible bounds.
Spent last week at NZNOG where we managed to give a reasonably successful presentation of everything we've done up until now. Managed to generate a bit of interest from operators, so we must be doing something right.
Replaced the event descriptions produced by netevmon with something a bit more human-readable. This was somewhat annoying to achieve, as it required passing a lot of extra parameters into each detector, e.g. the units that the time series is measured in, the metric itself, the scale factor for the raw data (e.g. bytes per period into mbps).
Made the amp-web graphs appear more responsive by displaying components as soon as their ajax call completes, rather than waiting for all the ajax to complete before rendering anything. In practical terms, this means the detail graph appears much sooner rather than having to wait for the query for 30 days of summary data to finish. I've also split the summary data query into multiple queries so the summary graph will now appear in increments, almost acting like a progress bar.
Tried to get netevmon deployed on skeptic, without much success so far. It seems that we can run it against a particular collection but as soon as you try to include all of the collections, the whole thing grinds to a halt and eventually prevents NNTSC from processing new data. Hopefully, we can find the cause of the problem early next week.
Fixed a bunch of other minor bugs / errors across Cuz in between times, as we try to get closer to something we can show off at NZNOG.
Got back into the swing of things by spending the week fixing a multitude of UI problems and general bugs in Cuz, with the aim of getting closer to something we feel comfortable demonstrating at NZNOG.
The main improvements are:
* Finally added a "graph browser" page which lets the user choose a collection to explore.
* Event groups are shown on graphs rather than individual events. This greatly reduces clutter when big events occur.
* Fixed various inconsistencies between the line colour shown on the legend and the line colour actually being drawn on the graph.
* Stopped creating tabs that go to empty graphs.
* Fixed a bug where the rainbow summary graph would only show the first couple of hops rather than the entire path.
* Added basic tooltips to the legend which show more detail about the group being moused over, e.g. what exactly is represented by each line colour.
* Better handling of database exceptions in Cuz so that Brendon's buggy AMP test results don't crash NNTSC :)
Updated the event tooltips to better describe the group that the event belongs to, as it was previously difficult to tell which line the event corresponded to when multiple lines were drawn on the graph.
Brad's rainbow graph is now used whenever an AMP traceroute event is clicked on in the dashboard. Fixed a couple of bugs with the rainbow graph: the main one being that it was rendering the heavily aggregated summary data in the detail graph instead of the detailed data.
Replaced the old hop count event detection for traceroute data with a detector that reports when a hop in the path has changed.
Fixed a tricky little bug in NNTSC where large aggregate data queries were being broken up into time periods that did not align with the requested binsize, so a bin would straddle two queries. This would produce two results for the same bin and was causing the summary graph to stop several hours short of the right hand edge.
Started working on making the tabs allowing access to "similar" graphs operational again. Have got this working for LPI, which is the most complicated case, so it shouldn't be too hard to get tabs going for everything else again before the end of the year.
Spent most of the week adding view support to all of the existing collections within ampy. Much of the work was modifying the code to be more generic rather than the AMP-specific original implementation Brendon wrote as a proof of concept.
Added a new api to amp-web called eventview that will generate a suitable view for a given event, e.g. an AMP ICMP event will produce a view showing a single line for the address family where the event was detected.
Updated the legend generation code for views to work for all collections as well. Added a short label for each line so it will be possible to display a pop-up which will distinguish between the different colours for the same line group.
Finished the re-implementation of anomalyfeed to support grouping of streams into a single time series. Now our AMP ICMP tests are considered as one time series despite being spread across multiple addresses (and therefore multiple streams).
Brendon changed the way that we store AMP traceroute test results to improve the query performance, so this required a further update to anomalyfeed to be able to parse the new row format.
Updated NNTSC to always use labels rather than stream ids when querying the database. Eventually, all incoming queries will use labels but ampy still uses stream ids for many collections so we have to support both methods still. Any queries that are using stream ids are converted to labels by the NNTSC client API.
Updated Brendon's view / stream group management code in ampy to not be so AMP-specific. The collection-specific code has now moved into the parser code for each collection so it should be much easier to implement views for the remaining collections now.
Spent the first part of the week fixing various bugs and less than ideal behaviours in netevmon and nntsc. Some examples include:
* Preventing an event from being triggered when an amp-traceroute stream reactivates after a long idle time
* Fixed a crash bug in anomalyfeed due to an incorrect field name being used
* Fixed a problem in NNTSC where the HTTP dataparser would fall over if a path contained a ' character.
* Added a rounding threshold to the Mode detector so that it can be used with AMP ICMP streams, as these measure in usec rather than msec. Now we can round to the nearest msec.
Brendon finally merged his view changes back into the development branches of our software. This caused a number of problems with netevmon, as this had been overlooked when testing the changes originally. Managed to patch up all the problems in a rather hurried session on Tuesday afternoon and got everything back up and running.
Restarted netevmon with the TEntropy detectors running. They seem to be performing very well so far and are a useful addition.
Started working on adding the ability to group streams into a single time series within anomalyfeed. The main reason for this is to be able to cope better with the variety of addresses that AMP ICMP typically tests to. It makes more sense to consider all of these streams as a single aggregated stream rather than trying to run the event detectors against each stream individually, especially considering many addresses are only tested to intermittently. Grouping them will ensure there should be a result at every measurement interval. So far I've got this working for AMP ICMP, AMP traceroute and AMP DNS and will need to reimplement the other collections using the new system.
Spent a fair chunk of time reading up on belief theory and Dempster-Shafer so that I could give Meena some pointers on what she will need to be able to apply them to our event data. Managed to come up with some rough ideas that seem to work, but not sure if the theory is being applied 100% correctly.
Spent some time tweaking the new TEntropy-based detectors in netevmon to reduce the number of false positives and insignificant events that they were reporting. Mostly this involved tuning the various thresholds used by the Plateau detector that is run over the TEntropy values rather than the TEntropy methodology itself.
As I was doing this, I started putting together a gigantic spreadsheet of the events observed, their significance, which detectors were picking them up, and the delay between the event starting and the detector reporting it. This is useful for two main reasons:
* As I adjust and tweak the existing detectors I can easily compare the events I used to detect with what I am detecting now (and what I think I should be getting).
* We will need to calculate the probability that a given detector is right for the next major phase of Meena's project. This spreadsheet will form the basis for estimating these probabilities.
Added support to NNTSC for collecting and storing AMP HTTP test results. Seems to work reasonably well (after fixing a bug or two in the test itself!) but it'll be interesting to see how query performance pans out once the table starts to get large, given our travails with the traceroute data.
Managed to write libprotoident rules for a couple of new applications, WeChat and Funshion. Released a new version of libprotoident (2.0.7).
Added support for the AMP DNS test to NNTSC, netevmon and amp-web. Wrote a new detector that looks for changes in response codes, e.g. the DNS response going from NOERROR to REFUSED or some other error state. This should also be useful for the HTTP test in the future.
Fixed a bug in the ChangepointDetector where it wasn't dealing well with streams that featured large values (i.e. >100,000). Also spent a bit more time tweaking the Plateau detector, mainly dealing with problems that show up when either the mean or the standard deviation are very small.
This release adds support for 14 new protocols including League of Legends, WhatsApp, Funshion, Minecraft, Kik and Viber. A new category for Caching has also been added.
A further 13 protocols have had their rules refined and improved including Steam, BitTorrent UDP, RDP, RTMP and Pando.
This release also fixes the bug where flows were erroneously being classified as No Payload, despite payload being present.
The full list of changes can be found in the libprotoident ChangeLog.