Shane Alcock's Blog
Spent most of the week on leave, so not much got done this week.
In the time I was here, I fixed a number of bugs with the auto-scaling summary graph that occurred when there was no data to plot in the detail view.
I implemented yet another new algorithm for trying to determine if a time series is constant or noisy, as the previous one was pretty awful at recognising that the time series had moved from constant to noisy. The new one is better at that, but still appears to have problems for some of our streams -- it now tends to flick between constant and noisy a little too frequently -- so it will be back to the drawing board somewhat on that one.
Added config options to amp-web for specifying the location of the netevmon and amp meta-data databases. Previously we had assumed these were on the local machine, which proved troublesome when Brad tried to get Cuz running on warlock.
Capped the maximum range of the summary graph to prevent users from zooming out into empty space.
Fixed some byte-ordering bugs in libpacketdump's RadioTap and 802.11 header parsing on big endian architectures.
Added a smarter method of generating tick labels on the X axis to amp-web. Previously, if you were zoomed in far enough, the labels simply showed a time with no indication as to what day you are looking at. Now, we show the date as well as the time.
Reworked how zoom behaviour works with the summary graph. The zoom-level is now determined dynamically based on the selected range, e.g. selecting more than 75% of the current summary range will cause it to zoom out to the next level. Selecting a small area will cause it to zoom back in.
To support arbitrary changes to the summary graph range without having to re-fetch and re-draw both graphs, I decided to rewrite our graph management scripts to operate on an instance of a class rather than just being a function that gets called whenever we want to render the graphs. The class has methods that update just the summary graph or just the detail graph, so we only end up changing the graph that we need to. Also, the class can be subclassed to support different graph styles easily, e.g. our Smokeping style. While I was rewriting, I used jQuery.when to make all of the AJAX requests for graph data simultaneously rather than sequentially as we were previously.
Implemented a new data caching scheme within ampy to try and limit the number of queries that are made to the NNTSC database. Previously, data was cached based on the start and end time given in the original query, which meant that we would only get a cache hit if the exact same query was made. Instead, caching is now done based on time "blocks", where each block includes 12 individual datapoints, so we can more easily re-use the results from old queries that overlap with the current one.
Re-worked the JitterVariance detector in netevmon, as it had been producing some unimpressive results of late. Instead of looking at the standard deviation of the individual measurements, I now look at the standard deviation as a percentage of the mean latency. Also started running a Plateau detector against these values, which has been surprisingly effective at picking up on increases in "smoke" quickly.
Fixed the issue in amp-web where the y-axis on the detail graph was autoscaling to the largest value in the summary graph. Also tweaked some of the behaviour of the selection area in the summary graph: single-clicking is now a null operation (i.e. it won't reset the detail graph to show the full summary graph) and you can now click and drag on the shaded area to move the selection (previously, you could only use the tiny handle for this).
Tidied up the _get_data function in the core of ampy, as this was getting messy and disorganised. ampy parsers must now implement a request_data function which will form and make the request to NNTSC for data -- however, the clunky get_aggregate_columns, get_group_columns and get_aggregate_functions functions have all gone away.
Spent another couple of days moving code around in amp-web to make it tidier and easier to work with. Hopefully, Brendon will still be able to find things inside the codebase...
Added support for the amp-traceroute collection to amp-web. The graph is just a placeholder at the moment (a line graph of hop counts) until we get around to implementing the more useful stacked hop count graph using envision.
Re-enabled the tabs on the right-hand side of the graphs that allowed switching between related graphs, albeit without the preview graphs that used to be on them. The original tabs were very AMP-specific and hard-coded to appear on every graph. Now, the tabs are generated dynamically by an AJAX request that asks ampy for a list of "related" streams to the one currently being displayed. For example, an LPI byte count stream would have tabs showing flow, packet and user counts for the same source and application protocol whereas AMP streams will have tabs showing latency and traceroute for the same source-destination pair.
To avoid page reloads when using the tabs to switch between collections, I changed the dropdowns to be generated dynamically via an AJAX request rather than being placed and populated via the python run when the page is loaded.
Added support for the AMP ICMP collection to ampy and amp-web, so we are now able to plot graphs of the test data Brendon has been collecting.
Spent a decent chunk of an afternoon working through the DPDK build system with Richard S., trying to make the DPDK libraries build as position-independent code so that we can link libtrace against them nicely.
Reworked a large amount of code in amp-web to move the collection-specific code out of the core source files and into separate little modules for each collection. This means that the core code should be much easier to follow and work on. Adding support for new collections should also be simpler and require less inside knowledge of how the whole system works.
Table partitioning is now up and running inside of NNTSC. Migrated all the existing data over to partitioned tables.
Enabled per-user tracking in the LPI collector and updated Cuz to deal with multiple users sensibly. Changed the LPI collector to not export counters that have a value of zero -- the client now detects which protocols were missing counters and inserts zeroes accordingly. Also changed NNTSC to only create LPI streams when the time series has a non-zero value occur, which avoids the problem of creating hundreds of streams per user which are entirely zero because the user never uses that protocol.
Added ability to query NNTSC for a list of streams that had been added since a given stream was created. This is needed to allow ampy to keep up to date with streams that have been added since the connection to NNTSC was first made. This is not an ideal solution as it adds an extra database query to many ampy operations, but I'm hoping to come up with something better soon.
Revisited and thoroughly documented the ShewhartS-based event detection code in netevmon. In the process, I made a couple of tweaks that should reduce the number of 'unimportant' events that we have been getting.
Somewhat disrupted week this week, due to illness.
Replaced the template-per-collection for the graph pages with a single template that uses TAL to automatically add the right dropdowns to the page for the collection being shown on that page. Added callback code to allow proper switching between LPI metrics when browsing the graphs -- it isn't perfect but it wasn't worth putting too much effort into it when we're probably going to completely change the graph selection method at some point.
Added code to ampy to query data from the AMP ICMP test. Also added an API function that returns details about all of the streams associated with a collection -- this will be used to populate the matrix with just one request rather than having to make a request for every stream.
Worked on getting NNTSC to use table partitioning so that we can avoid having to select from massive unwieldy data tables. Seems to working well with my test database but the big challenge is to migrate the existing 'production' database over to a partitioned setup.
Made a number of minor changes to my paper on open-source traffic classifiers in response to reviewer comments.
Modified the NNTSC exporter to inform clients of the frequency of the datapoints it was returning in response to a historical data request. This allows ampy to detect missing data and insert None values appropriately, which will create a break in the time series graphs rather than drawing a straight line between the points either side of the area covered by the missing data. Calculating the frequency was a little harder than anticipated, as not every stream records a measurement frequency (and that frequency may change, e.g. if someone modifies the amp test schedule) and the returned values may be binned anyway, at which point the original frequency is not suitable for determining whether a measurement is missing.
Added support for the remaining LPI metrics to NNTSC, ampy and amp-web. We are now drawing graphs for packet counts, flow counts (both new and peak concurrent) and users (both active and observed), in addition to the original byte counts. Not detecting any events on these yet, as these metrics are very different to what we have at the moment so a bit of thought will have to go into which detectors we should use for each metric.
Added support for the Libprotoident byte counters that we have been collecting from the red cable network to netevmon, ampy and amp-web. Now we can visualise the different protocols being used on the network and receive event alerts whenever someone does something out of the ordinary.
Replaced the dropdown list code in amp-web with a much nicer object-oriented approach. This should make it a lot easier to add dropdown lists for future NNTSC collections.
Managed to get our Munin graphs showing data using a Mbps unit. This was trickier than anticipated, as Munin sneakily divides the byte counts it gets from SNMP by its polling interval but this isn't very prominently documented. It took a little while for myself, Cathy and Brad to figure out why our numbers didn't match those being reported by the original Munin graphs.
Chased down and fixed a libtrace bug where converting a trace from any ERF format (including legacy) to PCAP would result in horrendously broken timestamps on Mac OS X. It turned out that the __BYTE_ORDER macro doesn't exist on BSD systems and so we were erroneously treating the timestamps as big endian regardless of what byte order the machine actually had.
Migrated wdcap and the LPI collector to use the new libwandevent3
Changed the NNTSC exporter to create a separate thread for each client rather than trying to deal with them all asynchronously. This alleviates the problem where a single client could request a large amount of history and prevent anyone else from connecting to the exporter until that request was served. Also made NNTSC and netevmon behave more robustly when a data source disappears -- rather than halting, they will now periodically try to reconnect so I don't have to restart everything from scratch when I want to apply changes to one component.
Finally, my paper on comparing the accuracy of various open-source traffic classifiers was accepted for WNM 2013. There's a few minor nits to possibly tidy up but it shouldn't require too much work to get camera-ready.