Shane Alcock's Blog
Table partitioning is now up and running inside of NNTSC. Migrated all the existing data over to partitioned tables.
Enabled per-user tracking in the LPI collector and updated Cuz to deal with multiple users sensibly. Changed the LPI collector to not export counters that have a value of zero -- the client now detects which protocols were missing counters and inserts zeroes accordingly. Also changed NNTSC to only create LPI streams when the time series has a non-zero value occur, which avoids the problem of creating hundreds of streams per user which are entirely zero because the user never uses that protocol.
Added ability to query NNTSC for a list of streams that had been added since a given stream was created. This is needed to allow ampy to keep up to date with streams that have been added since the connection to NNTSC was first made. This is not an ideal solution as it adds an extra database query to many ampy operations, but I'm hoping to come up with something better soon.
Revisited and thoroughly documented the ShewhartS-based event detection code in netevmon. In the process, I made a couple of tweaks that should reduce the number of 'unimportant' events that we have been getting.
Somewhat disrupted week this week, due to illness.
Replaced the template-per-collection for the graph pages with a single template that uses TAL to automatically add the right dropdowns to the page for the collection being shown on that page. Added callback code to allow proper switching between LPI metrics when browsing the graphs -- it isn't perfect but it wasn't worth putting too much effort into it when we're probably going to completely change the graph selection method at some point.
Added code to ampy to query data from the AMP ICMP test. Also added an API function that returns details about all of the streams associated with a collection -- this will be used to populate the matrix with just one request rather than having to make a request for every stream.
Worked on getting NNTSC to use table partitioning so that we can avoid having to select from massive unwieldy data tables. Seems to working well with my test database but the big challenge is to migrate the existing 'production' database over to a partitioned setup.
Made a number of minor changes to my paper on open-source traffic classifiers in response to reviewer comments.
Modified the NNTSC exporter to inform clients of the frequency of the datapoints it was returning in response to a historical data request. This allows ampy to detect missing data and insert None values appropriately, which will create a break in the time series graphs rather than drawing a straight line between the points either side of the area covered by the missing data. Calculating the frequency was a little harder than anticipated, as not every stream records a measurement frequency (and that frequency may change, e.g. if someone modifies the amp test schedule) and the returned values may be binned anyway, at which point the original frequency is not suitable for determining whether a measurement is missing.
Added support for the remaining LPI metrics to NNTSC, ampy and amp-web. We are now drawing graphs for packet counts, flow counts (both new and peak concurrent) and users (both active and observed), in addition to the original byte counts. Not detecting any events on these yet, as these metrics are very different to what we have at the moment so a bit of thought will have to go into which detectors we should use for each metric.
Added support for the Libprotoident byte counters that we have been collecting from the red cable network to netevmon, ampy and amp-web. Now we can visualise the different protocols being used on the network and receive event alerts whenever someone does something out of the ordinary.
Replaced the dropdown list code in amp-web with a much nicer object-oriented approach. This should make it a lot easier to add dropdown lists for future NNTSC collections.
Managed to get our Munin graphs showing data using a Mbps unit. This was trickier than anticipated, as Munin sneakily divides the byte counts it gets from SNMP by its polling interval but this isn't very prominently documented. It took a little while for myself, Cathy and Brad to figure out why our numbers didn't match those being reported by the original Munin graphs.
Chased down and fixed a libtrace bug where converting a trace from any ERF format (including legacy) to PCAP would result in horrendously broken timestamps on Mac OS X. It turned out that the __BYTE_ORDER macro doesn't exist on BSD systems and so we were erroneously treating the timestamps as big endian regardless of what byte order the machine actually had.
Migrated wdcap and the LPI collector to use the new libwandevent3
Changed the NNTSC exporter to create a separate thread for each client rather than trying to deal with them all asynchronously. This alleviates the problem where a single client could request a large amount of history and prevent anyone else from connecting to the exporter until that request was served. Also made NNTSC and netevmon behave more robustly when a data source disappears -- rather than halting, they will now periodically try to reconnect so I don't have to restart everything from scratch when I want to apply changes to one component.
Finally, my paper on comparing the accuracy of various open-source traffic classifiers was accepted for WNM 2013. There's a few minor nits to possibly tidy up but it shouldn't require too much work to get camera-ready.
Had a week of catching up on a few jobs I had put off in lieu of getting NNTSC, netevmon and amp2 ready for the Lightwire release.
Re-worked BSOD server to use a separate thread for communicating with clients, so that the packets can be sent to clients immediately rather than waiting for a break in the input stream. Unfortunately, this hasn't stopped the bursty appearance of packets on the client like I had hoped, so this requires further investigation. I suspect the flow management inside BSOD server isn't as optimal as it could be and may end up replacing this with libflowmanager.
With that in mind, I've modified libflowmanager to support multiple flow expiry 'plugins', as opposed to having a single defined expiry policy that all libflowmanager programs had to use. This will allow us to replicate BSOD's old expiry policy (flows expire after 20 seconds of inactivity) if we want to, although I would probably see how it goes with the classic libflowmanager policy first.
Received some bug reports for libtrace from Matt Brown as a result of Mayhem being run against the entirety of Debian. Perry had more or less patched them right away so I worked on releasing a new version of libtrace incorporating those fixes. The new release went out on Friday and also includes the rawerf fix from several weeks back. Had a few issues with both Fedora and FreeBSD that slowed down the testing process, so the release process took a bit longer than anticipated.
Libtrace 3.0.18 has been released.
This release fixes several bugs that have been reported in 3.0.17. In particular, this release fixes several crash bugs in the libtrace tools that were reported by the Mayhem team at Carnegie Mellon University. It also addresses a rare bug where the compression auto-detection could trigger a false positive on uncompressed ERF traces by including a new format URI (rawerf:) that can be used to force libtrace to treat the traces as uncompressed. We have also tightened up the compression auto-detection somewhat to reduce the likelihood of the bug occurring.
It is highly recommended that you explicitly use the rawerf: format if you are working with large numbers of uncompressed ERF traces.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.
Added manpages to netevmon to get it ready for Debian packaging. During this process, fixed a few little oversights in the netevmon script and the existing documentation.
Re-wrote much of the NNTSC API in ampy. The main goal was to reduce the amount of duplicated code in modules for individual NNTSC collections that was better suited to a more general NNTSC API. In the process I also changed the API to only use a single "NNTSC Connection" instance rather than creating and destroying one for every AJAX request. The main benefit of this is that we don't have to ask the database about collections and streams every time we make a request now -- instead we get them once and store that info for subsequent use. This will hopefully make the graph interface feel a bit more responsive.
Updated amp-web to use the new NNTSC API in ampy. I also spent a bit of time on Friday testing the web graphs on various browsers and fixing a few of the more obvious problems. Unsurprisingly, IE 10 was the biggest source of grief.
Added a new time series type to anomaly_ts -- JitterVariance. This time series tracks the standard deviation of the latencies reported by the individual smokeping pings. Using this, I've added a new event type designed to detect when the standard deviation has moved away from being near zero, e.g. the pings have started reporting variable latency. This helps us pick up on situations where the median stays roughly the same but the variance clearly indicates some issues. It also serves as a good early indicator of upcoming Plateau or Mode events on the median latency.
Finished preparing NNTSC for packaging. Wrote an init script for the NNTSC collector and ensured that all of the subprocesses are cleaned up when the main collector process is killed. Wrote some manpages, updated the other documentation and added some licensing to NNTSC before handing it off to Brendon for packaging.
Also moved towards packaging netevmon. Again, lots of messing around with daemonisation and ensuring that the monitor can be started and stopped nicely without anyone having to manually hunt down processes.
Spent the rest of my time working on the interaction between amp-web and History.js. Only one entry is placed in the history for each visited graph now and selecting a graph from the history will actually show you the right graph. Navigating to a graph via the history will also now update the dropdown lists to match the currently viewed graph. When using click and drag to explore a graph, clicking once on the graph will return to the previous zoom level (this was already present, but only worked for exploring the detailed graph, not the summary one).
Spent most of my week working on making the various components of NNTSC and netevmon backgroundable so that they are a lot easier to run long-term. This was pretty straightforward for the C++ programs but the python scripts have been a bit trickier, especially in terms of getting the logging going to the right place.
Also fixed a few of the outstanding issues with amp-web. In particular, I fixed the problems we were having with the X-axis of the summary graph being garbled and ensured that the summary graph will always show a sensible time period based on the region shown in the detailed view. These changes also meant I could remove the summary timestamps from the page URL, which cleans that up quite a bit.
Finished fixing the URLs in amp-web so that they are ordered sensibly and can support NNTSC streams that are defined using more than just "source" and "target". I also changed the ordering of the timestamps in the URL so that we can specify a start and end time for the detailed graph only (sensible defaults for the summary graph are meant to be chosen in this case). This is really handy when creating URLs that link to graphs showing events.
Started looking into what needed to be done to prepare NNTSC and netevmon for packaging and a possible distribution for our friends at Lightwire. Spent a decent chunk of time writing a README that should describe exactly how to get a NNTSC instance up and running.
NNTSC and netevmon both have tracs now and I've added a series of tickets to each with the aim of getting a release ready for Lightwire by the end of the month.