User login

Shane Alcock's Blog

12

Aug

2013

Spent another couple of days moving code around in amp-web to make it tidier and easier to work with. Hopefully, Brendon will still be able to find things inside the codebase...

Added support for the amp-traceroute collection to amp-web. The graph is just a placeholder at the moment (a line graph of hop counts) until we get around to implementing the more useful stacked hop count graph using envision.

Re-enabled the tabs on the right-hand side of the graphs that allowed switching between related graphs, albeit without the preview graphs that used to be on them. The original tabs were very AMP-specific and hard-coded to appear on every graph. Now, the tabs are generated dynamically by an AJAX request that asks ampy for a list of "related" streams to the one currently being displayed. For example, an LPI byte count stream would have tabs showing flow, packet and user counts for the same source and application protocol whereas AMP streams will have tabs showing latency and traceroute for the same source-destination pair.

To avoid page reloads when using the tabs to switch between collections, I changed the dropdowns to be generated dynamically via an AJAX request rather than being placed and populated via the python run when the page is loaded.

05

Aug

2013

Added support for the AMP ICMP collection to ampy and amp-web, so we are now able to plot graphs of the test data Brendon has been collecting.

Spent a decent chunk of an afternoon working through the DPDK build system with Richard S., trying to make the DPDK libraries build as position-independent code so that we can link libtrace against them nicely.

Reworked a large amount of code in amp-web to move the collection-specific code out of the core source files and into separate little modules for each collection. This means that the core code should be much easier to follow and work on. Adding support for new collections should also be simpler and require less inside knowledge of how the whole system works.

29

Jul

2013

Table partitioning is now up and running inside of NNTSC. Migrated all the existing data over to partitioned tables.

Enabled per-user tracking in the LPI collector and updated Cuz to deal with multiple users sensibly. Changed the LPI collector to not export counters that have a value of zero -- the client now detects which protocols were missing counters and inserts zeroes accordingly. Also changed NNTSC to only create LPI streams when the time series has a non-zero value occur, which avoids the problem of creating hundreds of streams per user which are entirely zero because the user never uses that protocol.

Added ability to query NNTSC for a list of streams that had been added since a given stream was created. This is needed to allow ampy to keep up to date with streams that have been added since the connection to NNTSC was first made. This is not an ideal solution as it adds an extra database query to many ampy operations, but I'm hoping to come up with something better soon.

Revisited and thoroughly documented the ShewhartS-based event detection code in netevmon. In the process, I made a couple of tweaks that should reduce the number of 'unimportant' events that we have been getting.

22

Jul

2013

Somewhat disrupted week this week, due to illness.

Replaced the template-per-collection for the graph pages with a single template that uses TAL to automatically add the right dropdowns to the page for the collection being shown on that page. Added callback code to allow proper switching between LPI metrics when browsing the graphs -- it isn't perfect but it wasn't worth putting too much effort into it when we're probably going to completely change the graph selection method at some point.

Added code to ampy to query data from the AMP ICMP test. Also added an API function that returns details about all of the streams associated with a collection -- this will be used to populate the matrix with just one request rather than having to make a request for every stream.

Worked on getting NNTSC to use table partitioning so that we can avoid having to select from massive unwieldy data tables. Seems to working well with my test database but the big challenge is to migrate the existing 'production' database over to a partitioned setup.

15

Jul

2013

Made a number of minor changes to my paper on open-source traffic classifiers in response to reviewer comments.

Modified the NNTSC exporter to inform clients of the frequency of the datapoints it was returning in response to a historical data request. This allows ampy to detect missing data and insert None values appropriately, which will create a break in the time series graphs rather than drawing a straight line between the points either side of the area covered by the missing data. Calculating the frequency was a little harder than anticipated, as not every stream records a measurement frequency (and that frequency may change, e.g. if someone modifies the amp test schedule) and the returned values may be binned anyway, at which point the original frequency is not suitable for determining whether a measurement is missing.

Added support for the remaining LPI metrics to NNTSC, ampy and amp-web. We are now drawing graphs for packet counts, flow counts (both new and peak concurrent) and users (both active and observed), in addition to the original byte counts. Not detecting any events on these yet, as these metrics are very different to what we have at the moment so a bit of thought will have to go into which detectors we should use for each metric.

05

Jul

2013

Added support for the Libprotoident byte counters that we have been collecting from the red cable network to netevmon, ampy and amp-web. Now we can visualise the different protocols being used on the network and receive event alerts whenever someone does something out of the ordinary.

Replaced the dropdown list code in amp-web with a much nicer object-oriented approach. This should make it a lot easier to add dropdown lists for future NNTSC collections.

Managed to get our Munin graphs showing data using a Mbps unit. This was trickier than anticipated, as Munin sneakily divides the byte counts it gets from SNMP by its polling interval but this isn't very prominently documented. It took a little while for myself, Cathy and Brad to figure out why our numbers didn't match those being reported by the original Munin graphs.

Chased down and fixed a libtrace bug where converting a trace from any ERF format (including legacy) to PCAP would result in horrendously broken timestamps on Mac OS X. It turned out that the __BYTE_ORDER macro doesn't exist on BSD systems and so we were erroneously treating the timestamps as big endian regardless of what byte order the machine actually had.

Migrated wdcap and the LPI collector to use the new libwandevent3

Changed the NNTSC exporter to create a separate thread for each client rather than trying to deal with them all asynchronously. This alleviates the problem where a single client could request a large amount of history and prevent anyone else from connecting to the exporter until that request was served. Also made NNTSC and netevmon behave more robustly when a data source disappears -- rather than halting, they will now periodically try to reconnect so I don't have to restart everything from scratch when I want to apply changes to one component.

Finally, my paper on comparing the accuracy of various open-source traffic classifiers was accepted for WNM 2013. There's a few minor nits to possibly tidy up but it shouldn't require too much work to get camera-ready.

28

Jun

2013

Had a week of catching up on a few jobs I had put off in lieu of getting NNTSC, netevmon and amp2 ready for the Lightwire release.

Re-worked BSOD server to use a separate thread for communicating with clients, so that the packets can be sent to clients immediately rather than waiting for a break in the input stream. Unfortunately, this hasn't stopped the bursty appearance of packets on the client like I had hoped, so this requires further investigation. I suspect the flow management inside BSOD server isn't as optimal as it could be and may end up replacing this with libflowmanager.

With that in mind, I've modified libflowmanager to support multiple flow expiry 'plugins', as opposed to having a single defined expiry policy that all libflowmanager programs had to use. This will allow us to replicate BSOD's old expiry policy (flows expire after 20 seconds of inactivity) if we want to, although I would probably see how it goes with the classic libflowmanager policy first.

Received some bug reports for libtrace from Matt Brown as a result of Mayhem being run against the entirety of Debian. Perry had more or less patched them right away so I worked on releasing a new version of libtrace incorporating those fixes. The new release went out on Friday and also includes the rawerf fix from several weeks back. Had a few issues with both Fedora and FreeBSD that slowed down the testing process, so the release process took a bit longer than anticipated.

28

Jun

2013

Libtrace 3.0.18 has been released.

This release fixes several bugs that have been reported in 3.0.17. In particular, this release fixes several crash bugs in the libtrace tools that were reported by the Mayhem team at Carnegie Mellon University. It also addresses a rare bug where the compression auto-detection could trigger a false positive on uncompressed ERF traces by including a new format URI (rawerf:) that can be used to force libtrace to treat the traces as uncompressed. We have also tightened up the compression auto-detection somewhat to reduce the likelihood of the bug occurring.

It is highly recommended that you explicitly use the rawerf: format if you are working with large numbers of uncompressed ERF traces.

The full list of changes in this release can be found in the libtrace ChangeLog.

You can download the new version of libtrace from the libtrace website.

24

Jun

2013

Added manpages to netevmon to get it ready for Debian packaging. During this process, fixed a few little oversights in the netevmon script and the existing documentation.

Re-wrote much of the NNTSC API in ampy. The main goal was to reduce the amount of duplicated code in modules for individual NNTSC collections that was better suited to a more general NNTSC API. In the process I also changed the API to only use a single "NNTSC Connection" instance rather than creating and destroying one for every AJAX request. The main benefit of this is that we don't have to ask the database about collections and streams every time we make a request now -- instead we get them once and store that info for subsequent use. This will hopefully make the graph interface feel a bit more responsive.

Updated amp-web to use the new NNTSC API in ampy. I also spent a bit of time on Friday testing the web graphs on various browsers and fixing a few of the more obvious problems. Unsurprisingly, IE 10 was the biggest source of grief.

Added a new time series type to anomaly_ts -- JitterVariance. This time series tracks the standard deviation of the latencies reported by the individual smokeping pings. Using this, I've added a new event type designed to detect when the standard deviation has moved away from being near zero, e.g. the pings have started reporting variable latency. This helps us pick up on situations where the median stays roughly the same but the variance clearly indicates some issues. It also serves as a good early indicator of upcoming Plateau or Mode events on the median latency.

17

Jun

2013

Finished preparing NNTSC for packaging. Wrote an init script for the NNTSC collector and ensured that all of the subprocesses are cleaned up when the main collector process is killed. Wrote some manpages, updated the other documentation and added some licensing to NNTSC before handing it off to Brendon for packaging.

Also moved towards packaging netevmon. Again, lots of messing around with daemonisation and ensuring that the monitor can be started and stopped nicely without anyone having to manually hunt down processes.

Spent the rest of my time working on the interaction between amp-web and History.js. Only one entry is placed in the history for each visited graph now and selecting a graph from the history will actually show you the right graph. Navigating to a graph via the history will also now update the dropdown lists to match the currently viewed graph. When using click and drag to explore a graph, clicking once on the graph will return to the previous zoom level (this was already present, but only worked for exploring the detailed graph, not the summary one).