Shane Alcock's Blog
Another short week -- this time on leave Monday and Tuesday.
Started integrating traceroute events into the event grouping part of netevmon. Changed the focus of the path change detection away from which ASN appears at each hop; instead, we look for new transitions between ASNs as this will mean we don't trigger events when it takes 4 hops to get through REANNZ instead of the usual 3.
Developed a system for assigning traceroute events to ASN groups. PathChange events are matched with the ASNs for both the old and new path transition, e.g. a change from REANNZ->AARNET to REANNZ-Vocus will be assigned to the ["REANNZ", "AARNET", "Vocus"] groups. A TargetUnreachable event will be matched with the ASNs that are now missing from the traceroute as well as the last observed ASN. A TargetReachable event uses the same "identify common path elements" approach that latency events use (for want of a better system right now).
Fixed some more event detection and grouping bugs as I've found them. One fix was to make sure we at least create a group for the test target if the AS path for the stream does not reach the destination.
Spent some time on Friday proof-reading the BTM report.
Short week as I was on leave on Thursday and Friday.
Continued tweaking the event groups produced by netevmon. My main focus has been on ensuring that the start time for a group lines up with the start time of the earliest event in the group. When this doesn't happen, it suggests that there is an incongruity in the logic for updating events and groups based on a new observed detection. Now the problem happens rarely -- which is good from the perspective that I am making progress but it is also bad because it takes a lot longer for a bad group to occur so testing and debugging is much slower.
Spent a bit of time rewriting Yindong's python trace analysis using C++ and libflowmanager. My program was able to run much faster and use a lot less memory, which should mean that wraith won't be hosed for months while Yindong waits for his analysis to run.
Added a new API function to libtrace to strip VLAN and MPLS headers from packets. This makes the packets easier to analyse with BPF filters as you don't need to construct complicated filters to deal with the possible presence of VLAN tags that you don't care about.
Installed libtrace on the Endace probe and managed to get it happily processing packets from a virtual DAG without too much difficulty.
Continued fine-tuning the event groupings produced by netevmon. Major changes I made include:
* When displaying groups on the dashboard, merge any event groups that have the exact same members.
* Avoid including events in a group if they don't have a reasonable D-S score, even if there are other similar significant events happening at the same time. This gets rid of a lot of pointless (and probably unrelated events) from each group and also ensures groups expire promptly. This change has introduced a few flow-on effects: the insignificant events still need to be members of the group (in case they do eventually become significant) but shouldn't affect any of the group statistics -- particularly the group start time.
* Allow events that occur within one expiry period before the first event in a group to be included in that group -- threaded netevmon doesn't export events in a strict chronological order any more, so we need to be careful not to throw away out-of-order events.
* Have a fallback strategy if there is no AS path information available for a given source, dest pair (e.g. there is no traceroute test scheduled or the traceroute test has failed for some reason). Instead, we will create 2 groups: one for the source and one for the target.
* Polished up the styling of the dashboard and event group list and fixed a few UI issues that Brendon had suggested after looking at it on Friday.
Brad managed to track down a newer video card for quarterpounder, so now BSOD is up and running again.
Added Meena's lpicollector to our github so now I can finally deprecate the lpi_live tool that comes with libprotoident. Spent a bit of time updating some documentation and reworking the example client scripts so that everything is a bit easier to use. Also fixed a couple of memory bugs that I may have introduced last time I worked on the collector.
Continued working with the new event groups. Found a problem where I was incorrectly preferring shorter AS path segments over longer ones when determining whether I could remove a group for being redundant. Having fixed that, many event groups now cover several ASNs so I've redesigned the event list on the dashboard to be better at displaying multiple AS names.
The source code for both BSOD and Meenakshee Mungro's reliable libprotoident collector have been added to the WAND github page. Developers can freely clone these projects and make their own modifications or additions to the source code, while keeping up with any changes that we make between releases.
This is the first time we have released the libprotoident collector under the GPLv3 license. This project is a replacement for the lpi_live tool included with libprotoident, which should now be considered deprecated.
We're also more than happy to consider pull requests for code that adds useful features to either project.
WAND on GitHub
My NNTSC live queue continued to keep up satisfactorily, so I've turned my attention back to testing AS-based event grouping in netevmon. Updated the dashboard to use AS names rather than numbers to describe event groups. Replaced the "top sources" and "top targets" graph with a "top networks" graph.
Spent Thursday hosting one of the candidates applying for a position with STRATUS.
Added BSOD to our github on Friday. Tried to get the client running on the big TV, but ran into some issues with our video card no longer being supported by fglrx. Attempting to get the client to build and run on the Mac was not much more successful, since Xcode seems to have lost track of some of our dynamic libraries.
Fixed my remaining issues with threaded anomaly_ts. Had a few problems where a call to sscanf was interfering with some strtok_r calls I was making, but once I replace the sscanf with some manual string parsing everything worked again.
Continued looking into my NNTSC live queue delays. Narrowed the problem down to there being a time delay between publishing a message to the live rabbit queue and the message actually appearing in the queue (thanks to the firehose feature in rabbitmq!). After doing a fair bit of reading and experimenting, I theorised that the cause was the live queue being 'durable'. Even though the published messages themselves are not marked as persistent, publishing to a durable queue seems to require touching disk which can be slow on a resource-constrained machine like prophet. Removed the durable flag from the live queue and managed to run successfully over the long weekend without ever falling behind.
Migrated all netevmon configuration to use a single YAML config file for all three components. Previously, each component supported a series of getopt command line arguments which was a bit unwieldy.
Started testing my new parallel anomaly_ts code. The main hiccup was that embedded R is not thread-safe, so I've had to wrap any calls out to R with a mutex. This creates a bit of a bottleneck in the parallel system so we may need to revisit writing our own implementation of the complex math that I've been fobbing off to R. After fixing that, latency time series seem to work fairly well in parallel but AS traceroute series definitely do not so I'll be looking into that some more next week.
Spent a week working on the amp-web matrix. First task was to add HTTP and Throughput test matrices so we could make the BTM website available to the various participants. This was a bit trickier than anticipated as a lot of the matrix code was written with just the ICMP test in mind so there were a lot of hard-coded references to IPv4/IPv6 splits that were not appropriate for either test.
Updated amp mesh database to list which tests were appropriate for each mesh. This enabled us to limit the mesh selection dropdowns to only contain meshes that were appropriate for the currently selected matrix, as there was little overlap between the targets for the latency, HTTP and throughput tests.
Continued keeping an eye on BTM until Brendon got back on Thursday. Briefed Brendon on all the problems we had noticed and what we thought was required to fix them.
Finished up my event detection webapp. Started experimenting with automating the running event detectors with a range of different parameter options. The first detector I'm looking at (Plateau) has about 15,000 different parameter combinations that I would like to try, so I'm going to have to be pretty smart about recognising events as being the same across different runs.
Started adding worker threads to anomaly_ts so that we can be more parallel. Each stream will be hashed to a consistent worker thread so that measurements will always be evaluated in order, but I still have to consider the impact of the resulting events not being in strict chronological order across all streams.