User login

Shane Alcock's Blog




Continued digging into the unknown traffic in the day-long Waikato trace I captured last week. Diminishing returns are starting to really kick in now, but I've still managed to add another 9 new protocols (including SPDY) and improved the rules for a further 8.

Worked on a series of scripts to process the results of running the Plateau detector using a variety of different possible configurations (e.g. history and trigger buffer sizes, sensitivity thresholds etc). The aim is to find the optimal set of parameters based on the ground truth we already have. Of course, some parameter combinations are going to produce events that we have never seen before so I've also had to write code to find these events and generate suitable graphs so I can use my web-app to quickly manually classify them appropriately.

Spent a fair bit of time helping Yindong with his experiments.




Continued working on adding new rules to libprotoident based on unknown flows seen with the new Waikato capture. Since getting access to fresh traffic, I've added 12 new protocols and improved the rules for another 13 existing ones.

Some of the more notable protocols that I've added are QUIC, SPDY, WeChat, Git and Speedtest. Also added a rule for the AMP throughput test, as this is one of the biggest contributors of "Unknown" traffic.

Captured a full weekday of traffic to use as a basis for working out how regularly we can take permanent captures and what sort of duration we can reasonably expect to capture for. A single day is around 116 GB (snapped and compressed). To put this in context, ~100 days of similar capture from 2007 was 491 GB -- a little over 4 days worth of traffic now.




More work on the dashboard this week:
* added the ability to remove "common" events from the recent event list and made the graphs collapsible.
* added a table that shows the most frequently occuring events in the past day, e.g. "increased latency from A to B (ipv4)".
* polished up some of the styling on the dashboard and moved the dashboard-specific CSS (of which there is now quite a lot) into its own separate file.

Started thinking about how to include loss-related events in the event groups, as these are ignored at the moment.

The new capture point came online on Wednesday, so the rest of my week was spent playing with the packet captures. This involved:
* learning to operate EndaceVision.
* installing wdcap on the vDAG VM.
* adding the ability to anonymise only the local network in wdcap.
* performing a short test capture.
* getting BSOD working again, which required the application of a little "in-flow" packet sampling to run smoothly.
* running libprotoident against the test capture to see what new rules I can add.




Continued testing and tweaking the event grouping in netevmon. My main problem was the creation of seemingly duplicate groups in cases where further (usually out-of-order) detections were being observed for events that were members of an already expired group. Eventually I tracked the problem down to the fact that the event was being deleted when the group expired so the later detection was resulting in the event and its parent group being re-created.

Started looking into methods for determining whether an event is "common" or "rare", so that we can allow users to filter events that occur regularly from the dashboard. This meant I had to change our method of populating the dashboard -- previously, we just grabbed the appropriate events in the pyramid view and passed them into the dashboard template, but now we need to be able to dynamically update the dashboard depending on whether the common events are being filtered or not.

Added some nice little icons to each event group to show what type of events are present within the group without having to click on the group. The current icons show latency increases, latency decreases and path changes.




Another short week -- this time on leave Monday and Tuesday.

Started integrating traceroute events into the event grouping part of netevmon. Changed the focus of the path change detection away from which ASN appears at each hop; instead, we look for new transitions between ASNs as this will mean we don't trigger events when it takes 4 hops to get through REANNZ instead of the usual 3.

Developed a system for assigning traceroute events to ASN groups. PathChange events are matched with the ASNs for both the old and new path transition, e.g. a change from REANNZ->AARNET to REANNZ-Vocus will be assigned to the ["REANNZ", "AARNET", "Vocus"] groups. A TargetUnreachable event will be matched with the ASNs that are now missing from the traceroute as well as the last observed ASN. A TargetReachable event uses the same "identify common path elements" approach that latency events use (for want of a better system right now).

Fixed some more event detection and grouping bugs as I've found them. One fix was to make sure we at least create a group for the test target if the AS path for the stream does not reach the destination.

Spent some time on Friday proof-reading the BTM report.




Short week as I was on leave on Thursday and Friday.

Continued tweaking the event groups produced by netevmon. My main focus has been on ensuring that the start time for a group lines up with the start time of the earliest event in the group. When this doesn't happen, it suggests that there is an incongruity in the logic for updating events and groups based on a new observed detection. Now the problem happens rarely -- which is good from the perspective that I am making progress but it is also bad because it takes a lot longer for a bad group to occur so testing and debugging is much slower.

Spent a bit of time rewriting Yindong's python trace analysis using C++ and libflowmanager. My program was able to run much faster and use a lot less memory, which should mean that wraith won't be hosed for months while Yindong waits for his analysis to run.

Added a new API function to libtrace to strip VLAN and MPLS headers from packets. This makes the packets easier to analyse with BPF filters as you don't need to construct complicated filters to deal with the possible presence of VLAN tags that you don't care about.

Installed libtrace on the Endace probe and managed to get it happily processing packets from a virtual DAG without too much difficulty.




Continued fine-tuning the event groupings produced by netevmon. Major changes I made include:
* When displaying groups on the dashboard, merge any event groups that have the exact same members.
* Avoid including events in a group if they don't have a reasonable D-S score, even if there are other similar significant events happening at the same time. This gets rid of a lot of pointless (and probably unrelated events) from each group and also ensures groups expire promptly. This change has introduced a few flow-on effects: the insignificant events still need to be members of the group (in case they do eventually become significant) but shouldn't affect any of the group statistics -- particularly the group start time.
* Allow events that occur within one expiry period before the first event in a group to be included in that group -- threaded netevmon doesn't export events in a strict chronological order any more, so we need to be careful not to throw away out-of-order events.
* Have a fallback strategy if there is no AS path information available for a given source, dest pair (e.g. there is no traceroute test scheduled or the traceroute test has failed for some reason). Instead, we will create 2 groups: one for the source and one for the target.
* Polished up the styling of the dashboard and event group list and fixed a few UI issues that Brendon had suggested after looking at it on Friday.




Brad managed to track down a newer video card for quarterpounder, so now BSOD is up and running again.

Added Meena's lpicollector to our github so now I can finally deprecate the lpi_live tool that comes with libprotoident. Spent a bit of time updating some documentation and reworking the example client scripts so that everything is a bit easier to use. Also fixed a couple of memory bugs that I may have introduced last time I worked on the collector.

Continued working with the new event groups. Found a problem where I was incorrectly preferring shorter AS path segments over longer ones when determining whether I could remove a group for being redundant. Having fixed that, many event groups now cover several ASNs so I've redesigned the event list on the dashboard to be better at displaying multiple AS names.




The source code for both BSOD and Meenakshee Mungro's reliable libprotoident collector have been added to the WAND github page. Developers can freely clone these projects and make their own modifications or additions to the source code, while keeping up with any changes that we make between releases.

This is the first time we have released the libprotoident collector under the GPLv3 license. This project is a replacement for the lpi_live tool included with libprotoident, which should now be considered deprecated.

We're also more than happy to consider pull requests for code that adds useful features to either project.

WAND on GitHub




My NNTSC live queue continued to keep up satisfactorily, so I've turned my attention back to testing AS-based event grouping in netevmon. Updated the dashboard to use AS names rather than numbers to describe event groups. Replaced the "top sources" and "top targets" graph with a "top networks" graph.

Spent Thursday hosting one of the candidates applying for a position with STRATUS.

Added BSOD to our github on Friday. Tried to get the client running on the big TV, but ran into some issues with our video card no longer being supported by fglrx. Attempting to get the client to build and run on the Mac was not much more successful, since Xcode seems to have lost track of some of our dynamic libraries.