Shane Alcock's Blog
Spent last week on leave, getting my balance down :)
Finished developing and testing stream / collection selection in netevmon.
Added support for the HTTP test back into NNTSC. We only store basic statistics from the test, i.e. number of objects, bytes, servers and the time taken to fetch everything, as opposed to the previous schema which tried to store detailed information about each individual fetched object. Managed to get my own amplet VM to do some testing and have been happily running HTTP tests for most of the week.
Replaced the pika code in NNTSC to use asynchronous connections rather than blocking connections. This should make our rabbit queue publishing and consuming code a bit more robust, especially if a TCP connection breaks down, and it also appears to have made our backlog processing much faster.
Spent a decent chunk of time chasing down a bug in the AMP HTTP test that would cause it to segfault if you tested to certain sites. After delving deep into the flex code that parses the HTML on the fetched pages looking for other objects to fetch, we eventually found that the buffer being provided to store the URL of the found object was not big enough to fit all the URLs we were seeing.
Released a new version of libtrace on Tuesday that contains the most recent batch of bug fixes. Started moving the libtrace wiki from trac to github; only the tool pages are left to migrate.
Updated netevmon to support the new family-based streams in NNTSC. Since this new approach results in one time series per stream (as opposed to multiple streams having to be aggregated into each time series), this greatly simplified the anomalyfeed script. Added event detection for changes in AS paths which operates in much the same way as the old IP path event detection.
Started adding the ability to specify a subset of streams / collections for event detection in netevmon, rather than automatically running against all streams. The streams / collections of interest are provided via a config file and a SIGHUP will cause the file to be re-read and any necessary changes made. This also
meant I had to add unsubscribe support to the NNTSC exporter, so that it would stop sending live updates for streams that had been removed from the config file.
Libtrace 3.0.21 has been released today.
This release fixes many bugs that have been reported by our users, including:
* trace_interrupt() now works properly for int, bpf, dag and ring formats.
* fixed double-counting of accepted packets when using the event API.
* fixed incorrect filtered packet counts for bpf format.
* fixed crash when performing very large reads with libwandio.
* fixed inconsistent behaviour if a bad filter string is used with int and dag formats.
* fixed potential infinite loop when combining filters, the event API and the pcapint format.
* fixed incorrect wire lengths when using SNAPLEN config option to truncate packets captured using the int format.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.
Finished up a draft of the PAM paper, eventually managing to squeeze it into the 12 page limit.
Spent a bit of time learning about DPDK while investigating a build bug reported by someone trying to use libtrace's DPDK support. Turns out we were a little way behind current DPDK releases, but Richard S has managed to bring us more up-to-date over the past few days. Spent my Friday afternoon fixing up the last outstanding known issue in libtrace (trace_interrupt not working for most live formats) in preparation for a release in the next week or two.
Spent most of my week writing up a paper for PAM on the event detectors we've implemented in netevmon.
Wrote and tested a script to ease the transition from the current per-address stream format to a per-family stream format. We've already accepted that we're not going to try and migrate any existing collected data for the affected collections, so it is mostly a case of making sure we drop all the right tables (and don't drop any wrong ones).
Spent Wednesday at the student Honours conference. Our students did fairly well and were much improved on their practice talks.
Wrote a script to query prophet's database to extract the Smokeping time series used to generate the event ground truth data used in Meena's masters project, with an eye towards releasing the time series and the associated events that we have identified as a dataset for the anomaly detection community to use to validate and compare new techniques.
Went over all of the events that we had found and updated them to match the current output of our event detection software, which had changed quite a bit since we originally collected the events. There were also quite a few errors and inconsistencies in the significance ratings for the events, so I ended up spending most of my week working on this. Many of the changes were made to events that I had originally classified, so I can't blame the students entirely :)
Spent a decent chunk of Wednesday listening to our students give their Honours practice talks. The good thing is that they all appear to have done some useful work so far, but there's a bit of work to do in terms of making that work accessible to a general CS audience.
Brendon deployed the new amp-traceroute test on a VM early in the week, so I was finally able to test the new amp-traceroute database schema. After a few minor glitches, we were able to get both AS paths and IP paths going into and coming out of a NNTSC database.
Updated the existing traceroute graphs to use the new data formats. Hop count and rainbow graphs are both now based on AS paths, which we will be measuring much more frequently than IP paths. In particular, using AS paths should make our rainbow graphs a bit more useful rather than looking like a bad patchwork quilt.
Merged Brad Christensen's traceroute map graph into my current amp-web branch and updated it to work with the IP path data that we are now collecting. The map graph now "works" but there are a lot of improvements to make in the future. Sizing nodes and edges based on the frequency that the hop was hit is the main goal, but we also need to figure out what to display on the summary graph.
Added support for the new amp-tcpping test to ampy and amp-web.
Started on yet another major database schema change. This time, we're getting rid of address-based streams for amp collections and instead having one stream per address family per target. For example, instead of having an amp-icmp stream for every google address we observed, we'll just have two: one for ipv4 and one for ipv6.
This will hopefully result in some performance improvements. Firstly, we'll be doing a maximum of 2 inserts per test/source/dest combination, rather than anywhere up to 20 for some targets. We'll also have a lot less streams to search and process when starting up a NNTSC client. Finally, we should save a lot of time when querying for data, as almost all of our use cases were taking the old stream data and aggregating it based on address family anyway. Now our data is effectively pre-aggregated -- we also will have a lot less joins and unions across multiple tables.
By the end of the week, my test NNTSC was successfully collecting and storing data using this new schema. I also had ampy fetching data for amp-icmp and amp-tcpping, with amp-traceroute most of the way towards working. The main complexity with amp-traceroute is that we should be deploying Brendon's AS path traceroute next week, so I'm changing the rainbow graph to fetch AS path data and adding a method to query the IP path data that will support the monitor map graph that was implemented last summer.
Spent a day working on libtrace following some bug reports from Mike Schiffman at Farsight Security. Fixed some tricky bugs that popped up when using BPF filters with the event API.
Deployed the update-less version of NNTSC on skeptic finally. Unfortunately this initially made the performance even worse, as we were trying to keep the last timestamp cache up to date after every message. Changed it so that NNTSC only writes to the cache once every 5 mins of realtime, which seems to have solved the problem. In fact, we are now finally starting to (slowly) catch up on the message queue on skeptic.
Made a few minor tidyups to the TCPPing test. The main change was to pad IPv4 SYNs with 20 bytes of TCP NOOP options to ensure IPv4 and IPv6 tests to the same target will have the same packet size. Otherwise this could get confusing for users when they choose a packet size on the graph modal and find that they can't see IPv6 (or IPv4) results.
Now that we have three AMP tests that measure latency, we decided that it would be best if all of the latency tests could be viewed on the same graph, rather than there being a separate graph for each of DNS, ICMP and TCPPing. This required a fair amount of re-architecting of ampy to support views that span multiple collections -- we now have an 'amp-latency' view that can contain groups from any of the 'amp-dns', 'amp-icmp' and 'amp-tcpping' collections.
Added support for the amp-latency view to the website. The most time-consuming changes were re-designing the modal dialog for choosing which test results to add to an amp-latency graph, as now it needed to support all three latency collections (which all have quite different test options) on the same dialog. It gets quite complicated when you consider that we won't necessarily run all three tests to every target, e.g. no point in running a DNS test to www.wand.net.nz as it isn't a DNS server, so the dialog must ensure that all valid selections and no invalid selections are presented to the user. As a result, there's a lot of hiding and showing of modal components required based on what option the user has just changed.
Managed to get amp-latency views working on the website for the existing amp-icmp and amp-dns collections, but it should be a straightforward task to add amp-tcpping as well.