Shane Alcock's Blog
Spent most of my week writing up a paper for PAM on the event detectors we've implemented in netevmon.
Wrote and tested a script to ease the transition from the current per-address stream format to a per-family stream format. We've already accepted that we're not going to try and migrate any existing collected data for the affected collections, so it is mostly a case of making sure we drop all the right tables (and don't drop any wrong ones).
Spent Wednesday at the student Honours conference. Our students did fairly well and were much improved on their practice talks.
Wrote a script to query prophet's database to extract the Smokeping time series used to generate the event ground truth data used in Meena's masters project, with an eye towards releasing the time series and the associated events that we have identified as a dataset for the anomaly detection community to use to validate and compare new techniques.
Went over all of the events that we had found and updated them to match the current output of our event detection software, which had changed quite a bit since we originally collected the events. There were also quite a few errors and inconsistencies in the significance ratings for the events, so I ended up spending most of my week working on this. Many of the changes were made to events that I had originally classified, so I can't blame the students entirely :)
Spent a decent chunk of Wednesday listening to our students give their Honours practice talks. The good thing is that they all appear to have done some useful work so far, but there's a bit of work to do in terms of making that work accessible to a general CS audience.
Brendon deployed the new amp-traceroute test on a VM early in the week, so I was finally able to test the new amp-traceroute database schema. After a few minor glitches, we were able to get both AS paths and IP paths going into and coming out of a NNTSC database.
Updated the existing traceroute graphs to use the new data formats. Hop count and rainbow graphs are both now based on AS paths, which we will be measuring much more frequently than IP paths. In particular, using AS paths should make our rainbow graphs a bit more useful rather than looking like a bad patchwork quilt.
Merged Brad Christensen's traceroute map graph into my current amp-web branch and updated it to work with the IP path data that we are now collecting. The map graph now "works" but there are a lot of improvements to make in the future. Sizing nodes and edges based on the frequency that the hop was hit is the main goal, but we also need to figure out what to display on the summary graph.
Added support for the new amp-tcpping test to ampy and amp-web.
Started on yet another major database schema change. This time, we're getting rid of address-based streams for amp collections and instead having one stream per address family per target. For example, instead of having an amp-icmp stream for every google address we observed, we'll just have two: one for ipv4 and one for ipv6.
This will hopefully result in some performance improvements. Firstly, we'll be doing a maximum of 2 inserts per test/source/dest combination, rather than anywhere up to 20 for some targets. We'll also have a lot less streams to search and process when starting up a NNTSC client. Finally, we should save a lot of time when querying for data, as almost all of our use cases were taking the old stream data and aggregating it based on address family anyway. Now our data is effectively pre-aggregated -- we also will have a lot less joins and unions across multiple tables.
By the end of the week, my test NNTSC was successfully collecting and storing data using this new schema. I also had ampy fetching data for amp-icmp and amp-tcpping, with amp-traceroute most of the way towards working. The main complexity with amp-traceroute is that we should be deploying Brendon's AS path traceroute next week, so I'm changing the rainbow graph to fetch AS path data and adding a method to query the IP path data that will support the monitor map graph that was implemented last summer.
Spent a day working on libtrace following some bug reports from Mike Schiffman at Farsight Security. Fixed some tricky bugs that popped up when using BPF filters with the event API.
Deployed the update-less version of NNTSC on skeptic finally. Unfortunately this initially made the performance even worse, as we were trying to keep the last timestamp cache up to date after every message. Changed it so that NNTSC only writes to the cache once every 5 mins of realtime, which seems to have solved the problem. In fact, we are now finally starting to (slowly) catch up on the message queue on skeptic.
Made a few minor tidyups to the TCPPing test. The main change was to pad IPv4 SYNs with 20 bytes of TCP NOOP options to ensure IPv4 and IPv6 tests to the same target will have the same packet size. Otherwise this could get confusing for users when they choose a packet size on the graph modal and find that they can't see IPv6 (or IPv4) results.
Now that we have three AMP tests that measure latency, we decided that it would be best if all of the latency tests could be viewed on the same graph, rather than there being a separate graph for each of DNS, ICMP and TCPPing. This required a fair amount of re-architecting of ampy to support views that span multiple collections -- we now have an 'amp-latency' view that can contain groups from any of the 'amp-dns', 'amp-icmp' and 'amp-tcpping' collections.
Added support for the amp-latency view to the website. The most time-consuming changes were re-designing the modal dialog for choosing which test results to add to an amp-latency graph, as now it needed to support all three latency collections (which all have quite different test options) on the same dialog. It gets quite complicated when you consider that we won't necessarily run all three tests to every target, e.g. no point in running a DNS test to www.wand.net.nz as it isn't a DNS server, so the dialog must ensure that all valid selections and no invalid selections are presented to the user. As a result, there's a lot of hiding and showing of modal components required based on what option the user has just changed.
Managed to get amp-latency views working on the website for the existing amp-icmp and amp-dns collections, but it should be a straightforward task to add amp-tcpping as well.
Wrote a script to update an existing NNTSC database to add the necessary tables and columns for storing AS path data. Tested it on my existing test database and will roll it out to prophet once we're collecting AS path data and are sure that our database schema covers everything we want to store.
Added a TCP ping test to AMP. This turned out to be a lot more complicated than I had first anticipated, but I'm reasonably confident that we've got something working now. The test works by sending a TCP SYN to a predefined port on the target and measures how long it takes to get a TCP response (either a SYN ACK or a RST). We can also get an ICMP response, so we need to listen for that and report a failed result in that case. The complications arise in that the operating system typically handles the TCP handshake, so we have to pull a number of tricks to be able to send and receive TCP SYN and SYN ACK packets inside our test code.
Sending a SYN is easy enough using a raw socket, although we have to make sure we bind the source port using a separate socket to prevent the OS from allowing other applications to use it which would screw with our responses. Getting the response is a lot harder -- we have to work out which interface our SYN is going to use and attach a pcap live capture to that interface (filtering on traffic for our known source and dest ports + icmp). We find the interface by creating a UDP socket to our target and seeing which source address it binds to, then check the list of addresses returned by getifaddrs() to find a match. The match will tell us the name of interface that the address belongs to.
Any packets received on our pcap capture are checked to see if they match any of the SYNs that we had sent out. This is done by parsing the packet headers -- I felt dirty writing non-libtrace packet parsing code -- and looking to see if the ACK matched the sequence number of the packet we had sent (in the case of TCP) or if the embedded TCP header matched our original SYN (in the case of ICMP).
The test still has a few annoying limitations due to the nature of firewalls on the Internet these days. I had originally intended to allow the test to vary the packet size by adding payload to the SYN, which is technically legal TCP behaviour, but in testing I found that SYNs with extra payload will often be dropped and we'll get no response. Transparent proxies on the monitor side are also problematic, in that they will pre-emptively respond to SYNs on port 80 and therefore mess with our latency measurement, e.g. the Fortigate here at Waikato does this, which initially made me think I had a bug in my timestamping since I was getting sub-1 ms results for targets I knew were hundreds of ms away.
Deployed the TCP ping test on our Centos VM successfully and was able to collect some test data in a NNTSC database. Also updated netevmon to be able to process TCP ping latency measurements.
Spent most of the week tidying up the NNTSC codebase. Replaced the dataparsers with proper OO classes, which removed a lot of repeated code and should make it easier to both write and maintain dataparsers. Also replaced most of the error reporting with exceptions rather than returning error codes everywhere.
Added support for storing AS paths alongside IP paths for amp-traceroute in NNTSC. The new improved traceroute test will eventually be able to report the AS for each detected hop, which is often more interesting and useful than the specific IP address. For instance, event detection can look for changes in the AS path rather than alerting when a new IP address is observed -- which can happen a lot when your path takes you through Google or Amazon EC2.
We can also colour our rainbow graph based on the AS rather than using a new colour for each address, which should hopefully reduce the patchwork-quilt effect while still being useful to look at.
Released libtrace 3.0.20 on Monday.
Got as much of our development environment up and running again after the power outage over the weekend. There are still a few non-critical things that needed some assistance from Brad that I wasn't able to get going on Monday, but they can wait until next week when we're both here.
On leave from Tuesday for the rest of the week.
Libtrace 3.0.20 has been released today.
This release fixes several bugs that have been reported by users, adds support for LZMA compression to libwandio and adds an API function for getting the fragment offset for an IP packet.
The bugs fixed in this release are:
* Fixed broken snaplen option for ring: input.
* Fixed trace_get_source_port and trace_get_destination_port returning bogus port numbers when given a fragmented packet.
* Fixed timestamp byte ordering on big endian architectures.
* Removed assert failure if a bad compression level or method is provided when configuring an output trace. A libtrace error is raised instead.
* Fixed broken compiler feature checking in configure script. Compiler features are also detected for compilers other than gcc, e.g. clang.
* Fixed potential segfaults in OSPF libpacketdump parser if the packet is truncated midway through the OSPF header.
The OSPF bug fix unfortunately resulted in the 'len' field in the libtrace_ospf_t structure being renamed to 'ospf_len' -- if you are using libtrace to process OSPF packets, please make sure you update your code accordingly.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.
Carrying on from last week, storing a cache entry per stream turned out to be a bad idea. Some matrix meshes consist of 100s of streams so we spend a lot of time looking up cache entries. As a result, I rewrote the caching code to store one dictionary per collection, mapping stream ids to tuples containing the timestamps. This gets looked up once per query, so only one cache operation is required to generate a matrix.
Updating the cache when we have to query for missing values is a bit annoying, as we cannot simply update the dictionary and put it back in the cache once the query is complete as the data inserting process may have updated other cache entries with new 'most recent data' timestamps while we were fulfilling our query. Instead, we have to re-fetch the dictionary, update the one stream we're changing and then immediately store the dictionary again.
Updated ampy to no longer keep track of active streams and removed support for ACTIVE_STREAMS queries from the NNTSC protocol.
Merged Perry's lzma support into libwandio. Started working towards a new libtrace release -- managed to build and pass all tests on our various development boxes so should be able to push out a release next week.
Spent a day reading over Meenakshee's thesis. Suggested a series of mostly minor edits and changes but overall it is looking pretty good.