Shane Alcock's Blog
Finished fixing various parts of Cuz so that it should be able to survive postgres restarts.
Started working on NNTSC version 3, i.e. implementing the database design changes that Brad has been testing. Fortunately, the way we've coded NNTSC meant that this was not overly difficult. Both insertion and querying now appears to work with the new code and we've even fixed a problem we would have had under the old system where a stream could only belong to a single label. Now we query each label in turn, so a stream can belong to as many labels as required to satisfy the query.
Also updated the AMP ICMP and DNS table structure to be more efficient in terms of space required per row.
Spent the latter part of the week working on verifying the events that netevmon has produced for amp.wand.net.nz. Found and fixed some netevmon issues in the process, in particular a bug where we were not subscribing to all the streams for a particular ICMP test so we were missing chunks of data depending on what addresses were being tested to. Overall, the event detection isn't too bad -- we pick up a lot of insignificant events but usually only one detector fires for each one so Meena's fusion techniques should be able to determine that they aren't worth bothering anyone about. The few major events we do get are generally reported by most of the detectors.
Gave a couple of lectures on libtrace for 513.
Short week this week due to being on holiday until Wednesday.
Spent a fair bit of time discussing potential database design improvements with Brendon and Brad. Based on Brad's experiments, it looks like we might be able to make sufficient improvements to the way we use postgres rather than having to move to a non-relational database right away. The main change will be to move to a system where each stream has its own data table rather than storing all streams in the same (partitioned) table. Originally, we were going to use partitions to create the per-streams table, but postgres doesn't like having lots of partitions so we will also need to update our querying and aggregation approach to cope with having to search through multiple tables.
Went through most of our code that touches the database and made sure it sensibly deals with postgres being restarted. It turns out there are a lot of places in both nntsc and ampy where this can cause problems. By the end of the week though, I seemed to have a system that would generally survive a postgres restart but I'm still not 100% sure about some of the more rarely-hit code paths.
Prepared some slides for the 513 libtrace lectures that I am going to giving next week. Also made a few tweaks to the assignment after some feedback from Ryan.
Finally fixed all my problems with the history-live overlap when subscribing to NNTSC streams. The solution was to register the subscription as soon as possible (i.e. before doing any querying) and collect all subsequent live data for those streams in a buffer. Once the history queries were complete, the HISTORY_DONE message was extended to include the timestamp of the last row in the returned data so I could skip those rows when pushing the saved live data to the client.
Added a QUERY_CANCELLED message to the NNTSC protocol so that clients can know that their database query timed out. Previously we have had problems where a timed-out query simply returns an empty dataset which was impossible to distinguish from a genuine lack of data. Updated ampy to deal with the new message -- in particular, the results for cancelled queries are not cached like they were before, with fairly obvious bad consequences.
Also updated the matrix to handle query timeouts better, which also fixes the day_data bug we had which was causing crashes while loading the matrix. If a timeout occurs, a useful error message is also shown to the user.
Released a new version of libtrace on Friday which fixes the FreeBSD 10 build bug, as well as a few other minor problems.
Libtrace 3.0.19 has been released.
The main purpose of this release is to fix a problem that prevented the libtrace 3.0.18 release from building on FreeBSD 10. A number of other minor bugs were also fixed, such as some libpacketdump decoding errors on big-endian CPUs and a bug in the ring: format that led to set_capture_length changing the wire length instead of the capture length.
This release also incorporates a patch from Martin Bligh that adds support for reading pcap traces that support nanosecond timestamp resolution via the pcapfile: URI.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.
Continued redevelopment of the NNTSC exporting code to be more robust and reliable. Replaced the live data pipes used by the dataparsers to push live data to the exporter with a RabbitMQ queue, which seems to be working well.
Modified the way that subscribing to streams worked to try and solve a problem we were having where data that arrived while historical data was being queried was not being pushed out to interested clients. Now, we store any live data that arrives for streams that are still being queried and push that out as soon as we get an indication from the query thread that the query has finished.
Unfortunately, we can still miss historical measurements if they haven't been committed to the database at the time when the final query begins. This often crops up if netevmon is resubscribing after NNTSC has been restarted, resulting in us missing out on the last historical measurement before the subscribe message arrives. Still looking for an elegant way to solve this one.
Added a version check message to the NNTSC protocol. This message is sent by the server as soon as a client connects and the client API has been updated to require its internal version to match the one received from the server. If not, the client stops and prints a message telling the user to update their client API. This should be helpful to the students who were previously getting weird broken behaviour with no apparent explanation whenever I made an API change to the production NNTSC on prophet.
Chased down a build issue with libtrace on FreeBSD 10. Turns out we had made the dist tarball with an old version of libtool which was stupidly written to never build shared libraries if the OS matched FreeBSD 1* (because FreeBSD 1.X didn't support shared libraries). Easy enough to fix, I just have to remember to make the libtrace distribution on something other than Debian Squeeze. Will start working on a new libtrace release in the near future so I don't keep getting emails from FreeBSD users.
Started going through all the NNTSC exporting code and replacing any instances of blocking sends with non-blocking alternatives. This should ultimately make both NNTSC and netevmon more stable when processing large amounts of historical data. It is also proving a good opportunity to tidy up some of this code, which had gotten a little ropey with all the hacking done on it leading up to NZNOG.
Spent a decent chunk of my week catching up on various support requests. Had two separate people email about issues with BSOD on Friday.
Wrote a draft version of this year's libtrace assignment for 513. I've changed it quite a bit from last years, based on what the students managed to achieve last year. The assignment itself should require a bit more work this time around, but should be easily doable in just C rather than requiring the additional learning curve of the STL. It should also be much harder to just rip off the examples :)
Read through the full report on a study into traffic classifier accuracy that evaluated libprotoident along with a bunch of other classifiers ( http://vbn.aau.dk/files/179043085/TBU_Extended_dpi_report.pdf ). Pleased to see that libprotoident did extremely well in the cases where it would be expected to do well, i.e. non-web applications.
Spent a lot of time chasing down deadlock behaviour in netevmon when it first starts up. The problem ultimately turned out to be that anomalyfeed was requesting a large amount of stream data from NNTSC, which was causing both ends to get stuck trying to complete a blocking send to the other. Reduced the likelihood of this occuring in the future by forcing anomalyfeed to wait for all streams for a collection to arrive before asking for any more streams, but the proper solution is going to be moving to non-blocking transmits.
Also replaced some pipes within the NNTSC exporting code with Queues, as full pipes were also causing problems. These problems were much worse, as one of the full pipes would stop NNTSC from processing new data and inserting it into the database.
Fixed a segfault in anomaly_ts due to reading off the end of a buffer. The problem was that we were using strchr to look for a newline character but never checking if the character we found was within sensible bounds.
Spent last week at NZNOG where we managed to give a reasonably successful presentation of everything we've done up until now. Managed to generate a bit of interest from operators, so we must be doing something right.
Replaced the event descriptions produced by netevmon with something a bit more human-readable. This was somewhat annoying to achieve, as it required passing a lot of extra parameters into each detector, e.g. the units that the time series is measured in, the metric itself, the scale factor for the raw data (e.g. bytes per period into mbps).
Made the amp-web graphs appear more responsive by displaying components as soon as their ajax call completes, rather than waiting for all the ajax to complete before rendering anything. In practical terms, this means the detail graph appears much sooner rather than having to wait for the query for 30 days of summary data to finish. I've also split the summary data query into multiple queries so the summary graph will now appear in increments, almost acting like a progress bar.
Tried to get netevmon deployed on skeptic, without much success so far. It seems that we can run it against a particular collection but as soon as you try to include all of the collections, the whole thing grinds to a halt and eventually prevents NNTSC from processing new data. Hopefully, we can find the cause of the problem early next week.
Fixed a bunch of other minor bugs / errors across Cuz in between times, as we try to get closer to something we can show off at NZNOG.
Got back into the swing of things by spending the week fixing a multitude of UI problems and general bugs in Cuz, with the aim of getting closer to something we feel comfortable demonstrating at NZNOG.
The main improvements are:
* Finally added a "graph browser" page which lets the user choose a collection to explore.
* Event groups are shown on graphs rather than individual events. This greatly reduces clutter when big events occur.
* Fixed various inconsistencies between the line colour shown on the legend and the line colour actually being drawn on the graph.
* Stopped creating tabs that go to empty graphs.
* Fixed a bug where the rainbow summary graph would only show the first couple of hops rather than the entire path.
* Added basic tooltips to the legend which show more detail about the group being moused over, e.g. what exactly is represented by each line colour.
* Better handling of database exceptions in Cuz so that Brendon's buggy AMP test results don't crash NNTSC :)
Updated the event tooltips to better describe the group that the event belongs to, as it was previously difficult to tell which line the event corresponded to when multiple lines were drawn on the graph.
Brad's rainbow graph is now used whenever an AMP traceroute event is clicked on in the dashboard. Fixed a couple of bugs with the rainbow graph: the main one being that it was rendering the heavily aggregated summary data in the detail graph instead of the detailed data.
Replaced the old hop count event detection for traceroute data with a detector that reports when a hop in the path has changed.
Fixed a tricky little bug in NNTSC where large aggregate data queries were being broken up into time periods that did not align with the requested binsize, so a bin would straddle two queries. This would produce two results for the same bin and was causing the summary graph to stop several hours short of the right hand edge.
Started working on making the tabs allowing access to "similar" graphs operational again. Have got this working for LPI, which is the most complicated case, so it shouldn't be too hard to get tabs going for everything else again before the end of the year.