Shane Alcock's Blog
Continued marching towards being able to migrate our prophet database to the updated NNTSC database schema. Discovered a number of cases where AMP tests were reporting failed results but values were still being inserted into the database for fields that should be invalid due to the test failing.
Updated the RRD-Smokeping schema to store the individual ping results as a single column using an array. This caused some problems with our approach for calculating the "smoke" that we show on the graphs, but Brendon the SQL-master was able to come up with some custom aggregation functions that should fix this problem.
Finished looking at the events on amp.wand.net.nz. Also managed to come up with a solution to the single-large-spike problem I had last week. It's not perfect (mainly in that it only works if the spike is exactly one measurement, a two measurement spike will still have the same problem), but it gets rid of a few annoying insignificant events.
Modified the traceroute pathchange detector to try and reduce the number of events we've been getting for certain targets, most notably NetFlix. The main change is that we now only consider a hop to be "new" if it doesn't match the subnet of any existing hops for that TTL. It's all very naive: a /24 is considered a subnet for IPv4, a /48 for IPv6, but it results in a big improvement. Eventually, I expect us to plug a BGP feed into the detector and look for changes in the AS path rather than the IP path, but this should tide us over until then.
Worked with Brad to set up a passive monitor to help ITS diagnose some problems they are having on their network related to broadcast and multicast traffic. Just waiting on ITS to let us know when the problems are occurring so we can narrow down our search for strange behaviour to just traces covering the time periods of interest.
Finished updating NNTSC to deal with traceroute data. The new QueryBuilder code should make query construction a bit less convoluted within the NNTSC dbselect module. Everything seems to work OK in basic testing, so it's now just a matter of migrating over one of our production setups and seeing what breaks.
Continued working through the events on amp.wand.net.nz, looking at events for streams that fall in the 25-100ms and the 300+ms ranges. Results still look very promising overall. Tried to fix another common source of insignificant events (namely a single very large spike that moves our mean so much that subsequent "normal" measurements are treated as slightly abnormal due to their distance from the new mean) but without any tangible success.
Moved libtrace and libprotoident from svn to git and put the repositories up on github. This should make the projects more accessible, particularly to the increasing number of people who want to add support for various formats and protocols. It should also make life easier for me when it comes to pushing out bug fixes to people having specific problems and merging in code contributed by our users.
The source code for both our libtrace and libprotoident libraries is now available on GitHub. Developers can freely clone these projects and make their own modifications or additions to the source code, while keeping up with any changes that we make between releases.
We're also more than happy to consider pull requests for code that adds useful features or support for new protocols / trace formats to our libraries.
Look out for more of our open-source projects to make their way onto GitHub soon!
Spent about half of my week continuing to validate netevmon events on amp.wand.net.nz. After noticing that the TEntropy detectors were tending to alert on pairs of single measurement spikes that were 2-3 minutes apart, I modified the detectors to require a minimum number of measurements contributing entropy (4) before triggering an alert (provided the time series was in a "constant" state). This seems to have removed many insignificant events without affecting the detection of major events, except for some cases where the TEntropy detectors might trigger a little later than they had previously.
Started implementing the new traceroute table schema within NNTSC. Because there are two table involved (one for paths and one for test results), it is a bit more complicated than the ICMP and DNS tables. Having to cast a list of IP addresses into an SQL array whenever we want to insert into the path table just makes matters worse. At this stage, I've got inserts working sensibly and am now working on making sure we can query the data. As part of this, I am trying to streamline how we construct our queries so that it's easier for us to keep track of all the query components and parameters and keep them in the correct order.
Finished fixing various parts of Cuz so that it should be able to survive postgres restarts.
Started working on NNTSC version 3, i.e. implementing the database design changes that Brad has been testing. Fortunately, the way we've coded NNTSC meant that this was not overly difficult. Both insertion and querying now appears to work with the new code and we've even fixed a problem we would have had under the old system where a stream could only belong to a single label. Now we query each label in turn, so a stream can belong to as many labels as required to satisfy the query.
Also updated the AMP ICMP and DNS table structure to be more efficient in terms of space required per row.
Spent the latter part of the week working on verifying the events that netevmon has produced for amp.wand.net.nz. Found and fixed some netevmon issues in the process, in particular a bug where we were not subscribing to all the streams for a particular ICMP test so we were missing chunks of data depending on what addresses were being tested to. Overall, the event detection isn't too bad -- we pick up a lot of insignificant events but usually only one detector fires for each one so Meena's fusion techniques should be able to determine that they aren't worth bothering anyone about. The few major events we do get are generally reported by most of the detectors.
Gave a couple of lectures on libtrace for 513.
Short week this week due to being on holiday until Wednesday.
Spent a fair bit of time discussing potential database design improvements with Brendon and Brad. Based on Brad's experiments, it looks like we might be able to make sufficient improvements to the way we use postgres rather than having to move to a non-relational database right away. The main change will be to move to a system where each stream has its own data table rather than storing all streams in the same (partitioned) table. Originally, we were going to use partitions to create the per-streams table, but postgres doesn't like having lots of partitions so we will also need to update our querying and aggregation approach to cope with having to search through multiple tables.
Went through most of our code that touches the database and made sure it sensibly deals with postgres being restarted. It turns out there are a lot of places in both nntsc and ampy where this can cause problems. By the end of the week though, I seemed to have a system that would generally survive a postgres restart but I'm still not 100% sure about some of the more rarely-hit code paths.
Prepared some slides for the 513 libtrace lectures that I am going to giving next week. Also made a few tweaks to the assignment after some feedback from Ryan.
Finally fixed all my problems with the history-live overlap when subscribing to NNTSC streams. The solution was to register the subscription as soon as possible (i.e. before doing any querying) and collect all subsequent live data for those streams in a buffer. Once the history queries were complete, the HISTORY_DONE message was extended to include the timestamp of the last row in the returned data so I could skip those rows when pushing the saved live data to the client.
Added a QUERY_CANCELLED message to the NNTSC protocol so that clients can know that their database query timed out. Previously we have had problems where a timed-out query simply returns an empty dataset which was impossible to distinguish from a genuine lack of data. Updated ampy to deal with the new message -- in particular, the results for cancelled queries are not cached like they were before, with fairly obvious bad consequences.
Also updated the matrix to handle query timeouts better, which also fixes the day_data bug we had which was causing crashes while loading the matrix. If a timeout occurs, a useful error message is also shown to the user.
Released a new version of libtrace on Friday which fixes the FreeBSD 10 build bug, as well as a few other minor problems.
Libtrace 3.0.19 has been released.
The main purpose of this release is to fix a problem that prevented the libtrace 3.0.18 release from building on FreeBSD 10. A number of other minor bugs were also fixed, such as some libpacketdump decoding errors on big-endian CPUs and a bug in the ring: format that led to set_capture_length changing the wire length instead of the capture length.
This release also incorporates a patch from Martin Bligh that adds support for reading pcap traces that support nanosecond timestamp resolution via the pcapfile: URI.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.
Continued redevelopment of the NNTSC exporting code to be more robust and reliable. Replaced the live data pipes used by the dataparsers to push live data to the exporter with a RabbitMQ queue, which seems to be working well.
Modified the way that subscribing to streams worked to try and solve a problem we were having where data that arrived while historical data was being queried was not being pushed out to interested clients. Now, we store any live data that arrives for streams that are still being queried and push that out as soon as we get an indication from the query thread that the query has finished.
Unfortunately, we can still miss historical measurements if they haven't been committed to the database at the time when the final query begins. This often crops up if netevmon is resubscribing after NNTSC has been restarted, resulting in us missing out on the last historical measurement before the subscribe message arrives. Still looking for an elegant way to solve this one.
Added a version check message to the NNTSC protocol. This message is sent by the server as soon as a client connects and the client API has been updated to require its internal version to match the one received from the server. If not, the client stops and prints a message telling the user to update their client API. This should be helpful to the students who were previously getting weird broken behaviour with no apparent explanation whenever I made an API change to the production NNTSC on prophet.
Chased down a build issue with libtrace on FreeBSD 10. Turns out we had made the dist tarball with an old version of libtool which was stupidly written to never build shared libraries if the OS matched FreeBSD 1* (because FreeBSD 1.X didn't support shared libraries). Easy enough to fix, I just have to remember to make the libtrace distribution on something other than Debian Squeeze. Will start working on a new libtrace release in the near future so I don't keep getting emails from FreeBSD users.
Started going through all the NNTSC exporting code and replacing any instances of blocking sends with non-blocking alternatives. This should ultimately make both NNTSC and netevmon more stable when processing large amounts of historical data. It is also proving a good opportunity to tidy up some of this code, which had gotten a little ropey with all the hacking done on it leading up to NZNOG.
Spent a decent chunk of my week catching up on various support requests. Had two separate people email about issues with BSOD on Friday.
Wrote a draft version of this year's libtrace assignment for 513. I've changed it quite a bit from last years, based on what the students managed to achieve last year. The assignment itself should require a bit more work this time around, but should be easily doable in just C rather than requiring the additional learning curve of the STL. It should also be much harder to just rip off the examples :)
Read through the full report on a study into traffic classifier accuracy that evaluated libprotoident along with a bunch of other classifiers ( http://vbn.aau.dk/files/179043085/TBU_Extended_dpi_report.pdf ). Pleased to see that libprotoident did extremely well in the cases where it would be expected to do well, i.e. non-web applications.