User login

Shane Alcock's Blog




Fixed problems we were having with netevmon causing NNTSC to fill up its queues and therefore use huge amounts of memory. There were two components to this fix: the most effective change was to modify netevmon to only ask for one stream at a time (previously we asked for them all at once because this was the most efficient way to query the old database schema). The other change was to compress the pickled query result before exporting it which reduced the queue footprint and also meant we could send the data faster, meaning that the queue would drain quicker.

Fixed a bug in ampy that was preventing events from showing up on the graphs or the dashboard. We now have a fully functioning netevmon running on prophet again.

Spent a couple of days going over the AMP event ground truth I generated a few weeks back after Meena reported that there were a number of events being reported now that didn't have ground truth. This was due to the changes and improvements I had made to netevmon while working on the ground truth -- as a result, some events disappeared but there were also a few new ones that took their place. Noticed a few bugs in Meena's new eventing script while I was doing this where it was reporting incorrect stream properties, so I tracked those down for her while I was at it.

Wrote a NNTSC dataparser for the new AMP throughput test. Found a few bugs in the test itself for Brendon to solve, but both the test and the dataparser seem to be working in the most basic cases.

Had a play with Nevil's python-libtrace code and reported a few bugs and missing features back to him. Looking forward to those being fixed as it is pretty nifty otherwise.




Updated the AMP dataparser in NNTSC to process more messages in a single batch before committing. This should improve speed when working through a large message backlog, as well as save on some I/O time during normal operation. This change required some modification to the way we handle disconnects and other errors, as we now have to re-insert all the previously uncommitted messages so we can't just disconnect and retry the current message.

Tried to bring our database cursor management in line with suggested best practice, i.e. closing cursors whenever we're done with them.

Improved exporting performance by limiting frequency calculations to the first 200 rows and using a RealDictCursor rather than a DictCursor to fetch query results. The RealDictCursor means we don't need to convert results into dictionaries ourselves -- they are already in the right format so we can avoid touching most rows by simply chucking them straight into our result.

Spent some time helping Meena write a script to batch-process her event data. This should allow us to easily repeat her event grouping and significance calculations using various parameters without requiring manual intervention. Found a few bugs along the way which have now been fixed.

Was planning to work the short week between Easter and Anzac day but fell ill with a cold instead.




Finished purging the last of the SQLAlchemy code from NNTSC. Once that was working, I was able to create a new class hierarchy for our database code to reduce the amount of duplicate code and ensure that we handle error cases consistently across all query types.

Split insertion operations across two different transactions: one for stream-related operations and one for measurement results. This allows us to commit new streams and data tables without having to commit any data results, which is an important step towards better synchronisation between the database and the messages in the Rabbit queue.

Spent a lot of time tracking down and fixing various error cases that were not being caught and handled within NNTSC. A lot of this work was focused on ensuring that no data was lost or duplicated after recovering from an error or a database restart, especially given our attempts to move towards committing less often.

Migrating the prophet development database over to the new NNTSC schema on Thursday. Generally things went pretty smoothly and we are now turning our attention to migrating skeptic and the live website as soon as possible.




Updated NNTSC to include the new 'smoke' and 'smokearray' aggregation functions. Replaced all calls to get_percentile_data in ampy with calls to get_aggregate_data using the new aggregation functions. Fixed a few glitches in amp-web resulting from changes to field names due to the switch-over.

Marked the 513 libtrace assignments. Overall, the quality of submissions was very good with many students demonstrating a high-level of understanding rather than just blindly copying from examples.

Modified NNTSC to handle a rare situation where we can try to insert a stream that already exists -- this can happen if two data-inserting NNTSCs are running on the same host. Now we detect the duplicate and return the stream id of the existing stream so NNTSC can update its own stream map to include the missing stream.

Discovered that our new table-heavy database schema was using a lot of memory due to SQLAlchemy trying to maintain a dictionary mapping all of the table names to table objects. This prompted me to finally rip out the last vestiges of SQLAlchemy from NNTSC. This involved replacing all of our table creation and data insertion code with psycopg2 and explicit SQL commands constructed programatically. Unfortunately, this will delay our database migration by at least another week but it will also end up simplifying our database code somewhat.




Continued marching towards being able to migrate our prophet database to the updated NNTSC database schema. Discovered a number of cases where AMP tests were reporting failed results but values were still being inserted into the database for fields that should be invalid due to the test failing.

Updated the RRD-Smokeping schema to store the individual ping results as a single column using an array. This caused some problems with our approach for calculating the "smoke" that we show on the graphs, but Brendon the SQL-master was able to come up with some custom aggregation functions that should fix this problem.

Finished looking at the events on Also managed to come up with a solution to the single-large-spike problem I had last week. It's not perfect (mainly in that it only works if the spike is exactly one measurement, a two measurement spike will still have the same problem), but it gets rid of a few annoying insignificant events.

Modified the traceroute pathchange detector to try and reduce the number of events we've been getting for certain targets, most notably NetFlix. The main change is that we now only consider a hop to be "new" if it doesn't match the subnet of any existing hops for that TTL. It's all very naive: a /24 is considered a subnet for IPv4, a /48 for IPv6, but it results in a big improvement. Eventually, I expect us to plug a BGP feed into the detector and look for changes in the AS path rather than the IP path, but this should tide us over until then.

Worked with Brad to set up a passive monitor to help ITS diagnose some problems they are having on their network related to broadcast and multicast traffic. Just waiting on ITS to let us know when the problems are occurring so we can narrow down our search for strange behaviour to just traces covering the time periods of interest.




Finished updating NNTSC to deal with traceroute data. The new QueryBuilder code should make query construction a bit less convoluted within the NNTSC dbselect module. Everything seems to work OK in basic testing, so it's now just a matter of migrating over one of our production setups and seeing what breaks.

Continued working through the events on, looking at events for streams that fall in the 25-100ms and the 300+ms ranges. Results still look very promising overall. Tried to fix another common source of insignificant events (namely a single very large spike that moves our mean so much that subsequent "normal" measurements are treated as slightly abnormal due to their distance from the new mean) but without any tangible success.

Moved libtrace and libprotoident from svn to git and put the repositories up on github. This should make the projects more accessible, particularly to the increasing number of people who want to add support for various formats and protocols. It should also make life easier for me when it comes to pushing out bug fixes to people having specific problems and merging in code contributed by our users.




The source code for both our libtrace and libprotoident libraries is now available on GitHub. Developers can freely clone these projects and make their own modifications or additions to the source code, while keeping up with any changes that we make between releases.

We're also more than happy to consider pull requests for code that adds useful features or support for new protocols / trace formats to our libraries.

Look out for more of our open-source projects to make their way onto GitHub soon!

Libtrace on GitHub
Libprotoident on GitHub




Spent about half of my week continuing to validate netevmon events on After noticing that the TEntropy detectors were tending to alert on pairs of single measurement spikes that were 2-3 minutes apart, I modified the detectors to require a minimum number of measurements contributing entropy (4) before triggering an alert (provided the time series was in a "constant" state). This seems to have removed many insignificant events without affecting the detection of major events, except for some cases where the TEntropy detectors might trigger a little later than they had previously.

Started implementing the new traceroute table schema within NNTSC. Because there are two table involved (one for paths and one for test results), it is a bit more complicated than the ICMP and DNS tables. Having to cast a list of IP addresses into an SQL array whenever we want to insert into the path table just makes matters worse. At this stage, I've got inserts working sensibly and am now working on making sure we can query the data. As part of this, I am trying to streamline how we construct our queries so that it's easier for us to keep track of all the query components and parameters and keep them in the correct order.




Finished fixing various parts of Cuz so that it should be able to survive postgres restarts.

Started working on NNTSC version 3, i.e. implementing the database design changes that Brad has been testing. Fortunately, the way we've coded NNTSC meant that this was not overly difficult. Both insertion and querying now appears to work with the new code and we've even fixed a problem we would have had under the old system where a stream could only belong to a single label. Now we query each label in turn, so a stream can belong to as many labels as required to satisfy the query.

Also updated the AMP ICMP and DNS table structure to be more efficient in terms of space required per row.

Spent the latter part of the week working on verifying the events that netevmon has produced for Found and fixed some netevmon issues in the process, in particular a bug where we were not subscribing to all the streams for a particular ICMP test so we were missing chunks of data depending on what addresses were being tested to. Overall, the event detection isn't too bad -- we pick up a lot of insignificant events but usually only one detector fires for each one so Meena's fusion techniques should be able to determine that they aren't worth bothering anyone about. The few major events we do get are generally reported by most of the detectors.

Gave a couple of lectures on libtrace for 513.




Short week this week due to being on holiday until Wednesday.

Spent a fair bit of time discussing potential database design improvements with Brendon and Brad. Based on Brad's experiments, it looks like we might be able to make sufficient improvements to the way we use postgres rather than having to move to a non-relational database right away. The main change will be to move to a system where each stream has its own data table rather than storing all streams in the same (partitioned) table. Originally, we were going to use partitions to create the per-streams table, but postgres doesn't like having lots of partitions so we will also need to update our querying and aggregation approach to cope with having to search through multiple tables.

Went through most of our code that touches the database and made sure it sensibly deals with postgres being restarted. It turns out there are a lot of places in both nntsc and ampy where this can cause problems. By the end of the week though, I seemed to have a system that would generally survive a postgres restart but I'm still not 100% sure about some of the more rarely-hit code paths.

Prepared some slides for the 513 libtrace lectures that I am going to giving next week. Also made a few tweaks to the assignment after some feedback from Ryan.