Shane Alcock's Blog
Continued reading over Stephen's thesis.
Further refined my event dashboard improvements. Added an algorithm that should recognise redundant event groups based on ASNs that the groups have in common with other groups that occur at the same time. This allows us to get rid of a large number of the vague UoW-REANNZ-AARNet, REANNZ-AARNet and UoW-REANNZ groups that were cluttering up the dashboard on prophet. Found and fixed a few bugs with the self-updating dashboard that were causing event groups to disappear or appear in the wrong order.
Added a working summary graph to the traceroute path map view, with the added benefit of making the selector appear and actually work for this graph.
Continued to battle with InfluxDB's memory usage on prophet. Experimented with tuning a variety of configuration options to try and avoid some of the surges that we occasionally see. Since these surges usually eventually result in the OOM killer being invoked, we need to be able to better control the memory usage before we can consider rolling InfluxDB into production.
Spent most of my week looking into methods for reducing some of the redundant event groups that appear on the amp-web dashboard. Came up with an algorithm for detecting smaller groups that are already covered by one large group, as well as one for detecting when a large group should be removed in favour of the smaller sub-groups.
Implemented my techniques on prophet, but the range of event groups that I get are a bit limited to be sure that everything is working correctly. Next week I may look into grabbing a copy of skeptic's event database to see how well things work on a more diverse set of event groups.
Spent some time reading over Stephen's revised thesis.
Back into it after a couple of weeks spent moving house.
Worked with Brendon to get nntsc, ampy and amp-web upgraded on skeptic. Also got netevmon running on skeptic so we now have event detection running on the public AMP mesh.
While I was away, InfluxDB ran out of memory and died on prophet. Trying to catch up on the backlog of data kept causing InfluxDB to use ridiculous amounts of memory so I had to spend a decent chunk of my week chasing the cause down. At this point, my biggest wish is that someone will add sensible memory management to InfluxDB.
Did a bit of preliminary writing of a possible paper on NNTSC. Organised some of my thoughts on network measurement ecosystems and turned them into a blog post.
We've been doing a lot of collaborative work with our ISP partners lately and one thing that has become increasingly apparent to me is the disconnect between what ISPs expect from measurement / monitoring software and what researchers typically have the time and energy to implement.
More specifically, researchers are very good at developing new or improved measurement techniques but they are not so great at developing the necessary infrastructure around the measurements to make it easy for ISPs to deploy and use the new techniques in a production environment. As a result, the ISPs tend to fall back on tried and true monitoring software (e.g. Smokeping) even though our conversations with operators suggest that they would prefer more than just the simple metrics and graphs that such tools provide.
Finished adding concurrent postgres-influx support to NNTSC, so now we should be able upgrade existing deployments to use influx without having to worry about migrating the existing data from one database to another.
Added an event feedback system to amp-web so that users can click on events and tell us whether the event was useful or not and provide some reasons why that was the case. Hopefully I can use this data to make some tweaks to netevmon and improve the quality of our event detection.
Started reading Stephen's thesis.
Developed a new 'stacked jitter' graph to amp-web for showing the range of packet delay variation seen by the amp-udpstream test. Also added UDPStream data as an option for the latency and loss matrices.
Started working on a transition scheme that will allow an influx-based NNTSC to fetch old data from a postgresql database if required. The idea is that this will save us having to deal with migrating the postgres data over to influx when we upgrade our existing deployments to use influx, while still making the old data queryable.
Made some progress on the InfluxDB memory issues we were having when catching up on old data. Now we are a lot less likely to drive the machine into swap, at the cost of taking a bit longer for backfilled data to be aggregated. Part of the problem was caused by my fix last week for the change in behaviour for the first() and last() aggregation functions in Influx 0.11 -- I've put in a new hacky fix but I'm basically waiting for Influx 0.13 which will hopefully provide us a way to get the old behaviour back.
Found another weird bug in Influx where if we query for certain streams, then sometimes a result row will get split into two "half-rows". This was messing with our querying code in NNTSC which assumes that the database will return only complete rows, so I've had to add extra code to deal with this possibility.
More influx issues: we aren't allowed to perform aggregation on the timestamp column in an Influx table, which was breaking our loss calculation for DNS -- we were using count(timestamp) to determine how many DNS requests we had sent as this was the only non-NULLable column in the DNS data table. Instead, I've had to add an extra "requests" column to the DNS data table so that we have an explicit count available in our aggregated data.
Lots of little fixes on the website. The changes to modals to bootstrap 3.3 are continuing to have a number of interesting flow-on effects, such as the "add new series" modal no longer working after the first time it is used. Added an AS path tab to latency and loss graphs that are only showing a single series, as we've often seen some interesting change and wondering whether the path has changed at the same time. Also fixed an issue where the last datapoint was often not visible on the graphs.
Finally, submitted my unexpected traffic paper to IMC on Thursday. Fingers crossed.
Started adding support for the new AMP UDPStream test to NNTSC, ampy and amp-web. Test results are now successfully inserted into the database and we can plot simple latency and loss graphs for the UDP streams. Next major tasks are to produce a new graph type that can be used to represent the jitter observed in the stream and to get some event detection working.
Spent much of my week chasing Influx issues. The first was that a change in how the last() function worked in 0.11 was messing with our enforced rollup approach -- the timestamp returned with the last row was no longer the timestamp of the last datapoint in the table; it was now the timestamp of the start of the period covered by the 'where' clause in your query. However, we had been using last() to figure out when we had last inserted an aggregated datapoint into the rollup tables, so this no longer worked.
The other issue I've been chasing (with mixed success) is memory usage when backfilling old data after NNTSC has been down for a little while. I believe this is mostly related to Influx caching our enforced rollup query results, which will be a lot of data if we're trying to catch up on the AMP queue. The end result on prophet is a machine that spends a lot of time swapping when you restart NNTSC with a bit of a backlog. I need to find a way to stop Influx from caching those query results or at least to flush them a lot sooner.
Finished up the first release version of the event filtering for amp-web and rolled it out to lamp on Thursday morning. Most of this week's work was polishing up some of the rough edges and making sure the UI behaves in a reasonable fashion -- Brad was very helpful playing the role of an average user and finding bad behaviour.
Post-release, tracked down and fixed the issue that was causing netevmon to not run the loss detector. Added support for loss events to eventing and the dashboard.
Released a new version of libprotoident, which includes all of my recent additions from the unexpected traffic study.
Marked the last libtrace assignment and pushed out the marks to the students.
After what seems like forever, I've finally managed to put together a new libprotoident release that includes all of the new protocol rules I've developed over the past couple of years. This release adds support for around 70 new protocols, including QUIC, SPDY, Cisco SSL VPN, Weibo and Line. A further 28 protocols have had their rules refined and improved, including BitTorrent, QQ, WeChat, Xunlei and DNS.
The lpi_live tool has been removed in this release, as this has been decommissioned in favour of the lpicollector tool.
Also, please note that libflowmanager 2.0.4 is required to build the libprotoident tools. Older versions of libflowmanager will fail the configure check.
The full list of changes can be found in the libprotoident ChangeLog.