Shane Alcock's Blog
Made some progress on the InfluxDB memory issues we were having when catching up on old data. Now we are a lot less likely to drive the machine into swap, at the cost of taking a bit longer for backfilled data to be aggregated. Part of the problem was caused by my fix last week for the change in behaviour for the first() and last() aggregation functions in Influx 0.11 -- I've put in a new hacky fix but I'm basically waiting for Influx 0.13 which will hopefully provide us a way to get the old behaviour back.
Found another weird bug in Influx where if we query for certain streams, then sometimes a result row will get split into two "half-rows". This was messing with our querying code in NNTSC which assumes that the database will return only complete rows, so I've had to add extra code to deal with this possibility.
More influx issues: we aren't allowed to perform aggregation on the timestamp column in an Influx table, which was breaking our loss calculation for DNS -- we were using count(timestamp) to determine how many DNS requests we had sent as this was the only non-NULLable column in the DNS data table. Instead, I've had to add an extra "requests" column to the DNS data table so that we have an explicit count available in our aggregated data.
Lots of little fixes on the website. The changes to modals to bootstrap 3.3 are continuing to have a number of interesting flow-on effects, such as the "add new series" modal no longer working after the first time it is used. Added an AS path tab to latency and loss graphs that are only showing a single series, as we've often seen some interesting change and wondering whether the path has changed at the same time. Also fixed an issue where the last datapoint was often not visible on the graphs.
Finally, submitted my unexpected traffic paper to IMC on Thursday. Fingers crossed.
Started adding support for the new AMP UDPStream test to NNTSC, ampy and amp-web. Test results are now successfully inserted into the database and we can plot simple latency and loss graphs for the UDP streams. Next major tasks are to produce a new graph type that can be used to represent the jitter observed in the stream and to get some event detection working.
Spent much of my week chasing Influx issues. The first was that a change in how the last() function worked in 0.11 was messing with our enforced rollup approach -- the timestamp returned with the last row was no longer the timestamp of the last datapoint in the table; it was now the timestamp of the start of the period covered by the 'where' clause in your query. However, we had been using last() to figure out when we had last inserted an aggregated datapoint into the rollup tables, so this no longer worked.
The other issue I've been chasing (with mixed success) is memory usage when backfilling old data after NNTSC has been down for a little while. I believe this is mostly related to Influx caching our enforced rollup query results, which will be a lot of data if we're trying to catch up on the AMP queue. The end result on prophet is a machine that spends a lot of time swapping when you restart NNTSC with a bit of a backlog. I need to find a way to stop Influx from caching those query results or at least to flush them a lot sooner.
Finished up the first release version of the event filtering for amp-web and rolled it out to lamp on Thursday morning. Most of this week's work was polishing up some of the rough edges and making sure the UI behaves in a reasonable fashion -- Brad was very helpful playing the role of an average user and finding bad behaviour.
Post-release, tracked down and fixed the issue that was causing netevmon to not run the loss detector. Added support for loss events to eventing and the dashboard.
Released a new version of libprotoident, which includes all of my recent additions from the unexpected traffic study.
Marked the last libtrace assignment and pushed out the marks to the students.
After what seems like forever, I've finally managed to put together a new libprotoident release that includes all of the new protocol rules I've developed over the past couple of years. This release adds support for around 70 new protocols, including QUIC, SPDY, Cisco SSL VPN, Weibo and Line. A further 28 protocols have had their rules refined and improved, including BitTorrent, QQ, WeChat, Xunlei and DNS.
The lpi_live tool has been removed in this release, as this has been decommissioned in favour of the lpicollector tool.
Also, please note that libflowmanager 2.0.4 is required to build the libprotoident tools. Older versions of libflowmanager will fail the configure check.
The full list of changes can be found in the libprotoident ChangeLog.
Only worked three days this week -- on leave for the rest.
Continued developing the event filtering mechanism for the amp-web dashboard. Managed to make all of the filtering options work properly, including AS-based filtering and filtering based on the number of affected endpoints.
Changed event loading to happen in batches, so if the selected time range covers a lot of events we will only load 20 at a time. A new batch is loaded each time the user scrolls to the bottom of the event list. This means that we can now replicate the old infinite scrolling event list behaviour on the dashboard, so I've removed the former page.
Added automatic fetching of new events to the dashboard, so the event list is now self-updating rather than requiring a refresh of the whole page to see any new events.
Continued working on the event filtering mechanism for amp-web. Added support for an ASN->AS name mapping database which will be used to manage the list of AS's that can be filtered on, as well as be used for labeling our traceroute graphs (instead of querying whois.cymru.org which can fail from time to time).
Changes to event filters are now posted back to the amp-web server and saved for the next time the user loads the event dashboard.
Started working on actually filtering the events based on the user's selections. I've got filtering working for time period, maximum event groups, event types, sources and targets. One interesting side effect of filtering is that the removal of certain events from event groups can create situations where we have duplicate event groups (because the events that made those groups distinct are no longer on the dashboard). Removing events can also change the start time of an event group and therefore event groups no longer appear in chronological order. As a result, I've had to re-work the event processing to correct for these issues.
Marked the 513 libtrace assignments. Some students performed very well and I was glad to see that the investigative task proved to be very doable.
Started working on adding the ability to filter events and event groups on the amp-web dashboard. Most of my effort so far has been in producing a mock-up of the interface, which I showed to Nathan and Chris on Thursday afternoon. Started replacing some hard-coded filtering settings with a dynamic template that uses user preferences stored in a database on Friday.
Fixed a few little netevmon issues that cropped when trying to restart netevmon on prophet prior to starting work on the dashboard filtering, mostly in relation to ensuring that the 'purge event database' option works sensibly.
Started writing up a short paper on the unexpected traffic analysis I've been doing for the past few weeks. Made decent progress -- I've got a mostly complete draft, just missing a conclusion and an abstract.
Spent a decent chunk of Thursday dealing with the fallout from upgrading influxdb to 0.11 on prophet. This broke most of our existing rollup tables, as the data type that we were now inserting (int) was no longer compatible with the data type that we apparently used to insert (float). Compounding matters was influxdb's lack of visibility into what data types are associated with any given column. Ended up trashing and re-creating the database (somewhat by accident) which fixed the problem, but not an ideal solution if we ever roll this out in production.
513 assignment was due at 5pm on Friday, so dealt with a few final queries from students. 20 submissions in the end, so a bit of marking to do next week.
Continued making progress with my unidentified mice flows in libprotoident. Added a whole pile of new rules, mostly for various Chinese apps again. Have probably done enough now that I can draw a line under this and start writing the paper itself; there are a few obvious patterns that I would like to identify but this has consumed a lot of time already.
Answered a handful of questions from 513 students -- mostly intelligent ones, so I'm reasonably confident about how the class is going overall. Due date is this coming Friday, so we'll know for sure soon enough.
Helped finish off the funding proposal in the first half of the week.
Continued working with libprotoident. This week I gave up on the elephant flows and started looking at the mice flows. Found some interesting stuff; the highlight being a huge number of flows on TCP port 80 that seem to be associated with the Baidu web browser. The behaviour of these flows is particularly odd: connect to server, send a FIN with seqno N, retransmit FIN a few times, send a non-FIN packet with 1 byte of payload (0x00) and seqno N-1 (incredibly invalid TCP behaviour!), server sends a RST. End result is > 150,000 flows over a week on port 80 with a single outgoing byte of payload.
Added some filters on the Endace probe to see if we can find people doing this traffic on campus, as the Baidu browser is pretty well-known for having a tendency to leak all sorts of private data back to its masters. Found multiple staff PCs that appear to be doing this sort of traffic, so Brad and I will try to prepare a report for ITS next week.
Met with Nathan at Lightwire on Thursday afternoon re: AMP and netevmon. Came away with plenty of ideas and suggestions for improvements we can make and hopefully we also helped Nathan understand parts of our system better as well. The good news is that netevmon seems to mostly be picking up valid events, but even so the number and frequency of these events can be overwhelming so we need better control over what events are shown to the user.