User login

Search Projects

Project Members

Cuz

The aim of this project is to develop a system whereby network measurements from a variety of sources can be used to detect and report on events occurring on the network in a timely and useful fashion. The project can be broken down into four major components:

Measurement: The development and evaluation of software to collect the network measurements. Some software used will be pre-existing, e.g. SmokePing, but most of the collection will use our own software, such as AMP, libprotoident and maji. This component is mostly complete.

Collection: The collection, storage and conversion to a standardised format of the network measurements. Measurements will come from multiple locations within or around the network so we will need a system for receiving measurements from monitor hosts. Raw measurement values will need to be stored and allow for querying, particularly for later presentation. Finally, each measurement technology is likely to use a different output format so will need to be converted to a standard format that is suitable for the next component.

Eventing: Analysis of the measurements to determine whether network events have occurred. Because we are using multiple measurement sources, this component will need to aggregate events that are detected by multiple sources into a single event. This component also covers alerting, i.e. deciding how serious an event is and alerting network operators appropriately.

Presentation: Allowing network operators to inspect the measurements being reported for their network and see the context of the events that they are being alerted on. The general plan here is for web-based zoomable graphs with a flexible querying system.

25

Jul

2016

Short week after taking leave on Monday and Tuesday.

Spent most of my remaining week looking at some new captures I took using the upgraded Probe. The main aim was to see whether there were any new protocols that libprotoident should be able to identify. Managed to find a handful of new protocols: Facebook Zero, Forticlient SSL VPN and Discord, as well as made some improvements to the rules for existing protocols (including the AMP throughput test!).

Most of my time was actually spent unsuccessfully hunting down what appears to be a new Chinese P2P protocol, which is a shame because it was contributing a very large amount of unknown traffic in my sample dataset.

Using BSOD on the live traffic feed also allowed me to spot a student that was doing vast quantities of torrenting on the campus network (which Brad reported to ITS) and our WITS FTP server being hammered with tons of download attempts from China. Fair to say, we've gotten some good milage of the upgraded Probe already.

Fixed a couple of outstanding bugs in amp-web. Should be ready to push some new packages out to skeptic and lamp early next week now.

20

Jul

2016

Ported my event group pruning code from amp-web to a separate daemon that runs as part of netevmon. Rather than tweaking the event groups prior to displaying them on the dashboard, the daemon periodically fetches the most recent event groups from the database and checks for any redundancies that can be pruned. If any are found, the database itself is updated in place.

The benefits of this approach over the amp-web approach are that we can save on space in the event database and we don't need to do the full redundancy processing every time someone loads the dashboard. The one downside is that any merges are effectively permanent so I have to be very careful about testing my redundancy checks before rolling them out live.

Found and fixed some more Influxdb memory problems when using the matrix. Most of the problems related to us using the last() function, which for some reason can result in Influxdb loading the whole table into memory. I've managed to rewrite the queries that used last() so that they don't require anywhere near as much memory (or processing time) so tooltips, in particular, should be a lot faster to process and less likely to push the server into swap.

Got the waikato capture point back up and running after its disks were replaced on Thursday. Used it to demo BSOD to various visitors who were here for the CSC.

11

Jul

2016

Continued reading over Stephen's thesis.

Further refined my event dashboard improvements. Added an algorithm that should recognise redundant event groups based on ASNs that the groups have in common with other groups that occur at the same time. This allows us to get rid of a large number of the vague UoW-REANNZ-AARNet, REANNZ-AARNet and UoW-REANNZ groups that were cluttering up the dashboard on prophet. Found and fixed a few bugs with the self-updating dashboard that were causing event groups to disappear or appear in the wrong order.

Added a working summary graph to the traceroute path map view, with the added benefit of making the selector appear and actually work for this graph.

Continued to battle with InfluxDB's memory usage on prophet. Experimented with tuning a variety of configuration options to try and avoid some of the surges that we occasionally see. Since these surges usually eventually result in the OOM killer being invoked, we need to be able to better control the memory usage before we can consider rolling InfluxDB into production.

01

Jul

2016

Spent most of my week looking into methods for reducing some of the redundant event groups that appear on the amp-web dashboard. Came up with an algorithm for detecting smaller groups that are already covered by one large group, as well as one for detecting when a large group should be removed in favour of the smaller sub-groups.

Implemented my techniques on prophet, but the range of event groups that I get are a bit limited to be sure that everything is working correctly. Next week I may look into grabbing a copy of skeptic's event database to see how well things work on a more diverse set of event groups.

Spent some time reading over Stephen's revised thesis.

27

Jun

2016

Back into it after a couple of weeks spent moving house.

Worked with Brendon to get nntsc, ampy and amp-web upgraded on skeptic. Also got netevmon running on skeptic so we now have event detection running on the public AMP mesh.

While I was away, InfluxDB ran out of memory and died on prophet. Trying to catch up on the backlog of data kept causing InfluxDB to use ridiculous amounts of memory so I had to spend a decent chunk of my week chasing the cause down. At this point, my biggest wish is that someone will add sensible memory management to InfluxDB.

Did a bit of preliminary writing of a possible paper on NNTSC. Organised some of my thoughts on network measurement ecosystems and turned them into a blog post.

21

Jun

2016

We've been doing a lot of collaborative work with our ISP partners lately and one thing that has become increasingly apparent to me is the disconnect between what ISPs expect from measurement / monitoring software and what researchers typically have the time and energy to implement.

More specifically, researchers are very good at developing new or improved measurement techniques but they are not so great at developing the necessary infrastructure around the measurements to make it easy for ISPs to deploy and use the new techniques in a production environment. As a result, the ISPs tend to fall back on tried and true monitoring software (e.g. Smokeping) even though our conversations with operators suggest that they would prefer more than just the simple metrics and graphs that such tools provide.

30

May

2016

Finished adding concurrent postgres-influx support to NNTSC, so now we should be able upgrade existing deployments to use influx without having to worry about migrating the existing data from one database to another.

Added an event feedback system to amp-web so that users can click on events and tell us whether the event was useful or not and provide some reasons why that was the case. Hopefully I can use this data to make some tweaks to netevmon and improve the quality of our event detection.

Started reading Stephen's thesis.

23

May

2016

Developed a new 'stacked jitter' graph to amp-web for showing the range of packet delay variation seen by the amp-udpstream test. Also added UDPStream data as an option for the latency and loss matrices.

Started working on a transition scheme that will allow an influx-based NNTSC to fetch old data from a postgresql database if required. The idea is that this will save us having to deal with migrating the postgres data over to influx when we upgrade our existing deployments to use influx, while still making the old data queryable.

16

May

2016

Made some progress on the InfluxDB memory issues we were having when catching up on old data. Now we are a lot less likely to drive the machine into swap, at the cost of taking a bit longer for backfilled data to be aggregated. Part of the problem was caused by my fix last week for the change in behaviour for the first() and last() aggregation functions in Influx 0.11 -- I've put in a new hacky fix but I'm basically waiting for Influx 0.13 which will hopefully provide us a way to get the old behaviour back.

Found another weird bug in Influx where if we query for certain streams, then sometimes a result row will get split into two "half-rows". This was messing with our querying code in NNTSC which assumes that the database will return only complete rows, so I've had to add extra code to deal with this possibility.

More influx issues: we aren't allowed to perform aggregation on the timestamp column in an Influx table, which was breaking our loss calculation for DNS -- we were using count(timestamp) to determine how many DNS requests we had sent as this was the only non-NULLable column in the DNS data table. Instead, I've had to add an extra "requests" column to the DNS data table so that we have an explicit count available in our aggregated data.

Lots of little fixes on the website. The changes to modals to bootstrap 3.3 are continuing to have a number of interesting flow-on effects, such as the "add new series" modal no longer working after the first time it is used. Added an AS path tab to latency and loss graphs that are only showing a single series, as we've often seen some interesting change and wondering whether the path has changed at the same time. Also fixed an issue where the last datapoint was often not visible on the graphs.

Finally, submitted my unexpected traffic paper to IMC on Thursday. Fingers crossed.

09

May

2016

Started adding support for the new AMP UDPStream test to NNTSC, ampy and amp-web. Test results are now successfully inserted into the database and we can plot simple latency and loss graphs for the UDP streams. Next major tasks are to produce a new graph type that can be used to represent the jitter observed in the stream and to get some event detection working.

Spent much of my week chasing Influx issues. The first was that a change in how the last() function worked in 0.11 was messing with our enforced rollup approach -- the timestamp returned with the last row was no longer the timestamp of the last datapoint in the table; it was now the timestamp of the start of the period covered by the 'where' clause in your query. However, we had been using last() to figure out when we had last inserted an aggregated datapoint into the rollup tables, so this no longer worked.

The other issue I've been chasing (with mixed success) is memory usage when backfilling old data after NNTSC has been down for a little while. I believe this is mostly related to Influx caching our enforced rollup query results, which will be a lot of data if we're trying to catch up on the AMP queue. The end result on prophet is a machine that spends a lot of time swapping when you restart NNTSC with a bit of a backlog. I need to find a way to stop Influx from caching those query results or at least to flush them a lot sooner.

02

May

2016

Finished up the first release version of the event filtering for amp-web and rolled it out to lamp on Thursday morning. Most of this week's work was polishing up some of the rough edges and making sure the UI behaves in a reasonable fashion -- Brad was very helpful playing the role of an average user and finding bad behaviour.

Post-release, tracked down and fixed the issue that was causing netevmon to not run the loss detector. Added support for loss events to eventing and the dashboard.

Released a new version of libprotoident, which includes all of my recent additions from the unexpected traffic study.

Marked the last libtrace assignment and pushed out the marks to the students.

26

Apr

2016

Only worked three days this week -- on leave for the rest.

Continued developing the event filtering mechanism for the amp-web dashboard. Managed to make all of the filtering options work properly, including AS-based filtering and filtering based on the number of affected endpoints.

Changed event loading to happen in batches, so if the selected time range covers a lot of events we will only load 20 at a time. A new batch is loaded each time the user scrolls to the bottom of the event list. This means that we can now replicate the old infinite scrolling event list behaviour on the dashboard, so I've removed the former page.

Added automatic fetching of new events to the dashboard, so the event list is now self-updating rather than requiring a refresh of the whole page to see any new events.

19

Apr

2016

Continued working on the event filtering mechanism for amp-web. Added support for an ASN->AS name mapping database which will be used to manage the list of AS's that can be filtered on, as well as be used for labeling our traceroute graphs (instead of querying whois.cymru.org which can fail from time to time).

Changes to event filters are now posted back to the amp-web server and saved for the next time the user loads the event dashboard.

Started working on actually filtering the events based on the user's selections. I've got filtering working for time period, maximum event groups, event types, sources and targets. One interesting side effect of filtering is that the removal of certain events from event groups can create situations where we have duplicate event groups (because the events that made those groups distinct are no longer on the dashboard). Removing events can also change the start time of an event group and therefore event groups no longer appear in chronological order. As a result, I've had to re-work the event processing to correct for these issues.

11

Apr

2016

Marked the 513 libtrace assignments. Some students performed very well and I was glad to see that the investigative task proved to be very doable.

Started working on adding the ability to filter events and event groups on the amp-web dashboard. Most of my effort so far has been in producing a mock-up of the interface, which I showed to Nathan and Chris on Thursday afternoon. Started replacing some hard-coded filtering settings with a dynamic template that uses user preferences stored in a database on Friday.

Fixed a few little netevmon issues that cropped when trying to restart netevmon on prophet prior to starting work on the dashboard filtering, mostly in relation to ensuring that the 'purge event database' option works sensibly.

04

Apr

2016

Started writing up a short paper on the unexpected traffic analysis I've been doing for the past few weeks. Made decent progress -- I've got a mostly complete draft, just missing a conclusion and an abstract.

Spent a decent chunk of Thursday dealing with the fallout from upgrading influxdb to 0.11 on prophet. This broke most of our existing rollup tables, as the data type that we were now inserting (int) was no longer compatible with the data type that we apparently used to insert (float). Compounding matters was influxdb's lack of visibility into what data types are associated with any given column. Ended up trashing and re-creating the database (somewhat by accident) which fixed the problem, but not an ideal solution if we ever roll this out in production.

513 assignment was due at 5pm on Friday, so dealt with a few final queries from students. 20 submissions in the end, so a bit of marking to do next week.

21

Mar

2016

Helped finish off the funding proposal in the first half of the week.

Continued working with libprotoident. This week I gave up on the elephant flows and started looking at the mice flows. Found some interesting stuff; the highlight being a huge number of flows on TCP port 80 that seem to be associated with the Baidu web browser. The behaviour of these flows is particularly odd: connect to server, send a FIN with seqno N, retransmit FIN a few times, send a non-FIN packet with 1 byte of payload (0x00) and seqno N-1 (incredibly invalid TCP behaviour!), server sends a RST. End result is > 150,000 flows over a week on port 80 with a single outgoing byte of payload.

Added some filters on the Endace probe to see if we can find people doing this traffic on campus, as the Baidu browser is pretty well-known for having a tendency to leak all sorts of private data back to its masters. Found multiple staff PCs that appear to be doing this sort of traffic, so Brad and I will try to prepare a report for ITS next week.

Met with Nathan at Lightwire on Thursday afternoon re: AMP and netevmon. Came away with plenty of ideas and suggestions for improvements we can make and hopefully we also helped Nathan understand parts of our system better as well. The good news is that netevmon seems to mostly be picking up valid events, but even so the number and frequency of these events can be overwhelming so we need better control over what events are shown to the user.

07

Mar

2016

Continued working away at the Unknown traffic from my libprotoident port study. Added new protocols for Telegram Messenger and Kuguo, as well as improved DNS (especially TCP DNS) and NTP matching. I still have a bit more Unknown traffic to identify before I'd be comfortable putting the results in a paper, but we're getting closer.

Gave my 513 lectures this week. Looking forward to seeing how the class get on with my assignment.

Met with Ryan Jones who is doing an Honours project that will use netevmon to try and find events in the CSC data. Gave him access to the code and a few hints to start out, but I imagine I'll have to dedicate some more time to this over the course of the year.

01

Mar

2016

My fixes to Andy's InfluxDB code seems to be resulting in consistent and correct bins being stored in the rollup tables. Threw netevmon at the development system to see if it can cope, which it seems to be doing OK. There's still a bit of a concern around long-term memory usage, but I'll see how that pans out over the next couple of weeks.

Spent the rest of my week concentrating on finishing up JP's summer study on unexpected traffic on typically open ports. Managed to improve a few existing rules to recognise more traffic, as well as add new rules for QQ video chat and what appears to be a C&C covert channel for some Chinese malware using UDP port 53. Started framing up a paper for IMC based on this study.

Did some final prep work for the libtrace lectures and assignment for 513.

22

Feb

2016

Arrived back in NZ on Monday, back at work on Tuesday. Brought Brendon and Richard N. up to speed on the things I learned at AIMS and the potential collaboration opportunities I discussed with people there. Spent a bit of time writing emails to chase up on some of these opportunities.

Deployed Andy's InfluxDB code on prophet. Spent much of the rest of the week playing around with the continuous query system to try and fix some outstanding issues caused by Influx's design decision to never automatically backfill the aggregated series when older / lagged data is received (e.g. when restarting NNTSC after an outage or AMP results arriving 40 seconds later than their timestamp due to timeouts). This was a bit trickier than you would think because there's no obvious way to find out when the last automatic continuous query ran (they don't happen exactly on the bin boundary) so I have to guess based on the current time, the time the bin should have ended and the timestamp of the current result.

16

Feb

2016

Spent my week in San Diego attending the BGP hackathon and the AIMS workshop.

The hackathon went really well. I was so intimidating that nobody wanted to join my team, but I still managed to add a lot of useful filtering capabilities to CAIDA's BGPStream software. Will try to write a more detailed blog post on what I did at some point, but it was enough to win myself a prize for being one of the top teams.

The AIMS workshop was also very valuable, as there was definitely some interest in what we have been doing with both AMP and NNTSC. In particular, it seems that AMP might have some value for some big ISPs outside of New Zealand. Looking forward to seeing what comes from the discussions I had with various workshop attendees.