The aim of this project is to develop a system whereby network measurements from a variety of sources can be used to detect and report on events occurring on the network in a timely and useful fashion. The project can be broken down into four major components:
Measurement: The development and evaluation of software to collect the network measurements. Some software used will be pre-existing, e.g. SmokePing, but most of the collection will use our own software, such as AMP, libprotoident and maji. This component is mostly complete.
Collection: The collection, storage and conversion to a standardised format of the network measurements. Measurements will come from multiple locations within or around the network so we will need a system for receiving measurements from monitor hosts. Raw measurement values will need to be stored and allow for querying, particularly for later presentation. Finally, each measurement technology is likely to use a different output format so will need to be converted to a standard format that is suitable for the next component.
Eventing: Analysis of the measurements to determine whether network events have occurred. Because we are using multiple measurement sources, this component will need to aggregate events that are detected by multiple sources into a single event. This component also covers alerting, i.e. deciding how serious an event is and alerting network operators appropriately.
Presentation: Allowing network operators to inspect the measurements being reported for their network and see the context of the events that they are being alerted on. The general plan here is for web-based zoomable graphs with a flexible querying system.
Another solid week of state machine improvements. I've been comparing the machines derived by my algorithm against the machines I can derive manually from the raw data. This has revealed quite a few failures on the part of my algorithm; a lot of the problems fell into one of two categories: 1) creating loops in situations when we probably shouldn't have or 2) a failure in the variant recognition code (both in terms of failing to recognise a variant and being too keen to decide two sequences are variants).
In the process of fixing these problems, I also discovered a bug in my original pattern extraction code that was causing it to halt too early, i.e. as soon as it has extracted a pattern of at least 4 tokens rather than the intended 20 tokens, which explains why many of the patterns I was working with were fragments of a whole sequence. Fixing that has greatly improved the quality of the machines I have been deriving, as well as revealing some patterns that I was previously always missing.
Also spent a day tidying up some of the ampy and amp-web code prior to Brendon releasing them on github. Made the old rrd-smokeping collection work again, as well as removed all of the old LPI and munin collections which we are not interested in maintaining right now.
Another short week of refinement on the FSM generation code. Fixed a major bug in my pattern-mining code that was causing it to return substrings that overlapped as the most common repeated substring. Also spent a lot of time refining the code that determine whether a sequence is a variant of another; now, a short sequence that is entirely encompassed by another much longer sequence is considered a good match despite the number of tokens in the long sequence that are unmatched.
Put together a poster describing the FSM work, as CROW are interested in displaying it at the CultivateIT event next week. Even if they don't use it there, it'll probably be handy to have available at some point.
Helped Brendon test out some code polishing that he has done to NNTSC before putting it up on GitHub. Went through and removed some outdated code in the repo (specifically the LPI modules) and updated the docs to not refer to our non-working modules so hopefully nobody will try to use them.
Another disrupted week, this time caused by a malfunctioning vehicle causing me to have to work from home for much of it.
Returned to polishing and improving my state machine generation code, mostly to deal with some minor inaccuracies when creating loops or converging branches. The machines are starting to look reasonably right, although I still need a good method for working out the best candidates to be start states.
Fixed some AMP matrix issues that cropped up when we rolled the latest code out to one of our deployments. The two main problems were that a) the throughput matrix hadn't been updated to the new API and b) the relative matrix metrics were inconsistent. As part of the process of fixing these, I also found that we've been calculating relative latency incorrectly for quite a while so I've fixed that as well.
NZNOG week. Spent the first two days finalising everything for the AMP tutorial.
The tutorial itself went ahead on Wednesday afternoon and seemed to be fairly successful. No major technical glitches and the participants seemed to get something out of it.
Gave my talk on my latest libprotoident study on Thursday. Happy with the reception I got and had quite a few interesting conversations afterwards as a result.
Helped Brendon get the NZNOG AMP tutorial in a presentable state. Built some VM images that our attendees will be able to use as an AMP server and made sure that the steps provided in our tutorial will result in a functional server. As a result, we've fixed a few little bugs in ampy and amp-web that showed up in situations where you don't already have a lot of pre-existing streams or meshes.
Made sure that our VMs and instructions will work with VMWare, VirtualBox and QEMU, as well as on Ubuntu, Windows 10 and macOS.
Installed Ubuntu 16.04 on all of our UP boards, so they are all ready to go next week.
Tidied up and documented the FSM extraction code, so that I'll be able to remember how it works when I start working on it again in earnest next year.
Finished the matrix layout / selection changes and merged them back into develop. Hopefully we will get a chance to roll these out early next year once Brendon builds some new packages.
I had to run a test capture for a few days last week to make sure that some changes Richard had made to libtrace had not broken DAG and RT inputs. Ran the resulting traces through libprotoident to see if there are any new protocols worth investigating. Managed to make a few improvements to the rules for existing protocols to catch a few cases that we were missing but otherwise nothing particularly exciting cropped up.
In Wellington for STRATUS forum on Monday. Had a few interesting chats -- definitely a lot of people out there interested in anomaly detection in a variety of contexts.
Continued refining my FSM generation code. Managed to get rid of most of the obviously incorrect transitions in my test cases now. There's still a bit of work to do in terms of tidying up some orphaned states that are left over as a result of the code realising they are redundant and trying to choose better start states, but my main focus before the end of the year will be tidying up the code and making sure it is sufficiently documented so I'll be able to pick it up again in the new year.
Fixed a bunch of small problems with amp-web and NNTSC that we've known about for a while. Started working on replacing the matrix selection tabs with dropdowns and combining related "tabs" into a single matrix type, e.g. http duration and http page size are combined into a single "http" matrix with the ability to change the metric using a dropdown.
Double report due to being away at IMC.
Gave a practice run of my IMC talk. Still needed to carve out some content and streamline it a bit, so spent more time working on that.
Documented the process for adding a new metric (i.e. a new graph/matrix for an existing collection) to the AMP website. Worked through an example by adding HTTP page size to my development website. Identified a number of issues with some of the terminology in the amp-web code that need to be fixed in the long run, but this will probably require a decent bit of code re-architecting.
Attended IMC in Santa Monica. Talk seemed to go over pretty well and it was great to catch up with people I had met at AIMS earlier this year, as well as some familiar faces from previous IMCs. Took the opportunity to have a brief holiday in L.A. afterwards.
Spent a couple of days reading over Richard S's paper and providing feedback.
Continued keeping an eye on the influx-nntsc test deployment. Pretty happy with it so Brendon and I will start working on packaging everything and rolling it out to skeptic next week.
Started working on an outline for my IMC talk.
Got some initial results back to Harris and Alan for their experiment using my suffix tree code. Had to rewrite a previously recursive algorithm to be iterative to work with some of the larger syscall logs, since Python is hopeless at recursion.
Migrated the iterative version back into my automatic FSM construction code, which I resumed looking at on Friday. Still finding plenty of cases where variant patterns are not being combined into the original FSM correctly, so this has mostly involved a lot of debugging. The code has started to sprawl a bit, so had to take some time to refactor it into a manageable state.
Added a new collection to NNTSC for storing traceroute path lengths. This allows me to store the path lengths in Influx (for fast matrix aggregation), while keeping the full traceroute data in postgres. Updated ampy and amp-web to use the new collection, so we now have better matrix performance for all data types. NNTSC memory usage still seems to be fairly stable, which is good news.
Made a few final tweaks to my NNTSC paper before submitting on Friday.
Started looking into how I can use the common sequences extracted by my suffix tree code to recognise syscall patterns that can be turned into FSMs. The interesting challenge is identifying and combining variants of the same syscall pattern -- this is still a work in progress but early signs are promising (at least for the one example I've got so far!).
Finished the draft of my NNTSC paper. Got some initial feedback from Brendon which I've been able to incorporate into the paper.
Still not entirely happy with Influx-NNTSC and netevmon running on the same machine, as the combined memory usage will push skeptic's current hardware to its limit. Experimented with running netevmon on a separate VM just to make sure that a remote event database does actually work, so we at least have the option of moving netevmon onto its own dedicated machine.
Finished my implementation of the imprecise pattern mining algorithm. Starting working on a more homegrown algorithm for detecting repeated sequences of syscalls within a larger trace, based on existing techniques for using a suffix tree to find repeated substrings within strings.
Returned to my half-written NNTSC paper with an eye towards submitting it to PAM in a few weeks. Paper is now around 75% finished, including a couple of nice diagrams showing the NNTSC architecture and the database schema. Space is starting to get a bit tight, so I'll have to revisit some of my earlier writing and cleanse it of unnecessary waffle.
With the help of an explanation from Harris, I've been able to decipher the temporal property mining algorithms. Managed to implement the simple version this week, which seems to be doing the right thing, and started working on a
more complicated variant that allows for some imperfections in the source data (e.g 9/10 times a close follows an open, but every now again someone forgets to call close before opening something else).
Kept tinkering with my mock skeptic install. I was a little concerned about the memory usage of anomaly_ts so I went back over some previous work I did to work out relative accuracy rates of each detector under a variety of different parameter settings to try and find good settings for each detector that used a minimal amount of stored history.
Spent a bit of time reading over some papers on mining temporal properties from sequences of function calls. The algorithms that these people are using are a bit tricky to decipher -- the explanation is a bit terse and I don't really have the background in the area to fill in the gaps -- so hopefully Harris will be able to get further than I did.
Continued building FSMs for common syscall patterns. Started working with the user study data which is not at all well covered by my existing FSMs. This appears to be mostly because of various Gnome / X processes and widgets that are continuously polling and receiving events. The syscalls generated by these processes drowns out everything else, so it is hard to find the actions that the users actually performed during the study.
Arranged travel and accommodation for my upcoming trip to IMC.
Finished up the libtrace4 and wandio releases and pushed them out.
Installed a mock version of skeptic on an openstack VM to test how InfluxDB copes with the full public AMP dataset. In general, InfluxDB seems to be coping OK when inserting / browsing data but the memory requirements of anomaly_ts are a bit larger than I would like so that's an avenue to chase up in the near future.
Continued implementing syscall FSMs manually to find out about other cases we need to consider when trying to automate the process. Added the ability to express a state as another FSM so we can build more complex machines from the smaller ones. Documented the code and put it into bitbucket so other people can start working with it.
Also started trying to use the FSMs on another dataset that Alan had collected. Turns out this dataset had a bunch of new syscalls that my previous parser hadn't seen before so it required a bit of updating.
Released new versions of libprotoident and libflowmanager with the new LGPL licensing. Also re-licensed and tested potential libtrace and wandio releases but haven't quite got to the stage where I want to push out the releases just yet.
Continued messing around with deriving FSMs from common system call patterns and turning them into runnable code. I've got 8 FSMs drawn up and have implemented 5 of them. Developed a bit of backend for applying my FSMs to the log data so that I can implement new FSMs with the least amount of coding possible (e.g. common actions like checking fd consistency and making sure paramaters match expected values are all done within a parent FSM class and the child classes just list the relevant data to compare against). Hopefully this will help move towards automated generation of the FSM code.
Had a few meetings where we discussed the FSM approach (and RA3 in general) with a few of the industry partners and they seem reasonably pleased with what we are trying to achieve so that's reassuring.
Helped Brendon try to debug some issues with data not appearing on graphs on the recently updated deployment. As a result of this, we've realised we need to re-think how we are storing and presenting traceroute data so that we can't avoid these problems in the future.
Started looking at the most common patterns in my example sysdig logs. It's pretty obvious that we can easily recognise some low-level actions based on the sequence of system calls and produce models that can be used to identify them. For example, loading a .so shared library will generally result in the same sequence of system calls (with some minor variations) and therefore that can be expressed as a finite state machine.
Developed FSMs for four low level actions: loading a .so library, loading a python module, receiving a typed character via ssh and reading a modprobe config file. Implemented the SSH action as code so I can now find and replace those sequences in the logs with a single SSHCharInput action.
Helped Brendon install NNTSC, ampy and amp-web packages on one of our existing deployments on Thursday. We ended up with a problem where NNTSC would not return query data to the web-site and it took a lot of time (and debugging) to find the source of our problem: incongruous versions of psycopg2 in pip vs the debian package.
Started prepping a libprotoident release. libprotoident is moving to an LGPL license so I've had to replace the blurb at the top of every source file. Been working through the usual pre-release testing and ChangeLog updating.
Spent Wednesday at the Honours conference. I thought all of our students presented well and gave good accounts of their work so far.
My IMC paper on unexpected traffic on well-known ports was accepted, which is great news. Spent Monday going over reviewer feedback and thinking about what revisions I need to make for the camera-ready version.
Continued working on integrating STRATUS with NNTSC. Spent way too much time trying to figure out why my data was not being inserted into the Influx database -- turns out the timestamp for the test data I was using was too old for the default retention policies so it was being automatically discarded. Fudged the test data times to be more recent and it finally worked.
Added file operations metric support to ampy and amp-web so we can now look at simple graphs of open frequency data. Found some scalability issues with our modal dialogs in cases where the number of possible options for a dropdown is very high, so I've gone back and added pagination support to all modal dropdowns so they only load 30 or so options at a time. This had some interesting flow-on effects, especially for the latency modal dialog which had a lot of custom code for populating the tabs for the different latency metrics. I think I've ironed out all of the extra wrinkles now.
Spent a little more time with the July traces to track down some more unknown protocols. Added a rule for the Netcore vulnerability scan (which happens a lot!) and updated rules for a lot of (mostly game-related) protocols.
Started working on integrating some of the STRATUS metrics into NNTSC so that we can explore using time-series based event detection to highlight potentially interesting file interactions. Going forward, I'm going to be splitting my time 50:50 between STRATUS development and WAND research work -- existing research might progress a bit slower as a result.
Continued poking at unknown flows in the July trace data. Added protocols for Final Fantasy XIV and Facebook Messenger. Noticed that we are still having issues with the vDAG pipe on the probe that services wdcap dropping packets so our captures are sometimes missing packets. Moving IP encryption off onto wraith seems to have helped with this, but is not an ideal solution.
Short week after taking leave on Monday and Tuesday.
Spent most of my remaining week looking at some new captures I took using the upgraded Probe. The main aim was to see whether there were any new protocols that libprotoident should be able to identify. Managed to find a handful of new protocols: Facebook Zero, Forticlient SSL VPN and Discord, as well as made some improvements to the rules for existing protocols (including the AMP throughput test!).
Most of my time was actually spent unsuccessfully hunting down what appears to be a new Chinese P2P protocol, which is a shame because it was contributing a very large amount of unknown traffic in my sample dataset.
Using BSOD on the live traffic feed also allowed me to spot a student that was doing vast quantities of torrenting on the campus network (which Brad reported to ITS) and our WITS FTP server being hammered with tons of download attempts from China. Fair to say, we've gotten some good milage of the upgraded Probe already.
Fixed a couple of outstanding bugs in amp-web. Should be ready to push some new packages out to skeptic and lamp early next week now.
Ported my event group pruning code from amp-web to a separate daemon that runs as part of netevmon. Rather than tweaking the event groups prior to displaying them on the dashboard, the daemon periodically fetches the most recent event groups from the database and checks for any redundancies that can be pruned. If any are found, the database itself is updated in place.
The benefits of this approach over the amp-web approach are that we can save on space in the event database and we don't need to do the full redundancy processing every time someone loads the dashboard. The one downside is that any merges are effectively permanent so I have to be very careful about testing my redundancy checks before rolling them out live.
Found and fixed some more Influxdb memory problems when using the matrix. Most of the problems related to us using the last() function, which for some reason can result in Influxdb loading the whole table into memory. I've managed to rewrite the queries that used last() so that they don't require anywhere near as much memory (or processing time) so tooltips, in particular, should be a lot faster to process and less likely to push the server into swap.
Got the waikato capture point back up and running after its disks were replaced on Thursday. Used it to demo BSOD to various visitors who were here for the CSC.