User login

Search Projects

Project Members

Cuz

The aim of this project is to develop a system whereby network measurements from a variety of sources can be used to detect and report on events occurring on the network in a timely and useful fashion. The project can be broken down into four major components:

Measurement: The development and evaluation of software to collect the network measurements. Some software used will be pre-existing, e.g. SmokePing, but most of the collection will use our own software, such as AMP, libprotoident and maji. This component is mostly complete.

Collection: The collection, storage and conversion to a standardised format of the network measurements. Measurements will come from multiple locations within or around the network so we will need a system for receiving measurements from monitor hosts. Raw measurement values will need to be stored and allow for querying, particularly for later presentation. Finally, each measurement technology is likely to use a different output format so will need to be converted to a standard format that is suitable for the next component.

Eventing: Analysis of the measurements to determine whether network events have occurred. Because we are using multiple measurement sources, this component will need to aggregate events that are detected by multiple sources into a single event. This component also covers alerting, i.e. deciding how serious an event is and alerting network operators appropriately.

Presentation: Allowing network operators to inspect the measurements being reported for their network and see the context of the events that they are being alerted on. The general plan here is for web-based zoomable graphs with a flexible querying system.

15

Oct

2012

Continued working with Nathan to get smokeping data successfully into the
event detection system. I generated some random data to fill the
historical buffers and then continued to run it over live data, which
generated a small number of plausible looking events. I'm now looking into
the scalability and resource usage of this as it seems a little higher
than it should be. Also polished the dashboard graphs slightly, changing
them to use more sensible axis and better resolution data.

Spent some time with Richard, Tony and Shane thinking about the future
direction of AMP. We've got some good ideas and have a whiteboard full of
initial planning for the work that needs to be done.

Read draft introductions to a number of 520 reports and gave some
hopefully useful feedback. Everyone seems to be on the right track so far,
looking forward to reading more.

15

Oct

2012

Short week this week - took leave on Thursday and Friday.

Released a new version of libtrace (3.0.15) on Monday. Mostly just a few little bug and build fixes, but it had been a while since the last release. Also submitted a patch for the FreeBSD libtrace port which had been broken for a very long time.

Did a bit more refinement on my Plunge and ArimaShewhart event detectors. They're at a stage now where the number of false positives is close to none. False negatives are a bit harder to identify, of course. The next sensible step is probably to think about testing against real-time data and manually validate the events as they roll in.

Spent a day looking at the latest LPI data from a live analysis I have running on our ISP monitor. Managed to get some up-to-date stats on application usage for last September but haven't had a chance to look over it in detail yet.

I did note a bit of an increase in the amount of unknown UDP traffic, so chased up a few of the more common patterns. Have added 3 new protocols to libprotoident as a result: ZeroAccess (a trojan), VXWorks Exploit and Apple's Facetime / iMessage setup protocol.

08

Oct

2012

Added a new anomaly detector to our network event monitor: the Plunge Detector. The basic aim is to detect situations where an otherwise active time series plunges to a very low (or zero) value. Sounds simple, but kinda tricky to do in a generic fashion. The general algorithm is track the median and minimum observed values over the past N measurements and then raise an alarm when the current value is both significantly below the median and the minimum observed values.

Spent much of the week testing both the new Plunge detector and the Shewhart detector against the various LPI time series in my test data set. Lots of refinement going on with both detectors, but starting to get pretty happy with the results.

Started working towards a new libtrace release - mostly just a few little bug fixes and tidyups. Part of the release process is to test it on a FreeBSD machine, but the old emulation image doesn't work with the new emulation network. Set up a FreeBSD 9 machine so that Brendon could make a new image, which was a lot more painful than it should have been. Managed to get libtrace tested and passed the machine over to Brendon for imaging - I expect a decent rant in his weekly report about that step of the process to :)

02

Oct

2012

Tried to make the generated alerts more efficient and more effective by
very slightly delaying the actual alerting - doing so means that the alert
can contain any other events that arrive immediately after the triggering
event. It also now sends me emails for certain event thresholds, but I
broke the live import of AMP data so need to fix that before I can get
more than the emails generated by my test data.

Started trying to make the information presented in the default web
interface a bit more concise and relevant to what is going on right now.
Trying to use a few graphs to give an initial overview of the recent data
while keeping the ability to go look at everything in detail as you can
now.

The AMP deployment on the NLNOG RING was mentioned during a talk at RIPE
about the RING along with screenshots and links back to WAND. The slides
look pretty good and I think it went well.

01

Oct

2012

Continued making tweaks and changes to the Shewhart anomaly detector in response to erroneous events produced when running it against the full set of protocols supported by libprotoident. It now tends to only pick up major or sudden changes in the time series, which is great when dealing with protocols that aren't very common but may not be the best for more popular protocols.

Finished my teaching load for 301 - final lecture was given on Monday and marked the last C programming assignment throughout the week. Definitely enjoyed the opportunity to do something a little different and hopefully it was valuable to the students too. It would be great if we could find a way to keep using some of the material I prepared in future courses.

25

Sep

2012

Alerts due to events can now be triggered on individual events as well as
combined event groups. There are checks in place to try to prevent too
many alerts being generated at any one time or by the same events. The
next step may be to actually generate emails to myself for new events to
test that the thresholds are set appropriately and aren't too annoying.

Finished implementing a fix to help minimise the number of event groups by
rearranging them when possible to get better groups.

Had to rewrite some of the event database queries to be more efficient now
that we have many more historical events being added. The database now
does more of the heavy lifting (as it should) rather than doing it in the
webcode.

24

Sep

2012

Spent a fair bit of time on teaching tasks this week. My threaded I/O assignment for 301 was proving quite testing for students - especially getting their heads around concurrent programming and the resulting issues. Also brushed up on my kernel module programming for the final lecture that I'll be giving on Monday.

Worked on improving the Shewhart anomaly detection when run against bursty time series. This basically came down to tweaking a few of the ratios and multipliers that I use to counteract some buggy and/or unintended behaviour. Replaced the event severity calculation with one that is much more internally consistent than before - bigger spikes now have a larger severity than smaller ones, as opposed to the previous approach where a bigger spike could have a much lower severity if preceded by slightly noisier measurements.

Reviewed a paper on YouTube modelling for Transactions on Emerging Telecommunications Technologies.

17

Sep

2012

Fixed the bug in the tput test that would sometimes cause it to refuse
connections for a minute when it was meant to be re-establishing a new
test connection. It was erroneously waiting for more data when there was
no more to follow, so wouldn't continue until select timed out. Updated
the NZ mesh with all the fixes from the last couple of weeks.

Worked on the backend for the event detection web interface to use a more
flexible and secure database abstraction. Made a few small changes to the
web interface to try to hide information that wasn't always needed, but
still make it available if required.

Investigated all the historical event groupings and found a few rare cases
where it wasn't doing the right thing due to the order in which events
arrived or due to some missing common attributes. Came up with an approach
to sometimes rebuild groups as needed to minimise their number and get
better matches.

17

Sep

2012

LPI events are now working inside Brendon's webpages - it's still a bit rough around the edges still but good enough for a working prototype.

Played around with using PHPTAL to provide templating for our pages. It provides some nice features like automatic escaping of html entities and separation of the page logic and layout. At the moment, just the LPI event display page is templated but will hopefully extend this to other parts of the presentation layer.

Started on some more comprehensive testing of the system by throwing the entirety of the Waikato 6 traceset at it - 249 protocols * 8 metrics * several months of data. This immediately started to reveal some problems in the anomaly detection phase, such as R really not liking having to guess an ARIMA model for a time series containing entirely zeroes and stopping the entire process as a result. I also found that my anomaly detection doesn't perform particularly well when the traffic level is mostly at zero with regular bursts at a consistent quantity - each burst is being treated as an event when really that appears to be normal behaviour.

Submitted the final camera-ready version of my IMC paper - already the publishers have come back with some pedantic typesetting crap :)

10

Sep

2012

Fixed a bug in the AMP web API that would give incorrect traceroute data
for the last measurement bin in certain situations and was causing issues
with the path change event detection algorithms. Also, after running an
AMP client with the threading fixes for a week on some of the machines
most often affected by the bug I'm pretty confident that it's fixed.

Fixed the group membership checks using common path information to
properly group events based on all items having a shared attribute. I'm
quite happy with the contents of the new groups, they make good sense and
can help show underlying problems in intermediate networks that aren't
immediately obvious from looking at just sources/destinations.

Started putting together a database schema and web interface for a very
simple alerting system using the event groups detected.

10

Sep

2012

Kinda short week this week - had Monday off and the 520 conference was on Tuesday, so not a lot of work done those days. Our students generally performed very well on Tuesday, so congrats to you all - very disappointed that we didn't manage to score a prize because I think many of you deserved one.

Prepared my last 301 assignment - a pthreads-based task. It's trickier than it looks, so it will be interesting to see how the students go. Also spent a bit of time trying to get ahead again with my lecture slides.

Continued pulling together the various components of our event detection system. Most of my effort this week involved playing around with the php Brendon wrote for his AMP events so that it would also support LPI events. Not quite there yet, but getting close to a working solution.

05

Sep

2012

Found what looks to be a threading bug in AMP measured that has been
troubling me for a while. Test threads check that the nametable is up to
date before running but it was possible for them to deadlock on accessing
the file. I've had a fixed version of the code running for a couple of
days on some of the machines that were most often affected and have yet to
see the problem again. Hopefully that's fixed!

Fixed a couple of small bugs in the AMP matrix tooltips that were
triggering events on child elements without the appropriate attributes, so
weren't displaying information.

Wrote a program to insert common path information into the event database
to use for grouping events. Testing so far with this data shows that
fewer, larger groups of events are being created. Some of the membership
is a little bit questionable, so am know in the process of having it
describe the reasoning behind creating each of the groups.

27

Aug

2012

Ran some more tests on the IPv6 packet filtering in the AMP ICMP test and
it does indeed appear that the errors are due to packets arriving between
the socket being opened and the filter being applied. That makes most of
the warnings much less worrying, and I've lowered the priority on those
that I can confirm aren't an issue. While investigating this I also found
a situation where various test resources weren't being freed in the
traceroute test if they involved IPv6 addresses. Fixed that as well.

Finished updating the protocol between the different parts of the event
detection process to use the new protocol design. Also changed it from
using local unix sockets to run across the network, as our data sources
will likely be on different machines to the eventing system. Socket input
for the time series data is also now supported rather than only using
stdin.

Updated the sample web scripts that display event information to work with
the new database schema to confirm that everything is still working as it
should.

Pushed out the AMP matrix changes to the NLNOG RING. Also investigated
colouring cells based on current performance vs historical performance
rather than raw latency values, which was a request they had.

27

Aug

2012

Managed to get the ArimaShewhart detector fully integrated into the anomaly detection system and producing "correct" results. Now started turning my attention to using Nathan's software to provide suitable input and store measurements in a database that can be queried by the presentation / graphing side of the project.

The latest 301 assignment was due on Friday, so spent a fair bit of time helping out students who were having a few pointer difficulties.

Finished a draft revised version of my IMC paper - turns out I hadn't gone over the page limit by as much as I had feared so it was relatively easy to get the paper down to a suitable length.

Fixed a bug in libtrace relating to the use of Linux native on loopback interfaces that was reported by Asad. Might be time to think about a new release soon.

20

Aug

2012

Marked the first 301 assignment. Generally, the students did really well - hopefully because of my teaching skills. Managed to run out of pre-prepared lectures, so spent a bit of time working on next week's lecture.

Started working on the camera-ready version of my IMC paper. Added quite a bit of content to address the review comments - now I just need to edit it all down to fit under the page limit.

Finished writing the C++ version of my Arima-Shewhart anomaly detector. Tracked down and fixed a few bugs in the Arima forecasting portion of the detector - now the forecasts match those produced by the original python scripts.

13

Aug

2012

Sat down with Shane and went over what we need to do to get our event
detection programs integrated. The protocol used between data fetching,
detection and eventing needs to be updated slightly as there is more
information needing to be shared and magic numbers from testing to be
changed into real data. Started work on updating the protocol to match
what is required and updating the database schema to match.

Integrated the path change detection into the main detection code and
updated the main code path to deal properly with the slight differences
between a wider variety of data - traceroute, latency, byte counts, etc.

Did a bit of maintenance on the NZ AMP mesh and KAREN weathermap as well -
updated some addresses and got IPv6 addresses for the Citylink and
Netspace AMPlets which is neat.

13

Aug

2012

Sat down with Brendon and worked out what communication is necessary between the various components of our event detection system. We also decided what we need to start doing to try and bring it together into a working prototype.

My main task is to take all my existing python prototype code and turn it into a C++ detector class to fit into the system we inherited from Andreas. This involved a bit of pain. Firstly, matrix math is a lot easier in python (with magic lists) than it is in C++ so I had to be quite careful and check that at each stage, my converted math was producing the same results as the python prototype. Secondly, interfacing with R via its C "library" is something of a trial and error process, in particular figuring out which elements in the result vector correspond to the values described in the R documentation.

However, I've managed to get past that now and am in the process of finishing up converting the Shewhart code (the stuff that actually picks out events).

Wrote the second assignment for 301, while responding to a few questions regarding assignment 1 which was due on Friday.

06

Aug

2012

With a bit of tweaking, my smoother modelling process is now producing results just as good, if not better, than what I was getting with the old wavelet-based system. There are still quite a few false positives, which is annoying, but these are almost all situations where there is a traffic spike but I judge it to be too small to qualify as a genuine event.

At this point, we need to stop playing with anomaly detection and start thinking about combining everything into a rough but functional final product.

Spent some time helping Meenakshee get set up and helped out with her 591 proposal. Also worked out a revision plan for the IMC paper and sent it off to our shepherd.

31

Jul

2012

Started looking at using topology data to generate more datapoints to help
group events on. Hopefully should be able to group events between sites
that share common paths (at this stage I'm planning on starting with the
AS path) as well as those that share sources and targets. As part of this
added an event detector to alert on major path changes between sites and
realised that there appears to be a bug in the AMP code to determine
common paths. Spent some time trying to track it down and it looks to be
due to counting the sample time period incorrectly, which I'm now trying
to fix.

Figured out the cause of the AMP data interface module crashing on newer
php/apache. An incorrectly sized variable was being used in the c portion
to receive data from the php portion and along the way it was clobbering
something it shouldn't have. I'm sure the compiler warned about this last
time, but not in this case.

30

Jul

2012

My IMC paper on the effect of the Copyright Amendment Act was accepted! However, it looks like I have a fair bit of work to do on it, mainly softening the conclusions. The reviewers felt the results suggested, but did not prove, that the CAA was the cause of the observed behaviour, which I feel is a fair response.

It was a case of one step forwards, two steps back with the event detection this week. I had added a new dataset to my testing, only to run into an old problem where a sharp change in the time series would cause the ARIMA modelling to perform undesirably. A large residual would enter the prediction calculations, which would cause the next prediction to be way off, which would cause a new large residual to enter the calculations, etc. etc.

Instead, I adjusted the ARIMA modelling to only use a small proportion of large residuals when updating the model. The proportion was calculated using a logarithmic algorithm, so that very large residuals would use a much smaller proportion. This resulted in a much better model that responded to change in the time series in a slower and smoother manner.

Previously, the response was very rapid and we detected events by looking for a single large residual (because the model would adapt so quickly, we usually only got one shot and seeing the change). Now, we tend to get several large (but much smaller than before) residuals as the predictive model slowly caught up with the change in traffic level produced by the event. Unfortunately, this meant that all of my event detection rules I had developed over the past month were useless, but I've been able to quickly adapt to the new approach and am getting results that aren't too different from what I was getting before I made this change.

One benefit of this change that I'm still investigating is that the smoother modelling may mean that we can drop the wavelet transform step. This was used to smooth the original data to remove random noise but had the downside of requiring over 20 measurements ahead to produce the smoothed value for a single point. In practical terms, this meant I couldn't report an event until 20 or more minutes after it had happened (assuming minutely measurements). If this works, I can report events much closer to the time that they happen.