User login

Shane Alcock's Blog




Attempted to implement an ARIMA self-updating forecaster in python, using the formula that R claims to be using for its own modelling. This worked reasonably well, but was nowhere near as accurate as the forecasts that R had been producing. Spent a fair bit of time looking into what math I should be using for ARIMA forecasting without much success.

Eventually settled on just calling R itself for a while until I'm satisfied that ARIMA is suitable for what we are trying to do. This is a lot slower than I'd like as the model has to be re-applied everytime we see a new measurement. This produces the forecasts that I was expecting and now the main problem is to work out how we should be setting the threshold for determining whether an event has occurred or not. Initially, I tried to use the variance of the residuals in the initial model as a starting point but there was no obvious relationship between that and the forecast errors for genuine events.

Also found that we're going to have to limit the number of wavelet transforms we use for smoothing our original data, as each additional transform will increase the gap between the event occurring and the event being detected.

In between all that, marked 513 libtrace assignments. Was mildly amused by the range of mistakes and errors that the students made -- maybe libtrace programming isn't as easy as we thought!




Managed to master the art of wavelet transforms - the problems I was having was due to mismatching the scale and wavelet values when inverting the transformation. After a lot of debugging, I was able to ensure that I could reliably transform my data and then invert it back to the same original values for any given number of nested transformations. Once that was working, I was able to get sensible results when denoising my time series.

Now that I had a denoised time series, I turned back to looking at forecasting techniques. Holt-Winters still wasn't a good fit for the denoised data, so I started learning about ARIMA models. Unfortunately, the test data I have doesn't really fit the basic ARIMA models, which made it difficult to get the right fit. Anyway, I now have a decent understanding of how ARIMA works in general, but need to come up with a way to use the ARIMA model in an on-line, self-updating context.

Released libprotoident 2.0.5 on Friday, mainly as something to do so I could have a break from mathematics for a bit :)




Libprotoident 2.0.5 has been released today.

This release adds support for 19 new protocols, including Omegle, Apple Push Notifications and DCC. It improves the rules that are used for matching a further 17 protocols.

This release also adds a new tool, lpi_arff, which produces protocol usage stats in a format that can be used by the WEKA machine learning software.

The full list of changes can be found in the libprotoident ChangeLog.

Download libprotoident 2.0.5 here!




Finished up the proof-of-concept CAPWAP parser. ITS seemed pretty happy with the results so far, so I will probably be asked to develop a production version at some stage.

Turned my thoughts back to anomaly detection in noisy time series data. Measuring the autocorrelation of errors suggested that Holt-Winters forecasting alone was unlikely to be useful for our purposes in the long run. Started learning about using wavelets to denoise the data so that forecasting techniques might work better. I'm part of the way there -- I can apply a couple of wavelet transformations and get smoother data but I seem to start adding noise if I go any further than that.

Was interviewed by Radio NZ on the topic of my research into Internet usage following the CAA act.

Had a good chat with Sam Russell from REANNZ on Tuesday when he and Steve Cotter came to visit the group.




Wrote an additive Holt Winters forecaster for use with the decomposed time series data. This one attempts to set the initial seasonal components correctly rather than just ignoring the seasonal behaviour in the training data. It's still not great with my test data but might still have uses with non-decomposed time series.

Started writing a proof-of-concept program to analyse CAPWAP traffic and track the amount of traffic observed on a wireless network as the AP, SSID and individual user levels. This is part of a possible project for ITS to allow them to keep historical statistics of the AP usage around campus.




Our paper on libtrace entitled "Libtrace: A Packet Capture and Analysis Library" has been officially published in this month's edition of ACM Computer Communication Review.

It has been a bit of a battle over the years to find a venue that was willing to publish a paper on libtrace, as the direct scientific contribution of libtrace itself is subtle. It was also difficult to articulate exactly how libtrace is so much easier and pleasant to work with compared to other trace analysis libraries. Often the improvements present in libtrace were dismissed out of hand as being nice but not necessary.

For example, capture format agnosticism was dismissed by some reviewers as mostly pointless because they never needed to work with a trace format other than pcap. The performance enhancements were similarly discredited because it was just easier to "buy a faster CPU" or because you could just use a separate zcat process to decompress the trace instead (hence the explicit discussion of the difference between using a separate process + pipe versus the threaded approach employed by libtrace).

As a result, we often had to go back to the drawing board and think more carefully about how to "sell" each of the enhancements in libtrace and clearly explain the reasoning behind each design decision. Eventually we managed to find the right combination of venue and tone that allowed us to finally get a submission accepted.
Hopefully this will lead to more network researchers learning about libtrace and adopting it for use in their own research and analysis tasks.

A copy of the paper can be downloaded from here.




Another rather fragmented week. Continued helping out where I could with the funding proposals, particularly finding references and tidying up some of the wording. Now we just have to wait and see if we actually get any of the funding we're asking for.

Taught 513 this week - we covered the recently published libtrace paper. I think I did a reasonable job of selling the students on libtrace. Wrote a possible libtrace programming assignment for the class which will be set if Richard gives it the go-ahead.

Prepared a 1.0.3 release for libtcpcsm. I've sent the release candidate off to a user who has been using the library quite a bit for testing prior to an actual release.

Started preparing for a new libprotoident release as well.

On the time series front, decomposing the time series seems to produce a trend line that can highlight genuine events in the data but there are still some caveats. In particular, none of our existing detectors work that well with the resulting data and it isn't clear that we can do the decomposition reliably when running live.




Due to the impending deadline for MSI funding proposals, last week was quite a mixed bag of tasks.

Developed another event detector that tries to detect obvious spikes in a relatively constant time series. The likelihood that a spike will be treated as an event is inversely correlated with the amount of noise in the time series, i.e. a spike in noisy data won't register as an event but a smaller spike following a long period of consistency would. Also started looking at decomposing time series with R again.

Wrote a lab exercise for 312 on configuring a DNS server. Spent a couple of hours in R block during the designated 312 lab time to help out students, although they were mostly working on previous labs (or wasting time looking at meme pictures).

Went over the methodology sections of both MSI proposals with Jamie and Brendon. Rewrote the methodologies to better suit the requirements, i.e. more emphasis on the research tasks that we will be carrying out.




Short week - on holiday until Thursday.

Caught up with various support requests once I got back. Had a long chat with Andreas about time series' and how we might be able to get better results when analysing the data produced by AMP and libprotoident.

Concluded that we need to start by making sure we can deal with the more obvious cases properly - in particular, time series where the reported value is mostly constant which we commonly get from AMP. The detectors we have at the moment are based on standard deviation, which doesn't work well when the stddev approaches nil. Developed a detector that works much better in those cases and also started adding code that will use an appropriate detector depending on the type of time series we have observed.




Started looking at Andreas' code in more detail by throwing a few different time series at it and seeing what anomalies it detects. Was not entirely happy with the results and spent a fair bit of time delving much deeper into the code than I would have liked to try and figure out what was going on.

This also involved spending a bit of time with R and its time series decomposition functions to see if that would shed any light on what we should be finding in the time series data.

Spent Thursday and Friday at the cricket.