User login

Shane Alcock's Blog




LPI events are now working inside Brendon's webpages - it's still a bit rough around the edges still but good enough for a working prototype.

Played around with using PHPTAL to provide templating for our pages. It provides some nice features like automatic escaping of html entities and separation of the page logic and layout. At the moment, just the LPI event display page is templated but will hopefully extend this to other parts of the presentation layer.

Started on some more comprehensive testing of the system by throwing the entirety of the Waikato 6 traceset at it - 249 protocols * 8 metrics * several months of data. This immediately started to reveal some problems in the anomaly detection phase, such as R really not liking having to guess an ARIMA model for a time series containing entirely zeroes and stopping the entire process as a result. I also found that my anomaly detection doesn't perform particularly well when the traffic level is mostly at zero with regular bursts at a consistent quantity - each burst is being treated as an event when really that appears to be normal behaviour.

Submitted the final camera-ready version of my IMC paper - already the publishers have come back with some pedantic typesetting crap :)




Kinda short week this week - had Monday off and the 520 conference was on Tuesday, so not a lot of work done those days. Our students generally performed very well on Tuesday, so congrats to you all - very disappointed that we didn't manage to score a prize because I think many of you deserved one.

Prepared my last 301 assignment - a pthreads-based task. It's trickier than it looks, so it will be interesting to see how the students go. Also spent a bit of time trying to get ahead again with my lecture slides.

Continued pulling together the various components of our event detection system. Most of my effort this week involved playing around with the php Brendon wrote for his AMP events so that it would also support LPI events. Not quite there yet, but getting close to a working solution.




Spent a large portion of the week on student-related tasks rather than my own research. Did manage to spend a bit of time on Friday working with Nathan's database code to develop a script that would produce graphs for detected events.

Listened to our honours students give their practice talks throughout the week. Was very impressed with how most of the projects were coming together at this stage.

Marked my most recent 301 assignment. Lessons learned:
* Students like to assume that any example input you give them will cover all edge cases.
* It doesn't matter how many times you suggest that -Wall is a good idea, people will still ignore you and miss easy-to-fix bugs as a result.
* Including 80s pop culture references in your assignment specification makes it really easy to Google for students who have put their assignment up on random programming forums asking for help.




Managed to get the ArimaShewhart detector fully integrated into the anomaly detection system and producing "correct" results. Now started turning my attention to using Nathan's software to provide suitable input and store measurements in a database that can be queried by the presentation / graphing side of the project.

The latest 301 assignment was due on Friday, so spent a fair bit of time helping out students who were having a few pointer difficulties.

Finished a draft revised version of my IMC paper - turns out I hadn't gone over the page limit by as much as I had feared so it was relatively easy to get the paper down to a suitable length.

Fixed a bug in libtrace relating to the use of Linux native on loopback interfaces that was reported by Asad. Might be time to think about a new release soon.




Just a note for future generations -- the correct file to edit to change the system default application for a given MIME type is:


This took a surprisingly long time to figure out - mainly because of the existence of other similar files such as /usr/share/applications/defaults.list and /etc/gnome/defaults.list.

Also, you can check the default application for a given MIME type using the following command: xdg-mime query default

As an example, the MIME-type for PDF is "application/pdf".




Marked the first 301 assignment. Generally, the students did really well - hopefully because of my teaching skills. Managed to run out of pre-prepared lectures, so spent a bit of time working on next week's lecture.

Started working on the camera-ready version of my IMC paper. Added quite a bit of content to address the review comments - now I just need to edit it all down to fit under the page limit.

Finished writing the C++ version of my Arima-Shewhart anomaly detector. Tracked down and fixed a few bugs in the Arima forecasting portion of the detector - now the forecasts match those produced by the original python scripts.




Sat down with Brendon and worked out what communication is necessary between the various components of our event detection system. We also decided what we need to start doing to try and bring it together into a working prototype.

My main task is to take all my existing python prototype code and turn it into a C++ detector class to fit into the system we inherited from Andreas. This involved a bit of pain. Firstly, matrix math is a lot easier in python (with magic lists) than it is in C++ so I had to be quite careful and check that at each stage, my converted math was producing the same results as the python prototype. Secondly, interfacing with R via its C "library" is something of a trial and error process, in particular figuring out which elements in the result vector correspond to the values described in the R documentation.

However, I've managed to get past that now and am in the process of finishing up converting the Shewhart code (the stuff that actually picks out events).

Wrote the second assignment for 301, while responding to a few questions regarding assignment 1 which was due on Friday.




With a bit of tweaking, my smoother modelling process is now producing results just as good, if not better, than what I was getting with the old wavelet-based system. There are still quite a few false positives, which is annoying, but these are almost all situations where there is a traffic spike but I judge it to be too small to qualify as a genuine event.

At this point, we need to stop playing with anomaly detection and start thinking about combining everything into a rough but functional final product.

Spent some time helping Meenakshee get set up and helped out with her 591 proposal. Also worked out a revision plan for the IMC paper and sent it off to our shepherd.




My IMC paper on the effect of the Copyright Amendment Act was accepted! However, it looks like I have a fair bit of work to do on it, mainly softening the conclusions. The reviewers felt the results suggested, but did not prove, that the CAA was the cause of the observed behaviour, which I feel is a fair response.

It was a case of one step forwards, two steps back with the event detection this week. I had added a new dataset to my testing, only to run into an old problem where a sharp change in the time series would cause the ARIMA modelling to perform undesirably. A large residual would enter the prediction calculations, which would cause the next prediction to be way off, which would cause a new large residual to enter the calculations, etc. etc.

Instead, I adjusted the ARIMA modelling to only use a small proportion of large residuals when updating the model. The proportion was calculated using a logarithmic algorithm, so that very large residuals would use a much smaller proportion. This resulted in a much better model that responded to change in the time series in a slower and smoother manner.

Previously, the response was very rapid and we detected events by looking for a single large residual (because the model would adapt so quickly, we usually only got one shot and seeing the change). Now, we tend to get several large (but much smaller than before) residuals as the predictive model slowly caught up with the change in traffic level produced by the event. Unfortunately, this meant that all of my event detection rules I had developed over the past month were useless, but I've been able to quickly adapt to the new approach and am getting results that aren't too different from what I was getting before I made this change.

One benefit of this change that I'm still investigating is that the smoother modelling may mean that we can drop the wavelet transform step. This was used to smooth the original data to remove random noise but had the downside of requiring over 20 measurements ahead to produce the smoothed value for a single point. In practical terms, this meant I couldn't report an event until 20 or more minutes after it had happened (assuming minutely measurements). If this works, I can report events much closer to the time that they happen.




Added a couple more time series to my event detection testing data, adding a whole new set of misclassifications to try and resolve. On Friday I noticed a large number of false positives were cases where the sample standard deviation only just crossed the standard deviation threshold. However, this relationship was logarithmic rather than linear which is why I didn't notice it earlier.

Adding a new rule (pseudocode):

if (sd / log(sd_limit, 2)) < SD_LOG_LIMIT

has drastically reduced the number of false positives in my dataset. I've lost a few genuine events as well, so it is not perfect, and I need to determine what the correct value for SD_LOG_LIMIT really is.

My last test run produced 258 correctly identified events, 42 false negatives and 48 false positives (out of a total of 975,375 measurements).

Started teaching 301 - first lecture went fairly well with students being attentive and asking questions, which is good.