Spent the week working on the TEntropy Detector. Added a few different metrics that will be used to determine the most suitable/appopriate combination of metrics (by trial and error). Choosing the correct metric would allow transforming the samples into a time-series of average entropy values, and these will be used to detect anomalies. Worked on converting the characters and started implementing a buffer to store the characters as they are added.
Some of the warts data files are downloading from the latest scamper run on planetlab.
The scamper warts file analysis methods have been further developed to analyse the successors of load balancing interfaces to find separate diamonds that are based on the same divergence point. As well as this counts have been carried out of destinations found, traces with the maximum limit of packets and both together. In particular analysis of the vantage point with a higher setting of the probing maximum has been carried out.
The Internet Simulator runs have not completed. It may be necessary to check that the limit on the number of traces carried out has worked correctly.
Initial compilation of the slides for the student conference has been carried out. In particular data from the new Caida and Planetlab results has been collected.
Spent a lot of time working on performance improvements to the web
interface, in particular the matrix and how it communicates to the
backend. It was previously creating a new connection to NNTSC to fetch
every cell it displayed, which was quite slow. The NNTSC backend already
supported querying data for multiple streams at once, but this needed to
be exposed to the rest of the code along the datapath. It can now query
data for any number of streams from a single collection.
Fixed the way that a single bin of the most recent data was being
queried to not use the regular binning code - modulo maths on the
timestamp could cause two half full bins to be created, of which we
would use the first. This should mean that the data on the matrix is now
slightly more accurate and quicker to respond to changes.
Installed the new amplet code on the REANNZ perfsonar machines and used
the install to document the process, with a few notes for more work that
the postinst scripts should do to make it easier. Started to work on a
simple method to fetch test schedules from within measured itself, to
help manage schedules on machines that are otherwise outside of our
Updated ampy to cache stream information as well as data measurements. I had noticed that multiple requests for the same stream information were being generated when loading a graph, which seemed a little wasteful. Now we cache the details of what streams are available for each collection and the description of each stream (source, dest, metric etc.). The one downside is that newly-added streams won't be obvious until the cached stream list for the collection has expired.
Added support in NNTSC for table partitioning of traceroute data. This was much more complicated than anticipated for several reasons:
* the trigger function that inserts the data must return NULL to avoid a duplicate insertion into the parent table as well as the partitioned table.
* our traceroute test table had a "test id" column that was defined as a primary key based on an auto-incremented sequence, which meant sqlalchemy would try to return the newly inserted row by default.
* we needed the value of the test id for subsequent inserts into other tables relating to the traceroute test.
* sqlalchemy had no error-handling for the case where an insert operation that was meant to return a row returned null, resulting in a crash with little to no useful error message.
Once I'd figured all this out, I implemented a (somewhat hackish) solution: disable the implicit return, so we could keep our trigger function returning NULL without crashing sqlalchemy. Then, following our insert operation, immediately perform a SELECT to find the row we just inserted and grab the test id from that.
There was also the problem of the traceroute path table which I also wanted to partition but did not have a timestamp column. The partitioning code I had written was only designed to partition based on timestamp, so I had to re-engineer that to support any numeric column (although it defaults to using timestamp).
Finally, I had to then go and manually move all of the existing traceroute data into suitable partitions.
I also spent some time fixing up the Constant to Noisy algorithm in netevmon. Mostly this just involved refining some of the thresholds for the change detection, but I also avoid moving from Constant to Noisy unless the most recent N measurements have all demonstrated a reasonable amount of noise, i.e. the differences between consecutive measurements is significant relative to the mean.
One last thing: added timer events to the python version of libwandevent. Used this to ensure that anomalyfeed would request historical information at a sensible rate when first starting up, rather than asking for it all at once and completely hosing NNTSC with data requests.
I've spent this week writing my report, which from the current progress is going to be huge. There are still a few things I need to finish in order to get more results for the report, like the neon implementation but I hope to work on those next week.
Continued work on the change-point detection algorithm.
Setup a copy of the event database that I'm using for testing the algorithm in order to test and tune it.
Worked on keeping the old probabilities for a certain amount of time so that multiple data points outside of normal would be needed to trigger an event. This seems to improve the sensitivity, but adds a delay to event detection.
It seems sometimes the probability mass is lost, which I still need to look into.
I looked into changing the underlying probability distribution from normal to student-t (because this is recommended when few data points are known), however couldn't find a library that implemented a non-standardised version.
Worked on python script that accesses the bookings database and runs the appropriate openstack commands when bookings start and end such as the changing of user roles and the suspending of instances linked to the user.
Started writing the final report this week.
Read a paper about T-Entropy and started implementing a detector that uses sliding windows, calculates some statistics, assigns an appropriate character/"class" to each window, which will then be concatenated into a string of characters, which will in turn be used to obtain the average T-Entropy for a sliding window. However, NNTSC/Netevmon was down until Wednesday so I didn't get to test it after that.
The rest of the week was spent taking care of GA duties, marking a ridiculous amount of assignments, updating Moodle grades, yadda yadda. Didn't manage to get any work done on the project, unfortunately.
Next week, I plan on working on the T-Entropy detector some more, especially adding new statistics and trying to figure out a combination of stats that "work".
I've been writing the proposal and thinking about how things will scale.
To poll both ends of a path I can add an extra flow on each switch specifying the ingress switch, so that when a packet leaves the system it is counted with the other packets that entered the fabric at that switch. This will require tables and stacked mpls labels (3 layers of mpls), though it could probably be made to work with two.
This way I can poll both ends of a path, but inbetween I am aggregating paths, since the alternative means I have a number of flows to create paths on each switch that is quadratic in the number of switches in the fabric. This is going to be a complication for accurately locating problems just by polling counters.