This week I completed work on my HMMDetector for libanomalyts.
I also assisted Shane with adding output parameters for raised events. Eg each Detector will know what metric, unit and scale factor to use when outputting an event string. This is set when a Detector is added to a DetectorList.
I have also begun investigating including my Hidden Markov Model code into libprotoident for the purposes of identifying BitTorrent flows that are making use of protocol encryption.
I started looking more deeply at the approach to failure recovery that I had been planning on implementing, and for a couple of reasons it may not be appropriate for my purposes. Firstly its very computationally expensive, which wouldnt be the worst thing in the world, but secondly it requires a lot of configuration. Since that is exactly what I am trying to avoid, I may need to use a much simpler system. On the upside, that hopefully will mean way less computation.
I also continued with testing things. Polling with 1 second frequency works fine, any more than that does not work however, as packet counts are only updated once per second. I also tried to introduce packet loss due to congestion within a virtual network, but the packet loss only seems to occur as packets leave the network. My impression is this is because when you have multiple ovs bridges on a single machine they still function as a single bridge. So the congestion I am generating within the switch isnt actually cause it to drop packets until it tries to output from the switch to one of the hosts.
A program was written to calculate a new statistic for measuring load balancer populations. It is the proportion of discovered interfaces that are load balancers. This was coded, run and incorporated into the paper. A plan was also developed for completing the introduction.
Further updates were carried out on the paper, and background reading on the use of the path based load balancer count statistic was also carried out.
Found and fixed a bug in the AMP DNS test where uninitialised data could
be reported if a server did not respond. Luckily this occurred once over
the break and gave us good logs as to what the problem was. Rabbit
easily dealt with the backlog of messages while this one blocked the
queue, which is very reassuring.
One of the current monitors appeared to stop reporting data for the ICMP
test, so I spent some time investigating that. The tests still run but
packets aren't always sent. Nothing in the logs gives any indication of
the problem, so will need to dig further.
Spent most of the week trying to improve the performance of database
queries by better limiting the query to only the data that is necessary.
Lots of streams for CDN targets are only active for a short time as
addresses change and we hit different instances, so we don't need to
check all of these for data. We now maintain a list of when streams were
active in order to limit the data that is queried.
Working on getting tracertstats working with the new parallelised libtrace. I've implemented two versions one which will run as fast as possible from a source such as a file which will block on printing a result until the next packet is read. The second version is intended to be used with tracetime (live) traces such as int: ring: this will timeout and print a result without having to receive a extra packet in the next time frame.
After a discussion with Shane, we decided that trace_event doesn't make sense anymore in the new parallel framework and wont be supported.
As such moved tracetime playback of traces (such as from a file) from a timeout event returned by trace_event into the mapper code for automatic handling.
Started the week by doing a summary of the Smokeping data that Shane and I have collected last year. This included grouping the streams based on average means (i.e. < 5, < 30, < 100, > 100) and summing up the number of FPs and significant/insignificant/unclassified events for the whole stream and also on a per detector basis. Using these numbers, I was able to find out accurate probability values for each detector. This also made it easy to see exactly where we needed more data, e.g. only having 5 Mode events throughout all the streams with an avg mean of < 5.
Then, I modified my eventing python script to use different probabilities based on the detector that fired and the average mean of the stream at that time. These probability values will still need to be updated later on since the sample size is too small for some of the detectors. However, this is tricky since some detectors (especially Mode) only fire occasionally when the mode of the time series has changed considerably, so getting a big enough sample size is tricky.
Spent some time looking over Bayes Theorem, which I plan on using as a comparison of different fusion methods.
This week I wanted to dedicate some time to cleaning up the fragmented amp-web interface to improve the consistency of CSS and markup across the site, and to remove unnecessary JS libraries, in the process determining which libraries best suit our purposes in cases where features overlap.
As a first step I included Bootstrap's CSS globally and rewrote the rest of the global stylesheet around it, restructuring hacky CSS that relied on (inline) markup. I would prefer to include only the Normalize reset and Bootstrap's class-based CSS rather than the base CSS that styles other elements, which I might investigate at some point, but for now everything works fairly well. Including Bootstrap globally broke the Matrix, whose CSS definitions overlapped with Bootstrap's (so I fixed this temporarily by renaming the affected classes).
So after breaking the matrix, I spent a lot of time cleaning it up (albeit mostly because I wasn't aware it existed in the first place). The Matrix used jQuery UI to display tooltip-style popups, which I replaced with a similar feature (popovers) in Boostrap. This took a bit of time and more rewriting than I'd expected as the JS for instantiating each is very different (particularly as the popovers are intended to appear on click rather than on mouseover), but it worked eventually and I managed to streamline some of the Matrix code in the process. I also replaced the matrix's jQuery UI tabs with custom ones, and that allowed me to remove jQuery UI and its CSS, the JS library cssSandpaper (which had been used for backwards compatibility that wasn't really relevant), and its dependency libaries cssQuery, EventHelpers, sylvester and textShadow.
I added a CSS hack to fix graphical glitches that were sometimes produced when rendering rotated text (the matrix headers). It seemed to only occur on Voodoo, but as well as preventing the flickering issue, the fix also looks to have improved legibility on all platforms.
Finally I spent some time integrating the traceroute map with the latest changes and updated it to use real data. It was interesting to see what a difference this made to the summary view, whose highly aggregated data is no longer useful for representing unique paths or for being able to see where paths change. Will have to look at how to best address this next.
Got back into the swing of things by spending the week fixing a multitude of UI problems and general bugs in Cuz, with the aim of getting closer to something we feel comfortable demonstrating at NZNOG.
The main improvements are:
* Finally added a "graph browser" page which lets the user choose a collection to explore.
* Event groups are shown on graphs rather than individual events. This greatly reduces clutter when big events occur.
* Fixed various inconsistencies between the line colour shown on the legend and the line colour actually being drawn on the graph.
* Stopped creating tabs that go to empty graphs.
* Fixed a bug where the rainbow summary graph would only show the first couple of hops rather than the entire path.
* Added basic tooltips to the legend which show more detail about the group being moused over, e.g. what exactly is represented by each line colour.
* Better handling of database exceptions in Cuz so that Brendon's buggy AMP test results don't crash NNTSC :)
So over the break I was trying to fix ovs, but after finally talking to the guy who wrote the ovs mpls branches this week, I am now giving up on that.
So instead I have the polling working with vlan tags and unique flows for each pair of nodes. It is currently just printing out values for packets sent and received, but it is counting them correctly and not losing any packets.
So then I started reading a few papers on passive monitoring techniques to focus on how they tested them. They've actually been fairly interesting. A couple using very similar techniques to mine.
This week I included the PlateauLevelDetector into my Hidden Markov Model Detector.
The HMMDetector now subclasses the PlateauLevelDetector, which it passes "magnitude" values (the quotient of the base log-probability and the current log-probability) such that the values are sufficiently smoothed out. This means that existing plateau detection code can be used to determine if the probabilities generated by the HMM are due to an event state.
I have tested this detector with it's current parameters and updated the testing data provided by Shane. These tests show that the detector works well for some streams, but not so well for others. Future work could look at optimising the detector parameters on a per-stream basis.