This week I spend some time familiarising myself with the operation of libtrace and the DAG cards. My current goal is getting parallel support for libtrace working with a DAG card, and that has involved consideration into bidirectional capture (having TCP streams filtered into one receive stream).
The DAG 7.5G2 makes use of a DSM module to split data into one of two receive streams. It turns out the filtering algorithm works great for all data that is not VLAN tagged. Richard S and I spent some time creating a bitwise filter that can take a VLAN tagged frame and split it into the two streams while still preserving bidirectional flows. A python script was written to create arbitrary configuration files for the DSM module to facilitate this split.
Some further research showed the DSM is only available in the 7.5G2 and that all other cards have a more effective filtering algorithm. Rather than continue working on getting the configuration generation added to libtrace it was decided that the 7.5G would not support bidirectional flows (in cases where packets are VLAN tagged) and other DAG cards will.
Developed another detector for netevmon based on the Binary Segmentation algorithm for detecting changepoints. The detector appears to work very well and outperforms most of our existing detectors in terms of detection latency, i.e. the time between an event beginning and the event being reported.
Finally was able to migrate prophet's database over to our new faster schema and upgrade NNTSC accordingly. Aside from a couple of minor glitches that were easily fixed, the upgrade went pretty well and our database is performs somewhat better than before although I'm not convinced it will be fast enough for the full production AMP mesh.
Experimented with a few other event detection approaches for our latency time series, but unfortunately these didn't really go anywhere useful.
I have been investigating detecting elephant flows in networks in real time.
This week I have been focusing on processing trace data in WEKA in order to find any underlying predictors for elephant flows.
Currently I have two ARFF files generated from the output of the lpicollector program which summarises the flows in a trace file and reports on their statistics.
These ARFF files are particularly large (over 1,000,000 instances) and are too resource intensive for WEKA to process.
I have used the full data sets to form two clusters. These clusters tend to naturally contain the mice and elephant flows, saving the effort of manually classifying them.
In order to make predictions over this large dataset I have been resampling them into smaller samples (a few thousand instances) and experimenting with different classifiers and attribute sets.
This has not yielded any surprising results thus far.
The best predictor of flow size has been either destination IP address, destination port number, or the size of the payload in the first packet sent to the server.
Next week I will begin a literature review into methods for detecting elephant flows and contrast them against the findings of the WEKA experiments.
Built new amplet packages for Centos and Debian to deploy the newest
version in the test mesh. Found a few problems running the tcpping test
on machines with multiple interfaces, which was fixed and the packages
rebuilt. Also updated the schedules on the test amplets to be closer to
what we are currently using on the main mesh in order to be closer to a
proper deployment scenario.
Added some more sanity checking to the way result messages are unpacked
by the server after (what appears to be a rather old, outdated version
of) the amplet client reported less data than it claimed to have
available, breaking the collector.
Spent some time looking into how puppet does initial certificate/key
distribution to its clients so that we might do something similar. We
need a sensible way to get certificates onto each amplet that doesn't
require a lot of manual generation and copying of files.
The event based simulator ran successfully using 20 traces per autonomous system, rather than one. It took quite a bit longer to run with this change. I have now set up a script to run the rest of the simulations at this setting. That has now been running for four days. There should only be one memory map build necessary as the Doubletree one it generated in the test was good and it only needs one more for Traceroute. The previous simulations took two minutes once built, so these might take something like 20 times longer or a bit more.
I have started to work out the structure of the thesis, and writing the introduction and background. The references automatically cite the authors and year in the text which I find much more cumbersome than numbered references. Adding so much extra text breaks up the flow of the document to the reader, so I hope that I can change the style.
I think that the students PhD conference went quite well, but I will have to wait and see what feedback I get.
Finished translating the mode detection over to C++ and managed to get it producing the same results as my original python prototype. Started running it against all of our AMP latency streams which was mostly successful but it looks like there are one or two very rare edge cases that can cause it to fall over entirely. Unfortunately, the problems are difficult to replicate, especially as the failures can occur at a point where I have no idea which time series I'm looking at, so debugging looks like it might be painful.
Wrote a new detector that uses the modes reported by my new code to identify mode changes or the appearance of new modes. It would possibly be more effective if the mode detection was performed more often (currently I look for new modes every 10 minutes), but I'm concerned about the performance impact of doing it more frequently.
Started investigating other potential anomaly detection methods. Had a look at Twitter's recent breakout detection R module, but it didn't perform very well with our latency data. Found another changepoint module in R which appears to work much better, so will start looking at developing our own version of this algorithm.
This week, I focused on using the magnitude of a change as another indication of severity because the results of using AMP-specific probabilities were not as favourable as we expected (this can be explained by the fact that DS does not work on events where only a single detector has fired). Spent several hours looking for relevant literature, but the papers I looked at were not especially helpful or completely unrelated. I took a break from reading papers and moved on to graphing the relationship between magnitude and severity. I used the Plateau detector's absolute and relative latency changes and the TEntropy-Stddev detector's TEntropyChange as metrics and plotted them against the severity score of the matched ground truth group. Found that TEntropyChange was useless for this purpose, but the absolute latency change showed promise: there was an easily identifiable threshold for identifying significant events but there were also some outliers that were less desirable (insignificant events with a high magnitude of change).
The non event based simulator runs of Doubletree completed and the results were converted into graphs. These were added to the PhD conference slides. These results are sensible and suggest that Doubletree may be useful in the many sources to few destinations case.
While this simulator was running it wasn't clear if further efficiency was required, so I borrowed the hash processing code from IS0 and ran it in a test program in preparation to replace the arrays used for stop set information in my current analysis. I may not need to do this now as the simulator runs finished on time.
The event based simulator IS0 runs that used 19000 ASes with one trace per AS also had their results graphed and added to the PhD conference slides. An initial investigation into whether IS0 can process 180000 traces using the same resources is also being carried out, with an initial run underway.
A new run of the black hole detector was initiated and the latest results were processed.
Continued investigating why traceroute tests were sometimes lingering
when the main amplet2 process was terminated. Eventually discovered that
I wasn't closing some file descriptors after forking, so that the test
children were able to connect to a listening local unix socket that
should have been closed. Despite listening, no running process was
actually expecting this connection, so it stalled waiting for it to be
Also tidied up more of the ASN socket querying code to better detect if
it had closed, and to actually report the error back so that it could be
dealt with in a smarter way, helping prevent the test hanging around in
a bad state.
Had a quick look at the HTTP test after seeing a few unusual results and
found that some software does a poor job of following the standards
(surprise!). Updated the header parser to be slightly smarter and deal
with some different combinations of capital letters, whitespace and
Spent some time working with Brad to get an example amplet machine
running that he can use to work through the upgrade process, bringing
them up to date with Debian.
Continued the painful process of migrating my python prototype for mode detection over to C++ for inclusion in netevmon. Managed to get the embedded R portion working correctly, which should be the trickiest part.
Spent a bit of time with our new libtrace testbed, getting the DAG 7.5G2s configured and capturing correctly. Ran into some problems getting the card to steer packets captured on each interface into separate stream buffers, as the firmware we are currently running doesn't appear to support steering.