This week I tried to help Joe track down the OVS stats bug, but didnt really make any progress with that.
So then I started planning a hello based system for fault detection, and how best to fit it into the RouteFlow architecture. So I can push a lot of it pretty close to rfproxy, which, hopefully will mean that if the switch can implement its own fault detection like BFD for instance, then there wont need to be much change to RFServer.
This also means that the switches can respond to problems slightly quicker. This has caused me to slightly rethink the architecture I was using for the stats poller based fault detection, and I think moving more of that to rfproxy might be a good idea as well.
Dean Pemberton has written a different routeflow LSP plug in, which in some ways is a bit nicer than mine. It allows for arbitrarily assigning ports to rfvms and creates paths based on egress ports rather than datapaths. In his system he sets the src and mac destination for the packet once as it enters the system and then just forwards on labels, so I have looked a bit at using MAC addresses as LSP labels. It's nice because it is such a massive address space. It does feel like something that could go horribly wrong though.
I'm currently merging this with my multi-table stuff to use that as the basis for the fault detection.
I am revisiting a strategy to find black holes in load balancers. The original set up used traceroute MDA which maps load balancers, followed by six cycles of Paris traceroute. The idea was to look for shortened Paris traces and further process these to see if the stop point was in a load balancer. A further final round of MDA has been added after Paris to check for route changes. The coding related to this change is underway. Currently more route changes are being found than should be so I am testing the algorithm that detects these on a trace case by case basis.
A mathematical model is also being developed for Megatree. This is an analogue to Doubletree that records details of load balancer diamonds and avoids repeatedly mapping them, especially the big complex ones that occur in smaller numbers. I have carried out regression analysis on the UDP CAIDA data to predict savings and packets numbers. I carried out transformations to make the residuals normally distributed and the relationships linear. These formulae were then included in a computer program which was inputted with factor levels used to produce customised results. I have generated results for the local and global scenarios i.e. within a vantage point and sharing LB data between vantage points. Next it will be necessary to generate relationships to make simulation of inter vantage point traffic costs possible.
Finally fixed all my problems with the history-live overlap when subscribing to NNTSC streams. The solution was to register the subscription as soon as possible (i.e. before doing any querying) and collect all subsequent live data for those streams in a buffer. Once the history queries were complete, the HISTORY_DONE message was extended to include the timestamp of the last row in the returned data so I could skip those rows when pushing the saved live data to the client.
Added a QUERY_CANCELLED message to the NNTSC protocol so that clients can know that their database query timed out. Previously we have had problems where a timed-out query simply returns an empty dataset which was impossible to distinguish from a genuine lack of data. Updated ampy to deal with the new message -- in particular, the results for cancelled queries are not cached like they were before, with fairly obvious bad consequences.
Also updated the matrix to handle query timeouts better, which also fixes the day_data bug we had which was causing crashes while loading the matrix. If a timeout occurs, a useful error message is also shown to the user.
Released a new version of libtrace on Friday which fixes the FreeBSD 10 build bug, as well as a few other minor problems.
Added SSL support to the amplet client for querying a remote server to
fetch schedule files. This should give us the ability to have clients we
don't really control stay up to date with test schedules, but needs a
bit more thought put into how often it should run and how it should
interact with the main schedule process.
Added a control server to the amplet client that will accept connections
from other clients that require specific test servers to be run (e.g.
throughput tests), and run them. Currently it accepts the id of the test
to be run and returns a port number that the new server is running on so
that the test knows where to connect to. Wrapped all this up in SSL as
well, validating both the certificate and the hostname/commonname, but
not yet checking revocation of certs.
Libtrace 3.0.19 has been released.
The main purpose of this release is to fix a problem that prevented the libtrace 3.0.18 release from building on FreeBSD 10. A number of other minor bugs were also fixed, such as some libpacketdump decoding errors on big-endian CPUs and a bug in the ring: format that led to set_capture_length changing the wire length instead of the capture length.
This release also incorporates a patch from Martin Bligh that adds support for reading pcap traces that support nanosecond timestamp resolution via the pcapfile: URI.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.
Spent the past 2 weeks collecting more samples of event groups and updated the data in the spreadsheet, so I'll have a better idea of which groups have an insufficient sample size. Andrew had already finalised and entered the data for his HMMDetector for the old streams, so I made sure to include HMM events in the newer streams I analysed (afrinic, lacnic, trademe and apnic).
I also realised I had mislabelled some events detected by the Changepoint detector whenever a loss in measurements occured, so I spent some time double-checking the events and the graphs and updating the appropriate severity value. We decided to exclude them from the detector probability values, since they are a different type of event (similar to LossDetector and Noisy-to-Constant/Constant-To-Noisy updates).
I'll collect more samples (if needed!) and update the values used by the different detectors and fusion methods, and finally move on to validating the output produced by the fusion methods next.
Slow final week, mostly spent fixing lots of minor issues here and there. I also added tooltips to the smokeping graph and did some work on improving the usefulness of information in legend tooltips. I have some extra clean up to do as I still have a few open branches, so I'll address those over the next week or so.
Wrote a report to hand off to the faculty, created a slideshow and wrote some notes for the presentation on Monday (which went very well).
The warts analysis was modified to provide data to the megatree mathematical model. Megatree involves local and distributed approaches to avoiding the mapping of the same load balancer more than once, and is based on Doubletree. In particular subsets of the 70000 destinations sets were created to create model data for varying numbers of destinations. Regression analysis was carried out to provide model structure.
The fast mapping analysis has been updated to include a full MDA trace at the beginning and end of the data collection cycle. This is to check for route and load balancer changes. Our fast mapping protocol uses six runs of paris traceroute between the MDA runs. There is still some more debugging and design to carry out.
Spent some time working on things to help keep the amplet code clean and
tidy. Added stricter compilation options and fixed up some cases where
these triggered warnings. Started working on unit tests for amplet based
on the built in automake target "check". Wrote very simple unit tests
for the icmp and traceroute tests as well as the nametable management.
While writing the nametable unit tests I found and fixed a bug that
would limit the nametable to only a single item.
Briefly had a look at different database options available to us that
might perform better with our data than postgres. There are still
further optimisations we can make to how we store our data in postgres,
but it will be interesting to see how they compare to something like
cassandra, HBase or riak.
Continued redevelopment of the NNTSC exporting code to be more robust and reliable. Replaced the live data pipes used by the dataparsers to push live data to the exporter with a RabbitMQ queue, which seems to be working well.
Modified the way that subscribing to streams worked to try and solve a problem we were having where data that arrived while historical data was being queried was not being pushed out to interested clients. Now, we store any live data that arrives for streams that are still being queried and push that out as soon as we get an indication from the query thread that the query has finished.
Unfortunately, we can still miss historical measurements if they haven't been committed to the database at the time when the final query begins. This often crops up if netevmon is resubscribing after NNTSC has been restarted, resulting in us missing out on the last historical measurement before the subscribe message arrives. Still looking for an elegant way to solve this one.
Added a version check message to the NNTSC protocol. This message is sent by the server as soon as a client connects and the client API has been updated to require its internal version to match the one received from the server. If not, the client stops and prints a message telling the user to update their client API. This should be helpful to the students who were previously getting weird broken behaviour with no apparent explanation whenever I made an API change to the production NNTSC on prophet.
Chased down a build issue with libtrace on FreeBSD 10. Turns out we had made the dist tarball with an old version of libtool which was stupidly written to never build shared libraries if the OS matched FreeBSD 1* (because FreeBSD 1.X didn't support shared libraries). Easy enough to fix, I just have to remember to make the libtrace distribution on something other than Debian Squeeze. Will start working on a new libtrace release in the near future so I don't keep getting emails from FreeBSD users.