Finished purging the last of the SQLAlchemy code from NNTSC. Once that was working, I was able to create a new class hierarchy for our database code to reduce the amount of duplicate code and ensure that we handle error cases consistently across all query types.
Split insertion operations across two different transactions: one for stream-related operations and one for measurement results. This allows us to commit new streams and data tables without having to commit any data results, which is an important step towards better synchronisation between the database and the messages in the Rabbit queue.
Spent a lot of time tracking down and fixing various error cases that were not being caught and handled within NNTSC. A lot of this work was focused on ensuring that no data was lost or duplicated after recovering from an error or a database restart, especially given our attempts to move towards committing less often.
Migrating the prophet development database over to the new NNTSC schema on Thursday. Generally things went pretty smoothly and we are now turning our attention to migrating skeptic and the live website as soon as possible.
Tidied up some arbitrarily sized buffers in the icmp test to be the
actual size required for the data. Accidentally made them too small, so
fixed that and then wrote some more unit tests to cover the
sending/receiving of data and buffer management. Also updated the icmp
test to be able to short circuit the loss wait timeout once all data has
been accounted for - previously it was always waiting a minimum of
200ms, even if all responses had been received.
Spent some time examining query logs from the newly migrated test
database on prophet to see where slowdowns were now occurring. Found and
fixed a simple case where we were over-querying for data, and have a few
ideas for other places to look for more improvements.
Investigated how it might be possible to set DNS servers per process in
order to run multiple amplet clients on the same linux host without
putting them in individual containers. It isn't made obvious in libc how
to do this, but it seems to be possible by modifying some internal
resolver structures. If I set these right, then getaddrinfo() etc will
all work as normal except using the specified name server rather than
whatever is in /etc/resolv.conf. The alternative here seems to be
replacing the name resolution functions with another library or custom code.
Thanks to Richard I now have an STM32W RF Control Kit, which I had a chance to play around with a little bit this weekend. Spent some time looking through its documentation and eventually found Windows drivers for communicating with each component (the USB dongle and the application board) through a virtual COM interface. The boards run a simple "chat" application by default so you can see the RF communication between them by typing into one COM terminal and watching it appear at the other end. I tested flashing another couple of sample applications, in particular one that is mentioned in the documentation that contains a number of commands for testing functionality. (The LED commands didn't seem to actually control the LEDs, but otherwise it seemed to function as described in the docs so I assume I'm still on the right track...) All in all an interesting intro and next week I'll start looking into what it's going to take to get Contiki on to the boards.
I spent a lot of my time this week getting my project environment set up and familiarising myself with Ryu and Openvswitch. I've started with a very basic topology to work through, and so far I successfully have flows being learnt between a series of kvm hosts and multiple connected virtual switches. Once I'm comfortable enough with the environment, my goal is to work towards implementing a basic virtual network which will allow DHCP leases to be issued to hosts from an out-of-band DHCP server through the Ryu controller. This step should represent the first milestone of my project as I work towards distributing some of the existing functionality of a BRAS out to multiple controllers and switches.
Built new CentOS and Debian amplet packages for testing and deployed to
a test machine to check that both old and new versions of the transfer
format could be saved. After a bit of tweaking to the save functions
this looks to work fine.
Tested the full data path from capture to display, which included fixing
the way aggregation of data streams is performed for matrix tooltips.
Everything works well together, except the magic new aggregation
function fails in the case where entire bins are NULL. Will have to
spend some time next week making this work properly.
Wrote some more unit tests for the amplet client testing address
binding, sending data and scheduling tests. While doing so, found what
appears to be a bug in scheduling tests with period end times that were
shorter than hour/day/week.
Updated NNTSC to include the new 'smoke' and 'smokearray' aggregation functions. Replaced all calls to get_percentile_data in ampy with calls to get_aggregate_data using the new aggregation functions. Fixed a few glitches in amp-web resulting from changes to field names due to the switch-over.
Marked the 513 libtrace assignments. Overall, the quality of submissions was very good with many students demonstrating a high-level of understanding rather than just blindly copying from examples.
Modified NNTSC to handle a rare situation where we can try to insert a stream that already exists -- this can happen if two data-inserting NNTSCs are running on the same host. Now we detect the duplicate and return the stream id of the existing stream so NNTSC can update its own stream map to include the missing stream.
Discovered that our new table-heavy database schema was using a lot of memory due to SQLAlchemy trying to maintain a dictionary mapping all of the table names to table objects. This prompted me to finally rip out the last vestiges of SQLAlchemy from NNTSC. This involved replacing all of our table creation and data insertion code with psycopg2 and explicit SQL commands constructed programatically. Unfortunately, this will delay our database migration by at least another week but it will also end up simplifying our database code somewhat.
My brain pretty much exploded from all the graph theory this week, so I put that aside to work on the actual testing approach.
The issue I have had with the redundant paths thing is I still havent been able to prove that my algorithm will actually always halt. I've tried a couple of approaches around this, but, as easy as it is to look at a picture of a network and pretty much instantly see exactly what needs to be done, turning that into an algorithm is really hard.
But the testing has been much more productive. The first thing to do is to determine exactly how quickly I can run the different approaches, IE how quickly I can poll the flow counters and how quickly I can send hello packets. This also means testing different switch implementations to see if that makes a difference. Then I can test different approaches about how quickly they notice loss of differing natures. Test how much latency is required for false positives, and how easily I can differentiate latency from loss. Then I can do some tests about how much this can scale.
So I have started playing around with the ovs bfd implementation to have a baseline to work from. I wont be able to beat the ovs bfd implementation in terms of speed, but that isnt configurable automatically.
Joe tells me the ovs bfd implementation is being librarified but that probably isnt done. Which is a pity, cause I really dont want to implement this from scratch.
Once the eventing script had been tested properly, I moved on to including the option of producing a CSV file out of the results.
Afterwards, I wrote a Python script to read in the manually classified events (ground truth) and the results from the eventing script. Then, the entries are compared and matched to produce a list containing the following info: ts event started and severity score(from ground truth), fusion method probabilities (from eventing results, which include DS, bayes, averaging and a newer method which counts the number of detectors that fired and allocates an appropriate severity score), and finally the timestamp at which point each fusion method detected that the event group is significant. This is useful to determine which fusion method performs best (e.g. fastest at detecting significant events, smaller number of FPS(ground truth says not significant, but fusion method detects significance), etc.
The script performs better than I expected after testing (e.g. 46 event groups from the ground truth and 42 matched event group/probability results from the eventing script when tested with Google's stream). The remaining unmatched events will need to be manually sorted out, so hopefully the script will perform as well on AMP data.
Since it was decided to search for black holes in load balancers I have been developing the first driver of a pair which will add an entry to a targets list file if a short Paris traceroute trace is found, signalling a possible black hole. The Paris trace is compared to an initial MDA trace to determine if it is shorter. MDA traceroute is the mode that maps load balancers as well as linear parts of a trace from source to destination in the Internet. Paris traceroute follows the same path through a per flow load balancer with each subsequent packet sent. There has been some debugging to do with the driver and then subsequent reruns, to get it up to scratch.
A new run on the Internet simulator has been initiated using team rather than gang probing. The difference is that team probing probes to different destinations from each vantage point whereas gang probing does all the destinations from each vantage point. The CAIDA traceroute data is collected with team probing, so if the simulator is set to gang probing then extra links must be created in the memory model of the subset of the Internet under study. This addition of links takes a long time to do, and may not be practical in our case. In getting the new settings right the simulator has exited early a couple of times with an error message. Once there was a missing data file and once there was an inaccessible address for the controller.
It is desirable to now look at the big picture in regard to what the Internet simulator can be used for. Two possibilities are to analyse controller cost for doubletree and to analyse controller cost and savings for Megatree which is similar but analyses load balancers. Megatree would not start in the middle of the trace like doubletree, and would require an initial single packet per hop traceroute to determine if any load balancers which have been seen before have occurred again. Then the full MDA traceroute would be carried out. Savings would obviously have to cover this initial look-see as well as controller traffic. Doubletree is designed to avoid repeatedly discovering the same sections of topology by recording trace end points and nodes connected to these. Megatree records divergence and convergence points of load balancers.
Finally starting to get underway with some coding again now that the paper work is out of the way.
I created a forked copy of Libtrace on Github to keep my work separate it is available here https://github.com/rsanger/libtrace/. I worked through my existing code from summer to ensure that any recent patches are applied properly and some tidy ups. I'm hoping making this public will also force me to keep things a bit tidier.
This coming week I will be focusing on writing some tests for the data structures and the parallel routines in general, as well as a test to measure performance.