Thanks to Richard I now have an STM32W RF Control Kit, which I had a chance to play around with a little bit this weekend. Spent some time looking through its documentation and eventually found Windows drivers for communicating with each component (the USB dongle and the application board) through a virtual COM interface. The boards run a simple "chat" application by default so you can see the RF communication between them by typing into one COM terminal and watching it appear at the other end. I tested flashing another couple of sample applications, in particular one that is mentioned in the documentation that contains a number of commands for testing functionality. (The LED commands didn't seem to actually control the LEDs, but otherwise it seemed to function as described in the docs so I assume I'm still on the right track...) All in all an interesting intro and next week I'll start looking into what it's going to take to get Contiki on to the boards.
I spent a lot of my time this week getting my project environment set up and familiarising myself with Ryu and Openvswitch. I've started with a very basic topology to work through, and so far I successfully have flows being learnt between a series of kvm hosts and multiple connected virtual switches. Once I'm comfortable enough with the environment, my goal is to work towards implementing a basic virtual network which will allow DHCP leases to be issued to hosts from an out-of-band DHCP server through the Ryu controller. This step should represent the first milestone of my project as I work towards distributing some of the existing functionality of a BRAS out to multiple controllers and switches.
Built new CentOS and Debian amplet packages for testing and deployed to
a test machine to check that both old and new versions of the transfer
format could be saved. After a bit of tweaking to the save functions
this looks to work fine.
Tested the full data path from capture to display, which included fixing
the way aggregation of data streams is performed for matrix tooltips.
Everything works well together, except the magic new aggregation
function fails in the case where entire bins are NULL. Will have to
spend some time next week making this work properly.
Wrote some more unit tests for the amplet client testing address
binding, sending data and scheduling tests. While doing so, found what
appears to be a bug in scheduling tests with period end times that were
shorter than hour/day/week.
Updated NNTSC to include the new 'smoke' and 'smokearray' aggregation functions. Replaced all calls to get_percentile_data in ampy with calls to get_aggregate_data using the new aggregation functions. Fixed a few glitches in amp-web resulting from changes to field names due to the switch-over.
Marked the 513 libtrace assignments. Overall, the quality of submissions was very good with many students demonstrating a high-level of understanding rather than just blindly copying from examples.
Modified NNTSC to handle a rare situation where we can try to insert a stream that already exists -- this can happen if two data-inserting NNTSCs are running on the same host. Now we detect the duplicate and return the stream id of the existing stream so NNTSC can update its own stream map to include the missing stream.
Discovered that our new table-heavy database schema was using a lot of memory due to SQLAlchemy trying to maintain a dictionary mapping all of the table names to table objects. This prompted me to finally rip out the last vestiges of SQLAlchemy from NNTSC. This involved replacing all of our table creation and data insertion code with psycopg2 and explicit SQL commands constructed programatically. Unfortunately, this will delay our database migration by at least another week but it will also end up simplifying our database code somewhat.
My brain pretty much exploded from all the graph theory this week, so I put that aside to work on the actual testing approach.
The issue I have had with the redundant paths thing is I still havent been able to prove that my algorithm will actually always halt. I've tried a couple of approaches around this, but, as easy as it is to look at a picture of a network and pretty much instantly see exactly what needs to be done, turning that into an algorithm is really hard.
But the testing has been much more productive. The first thing to do is to determine exactly how quickly I can run the different approaches, IE how quickly I can poll the flow counters and how quickly I can send hello packets. This also means testing different switch implementations to see if that makes a difference. Then I can test different approaches about how quickly they notice loss of differing natures. Test how much latency is required for false positives, and how easily I can differentiate latency from loss. Then I can do some tests about how much this can scale.
So I have started playing around with the ovs bfd implementation to have a baseline to work from. I wont be able to beat the ovs bfd implementation in terms of speed, but that isnt configurable automatically.
Joe tells me the ovs bfd implementation is being librarified but that probably isnt done. Which is a pity, cause I really dont want to implement this from scratch.
Once the eventing script had been tested properly, I moved on to including the option of producing a CSV file out of the results.
Afterwards, I wrote a Python script to read in the manually classified events (ground truth) and the results from the eventing script. Then, the entries are compared and matched to produce a list containing the following info: ts event started and severity score(from ground truth), fusion method probabilities (from eventing results, which include DS, bayes, averaging and a newer method which counts the number of detectors that fired and allocates an appropriate severity score), and finally the timestamp at which point each fusion method detected that the event group is significant. This is useful to determine which fusion method performs best (e.g. fastest at detecting significant events, smaller number of FPS(ground truth says not significant, but fusion method detects significance), etc.
The script performs better than I expected after testing (e.g. 46 event groups from the ground truth and 42 matched event group/probability results from the eventing script when tested with Google's stream). The remaining unmatched events will need to be manually sorted out, so hopefully the script will perform as well on AMP data.
Since it was decided to search for black holes in load balancers I have been developing the first driver of a pair which will add an entry to a targets list file if a short Paris traceroute trace is found, signalling a possible black hole. The Paris trace is compared to an initial MDA trace to determine if it is shorter. MDA traceroute is the mode that maps load balancers as well as linear parts of a trace from source to destination in the Internet. Paris traceroute follows the same path through a per flow load balancer with each subsequent packet sent. There has been some debugging to do with the driver and then subsequent reruns, to get it up to scratch.
A new run on the Internet simulator has been initiated using team rather than gang probing. The difference is that team probing probes to different destinations from each vantage point whereas gang probing does all the destinations from each vantage point. The CAIDA traceroute data is collected with team probing, so if the simulator is set to gang probing then extra links must be created in the memory model of the subset of the Internet under study. This addition of links takes a long time to do, and may not be practical in our case. In getting the new settings right the simulator has exited early a couple of times with an error message. Once there was a missing data file and once there was an inaccessible address for the controller.
It is desirable to now look at the big picture in regard to what the Internet simulator can be used for. Two possibilities are to analyse controller cost for doubletree and to analyse controller cost and savings for Megatree which is similar but analyses load balancers. Megatree would not start in the middle of the trace like doubletree, and would require an initial single packet per hop traceroute to determine if any load balancers which have been seen before have occurred again. Then the full MDA traceroute would be carried out. Savings would obviously have to cover this initial look-see as well as controller traffic. Doubletree is designed to avoid repeatedly discovering the same sections of topology by recording trace end points and nodes connected to these. Megatree records divergence and convergence points of load balancers.
Finally starting to get underway with some coding again now that the paper work is out of the way.
I created a forked copy of Libtrace on Github to keep my work separate it is available here https://github.com/rsanger/libtrace/. I worked through my existing code from summer to ensure that any recent patches are applied properly and some tidy ups. I'm hoping making this public will also force me to keep things a bit tidier.
This coming week I will be focusing on writing some tests for the data structures and the parallel routines in general, as well as a test to measure performance.
Updated the throughput test to report data in a manner more consistent
with the other tests, including sending an ampname for the test target.
Added some simple unit tests to the throughput test to check connection
establishment/hello/configuration messages between server and client.
Updated the control socket to properly listen on specific
interfaces/addresses for both IPv4 and IPv6, rather than listening on
all or one single address.
Added long options to all of the standalone tests that didn't have them,
to be consistent with the manpages and other tests.
Fixed the parsing of reported test data to properly null fields that are
undefined because we received no response. This lets the database
insertion code properly record them without a value rather than storing
zero. Also added code to deal with both the old and new protocol
versions so that we can keep data that was reported during the
Wrote a pair of sql aggregate functions to operate on our new data
formats and perform percentile calculations across arrays of values and
single values. These should hopefully be able to replace some of the
more confusing query code with a simple call to the appropriate
Continued marching towards being able to migrate our prophet database to the updated NNTSC database schema. Discovered a number of cases where AMP tests were reporting failed results but values were still being inserted into the database for fields that should be invalid due to the test failing.
Updated the RRD-Smokeping schema to store the individual ping results as a single column using an array. This caused some problems with our approach for calculating the "smoke" that we show on the graphs, but Brendon the SQL-master was able to come up with some custom aggregation functions that should fix this problem.
Finished looking at the events on amp.wand.net.nz. Also managed to come up with a solution to the single-large-spike problem I had last week. It's not perfect (mainly in that it only works if the spike is exactly one measurement, a two measurement spike will still have the same problem), but it gets rid of a few annoying insignificant events.
Modified the traceroute pathchange detector to try and reduce the number of events we've been getting for certain targets, most notably NetFlix. The main change is that we now only consider a hop to be "new" if it doesn't match the subnet of any existing hops for that TTL. It's all very naive: a /24 is considered a subnet for IPv4, a /48 for IPv6, but it results in a big improvement. Eventually, I expect us to plug a BGP feed into the detector and look for changes in the AS path rather than the IP path, but this should tide us over until then.
Worked with Brad to set up a passive monitor to help ITS diagnose some problems they are having on their network related to broadcast and multicast traffic. Just waiting on ITS to let us know when the problems are occurring so we can narrow down our search for strange behaviour to just traces covering the time periods of interest.