Following the authoring of my last weekly report, I successfully discovered and resolved the packet truncation issue I was having. It appears that there is a (possibly reintroduced) bug in the Open vSwitch implementation where the OFPCML_NO_BUFFER option (which instructs the switch not to create a buffer for an incoming packet) was being ignored where the incoming packets came in on a flow set with a priority of 0. Changing the priority to a different number allowed us to successfully read the full contents of a DHCP packet, options and all - no truncation. I was able to discover the problem by using scapy to throw some fixed-length packets to my controller and observe how the controller interpreted them as they were intercepted by different flows. A simple fix but exasperating to find.
I've spent the rest of this week assembling my presentation for my in-class honours practice talk, which I gave it today. It went ok, but I ran over time and found I had too many slides with too much content - but then again I found it difficult to find a happy medium in accommodating my audiences understanding of some of the concepts I was to speak about. Hat-tip to Brad and other members of the WAND group for their constructive feedback about my draft presentation.
Other than that, I have spent the rest of my project time this week working on my interim project report. I haven't found it too much of a struggle to grasp Latex, and now I'm finding myself wondering how I managed report writing before now without it. I've spent a lot more time than I intended writing longer introduction and background chapters, with the intention of borrowing some of it for my final report where applicable.
I gave my practice presentation to the 520 class last Wednesday, it went well.
I finished updating tracertstats to use the tick packets and removed the previous system of holding temporary results. This has also removed a lot of duplicated code which is nice.
I've started on the mid-term report which is due this Friday, I'm hoping to be able to reuse much of the introductory content in the final report.
I've been flat out with other assignments for the past little while and haven't managed to get any honours work done, but I'm about to spend this week writing my interim report. As with my proposal I plan for it to consist mostly of background research, since my experiences with the STM32W RFCKIT have thus far yielded more questions than results.
Spent Mon-Wed on Jury service.
Continued fixing problems with gcc-isms in libtrace. Added proper checks for each of the various gcc optimisations that we use in libtrace, e.g. 'pure', 'deprecated', 'unused'. Tested the changes on a variety of system and they seem to be working as expected.
Started testing the new ampy/amp-web on prophet. Found plenty of little bugs that needed fixing, but it now seems to be capable of drawing sensible graphs for most of the collections. Just a couple more to test, along with the matrix.
Replaced the libc resolver with libunbound. Wrote a few wrapper
functions around the library calls to give me data in a linked list of
addrinfo structs in a similar way to getaddrinfo() so that it don't need
to modify the code around tests too much. The older approach with each
test managing the resolver didn't allow caching to work (there was no
way for them to share context/cache), so I moved that all into the main
process. Tests now connect to the main process across a unix socket and
ask for the addresses for their targets.
Using asynchronous calls to the resolver has massively cut the time
taken pre-test, and the caching has cut the number of queries that we
actually have to make. We shouldn't be hammering the DNS servers any more.
Spent a lot of time testing this new approach and trying to track down
one last infrequently occurring memory leak.
Finished most of the ampy reimplementation. Implemented all of the remaining collections and documented everything that I hadn't done the previous week, including the external API. Add caching for stream->view and view->groups mappings and added extra methods for querying aspects of the amp meta-data that I had forgotten about, e.g. site information and a list of available meshes.
Started re-working amp-web to use the new ampy API, tidying up a lot of the python side of amp-web as I went. In particular, I've removed a lot of web API functions that we don't use anymore and also broken the matrix handling code down into more manageable functions. Next job is to actually install and test the new ampy and amp-web.
Spent a decent chunk of time chasing down a libtrace bug on Mac OS X, which was proving difficult to replicated. Unfortunately, it turned out that I had already fixed the bug in libtrace 3.0.19 but the reporter didn't realise they were using 3.0.18 instead. Also, received a patch to the libtrace build system to try and better support compilers other than gcc (e.g. clang) which prompted me to take a closer look at some of the gcc-isms in our build process. In the process, I found that our attempts to check if -fvisibility is available was not working at all. Once I had replaced the configure check with something that works, the whole libtrace build broke because some function symbols were no longer being exported. Managed to get it all back working again late on Friday afternoon, but I'll need to make sure the new checks work properly on other systems, particularly FreeBSD 10 which only has clang by default.
Further development has been carried out on the warts analysis based MDA (load balancer topology) simulator. Local data is looking reasonable now and a start has been made on coding the global or distributed data processing part. The program is in the process of being debugged. Furthermore it seems like some analysis of the many to few scenario would be a good idea for this work as it would tie in with the emphasis on many vantage points. Factors included so far include various numbers of stages where controller info is made use of, and a window of 500 traces. The frequency of sending control data is also varied.
The Doubletree and Traceroute simulator runs have been carried out for the data from one day of Caida data collection. I am now in the process of producing graphs from it. Factors included many versus few sources, Doubletree versus Traceroute and varying numbers of stages where control data is sent. It seems like a good idea to also reduce the number of vantage points in several stages, and repeat the simulator runs.
I've been working on implementing empty tick messages, these can be produced every X seconds to assist programs that want to report results in tracetime (wall time).
I had two main choices here to produce these messages, either start a new thread to send these messages or use timer_create() to enter a signal handler. I've opted for a separate thread due to it not being obvious what thread is best to put said signal handler on, a separate thread will find the least busy core and cannot block other threads out of shared resources by holding a lock etc.
Later this will also be easier to customise the timers behaviour such as accounting for skew in the format.
I've also created/updated slides for my honours practice talk.
It's been a busy week with other assignments, so this is not fully integrated with tracertstats yet and needs a configuration option added, however it is functioning.
I've spent a lot of my time lately working through various bugs with my controller implementation and getting a real feel for the project environment. I've recently pulled all my kvm test hosts across to virt-manager and set them up with serial access, which allows me easy cli access to each of my hosts through the virsh console, which unsurprisingly is proving to be much quicker and easier than accessing them through X11 (hat-tip to Brad for that). Using virt-manager should hopefully allow me to scale management of my hosts as my virtual topologies change and grow much more efficiently.
I have successfully managed to create dynamic flows between a DHCP server and a client host requesting an IP address, using an anonymous unicast like transmission to discover characteristics about how the DHCP server is connected to our virtual switch, i.e. its 'physical' port. Once this dynamic flow is subsequently created, server and client are freely permitted to exchange packets, with the intention being of creating a new dynamic flow to some form of WAN for the client when the DHCP lease negotiation process is complete.
I have been caught up for a few days now fighting with my Ryu controller and OpenVSwitch instance over packet buffering issues. I have been attempting to cast incoming packets seen by the controller into the supplied Ryu packet API to extract their payloads and other interesting information, but for a reason that presently escapes me, any incoming packets are being unintentionally buffered and therefore any off-the-wire data extraction the controller attempts is being truncated. If for instance we attempt to do this with DHCP packets, exceptions are being thrown by the API when we try to unpack the packet payloads struct, and some analysis reveals that the payloads are being truncated after about ~123 bytes. Given that much of the interesting information in a DHCP packet is located at the tail of a payload in the options field, this presents a problem.
I have a few suggestions from other members about how to proceed further past my existing problem. Next up, I'm going to try generate some sample packet data using scapy and pass it through a series of default OpenVSwitch flows (ignoring my controller implementation) to see both how the data behaves and whether the buffering occurs in the same way to achieve the same outcome. Failing that uncovering anything, I'll try test some similar implementations on older versions of OpenVSwitch to see if this problem is perhaps localised to a specific version of any of my environmental components. Other possibilities include investigating DHCP server APIs to see if I can bypass off-the-wire data extraction, and bypassing my DHCP component altogether to focus on the next parts of my project - such as VLAN awareness and AAA database interfacing.
Test results have been coming in, and are pretty much going exactly as expected.
Packet colouring can find loss within 2 seconds, as you would expect. BFD takes a long time to notice low rates of packet loss, but will detect at least as low as 2% eventually. Basically it has to send in the order of 1/(n^3) packets to detect a packet loss rate of n. That all has brought me to the conclusion that packet colouring combined with the in-built link down detection in openflow is by far the most efficient way of finding loss. Link loss is detected immediately, and everything else is detected in 2 seconds.
The counters on the pronto are still pretty dodgy. Using packet counters with multiple bridges on the one device seems a bad idea. My packets are being handled by one bridge but are being counted towards flows on the other.
I've started building path calculators, I have one which tries to exploit openflow fast failovers, so when a link goes down, it only modifies the path at the point where the link down is detected. This could potentially require the smallest number of flows, but unless your network fits certain topological constrains it probably wont. I'm currently implementing the 2 flows per switch version, which may involve very long paths in certain circumstances (basically longer paths end up being prioritised over shorter paths in a lot of circumstances), and I will implement the patented version too, (unless someone tells me not to---I really have no understanding of patent law) just for a comparison.