Added a basic test for the entire parallel libtrace system, based upon the existing test-format code. This tests most of the basic functions including pausing and restarting a trace, however is limited to only the file based formats.
I setup a BSD VM at home and spent some time tidying up the code to ensure it still compiled. Merged in the latest upstream fixes, and spent a while finding a build bug where visibility for some 'exported' functions were hidden.
Fixed the delay in name resolution which was causing one amplet to
timeout tests when first starting. It was caused by the default
behaviour being to perform as a recursive resolver when no resolvers
were specified. It now properly uses the nameservers listed in
/etc/resolv.conf if there is no overriding configuration given.
Implemented the change in the receive code to use timestamps direct from
the test sockets. All tests that use the AMP library functions to
receive packets will be able to pass in a pointer to a timeval and have
it filled with the receive time of the packet. I haven't merged this yet
as I plan to spend some more time testing it under load and comparing it
to the previous approach.
Updated the schedule/nametable to allow selecting specific address
families of test targets even if the name/address pair was manually
specified in the nametable rather than resolved using DNS. This should
all behave consistently in the schedule file now regardless of type.
Spent some time investigating a bug in the code to rename the test
processes to more useful names than the parent. rsyslog starts printing
incorrect process names when logging and can lead to crashes. Renaming
works fine when run in the foreground with logging directly to the
terminal, and the correct process names are shown.
Short week last week, after being sick on Monday and recovering from a spot of minor surgery on Thursday and Friday.
Finished adding support for the amp-throughput test to amp-web, so we can now browse and view graphs for amp-throughput data. Once again, some fiddling with the modal code was required to ensure the modal dialog for the new collection supported all the required selection options.
Lines for basic time series graphs (i.e. non-smokeping style graphs) will now highlight and show tooltips if moused over on the detail graph, just like the smoke graphs do. This was an annoying inconsistency that had been ignored because all of the existing amp collections before now used the smoke line style. Also fixed the hit detection code for the highlighting so that it would work if a vertical line segment was moused over -- previously we were only matching on the horizontal segments.
I wrote my interim report a couple of weeks ago (since the last time I wrote a blog entry) and this week I have been beginning to get back into the swing of things. I spent some time looking into 6LoWPAN/RPL gateway router solutions (to bridge the LoWPAN with Ethernet) and found that a common solution (6LBR) seems to be to run the router on a more powerful device such as a Raspberry Pi or BeagleBoard, using the radio of a USB-connected mote via a serial tunnel. An alternative to this would be to connect a radio directly to the board, but radios available are limited and generally not mass produced. To this end I went back to trying to get a serial tunnel working so that I could communicate between the host PC and a (remote) mote. I read into the code for the tunnel a little further and managed to work out that packets are being received and recognised between the host PC and tunnel mote, but not forwarded further to the remote mote. This is confusing since I had expected the problem to lie with the tunnel software (which I had such trouble initially compiling) and based on what I've read, it sounds like the motes should automatically establish a LoWPAN between themselves and communication should be no problem.
Following the plan I laid out in my interim report, I was aiming this week to nail down the hardware I would need for the coming weeks. Since Richard N has procured another STM32w RFCKit for me already and Richard S has a Raspberry Pi he doesn't need any more, I should be good to go.
A single simulation rerun was completed on the Doubletree Internet simulator. The automated graphs were updated to include this data point. The graphing was also modified to do bars rather than lines and points, and monochrome dashed boxes were also set up in the png file output. Looking at the data I noted that the numbers of control packets reported need to be checked. Once this is done I plan to run the simulator with reduced data to create another factor variable.
Automated graphing was also set up for the Megatree Internet simulator graphing. The monochrome png file settings were also applied to this program. Once again a number of categorical data variables were applied to the x axis. These included timing which means the number of consecutive traces that contributed to the local and global data sets, number of sources included in the simulations, and similarly the number of destinations included.
The black hole detector drivers were changed over to ICMP probing, and a run was set going on Yoyo. This was a step towards running on PlanetLab. The original data that made me think that I was finding black holes in load balancers was reviewed and a typical case was analysed. This was where Paris Traceroute ran one hop shorter than the two MDA Traceroute runs. The MDA traces showed that one arm of a terminal load balancer connected to the destination whereas the other had no further links found. One trace collected using Paris Traceroute goes down one arm of a given load balancer, because the flow ID is kept constant. It is interesting that the Paris trace stops short in the other arm in this case without reaching the destination, which there should be a valid path to. Note that the convergence point of the load balancer did not appear to be reached.
Deployed a new version of the amplet client to most of the monitors.
Found some new issues with name resolution taking too long and timing
out tests on one particular machine. Fixed the SIGPIPE caused by this,
but have yet to diagnose the root cause. No other machine exhibits this.
Kept looking into the problem with packet timing when the machine is
under heavy load, and after looking more closely at the iputils ping
source managed to find a solution. Using recvmsg() grants access to
ancillary data which if configured correctly can include timestamps. My
initial failed testing of this didn't properly set the message
structures - doing it properly gives packet timestamps that are much
more stable under load.
Updated the test schedule fetching to be more flexible and easier to
deal with. Client SSL certs are no longer required to identify the
particular host (but can still be used if desired).
I investigated implementation of the BPF filters in regards to thread safety. It turns out that what I originally suspected is true, the compilation is not thread safe but the running of the filter is. The possible problem was dependent on where the scratch memory with the filter was stored, it appears that when running the filter this is allocated on the stack within the pcap code and any of the kernel implementations. I suspect that the LLVM JIT does the same, however since this is disabled by default I'm not too concerned for now. The BPF filter compilation on the other hand uses lots of globals in libpcap which is not thread safe, this case I have already locked against previously.
Hooked up configuration options for the tick packets allowing them to be turned on and off with a specified interval to the nearest 1msec resolution.
Reworked how aggregation binsizes are calculated for the graphs. There is now a fixed set of aggregation levels that can be chosen, based on the time period being shown on the graph. This means that we should hit cached data a lot more often rather than choosing a new binsize every few zoom levels. Increased the minimum binsize to 300 seconds for all non-amp graphs and 60 seconds for amp graphs. This will help avoid problems where the binsize was smaller than the measurement frequency, resulting in empty bins that we had to recognise were not gaps in the data.
Added new matrices for DNS data, one showing relative latency and the other showing absolute latency. These act much like the existing latency matrices, except we have to be a lot smarter about which streams we use for colouring the matrix cell. If there are any non-recursive tests, we will use the streams for those tests as these are presumably cases where we are querying an authoritative server. Otherwise, we assume we are testing a public DNS server and use the results from querying for 'google.com', as this is a name that is most likely to be cached. This will require us to always schedule a 'google.com' test for any non-authoritative servers that we test, but that's probably not a bad idea anyway.
Wrote a script to more easily update the amp-meta databases to add new targets and update mesh memberships. Used this script to completely replace the meshes on prophet to better reflect the test schedules that we are running on hosts that report to prophet.
Merged the new ampy/amp-web into the develop branch, so hopefully Brad and I will be able to push out these changes to the main website soon.
Started working on adding support for the throughput test to ampy. Hopefully all the changes I have made over the past few weeks will make this a lot easier.
The scamper warts based MDA load balancer simulator of Megatree was updated to calculate how many control packets are required using a size of 576 bytes where pairs of IpV4 addresses each take up 8 bytes. Megatree does an initial traceroute and determines if there are any load balancers that have been seen before, and if so avoids remapping them. After the first attempt at improved packet counting, it was necessary to adjust the counts to take into account partly full packets and count them as one packet.
A method for simulating Doubletree with the MDA data set used above has been developed. It was necessary to write routines to simulate conventional Traceroute and Doubletree. The calculations of control packets were along the same lines as above, but a bit different requiring redevelopment of the same code as a starting point. The first run of this code is underway after a small amount of debugging and some code checking.
I am also trying to decide if there is some way of combining Doubletree and Megatree for analysis of MDA data, however at first glance there is no easy way to do this. Doubletree could conceivably be applied at the initial Traceroute stage and at the later load balancer detection stage using MDA. This would be in addition to avoiding the remapping of load balancers.
The black hole detector is looking less promising than initially as I remove more and more exceptions to the discovery of a black hole. The most common remaining scenario is that of Paris traces that are one hop less than the MDA traces and the final hop of the Paris trace is not the MDA destination. I am in the process of looking at some warts dump data of these cases. In addition I have also started looking for cases where load balancers have dropped a successor due to a black hole in that arm of the load balancer. This is what might be expected if the network functions correctly. It is not immediately clear how long it would take a network to adapt to the loss of a node in a load balancer arm, and it may be that we would only very rarely detect a black hole in a load balancer.
Spent some time investigating the best way to rename running processes,
so that amplet test processes can have more descriptive names. On linux
it appears that the best way is simply to clobber argv (moving
environment etc out of the way to make more space), as it can use longer
names and the change affects the most places process names are observed.
The prctl() function is about the only other option in linux, and that
is limited to 16 characters and only changes the output of top. The test
processes now name themselves after the ampname they belong to and the
test they perform.
Performed a test install on both a puppet managed machine and an older
amplet machine. This was complicated by needing to upgrade rabbitmq on
the older machines, but without proper packages being used. Put together
some scripts that should mostly automate the upgrade process for later
installs. Watching these two test installs I found and fixed a race
condition that was triggering an assert because the number of
outstanding DNS requests was being incorrectly modified.
Moved the amplet2 repository from svn to git, which will make branching
etc a lot nicer/easier.