Brendon Jones's blog
Finished removing the test specific options from the main test
management protocol. Each test now deals with it's own options (if any)
that are embedded in the top level protocol buffer message.
Refactored the main loops of the throughput and udpstream tests to be a
lot more readable and to make it obvious what is going on for each of
the message types.
Lots of small fixes for things in the udpstream test - making sure
packet contents (timestamps etc) are sensible for both 32 and 64 bit
architectures, median/percentiles correctly calculated with small
numbers of packet delay variances.
Started work on moving the main control socket (used to start the test
servers, and soon, to run on-demand tests) to using protocol buffers.
Started to remove the test specific options that had crept into the
generic control message definitions. Unfortunately the protobuf-c
library doesn't appear to do extensions properly yet, so I've had to
work around that to embed test specific options inside the top level
Throughput test now uses protocol buffers for all of the messages
involved in arranging and reporting the test. This is all achieved
through the same functions used by the udpstream test, which should now
be generic enough that other tests requiring custom servers can be written.
Wrote a first pass at the save function for the udpstream test, which
may need some modification once database schemas for storing the results
Updated the UDP stream test to report on the different periods of lost
and received packets during the test, to show how any lost packets were
distributed (bursty, random, etc).
Updated the messaging to now include result packets to retrieve results
from the remote endpoint. These are combined with the local result set
so that results are reported for both test directions.
Lots of small fixes to make sure that the right things are happening -
packet sizes are correctly calculated, sockets are closed appropriately,
memory is tidied up.
Finished moving all the control connections to protocol buffers,
including sending a full set of test options to the server when the test
The udpstream test now calculates packet delay variation between
consecutive packets and reports summary statistics across the whole
stream. The test schedule (directions, sizes, etc) can now be controlled
properly on the client, which will inform the server of what it needs to
do and when.
Short week as I took two days off to move house.
Continued working on porting the udpstream test to the new amplet
client. Spent most of the time trying to combine the control protocol
from the throughput test and udpstream test into something nicely
generic that can be used with future test servers, instructing them on
when and how to send test flows. Also started to convert the control
channel to use protocol buffers so that interoperability is much easier.
Discovered that most of the extra latency and jitter I was observing is
being introduced by stateful firewalling along the path (and on the host
itself as well). I'm also possibly some overhead from routing lookups
for the first packet that uses a route but at that point the time scales
are getting quite small.
Started working on porting the udpstream test to the new amplet client.
The way it will work shares quite a few ideas with the existing
throughput test, so I'm also taking the time to move lots of code from
the throughput test into a library that both (and future) tests can use
to coordinate sending data and results.
Updated amplet client init scripts to return proper LSB error codes when
starting without configuration, so systemd no longer falsely believes
the client was started ok. Updated the key permissions for the puppet
configuration to enforce to match those of the amplet client packages.
Built new packages and deployed them for testing.
Spent some time investigating start-stop-daemon and killing process
groups. There doesn't appear to be any nice way to make this happen
without writing our own code to stop everything, which is starting to
look worthwhile to make sure everything is tidied up properly when using
Wrote functions to format raw AS traceroute and path length data for
download from the graph pages. Still need to do full IP path traceroutes.
Found some interesting results when comparing the amplet ICMP test with
a few other data sources. Something is introducing delay and jitter in
one that isn't present in the others. Spent some time looking at source
code and traces to try to figure out what is going on (unsuccessfully so
far, will continue on Monday).
Fixed the permissions for directories and files created for keys/certs
to make sure that rabbitmq can access them. Also added exponential
backoff when trying to fetch signed certificates - hopefully a machine
that is being actively installed will query soon enough to quickly get a
new certificate, but unattended installs won't hammer the server.
Investigated some reported issues about init scripts not performing
correctly, but not sure I can find a fault. Also looked into two clients
that are not testing to the full list of targets - they just appear to
be ignored and there is no obvious reason why.
Worked with Brad to update two more amplets to Wheezy, and spent some
time trying to determine why we partially lost access to one of the few
remaining un-updated machines.
Spent some time putting together a test environment similar to how some
of the Lightwire monitors are configured, with ppp interfaces inside of
network namespaces. This allowed me to start tracking down issues with
the tcpping test that they were seeing. Firstly the differences between
capturing on ethernet and Linux SLL/cooked interfaces weren't being
taken into account and header offsets were incorrectly calculated.
Secondly, I spent a lot of time trying to determine why the test was not
capturing the first response packet on a ppp interface - after a lot of
digging it turns out there is a bug in libpcap to do with bpf filters on
a cooked interface that was breaking it. The bug has been fixed, but
needs a backported package to get the new library version in Debian.
Tested building and running the amplet client and all the supporting
libraries on a Raspberry Pi. I've run standalone tests (it has a newer
kernel which I thought might help debug my ppp problems) and the results
look to be sensible. Will hopefully get a chance to test general
performance while reporting results next week.
Lots more small fixes to tidy up the AMP scheduling web interface.
Updated more dropdown menus to work with the changes that Brad made to
the API, properly set valid default meshes when using the matrix, making
sure that only meshes tested to are added. Put in links to the raw YAML
schedules for sites (possibly useful for debugging) and a link to an
example configuration script that will set up a client from scratch
(installing packages, configuration, etc).
Spent a morning at Lightwire doing a demo of the AMP web interface,
talking about the different data that can be collected and the ways it
can be useful. Tried to install a test client to show how that works,
but unfortunately ran into some issues with the test environment that
prevented name resolution from happening. Tracked it down to the way
that getifaddrs() describes ppp interfaces being unexpectedly different
from the ethernet interfaces we had tested on so far. Found and fixed a
heap of other smaller issues that came out of the meeting, mostly to do
with permissions and documentation.