Brendon Jones's blog
Short week again with the Easter break. Finished the setup of the second
measurement machine and the central collector/graphing machine. Got the
measurement machine shipped, which should hopefully be physically
installed in a few days.
Added some temporary throughput tests to the main schedule to test
performance across different times of day to each of the sites. The
proper schedule will need to be slightly tweaked to include the extra
throughput targets. So far the test data shows the targets to be
Went to Auckland with Brad to install a new measurement machine. It's
now up and running and performing a subset of tests, but generally looks
to be doing the right thing. Will be watching the data and adding more
tests over the next little while.
Performed some testing on the first throughput targets that we have
available to make sure that they will be fast enough. Two out of the
three so far look good, but the third is probably not well connected
enough to push the amount of data we are expecting.
Started to install a second measurement machine based on the first, as
well as documenting parts of the process that will need to be performed
by remote hands.
Spent most of the week getting the last few things ready to go ahead of
the test deployment next week, including diagrams of how various
components fit together.
Fixed a copy and paste error when comparing schedule items that meant
that there was a small chance of two tests being considered the same
even if they had different end times. Also updated some error handling
branches to properly free some resources that had been forgotten about,
and had a general tidy up based on feedback from a slightly newer gcc.
Generated all the certificates for the test connections, and while doing
so found and fixed a bug that could prevent multiple certificates from
being generated when listed on the command line.
Built new packages with the recent minor updates and deployed them on
one of my test amplets to verify. While watching the results I found
some interesting behaviour with tests to www.amazon.com, where the
object counts are fluctuating between two values. A quick investigation
suggests that it's not caused by any changes to the test, but I haven't
discovered what is actually going on.
Made the HTTP test more consistent with the other tests when reporting
test failure due to name resolution or similar (rather than the target
failing to respond). Also added the option to suppress parsing the
initial object as HTML and to just fetch it.
Found and fixed a problem with long interface names being used inside my
network namespaces. Linux appears to allow longer interface names than
dhclient can deal with, so I've had to shorted some of my more
descriptive interface names.
Spent some time measuring the quantity of test data to get an estimate
of how much database storage will be required for the new test clients.
Also looked at how much throughput test data is likely to be used
(multiple TB of data a month), and possible locations that might be
suitable to test to without hitting rate limits or interfering with
Continued to check various parts of the process chain to make sure that
they perform robustly when bits go away, networking is lost, machine
Installed and configured nntsc and postgres on the new test machine in
order to keep a local copy of all the data that will be collected. This
also has the ability to duplicate the incoming data and send it off to
another rabbitmq server for backup/visualisation.
Made some minor changes to the amplet client for the new test machine
that required building new packages, which are now available in a custom
repository to keep them separate from the others. Installed the amplet
client on the new test machine and configured it to run multiple
clients, testing the network namespace code.
Continued to test some of the infrastructure code around
starting/stopping amplet clients, creating network namespaces etc. Found
and worked around a few small problems where tools were not properly
namespace aware, or would not create files belonging to a namespace (but
would happily use them if they already existed).
Spent some time checking over the results of the new schedule to make
sure all the targets were reporting sensible data. Found that for some
of the HTTP test targets it was reporting 302 codes, despite choosing
the target URLs specifically to avoid this. Turns out that I was
stripping the scheme portion and just passing the hostname to libcurl,
so URLs that should have been HTTPS were treated as HTTP. Fixed it so it
properly uses the full URL. Had to update some ports used in the tcpping
test too, in order to get a useful response.
Also updated the HTTP test to return more sensible data if the
connection times out, rather than storing the duration of the timeout
and flatlining the graphs, it now returns an empty duration which is
Wrote a custom init script and other config to deal with starting the
amplet client in a network namespace for the new test deployment.
Updated the new test schedule slightly and set it running to collect
initial data. I'll look over that and make sure that all the targets are
responding as they should, and that the HTTP tests are fetching useful
pages (not redirects etc).
Got some useful looking stats from the embedded html5 youtube player
running with slimer.js inside a virtual frame buffer. Ideally it would
be fully headless, but it shows we should be able to get the information
we need from the player - initial buffering delay, other time spent
buffering, playing, etc.
Started a test install of the amplet components onto a machine using
network namespaces. Figured out what I need to do to make sure
everything can talk to each other in the right way. Also found and fixed
a few issues with the new PKI changes when used in a new environment
from scratch - some setup steps were relying on things that hadn't
happened yet and needed to be reordered.
Fixed some minor edge cases that I found while repeatedly killing and
restarting various parts of the amplet client. Some strings describing
tests were invalidated when tests got reloaded, so are now stored. Some
sockets were able to be leaked if the control socket failed to start.
Tidied up the freeing of timers (tests, watchdogs, schedule updates) to
properly remove everything on exit. Also cleaned up the build process
and removed some extraneous flags, fixed some warnings and enabled
silent rules in automake.
Put together a new test schedule that covers more web targets (mostly
top ranked Alexa sites), as well as performing latency measurements to
some of the major gaming infrastructure targets. Took a while to find
some of these as they appear to be mostly undocumented.
Spent some time looking at measuring streaming video from youtube - the
embedded html5 player exposes some information that looks like it could
be used to make measurements. Currently trying to get it working with
the Zombie.js headless browser, but am stalled just before the video is
meant to start playing.
Spent some tracking down scheduling problems, particularly around the
edges of schedule periods. Added the ability to dump the current
schedule on demand and while using that figured out why some tests were
not being rescheduled. For some reason two queries were being made to
fetch timestamps and sometimes they fell either side of the boundary,
which broke the maths and generated a rescheduling time that
libwandevent would reject.
Also tidied up some other small issues that could cause hangs or crashes
- fixed a possible infinite loop in the tcpping test and removed an
assert in the watchdog code that was being hit in an uncommon but
actually legitimate case (if a test ended at roughly the same time as
the watchdog tried to kill it).
Built new packages for the amplet client and pushed them out to the
Updated the certificate packages to properly generate certificates to be
used with apache, and updated the configuration to enable SSL and
actually use them. Built new packages with the changes.
Updated the certificate signing script to also create and set up
rabbitmq users for new amplet clients. Also, made the user be more
explicit when revoking certificates in the case of ambiguity, and
improved checking of user input for host/cert names etc.
Rewrote the logic in the client around requesting/fetching certificates,
to make sure that the timeouts and wait periods apply to the whole
process, not just to the last step of fetching a certificate.
Started work on building Debian packages for the certificate signing
scripts. Spent some time making sure that all the required files were
included, installed at the correct location and with the correct
permissions. Created web configuration scripts to allow the web side of
things to run almost out of the box.