Brendon Jones's blog
Updated the HTTP test to not include time spent fetching objects that
eventually timed out, as all that was doing was recording the curl
timeout duration. Instead, we need to report the number of failed
objects, last time an object was successfully/unsuccessfully fetched,
and possibly try to update the timeouts to match those commonly used by
Switched the meaning of "in" and "out" for throughput tests, as
somewhere along the way this got switched. This involved updating
existing data in the database as well as the code that saves the data.
Added a bit more information to log messages to help identify the
specific amplet client that was responsible, as it was becoming
confusing in situations with multiple clients running on the same machine.
Started adding an interface to download raw data from the graph pages.
Partway through it was taking longer than expected, so took a slight
detour and wrote a standalone tool to dump the data from NNTSC.
Thursday was my first day back after my break, so spent some time
catching up on things that had happened while I was away.
Shane and Brad had found some unusual data being reported, so I looked
into that and updated schedules to help solve some of the problems. Also
exposed some more tuning knobs so that we can change inter-packet delay
when sending probes (we were sending too fast in some cases) and merged
in some fixes that Shane had written.
Built some new Debian packages with these changes and pushed them out,
which appears to have immediately improved the quality of the data we
Configured the third throughput test target and updated the test
schedule to properly include all three throughput test targets. Went
through all the results to make sure that all are reporting - found and
fixed a couple where incorrect HTTP targets had been set and redirects
were happening. Double checked that some unusual throughput results were
correct (they appear to be).
Spent some time investigating some connections that appeared to up, but
wouldn't forward my data. The modem seems to think everything is fine
and there is nothing obviously wrong at my end, so I've asked for it to
Short week again with the Easter break. Finished the setup of the second
measurement machine and the central collector/graphing machine. Got the
measurement machine shipped, which should hopefully be physically
installed in a few days.
Added some temporary throughput tests to the main schedule to test
performance across different times of day to each of the sites. The
proper schedule will need to be slightly tweaked to include the extra
throughput targets. So far the test data shows the targets to be
Went to Auckland with Brad to install a new measurement machine. It's
now up and running and performing a subset of tests, but generally looks
to be doing the right thing. Will be watching the data and adding more
tests over the next little while.
Performed some testing on the first throughput targets that we have
available to make sure that they will be fast enough. Two out of the
three so far look good, but the third is probably not well connected
enough to push the amount of data we are expecting.
Started to install a second measurement machine based on the first, as
well as documenting parts of the process that will need to be performed
by remote hands.
Spent most of the week getting the last few things ready to go ahead of
the test deployment next week, including diagrams of how various
components fit together.
Fixed a copy and paste error when comparing schedule items that meant
that there was a small chance of two tests being considered the same
even if they had different end times. Also updated some error handling
branches to properly free some resources that had been forgotten about,
and had a general tidy up based on feedback from a slightly newer gcc.
Generated all the certificates for the test connections, and while doing
so found and fixed a bug that could prevent multiple certificates from
being generated when listed on the command line.
Built new packages with the recent minor updates and deployed them on
one of my test amplets to verify. While watching the results I found
some interesting behaviour with tests to www.amazon.com, where the
object counts are fluctuating between two values. A quick investigation
suggests that it's not caused by any changes to the test, but I haven't
discovered what is actually going on.
Made the HTTP test more consistent with the other tests when reporting
test failure due to name resolution or similar (rather than the target
failing to respond). Also added the option to suppress parsing the
initial object as HTML and to just fetch it.
Found and fixed a problem with long interface names being used inside my
network namespaces. Linux appears to allow longer interface names than
dhclient can deal with, so I've had to shorted some of my more
descriptive interface names.
Spent some time measuring the quantity of test data to get an estimate
of how much database storage will be required for the new test clients.
Also looked at how much throughput test data is likely to be used
(multiple TB of data a month), and possible locations that might be
suitable to test to without hitting rate limits or interfering with
Continued to check various parts of the process chain to make sure that
they perform robustly when bits go away, networking is lost, machine
Installed and configured nntsc and postgres on the new test machine in
order to keep a local copy of all the data that will be collected. This
also has the ability to duplicate the incoming data and send it off to
another rabbitmq server for backup/visualisation.
Made some minor changes to the amplet client for the new test machine
that required building new packages, which are now available in a custom
repository to keep them separate from the others. Installed the amplet
client on the new test machine and configured it to run multiple
clients, testing the network namespace code.
Continued to test some of the infrastructure code around
starting/stopping amplet clients, creating network namespaces etc. Found
and worked around a few small problems where tools were not properly
namespace aware, or would not create files belonging to a namespace (but
would happily use them if they already existed).
Spent some time checking over the results of the new schedule to make
sure all the targets were reporting sensible data. Found that for some
of the HTTP test targets it was reporting 302 codes, despite choosing
the target URLs specifically to avoid this. Turns out that I was
stripping the scheme portion and just passing the hostname to libcurl,
so URLs that should have been HTTPS were treated as HTTP. Fixed it so it
properly uses the full URL. Had to update some ports used in the tcpping
test too, in order to get a useful response.
Also updated the HTTP test to return more sensible data if the
connection times out, rather than storing the duration of the timeout
and flatlining the graphs, it now returns an empty duration which is
Wrote a custom init script and other config to deal with starting the
amplet client in a network namespace for the new test deployment.
Updated the new test schedule slightly and set it running to collect
initial data. I'll look over that and make sure that all the targets are
responding as they should, and that the HTTP tests are fetching useful
pages (not redirects etc).
Got some useful looking stats from the embedded html5 youtube player
running with slimer.js inside a virtual frame buffer. Ideally it would
be fully headless, but it shows we should be able to get the information
we need from the player - initial buffering delay, other time spent
buffering, playing, etc.
Started a test install of the amplet components onto a machine using
network namespaces. Figured out what I need to do to make sure
everything can talk to each other in the right way. Also found and fixed
a few issues with the new PKI changes when used in a new environment
from scratch - some setup steps were relying on things that hadn't
happened yet and needed to be reordered.