Brendon Jones's blog
Installed new amplet packages onto prophet to test reporting using
protocol buffers in a live environment. Found and fixed a couple of
minor issues with the way I was trying to access some protocol buffer
fields, and missing python dependencies. The very first message was
invalid in some way, but after a day spent investigating it and trying
(failing) to narrow down the cause I now have all our test clients
successfully running their full schedule. It hasn't happened again and
I'm hoping that it was just some data left in a queue somewhere or
Started work on some simple unit tests for the HTTP test, which has been
without them until now.
Spent Wednesday at the student honours conference. Was good to see all
our students had worked on their presentations since the practice run,
there were quite a few interesting talks of a high quality.
Removed the requirement for AMP tests to always have a destination - if
a test has a sensible way to determine the target then it can now do so
without having it explicitly set. Updated the DNS test to name local DNS
servers more consistently when testing to them (the default, with no
Relocated some variables so that they are no longer global as part of a
general code tidy up. Also updated many getopt and usage strings to
reflect recent changes.
Fetched and processed some of the longer term measurement data that we
have been collecting in preparation for some statistical analysis. Had a
quick peek at the data and it is consistent at the top end of
performance, but it degrades quite differently for what should be very
Spent a couple of afternoons watching student practice talks ahead of
the honours conference.
Wrote a simple program to try to capture packet traces of long running
HTTP tests and started it running. Captured a few successfully from the
local test amplets, but it doesn't deal well with the short intervals
between tests on the real amplet I want to test. Put a couple of traces
into a HAR viewer which showed most of the delay in that case being
connection delay to a number of related servers. Doesn't work with HTTPS
and doesn't quite show the detail I want, so may need to write some more
Fixed the Debian packaging to include all the python scripts in the
server package, which took a lot longer than it should have (noone ever
mentions the --single-version-externally-managed option to setuptools).
Made a few other minor packaging updates to init scripts, man pages,
sample configs etc. Brought the Centos packaging scripts up to date as well.
Worked with Shane to get skeptic updated to a more recent version of the
server-side amp code. It should now be able to process data from more
recent amplet clients (we are getting close to being able to upgrade the
current NZ mesh).
Wrote some more unit tests to check that AMP tests were correctly
reporting data using protocol buffers, and that the data coming out
matched what was put in.
Updated the build system to properly reflect the new requirements for
protocol buffers and Debian packaging dependencies.
Did some initial testing with individual tests to make sure that nntsc
would accept the data, and fixed a couple of issues that I found (mostly
signed vs unsigned mismatches). Ran a proper client with a full test
schedule and checked the results against existing data to make sure that
everything was working as expected.
Converted the HTTP test to report data using protocol buffers.
Wrote a simple unit test for the DNS test to check that the data coming
out was the same as the data going in, and did some testing with NNTSC
to make sure that the data was in the appropriate format to be inserted
into the database. Found and fixed a few errors where things weren't
being set appropriately.
Spent some time looking into the slow HTTP test data I have been
collecting. Around 60% of the objects that were slow to fetch were on
new connections (usually the first one where we fetch the initial HTML)
and the delay was normally between sending the request and receiving any
response bytes. However there are enough delays in different places and
for different reasons that there is no obvious single cause, more
investigation is required.
Converted the DNS, TCPPing, traceroute and throughput tests to report
data using protocol buffers, and updated the scripts used by nntsc to
extract/parse the report messages. Updated the build system to
automatically build all the appropriate files from the .proto definition
Wrote some unit tests to make sure that the data being put into the
protocol buffers was the same as the data coming out and that optional
fields were appropriately present/absent.
Started collecting some more data on slow HTTP tests, dumping full
result data to try to see if there are any patterns around what objects
are slow to fetch, and which part of the transfer process is slow.
Built and tested new amplet client packages for the wheezy portion of
the New Zealand mesh, including the ASN lookup fixes from last week.
Deployed them on the local test amplet to run.
Chased up a few issues that had come to light recently, including using
the correct credentials to sign the cert used by apache when serving
client keys, the HTTP test incorrectly reporting "-1" data rather than
None/missing, and a few minor compiler warnings.
Started to move the amplet test reporting away from handcrafted
structures to Google protocol buffers. This will take care of a lot of
the boring bits around encoding variable length data and makes it a lot
easier to report only the data required to describe a test result
(rather than including unused fields). So far I have updated the ICMP
test to use protocol buffers and it has been a pleasantly easy experience.
Rewrote some of the code around ASN lookups for traceroute tests to make it more robust in the case of the server being unreachable. Failure to connect is now detected much quicker (using non-blocking sockets) and if anything goes wrong with a remote lookup then the thread will stop trying and respond only with cached data. The flow of control is now also a lot simpler, which means sockets always get read from in a timely manner and the code is a lot clearer.
Spent longer than I would have liked trying to diagnose the problems around ASN lookups due to not realising rsyslog was throttling all the extra debug output I had added.
Had a bit of a deeper look into some of the long duration HTTP tests to see if there might be any patterns of interest. Trademe stands out as the site that has the most slow (>10 second) page fetches, but the majority of them all come from the same connection, with other connections being perfectly fine. Most of the other targets are pretty consistent with each other.
Spent some time investigating unusual data to make sure it wasn't
occurring in the amplet tests. Monitoring of management connections
found one that was sharing a physical link with a test connection. Some
HTTP tests were having unusually long run times which appears to be
caused by the server infrastructure and not our own DNS lookups or
Started testing a new version of the amplet client for deployment on the
NZ mesh. Ran into an issue with our large schedule files where a count
variable was too small and overflowing. Results were collected fine, but
most of them were being thrown away when reported. Split all report
messages into smaller chunks as a short term solution that doesn't
require updating the server side code (still aim to move to something
smarter like protocol buffers).
Made no useful progress on getting Chromium to fetch/modify headers
without crashing. There are newer versions I need to try, but they
require more recent versions of libraries than I have.
Kept working with Chromium to try to get complete information on object
fetch timings. It looks like I should be able to get full timing
information for every object if I can set the Timing-Allow-Origin
header. Currently stymied by the library crashing in its memory freelist
implementation when I try to modify HTTP response headers.
Had a closer look into the behaviour of wget to try to confirm some test
methodology used by some other data sources I'm looking at. Turns out
that wget actually measures only the amount of time spent reading from
the socket and ignores everything else, reporting a very misleading time
Investigated further into some MTU issues we were seeing to confirm the
behaviour we were reporting. Something in the path only has a 1400 byte
MTU but doesn't always send packet too big messages, which is causing
lots of connection failures.
Spent some time proofreading reports.