Brendon Jones's blog
Continued investigating why traceroute tests were sometimes lingering
when the main amplet2 process was terminated. Eventually discovered that
I wasn't closing some file descriptors after forking, so that the test
children were able to connect to a listening local unix socket that
should have been closed. Despite listening, no running process was
actually expecting this connection, so it stalled waiting for it to be
Also tidied up more of the ASN socket querying code to better detect if
it had closed, and to actually report the error back so that it could be
dealt with in a smarter way, helping prevent the test hanging around in
a bad state.
Had a quick look at the HTTP test after seeing a few unusual results and
found that some software does a poor job of following the standards
(surprise!). Updated the header parser to be slightly smarter and deal
with some different combinations of capital letters, whitespace and
Spent some time working with Brad to get an example amplet machine
running that he can use to work through the upgrade process, bringing
them up to date with Debian.
Spent some time building new amplet2 Debian packages to make sure that
the build process was up to date with any new dependencies added with
the recent changes. Had to deal with a few packages in Debian Lenny
being well out of date and missing features (though an upgrade is on the
Installed new packages on a test amplet, and configured the schedule
using the web interface. In doing so, found a few test options that
weren't properly hooked up and were setting the wrong values, and that
sites were including themselves in their test schedules.
Accidentally left some firewall rules in place while testing and found
some broken behaviour when parts of tests failed. Watchdog timers
weren't being removed if the test exited badly, which was leading to
extraneous messages reporting tests being killed (when they had already
stopped). Broken connections to the ASN server could also trigger a
SIGPIPE when querying the local cache, which weren't being properly
Spent the latter part of the week reading student honours reports.
Spent some more time checking up on the traceroute test, after merging
all the stopset/ASN changes. Found and fixed a case where ICMP error
codes weren't being properly recorded. Also found and fixed what appears
to be the main cause of the test running too long - some targets will
decrement the TTL before responding with a port unreachable message,
which throws the path length estimate off by one and can cause the same
TTL to be probed multiple times.
Added the ability to signal tests that their time is running out, giving
them an opportunity to report any partial results they have collected
and to gracefully exit before they get killed. This is configurable per
test type, depending on whether or not it is possible to get useful
information without the test entirely completing.
Updated the schedule interface display a bit more information about test
timings, and tidied up some documentation about the new format. Fixed
the raw interface to properly check if-modified-since headers from
amplets requesting new configs, so only new configs are sent.
Finished moving all the standalone traceroute ASN fetching from DNS to
the TCP bulk interface. Decided to reuse the trie datastructure to make
an actual unique set of addresses to query (rather than the previous
simple system that just looked at nearby ones), minimising the data
needing to be sent/received. Fixed a few bugs in the buffer management
that meant new ASN data was possibly clobbering the last unprocessed
portion from a previous read. Merged all these changes and they should
now be running on atest amplet deployment.
Fixed up some bugs in the new schedule parsing code that didn't work
properly when the test type was not specified. Most other settings were
optional and had sensible default values, but it wasn't expected that
the most important option would be missing from (usually generated)
files. Schedule items without a test type are now properly ignored. Also
merged all these changes which are now running on a test deployment.
Added parameters for the throughput and HTTP tests to the scheduling web
interface. Slightly modified the throughput test options to make it much
easier to schedule the sorts of tests that it is commonly used for. Also
updated the HTTP test to follow 3XX redirects and to record that they
happened (with timings, sizes etc for both the redirect and the followup
Turned a lot of the scheduling web interface code into templates that
can be reused between creating and updating tests. They were similar
enough that most of it can be reused, with only a few minor changes
specific to each view.
Fixed up some small bugs in the ASN query code to make sure that all
addresses in the path are fetched (paths shorter than the initial TTL
weren't querying for the ASN of the final hop). The cache will now be
cleared regularly during operation and will also tidy up properly after
itself on program end. Started work on replacing the ASN fetching using
DNS with the TCP bulk whois for the standalone traceroute tests too.
Spent some time applying patches and building old bash from source to
update the old amplets against the new bash vulnerability. These
machines are really due for a software refresh!
Spent some time setting up a properly scheduled throughput test between
machines in the real world. While doing so, found out a few things about
certificate management that may not have been full thought through yet.
The certificates used for connections to the control socket (for
starting the remote end of the test) are currently only configured to be
clients, they can't act as a server without an extra setting being
enabled. Also, the server currently tries to validate client hostnames,
which relies on reverse DNS and won't be effective in most real world cases.
Added caching to the ASN lookups that use the bulk TCP interface, using
a data structure that looks similar to a radix trie. Looks to work well
and fast. May also try to use this in the temporary test processes too,
to store addresses and ASNs while they get applied to a particular set
of traceroute data (it would more more easily remove duplicates from the
Found and fixed a few bugs in the traceroute test now that there is a
small test deployment testing to real locations. IPv4 paths shorter than
the initial probe TTL of 6 were exhibiting inconsistent behaviour with
the value of the TTL in the packet embedded in the response. I now use
the response packet itself to calculate path length. Also fixed a bug
where multiple ASNs being returned in a single result were being parsed
Started working on fetching ASNs in a single bulk TCP connection rather
than using the DNS infrastructure, as requested by Team Cymru (whose
data we are using). Fetching all appears to work fine but there is
currently no caching, it will generate queries for all addresses everytime.
Updated the scheduling interface to allow scheduling/viewing tests for
meshes as well. All mesh tests are treated pretty much the same as those
for individual sites, and are merged when generating yaml configs for
sites. Tidied up some of the schedule display to hide headings and
sections that are empty or not currently relevant.
Continued to work on the interface for scheduling tests. As well as
adding new tests to a site, you can now modify an existing test. Full
details on a specific test can be viewed in a modal window very similar
to that used to create the test, and options/scheduling can be modified
there. Extra destinations can be added and existing destinations can be
removed, and the test itself can be completely deleted.
Added backend support to deal with all the above - including
adding/deleting test destinations, deleting tests, modifying test
arguments, modifying test schedules.
it was very similar and didn't warrant being entirely separate, but had
diverged enough to be annoying. The templates for the modals will need a
similar job done on them.
Built new Debian and Centos packages for the updated libwandevent code,
and used those to build new amplet2 packages for Centos. Debian packages
still need a bit more work to build in my new environment. Deployed a
couple of the new packages to further test some of the new traceroute
reporting for Shane.
Hooked up the rest of the test arguments in the form to schedule a new
test, so they are all now properly added to the database when the form
Filtered the YAML output to only include meshes that are used in the
schedule to reduce file size. Added code to track the time that
schedules were last updated, so that I can return a 304 not modified to
clients that request the YAML when there have been no changes.
Spent Wednesday watching student honours presentations. Well done to our
students who presented.
Fixed the way I build the data for the YAML output so that the emitter
can better tell which parts should be used as aliases/anchors (which
makes groups of test destinations a lot tidier looking).
Added more dynamic content to the schedule pages using data from the
actual metadata/schedule tables rather than hard coding it to test
layout/behaviour. Sources, destinations are all fetched from the
database, and current test schedules are displayed.
Added API functions to insert tests into a schedule, and hooked it up to
the data coming from the schedule modal form. Most of the data for
creating a new test is now understood and inserted into the schedule table.