Brendon Jones's blog
Started working on implementing a nice way to generate/sign/distribute
certificates for amplets to use, similar to the way that puppet does it.
Clients that are missing certificates can send a signing request to the
central server (which will probably be signed after manual verification)
and then wait for the certificate to be signed before proceeding. So far
I have the client fetching the server cert, generating keys if required,
generating the signing request and then sending it. The server currently
offers its certificate, and waits for connections (but does nothing with
Spent some time trying to chase down a case where duplicate results are
being reported for some tests. It appears that tests are sometimes being
run slightly early (according to the host clock, which drifts) and so
the next scheduled run is set to occur almost immediately, resulting in
two tests run a fraction of a second apart. Even though actual
scheduling in libwandevent is based on the monotonic clock, the AMP
scheduling uses epoch based time so both clocks are required. I've
implemented a test fix and some more debugging, and will check how it
goes after the weekend.
Built new amplet packages for Centos and Debian to deploy the newest
version in the test mesh. Found a few problems running the tcpping test
on machines with multiple interfaces, which was fixed and the packages
rebuilt. Also updated the schedules on the test amplets to be closer to
what we are currently using on the main mesh in order to be closer to a
proper deployment scenario.
Added some more sanity checking to the way result messages are unpacked
by the server after (what appears to be a rather old, outdated version
of) the amplet client reported less data than it claimed to have
available, breaking the collector.
Spent some time looking into how puppet does initial certificate/key
distribution to its clients so that we might do something similar. We
need a sensible way to get certificates onto each amplet that doesn't
require a lot of manual generation and copying of files.
Continued investigating why traceroute tests were sometimes lingering
when the main amplet2 process was terminated. Eventually discovered that
I wasn't closing some file descriptors after forking, so that the test
children were able to connect to a listening local unix socket that
should have been closed. Despite listening, no running process was
actually expecting this connection, so it stalled waiting for it to be
Also tidied up more of the ASN socket querying code to better detect if
it had closed, and to actually report the error back so that it could be
dealt with in a smarter way, helping prevent the test hanging around in
a bad state.
Had a quick look at the HTTP test after seeing a few unusual results and
found that some software does a poor job of following the standards
(surprise!). Updated the header parser to be slightly smarter and deal
with some different combinations of capital letters, whitespace and
Spent some time working with Brad to get an example amplet machine
running that he can use to work through the upgrade process, bringing
them up to date with Debian.
Spent some time building new amplet2 Debian packages to make sure that
the build process was up to date with any new dependencies added with
the recent changes. Had to deal with a few packages in Debian Lenny
being well out of date and missing features (though an upgrade is on the
Installed new packages on a test amplet, and configured the schedule
using the web interface. In doing so, found a few test options that
weren't properly hooked up and were setting the wrong values, and that
sites were including themselves in their test schedules.
Accidentally left some firewall rules in place while testing and found
some broken behaviour when parts of tests failed. Watchdog timers
weren't being removed if the test exited badly, which was leading to
extraneous messages reporting tests being killed (when they had already
stopped). Broken connections to the ASN server could also trigger a
SIGPIPE when querying the local cache, which weren't being properly
Spent the latter part of the week reading student honours reports.
Spent some more time checking up on the traceroute test, after merging
all the stopset/ASN changes. Found and fixed a case where ICMP error
codes weren't being properly recorded. Also found and fixed what appears
to be the main cause of the test running too long - some targets will
decrement the TTL before responding with a port unreachable message,
which throws the path length estimate off by one and can cause the same
TTL to be probed multiple times.
Added the ability to signal tests that their time is running out, giving
them an opportunity to report any partial results they have collected
and to gracefully exit before they get killed. This is configurable per
test type, depending on whether or not it is possible to get useful
information without the test entirely completing.
Updated the schedule interface display a bit more information about test
timings, and tidied up some documentation about the new format. Fixed
the raw interface to properly check if-modified-since headers from
amplets requesting new configs, so only new configs are sent.
Finished moving all the standalone traceroute ASN fetching from DNS to
the TCP bulk interface. Decided to reuse the trie datastructure to make
an actual unique set of addresses to query (rather than the previous
simple system that just looked at nearby ones), minimising the data
needing to be sent/received. Fixed a few bugs in the buffer management
that meant new ASN data was possibly clobbering the last unprocessed
portion from a previous read. Merged all these changes and they should
now be running on atest amplet deployment.
Fixed up some bugs in the new schedule parsing code that didn't work
properly when the test type was not specified. Most other settings were
optional and had sensible default values, but it wasn't expected that
the most important option would be missing from (usually generated)
files. Schedule items without a test type are now properly ignored. Also
merged all these changes which are now running on a test deployment.
Added parameters for the throughput and HTTP tests to the scheduling web
interface. Slightly modified the throughput test options to make it much
easier to schedule the sorts of tests that it is commonly used for. Also
updated the HTTP test to follow 3XX redirects and to record that they
happened (with timings, sizes etc for both the redirect and the followup
Turned a lot of the scheduling web interface code into templates that
can be reused between creating and updating tests. They were similar
enough that most of it can be reused, with only a few minor changes
specific to each view.
Fixed up some small bugs in the ASN query code to make sure that all
addresses in the path are fetched (paths shorter than the initial TTL
weren't querying for the ASN of the final hop). The cache will now be
cleared regularly during operation and will also tidy up properly after
itself on program end. Started work on replacing the ASN fetching using
DNS with the TCP bulk whois for the standalone traceroute tests too.
Spent some time applying patches and building old bash from source to
update the old amplets against the new bash vulnerability. These
machines are really due for a software refresh!
Spent some time setting up a properly scheduled throughput test between
machines in the real world. While doing so, found out a few things about
certificate management that may not have been full thought through yet.
The certificates used for connections to the control socket (for
starting the remote end of the test) are currently only configured to be
clients, they can't act as a server without an extra setting being
enabled. Also, the server currently tries to validate client hostnames,
which relies on reverse DNS and won't be effective in most real world cases.
Added caching to the ASN lookups that use the bulk TCP interface, using
a data structure that looks similar to a radix trie. Looks to work well
and fast. May also try to use this in the temporary test processes too,
to store addresses and ASNs while they get applied to a particular set
of traceroute data (it would more more easily remove duplicates from the
Found and fixed a few bugs in the traceroute test now that there is a
small test deployment testing to real locations. IPv4 paths shorter than
the initial probe TTL of 6 were exhibiting inconsistent behaviour with
the value of the TTL in the packet embedded in the response. I now use
the response packet itself to calculate path length. Also fixed a bug
where multiple ASNs being returned in a single result were being parsed
Started working on fetching ASNs in a single bulk TCP connection rather
than using the DNS infrastructure, as requested by Team Cymru (whose
data we are using). Fetching all appears to work fine but there is
currently no caching, it will generate queries for all addresses everytime.
Updated the scheduling interface to allow scheduling/viewing tests for
meshes as well. All mesh tests are treated pretty much the same as those
for individual sites, and are merged when generating yaml configs for
sites. Tidied up some of the schedule display to hide headings and
sections that are empty or not currently relevant.
Continued to work on the interface for scheduling tests. As well as
adding new tests to a site, you can now modify an existing test. Full
details on a specific test can be viewed in a modal window very similar
to that used to create the test, and options/scheduling can be modified
there. Extra destinations can be added and existing destinations can be
removed, and the test itself can be completely deleted.
Added backend support to deal with all the above - including
adding/deleting test destinations, deleting tests, modifying test
arguments, modifying test schedules.
it was very similar and didn't warrant being entirely separate, but had
diverged enough to be annoying. The templates for the modals will need a
similar job done on them.