Brendon Jones's blog
Started work on a program to help manage signing certificate requests
from amplet clients, similar to the puppet CA. Got most of the required
behaviour implemented - listing outstanding requests, signing them, and
revoking signed certificates. There also still needs to be a bit of
thinking done around how the amplet clients deal with revocation (how to
best do OCSP or similar) and how to reissue certificates that have expired.
Kept working through improving the SSL code to exchange keys and operate
the control socket on the amplet client. It now validates the
Diffie-Hellman parameters before using them (aborting if they are not up
to scratch), and I disabled compression to avoid another known attack
vector. Validated that the options I was setting were set, and the
protocols/ciphers I enabled were in fact the only ones being used. Spent
some time refactoring the code to be cleaner and easier to follow and
also added more logging to help make it obvious what was going on at
each step where something could fail.
Found a test that was failing to pass when building packages for 32bit
Debian and spent some time trying to fix it. Some constants I was using
to test edge cases in scheduling were too large and overflowing
variables giving incorrect results. Forcing them all to be the expected
size fixed that and the tests all pass.
Spent some time trying to figure out why my server and client could not
find any ciphers in common that they could use, despite having identical
lists in exactly the same order. Turns out that if you want to use DHE
or ECDHE ciphers there is extra setup required, but very little
documentation pointing this out (the documentation about how to do it is
fine, but nothing makes you aware that you have to do this).
Added code to tests so that they only resolve address families that are
present on available test interfaces. This used to be set with a flag to
getaddrinfo, but we've since moved to using libunbound instead. The old
flag would also consider all interfaces on the host, while the new code
is aware that a test can be bound to a particular interface and only
checks for address families there.
Merged in the scheduling fixes I had been testing, after they had been
running for some time without rescheduling problems. Tidied up a bunch
of compilation warnings, logging, documentation etc too, in preparation
for a new amplet release. Built some new packages and deployed them on a
test amplet to run over the weekend.
Got certificate request verification working, so now signed certificates
will only be sent to those clients that can prove their identity and
their right to retrieve that certificate. Requests are fingerprinted and
saved on the server to await signing.
Had a quick audit of the key/certificate management code with Brad, and
found a few places where security needs to be tightened. In particular,
careful attention needs to be paid to making sure that insecure ciphers
can't be used. Because we have full control over all the client and
server software, we can limit all communication to using only TLS >= 1.2
with a small set of strong ciphers. Spent some time investigating
exactly which ciphers we want to use, and how to enable them.
Also spent a lot of time reading the OpenSSL wiki and various
patches/changelogs to determine when certain fixes were made. The Debian
packages are fairly up to date, but are still missing some well known
changes that are needed to improve security, so I implemented them
within the AMP environment.
Spent most of the week working on the key distribution for the amplet
clients. The server will now save certificate signing requests as well
as offer signed certificates (if present) to clients that can prove that
the certificate is theirs. Everything looks to be in order here
(certificate requests, signatures, etc all look correct) except that I
can't get the final step to verify from within code, despite working
fine using command line tools.
Also spent some more time looking into the problem of tests being run
slightly early and then being rescheduled almost immediately. The fix I
tested over the weekend didn't work, so had to try a few more things.
Each test info block now also contains the wall-clock time that the test
was intended to run, and tests that are triggered too soon can be safely
rescheduled at the correct time. I've built a new client to test over
the week, so far it looks promising.
Started working on implementing a nice way to generate/sign/distribute
certificates for amplets to use, similar to the way that puppet does it.
Clients that are missing certificates can send a signing request to the
central server (which will probably be signed after manual verification)
and then wait for the certificate to be signed before proceeding. So far
I have the client fetching the server cert, generating keys if required,
generating the signing request and then sending it. The server currently
offers its certificate, and waits for connections (but does nothing with
Spent some time trying to chase down a case where duplicate results are
being reported for some tests. It appears that tests are sometimes being
run slightly early (according to the host clock, which drifts) and so
the next scheduled run is set to occur almost immediately, resulting in
two tests run a fraction of a second apart. Even though actual
scheduling in libwandevent is based on the monotonic clock, the AMP
scheduling uses epoch based time so both clocks are required. I've
implemented a test fix and some more debugging, and will check how it
goes after the weekend.
Built new amplet packages for Centos and Debian to deploy the newest
version in the test mesh. Found a few problems running the tcpping test
on machines with multiple interfaces, which was fixed and the packages
rebuilt. Also updated the schedules on the test amplets to be closer to
what we are currently using on the main mesh in order to be closer to a
proper deployment scenario.
Added some more sanity checking to the way result messages are unpacked
by the server after (what appears to be a rather old, outdated version
of) the amplet client reported less data than it claimed to have
available, breaking the collector.
Spent some time looking into how puppet does initial certificate/key
distribution to its clients so that we might do something similar. We
need a sensible way to get certificates onto each amplet that doesn't
require a lot of manual generation and copying of files.
Continued investigating why traceroute tests were sometimes lingering
when the main amplet2 process was terminated. Eventually discovered that
I wasn't closing some file descriptors after forking, so that the test
children were able to connect to a listening local unix socket that
should have been closed. Despite listening, no running process was
actually expecting this connection, so it stalled waiting for it to be
Also tidied up more of the ASN socket querying code to better detect if
it had closed, and to actually report the error back so that it could be
dealt with in a smarter way, helping prevent the test hanging around in
a bad state.
Had a quick look at the HTTP test after seeing a few unusual results and
found that some software does a poor job of following the standards
(surprise!). Updated the header parser to be slightly smarter and deal
with some different combinations of capital letters, whitespace and
Spent some time working with Brad to get an example amplet machine
running that he can use to work through the upgrade process, bringing
them up to date with Debian.
Spent some time building new amplet2 Debian packages to make sure that
the build process was up to date with any new dependencies added with
the recent changes. Had to deal with a few packages in Debian Lenny
being well out of date and missing features (though an upgrade is on the
Installed new packages on a test amplet, and configured the schedule
using the web interface. In doing so, found a few test options that
weren't properly hooked up and were setting the wrong values, and that
sites were including themselves in their test schedules.
Accidentally left some firewall rules in place while testing and found
some broken behaviour when parts of tests failed. Watchdog timers
weren't being removed if the test exited badly, which was leading to
extraneous messages reporting tests being killed (when they had already
stopped). Broken connections to the ASN server could also trigger a
SIGPIPE when querying the local cache, which weren't being properly
Spent the latter part of the week reading student honours reports.
Spent some more time checking up on the traceroute test, after merging
all the stopset/ASN changes. Found and fixed a case where ICMP error
codes weren't being properly recorded. Also found and fixed what appears
to be the main cause of the test running too long - some targets will
decrement the TTL before responding with a port unreachable message,
which throws the path length estimate off by one and can cause the same
TTL to be probed multiple times.
Added the ability to signal tests that their time is running out, giving
them an opportunity to report any partial results they have collected
and to gracefully exit before they get killed. This is configurable per
test type, depending on whether or not it is possible to get useful
information without the test entirely completing.
Updated the schedule interface display a bit more information about test
timings, and tidied up some documentation about the new format. Fixed
the raw interface to properly check if-modified-since headers from
amplets requesting new configs, so only new configs are sent.