Brendon Jones's blog
Started planning the best way to approach changing the traceroute test
to be faster and more network friendly. Making it more event driven and
sending packets when we know they have left the network should help
speed up the test, rather than probing in waves and having to wait at
each TTL for all responses. Before changing the test in this way it made
sense to move from the deprecated libwandevent2 to libwandevent3, which
I did. I've also made the first few changes in the traceroute test to
use an event based approach.
Read up a bit on doubletree and had a look into how some other
traceroute implementations dealt with it. Will hopefully be able to
apply some of the ideas around stop sets to the updated traceroute test
too. Tidied up a bit more low hanging fruit in the amplet packaging and
Spent some time proofreading reading student theses to provide feedback.
Fixed a crash when changing the name of test processes, where getopt was
being unhappy after having argv changed underneath it, despite being
given a different array to operate on after forking. Logging has also
been made more sensible, with all amp processes using a fixed prefix
rather than using the full process name.
Spent some time comparing results of the new timestamping mechanism
against iputils ping. Timestamps are looking much more stable now in all
situations. There was a consistent small offset between the amp and ping
values, which appears to mostly be due to one timestamping packets
immediately before sending them and the other immediately after.
Changing amp to record timestamps at the same time as ping removes this
offset. Testing between a pair of hosts directly connected at gigabit
gives very similar results for both approaches, with identical quartiles
and only 0.2 microseconds difference in mean.
Tidied up packaging scripts for Debian and Centos, removing some default
configuration files that were being installed but are no longer needed.
Updated Centos init scripts to be more similar to the new Debian ones
that allow multiple clients to be run.
Fixed the delay in name resolution which was causing one amplet to
timeout tests when first starting. It was caused by the default
behaviour being to perform as a recursive resolver when no resolvers
were specified. It now properly uses the nameservers listed in
/etc/resolv.conf if there is no overriding configuration given.
Implemented the change in the receive code to use timestamps direct from
the test sockets. All tests that use the AMP library functions to
receive packets will be able to pass in a pointer to a timeval and have
it filled with the receive time of the packet. I haven't merged this yet
as I plan to spend some more time testing it under load and comparing it
to the previous approach.
Updated the schedule/nametable to allow selecting specific address
families of test targets even if the name/address pair was manually
specified in the nametable rather than resolved using DNS. This should
all behave consistently in the schedule file now regardless of type.
Spent some time investigating a bug in the code to rename the test
processes to more useful names than the parent. rsyslog starts printing
incorrect process names when logging and can lead to crashes. Renaming
works fine when run in the foreground with logging directly to the
terminal, and the correct process names are shown.
Deployed a new version of the amplet client to most of the monitors.
Found some new issues with name resolution taking too long and timing
out tests on one particular machine. Fixed the SIGPIPE caused by this,
but have yet to diagnose the root cause. No other machine exhibits this.
Kept looking into the problem with packet timing when the machine is
under heavy load, and after looking more closely at the iputils ping
source managed to find a solution. Using recvmsg() grants access to
ancillary data which if configured correctly can include timestamps. My
initial failed testing of this didn't properly set the message
structures - doing it properly gives packet timestamps that are much
more stable under load.
Updated the test schedule fetching to be more flexible and easier to
deal with. Client SSL certs are no longer required to identify the
particular host (but can still be used if desired).
Spent some time investigating the best way to rename running processes,
so that amplet test processes can have more descriptive names. On linux
it appears that the best way is simply to clobber argv (moving
environment etc out of the way to make more space), as it can use longer
names and the change affects the most places process names are observed.
The prctl() function is about the only other option in linux, and that
is limited to 16 characters and only changes the output of top. The test
processes now name themselves after the ampname they belong to and the
test they perform.
Performed a test install on both a puppet managed machine and an older
amplet machine. This was complicated by needing to upgrade rabbitmq on
the older machines, but without proper packages being used. Put together
some scripts that should mostly automate the upgrade process for later
installs. Watching these two test installs I found and fixed a race
condition that was triggering an assert because the number of
outstanding DNS requests was being incorrectly modified.
Moved the amplet2 repository from svn to git, which will make branching
etc a lot nicer/easier.
Spent some time working through my use of libunbound to get a better
understanding of exactly what it was doing at each point, and fixed the
memory leak I was experiencing. All of my worker threads can see
responses to any query (or none at all), so knowing when all their names
are resolved and the test can continue is important. They can update
each others results lists, so proper locking is also needed.
Updated Debian packaging files in preparation of making a new release on
all our current monitors. Tried a few iterations as the upgrade path
from the old version needed a bit of work, especially in conjunction
with puppet managed configuration spaces. This will go out after the
data migration next week.
Replaced the libc resolver with libunbound. Wrote a few wrapper
functions around the library calls to give me data in a linked list of
addrinfo structs in a similar way to getaddrinfo() so that it don't need
to modify the code around tests too much. The older approach with each
test managing the resolver didn't allow caching to work (there was no
way for them to share context/cache), so I moved that all into the main
process. Tests now connect to the main process across a unix socket and
ask for the addresses for their targets.
Using asynchronous calls to the resolver has massively cut the time
taken pre-test, and the caching has cut the number of queries that we
actually have to make. We shouldn't be hammering the DNS servers any more.
Spent a lot of time testing this new approach and trying to track down
one last infrequently occurring memory leak.
Successfully built Debian packages of the new amplet client and
installed them on a new machine with multiple network interfaces. Spent
some time making sure that all the configuration files ended up in the
right place, and that the init script performed as expected.
Spent a lot of time looking into how well the DNS lookups behaved with
multiple clients running at once, and that they respected interface
bindings when they were set. In general, everything co-existed nicely
and worked but some possible failure modes could bring the whole thing
down. If DNS sockets were reopened due to a query failure then they
would reset to normal behaviour. Started to investigate other approaches
to name resolution - it looks like using libunbound will be the way
forward from here as it also gives us asynchronous queries (synchronous
lookups were becoming time consuming) and caching.
Fixed up the local rabbitmq configuration to properly generate vhosts,
users and shovels that will actually work and allow data to be passed
around. The configuration file needed some tweaking to make this work,
and may need to be reworked in the near future to make it more clear how
the collector needs to be configured.
Created a Debian init script to deal with starting multiple instances of
amplet clients on a single machine, each using different configuration
files. They be started/stopped individually or as a whole. Updated some
of the other Debian packaging scripts to deal with the new config
directory layout. Started testing them to make sure that everything
required is present and still ends up in the right places.
Added options to create a pidfile in the amplet2 client so that it works
better with the new init scripts, and targeting individual instances of
Spent some time looking into test results and performance after seeing
some results from a student testing on a loaded system. Latency
measurements degrade as load increases, but those made by ping remain
quite stable. Briefly tried testing the difference between using
gettimeofday() and SO_TIMESTAMP but found no obvious differences under load.
Updated some configuration in amp-web to allow fully specifying how to
connect to the amp/views/event databases.
Set up some throughput tests to collect data for Shane to test inserting
the data. While doing so I found and fixed some small issues with
schedule parsing (test parameters that included the schedule delimiter
were being truncated) and test establishment (EADDRINUSE wasn't being
picked up in some situations).
Started adding configuration support for running multiple amplet clients
on a single machine. Some schedule configuration can be shared globally
between all clients, but they also need to be able to specify schedules
that belong only to a single client. Nametables, keys, etc also need to
be set up so that each client knows where they are.
Started writing code to configure rabbitmq on a client and isolate our
data from anything else that might already be on that broker (e.g.
another amplet client). Each amplet client should now operate within a
private vhost and no longer require permissions on the default one.