Brendon Jones's blog
Deployed a new version of the amplet client to most of the monitors.
Found some new issues with name resolution taking too long and timing
out tests on one particular machine. Fixed the SIGPIPE caused by this,
but have yet to diagnose the root cause. No other machine exhibits this.
Kept looking into the problem with packet timing when the machine is
under heavy load, and after looking more closely at the iputils ping
source managed to find a solution. Using recvmsg() grants access to
ancillary data which if configured correctly can include timestamps. My
initial failed testing of this didn't properly set the message
structures - doing it properly gives packet timestamps that are much
more stable under load.
Updated the test schedule fetching to be more flexible and easier to
deal with. Client SSL certs are no longer required to identify the
particular host (but can still be used if desired).
Spent some time investigating the best way to rename running processes,
so that amplet test processes can have more descriptive names. On linux
it appears that the best way is simply to clobber argv (moving
environment etc out of the way to make more space), as it can use longer
names and the change affects the most places process names are observed.
The prctl() function is about the only other option in linux, and that
is limited to 16 characters and only changes the output of top. The test
processes now name themselves after the ampname they belong to and the
test they perform.
Performed a test install on both a puppet managed machine and an older
amplet machine. This was complicated by needing to upgrade rabbitmq on
the older machines, but without proper packages being used. Put together
some scripts that should mostly automate the upgrade process for later
installs. Watching these two test installs I found and fixed a race
condition that was triggering an assert because the number of
outstanding DNS requests was being incorrectly modified.
Moved the amplet2 repository from svn to git, which will make branching
etc a lot nicer/easier.
Spent some time working through my use of libunbound to get a better
understanding of exactly what it was doing at each point, and fixed the
memory leak I was experiencing. All of my worker threads can see
responses to any query (or none at all), so knowing when all their names
are resolved and the test can continue is important. They can update
each others results lists, so proper locking is also needed.
Updated Debian packaging files in preparation of making a new release on
all our current monitors. Tried a few iterations as the upgrade path
from the old version needed a bit of work, especially in conjunction
with puppet managed configuration spaces. This will go out after the
data migration next week.
Replaced the libc resolver with libunbound. Wrote a few wrapper
functions around the library calls to give me data in a linked list of
addrinfo structs in a similar way to getaddrinfo() so that it don't need
to modify the code around tests too much. The older approach with each
test managing the resolver didn't allow caching to work (there was no
way for them to share context/cache), so I moved that all into the main
process. Tests now connect to the main process across a unix socket and
ask for the addresses for their targets.
Using asynchronous calls to the resolver has massively cut the time
taken pre-test, and the caching has cut the number of queries that we
actually have to make. We shouldn't be hammering the DNS servers any more.
Spent a lot of time testing this new approach and trying to track down
one last infrequently occurring memory leak.
Successfully built Debian packages of the new amplet client and
installed them on a new machine with multiple network interfaces. Spent
some time making sure that all the configuration files ended up in the
right place, and that the init script performed as expected.
Spent a lot of time looking into how well the DNS lookups behaved with
multiple clients running at once, and that they respected interface
bindings when they were set. In general, everything co-existed nicely
and worked but some possible failure modes could bring the whole thing
down. If DNS sockets were reopened due to a query failure then they
would reset to normal behaviour. Started to investigate other approaches
to name resolution - it looks like using libunbound will be the way
forward from here as it also gives us asynchronous queries (synchronous
lookups were becoming time consuming) and caching.
Fixed up the local rabbitmq configuration to properly generate vhosts,
users and shovels that will actually work and allow data to be passed
around. The configuration file needed some tweaking to make this work,
and may need to be reworked in the near future to make it more clear how
the collector needs to be configured.
Created a Debian init script to deal with starting multiple instances of
amplet clients on a single machine, each using different configuration
files. They be started/stopped individually or as a whole. Updated some
of the other Debian packaging scripts to deal with the new config
directory layout. Started testing them to make sure that everything
required is present and still ends up in the right places.
Added options to create a pidfile in the amplet2 client so that it works
better with the new init scripts, and targeting individual instances of
Spent some time looking into test results and performance after seeing
some results from a student testing on a loaded system. Latency
measurements degrade as load increases, but those made by ping remain
quite stable. Briefly tried testing the difference between using
gettimeofday() and SO_TIMESTAMP but found no obvious differences under load.
Updated some configuration in amp-web to allow fully specifying how to
connect to the amp/views/event databases.
Set up some throughput tests to collect data for Shane to test inserting
the data. While doing so I found and fixed some small issues with
schedule parsing (test parameters that included the schedule delimiter
were being truncated) and test establishment (EADDRINUSE wasn't being
picked up in some situations).
Started adding configuration support for running multiple amplet clients
on a single machine. Some schedule configuration can be shared globally
between all clients, but they also need to be able to specify schedules
that belong only to a single client. Nametables, keys, etc also need to
be set up so that each client knows where they are.
Started writing code to configure rabbitmq on a client and isolate our
data from anything else that might already be on that broker (e.g.
another amplet client). Each amplet client should now operate within a
private vhost and no longer require permissions on the default one.
Spent some time tidying up the code to adjust nameservers for AMP at
runtime, and adding in configuration options to allow them to be set.
While doing this realised that name resolution wasn't neccessarily going
to respect the interface/address bindings set up for the tests, so
looked into ways I could make this happen. The best/easiest way so far
seems to be to create my own sockets for the resolver to use and then
bind them how I like. This appears to work with my testing so far, but
is possibly getting a bit too specific to the internals of the libc
library I'm using.
Also wrote some unit tests around the ICMP test response packet
processing to help make sure that malformed or incorrect packets are
correctly dealt with.
Tidied up some arbitrarily sized buffers in the icmp test to be the
actual size required for the data. Accidentally made them too small, so
fixed that and then wrote some more unit tests to cover the
sending/receiving of data and buffer management. Also updated the icmp
test to be able to short circuit the loss wait timeout once all data has
been accounted for - previously it was always waiting a minimum of
200ms, even if all responses had been received.
Spent some time examining query logs from the newly migrated test
database on prophet to see where slowdowns were now occurring. Found and
fixed a simple case where we were over-querying for data, and have a few
ideas for other places to look for more improvements.
Investigated how it might be possible to set DNS servers per process in
order to run multiple amplet clients on the same linux host without
putting them in individual containers. It isn't made obvious in libc how
to do this, but it seems to be possible by modifying some internal
resolver structures. If I set these right, then getaddrinfo() etc will
all work as normal except using the specified name server rather than
whatever is in /etc/resolv.conf. The alternative here seems to be
replacing the name resolution functions with another library or custom code.
Built new CentOS and Debian amplet packages for testing and deployed to
a test machine to check that both old and new versions of the transfer
format could be saved. After a bit of tweaking to the save functions
this looks to work fine.
Tested the full data path from capture to display, which included fixing
the way aggregation of data streams is performed for matrix tooltips.
Everything works well together, except the magic new aggregation
function fails in the case where entire bins are NULL. Will have to
spend some time next week making this work properly.
Wrote some more unit tests for the amplet client testing address
binding, sending data and scheduling tests. While doing so, found what
appears to be a bug in scheduling tests with period end times that were
shorter than hour/day/week.