Brendon Jones's blog
Tidied up the traceroute stopset code to add addresses in a more
consistent manner regardless of whether an address in the stopset was
found or if the TTL hit 1. This also allowed me to more easily check
that parts of a completed path don't already exist in the stopset (they
might have been added since they were last checked for) to prevent
Added the ability to lookup and report AS numbers for all addresses seen
in the paths (using the Team Cymru data). This currently works for the
standalone test (which doesn't have access to the built-in DNS cache)
but requires some slight modification to run as part of amplet itself.
Added local stop sets to the traceroute test to record paths near to the
source and prevent them from being reprobed for every single
destination. Due to the highly parallel nature of the test this
initially had only a very minimal impact on the number of probes
required. At a suggestion from Shane I began probing destinations using
a smaller, fixed sized window rather than all at once in order to
populate the stopset early on in the test. With the current destination
list, this reduced the number of probes required by about 30% without
any real impact on the duration of the test.
Spent some time confirming that the results were the same as the
original test produced, and that they matched the results of other
traceroute programs. Found some slightly different behaviours where I
was treating certain ICMP error codes incorrectly, which I fixed.
Started to look at doing optional AS lookups for addresses on the path.
Appears the easiest solution is to have the test itself look them up
before returning results. Using something like the Team Cymru IP to AS
mappings (which are available over DNS) is simple and would make good
use of caching to minimise the number of queries.
Finished updating the traceroute test to use libwandevent3 to schedule
packets and track timeouts. The aim was to make each action
self-contained and easily understandable, to aid in adding the extra
complexity of stop sets and AS lookups later on. Modified the probing
algorithm to start partway through the path and probe forward, then
probe backwards from the initial point - we can probe forward into paths
that we likely haven't seen before, and then stop probing on the reverse
when we see familiar addresses.
Spent some more time reading student theses to provide feedback.
Started planning the best way to approach changing the traceroute test
to be faster and more network friendly. Making it more event driven and
sending packets when we know they have left the network should help
speed up the test, rather than probing in waves and having to wait at
each TTL for all responses. Before changing the test in this way it made
sense to move from the deprecated libwandevent2 to libwandevent3, which
I did. I've also made the first few changes in the traceroute test to
use an event based approach.
Read up a bit on doubletree and had a look into how some other
traceroute implementations dealt with it. Will hopefully be able to
apply some of the ideas around stop sets to the updated traceroute test
too. Tidied up a bit more low hanging fruit in the amplet packaging and
Spent some time proofreading reading student theses to provide feedback.
Fixed a crash when changing the name of test processes, where getopt was
being unhappy after having argv changed underneath it, despite being
given a different array to operate on after forking. Logging has also
been made more sensible, with all amp processes using a fixed prefix
rather than using the full process name.
Spent some time comparing results of the new timestamping mechanism
against iputils ping. Timestamps are looking much more stable now in all
situations. There was a consistent small offset between the amp and ping
values, which appears to mostly be due to one timestamping packets
immediately before sending them and the other immediately after.
Changing amp to record timestamps at the same time as ping removes this
offset. Testing between a pair of hosts directly connected at gigabit
gives very similar results for both approaches, with identical quartiles
and only 0.2 microseconds difference in mean.
Tidied up packaging scripts for Debian and Centos, removing some default
configuration files that were being installed but are no longer needed.
Updated Centos init scripts to be more similar to the new Debian ones
that allow multiple clients to be run.
Fixed the delay in name resolution which was causing one amplet to
timeout tests when first starting. It was caused by the default
behaviour being to perform as a recursive resolver when no resolvers
were specified. It now properly uses the nameservers listed in
/etc/resolv.conf if there is no overriding configuration given.
Implemented the change in the receive code to use timestamps direct from
the test sockets. All tests that use the AMP library functions to
receive packets will be able to pass in a pointer to a timeval and have
it filled with the receive time of the packet. I haven't merged this yet
as I plan to spend some more time testing it under load and comparing it
to the previous approach.
Updated the schedule/nametable to allow selecting specific address
families of test targets even if the name/address pair was manually
specified in the nametable rather than resolved using DNS. This should
all behave consistently in the schedule file now regardless of type.
Spent some time investigating a bug in the code to rename the test
processes to more useful names than the parent. rsyslog starts printing
incorrect process names when logging and can lead to crashes. Renaming
works fine when run in the foreground with logging directly to the
terminal, and the correct process names are shown.
Deployed a new version of the amplet client to most of the monitors.
Found some new issues with name resolution taking too long and timing
out tests on one particular machine. Fixed the SIGPIPE caused by this,
but have yet to diagnose the root cause. No other machine exhibits this.
Kept looking into the problem with packet timing when the machine is
under heavy load, and after looking more closely at the iputils ping
source managed to find a solution. Using recvmsg() grants access to
ancillary data which if configured correctly can include timestamps. My
initial failed testing of this didn't properly set the message
structures - doing it properly gives packet timestamps that are much
more stable under load.
Updated the test schedule fetching to be more flexible and easier to
deal with. Client SSL certs are no longer required to identify the
particular host (but can still be used if desired).
Spent some time investigating the best way to rename running processes,
so that amplet test processes can have more descriptive names. On linux
it appears that the best way is simply to clobber argv (moving
environment etc out of the way to make more space), as it can use longer
names and the change affects the most places process names are observed.
The prctl() function is about the only other option in linux, and that
is limited to 16 characters and only changes the output of top. The test
processes now name themselves after the ampname they belong to and the
test they perform.
Performed a test install on both a puppet managed machine and an older
amplet machine. This was complicated by needing to upgrade rabbitmq on
the older machines, but without proper packages being used. Put together
some scripts that should mostly automate the upgrade process for later
installs. Watching these two test installs I found and fixed a race
condition that was triggering an assert because the number of
outstanding DNS requests was being incorrectly modified.
Moved the amplet2 repository from svn to git, which will make branching
etc a lot nicer/easier.
Spent some time working through my use of libunbound to get a better
understanding of exactly what it was doing at each point, and fixed the
memory leak I was experiencing. All of my worker threads can see
responses to any query (or none at all), so knowing when all their names
are resolved and the test can continue is important. They can update
each others results lists, so proper locking is also needed.
Updated Debian packaging files in preparation of making a new release on
all our current monitors. Tried a few iterations as the upgrade path
from the old version needed a bit of work, especially in conjunction
with puppet managed configuration spaces. This will go out after the
data migration next week.
Replaced the libc resolver with libunbound. Wrote a few wrapper
functions around the library calls to give me data in a linked list of
addrinfo structs in a similar way to getaddrinfo() so that it don't need
to modify the code around tests too much. The older approach with each
test managing the resolver didn't allow caching to work (there was no
way for them to share context/cache), so I moved that all into the main
process. Tests now connect to the main process across a unix socket and
ask for the addresses for their targets.
Using asynchronous calls to the resolver has massively cut the time
taken pre-test, and the caching has cut the number of queries that we
actually have to make. We shouldn't be hammering the DNS servers any more.
Spent a lot of time testing this new approach and trying to track down
one last infrequently occurring memory leak.