Brendon Jones's blog
Changed some of the options in the traceroute test to better match what
Shane is expecting to see when saving the data, and to better specify
what data is present.
Built a sample yaml schedule file and updated the schedule parsing code
to generate test events from the new format. The old approach was very
deterministic, with every field present and in a fixed order, while the
new approach can have fields appearing in any order and makes better use
of default values.
Started exploring how I might construct an interface to easily schedule
tests, and to visualise the tests within a schedule.
Fixed the AS lookups in the traceroute test to ignore RFC1918 addresses.
Wouldn't have been too much of a problem, but the NXDOMAIN responses
were only being cached for 5 minutes and ended up generating too many
extra queries. Also tidied up some checks for stopset membership to use
the TTL to better match, or prevent inserting pointless addresses.
Merged in the timing changes I made the other week into the main branch
so that they can be used.
Had a good meeting with Shane and Brad about where we need to go with
AMP in the next few months. After this, started looking at better ways
to represent and generate test schedule files so they are easier to
understand and edit. Spent some time looking into YAML and how an
example schedule might look, and experimented with the libyaml parser to
see how the data looks.
Tidied up the traceroute stopset code to add addresses in a more
consistent manner regardless of whether an address in the stopset was
found or if the TTL hit 1. This also allowed me to more easily check
that parts of a completed path don't already exist in the stopset (they
might have been added since they were last checked for) to prevent
Added the ability to lookup and report AS numbers for all addresses seen
in the paths (using the Team Cymru data). This currently works for the
standalone test (which doesn't have access to the built-in DNS cache)
but requires some slight modification to run as part of amplet itself.
Added local stop sets to the traceroute test to record paths near to the
source and prevent them from being reprobed for every single
destination. Due to the highly parallel nature of the test this
initially had only a very minimal impact on the number of probes
required. At a suggestion from Shane I began probing destinations using
a smaller, fixed sized window rather than all at once in order to
populate the stopset early on in the test. With the current destination
list, this reduced the number of probes required by about 30% without
any real impact on the duration of the test.
Spent some time confirming that the results were the same as the
original test produced, and that they matched the results of other
traceroute programs. Found some slightly different behaviours where I
was treating certain ICMP error codes incorrectly, which I fixed.
Started to look at doing optional AS lookups for addresses on the path.
Appears the easiest solution is to have the test itself look them up
before returning results. Using something like the Team Cymru IP to AS
mappings (which are available over DNS) is simple and would make good
use of caching to minimise the number of queries.
Finished updating the traceroute test to use libwandevent3 to schedule
packets and track timeouts. The aim was to make each action
self-contained and easily understandable, to aid in adding the extra
complexity of stop sets and AS lookups later on. Modified the probing
algorithm to start partway through the path and probe forward, then
probe backwards from the initial point - we can probe forward into paths
that we likely haven't seen before, and then stop probing on the reverse
when we see familiar addresses.
Spent some more time reading student theses to provide feedback.
Started planning the best way to approach changing the traceroute test
to be faster and more network friendly. Making it more event driven and
sending packets when we know they have left the network should help
speed up the test, rather than probing in waves and having to wait at
each TTL for all responses. Before changing the test in this way it made
sense to move from the deprecated libwandevent2 to libwandevent3, which
I did. I've also made the first few changes in the traceroute test to
use an event based approach.
Read up a bit on doubletree and had a look into how some other
traceroute implementations dealt with it. Will hopefully be able to
apply some of the ideas around stop sets to the updated traceroute test
too. Tidied up a bit more low hanging fruit in the amplet packaging and
Spent some time proofreading reading student theses to provide feedback.
Fixed a crash when changing the name of test processes, where getopt was
being unhappy after having argv changed underneath it, despite being
given a different array to operate on after forking. Logging has also
been made more sensible, with all amp processes using a fixed prefix
rather than using the full process name.
Spent some time comparing results of the new timestamping mechanism
against iputils ping. Timestamps are looking much more stable now in all
situations. There was a consistent small offset between the amp and ping
values, which appears to mostly be due to one timestamping packets
immediately before sending them and the other immediately after.
Changing amp to record timestamps at the same time as ping removes this
offset. Testing between a pair of hosts directly connected at gigabit
gives very similar results for both approaches, with identical quartiles
and only 0.2 microseconds difference in mean.
Tidied up packaging scripts for Debian and Centos, removing some default
configuration files that were being installed but are no longer needed.
Updated Centos init scripts to be more similar to the new Debian ones
that allow multiple clients to be run.
Fixed the delay in name resolution which was causing one amplet to
timeout tests when first starting. It was caused by the default
behaviour being to perform as a recursive resolver when no resolvers
were specified. It now properly uses the nameservers listed in
/etc/resolv.conf if there is no overriding configuration given.
Implemented the change in the receive code to use timestamps direct from
the test sockets. All tests that use the AMP library functions to
receive packets will be able to pass in a pointer to a timeval and have
it filled with the receive time of the packet. I haven't merged this yet
as I plan to spend some more time testing it under load and comparing it
to the previous approach.
Updated the schedule/nametable to allow selecting specific address
families of test targets even if the name/address pair was manually
specified in the nametable rather than resolved using DNS. This should
all behave consistently in the schedule file now regardless of type.
Spent some time investigating a bug in the code to rename the test
processes to more useful names than the parent. rsyslog starts printing
incorrect process names when logging and can lead to crashes. Renaming
works fine when run in the foreground with logging directly to the
terminal, and the correct process names are shown.
Deployed a new version of the amplet client to most of the monitors.
Found some new issues with name resolution taking too long and timing
out tests on one particular machine. Fixed the SIGPIPE caused by this,
but have yet to diagnose the root cause. No other machine exhibits this.
Kept looking into the problem with packet timing when the machine is
under heavy load, and after looking more closely at the iputils ping
source managed to find a solution. Using recvmsg() grants access to
ancillary data which if configured correctly can include timestamps. My
initial failed testing of this didn't properly set the message
structures - doing it properly gives packet timestamps that are much
more stable under load.
Updated the test schedule fetching to be more flexible and easier to
deal with. Client SSL certs are no longer required to identify the
particular host (but can still be used if desired).
Spent some time investigating the best way to rename running processes,
so that amplet test processes can have more descriptive names. On linux
it appears that the best way is simply to clobber argv (moving
environment etc out of the way to make more space), as it can use longer
names and the change affects the most places process names are observed.
The prctl() function is about the only other option in linux, and that
is limited to 16 characters and only changes the output of top. The test
processes now name themselves after the ampname they belong to and the
test they perform.
Performed a test install on both a puppet managed machine and an older
amplet machine. This was complicated by needing to upgrade rabbitmq on
the older machines, but without proper packages being used. Put together
some scripts that should mostly automate the upgrade process for later
installs. Watching these two test installs I found and fixed a race
condition that was triggering an assert because the number of
outstanding DNS requests was being incorrectly modified.
Moved the amplet2 repository from svn to git, which will make branching
etc a lot nicer/easier.