Brendon Jones's blog
Added SSL support to the amplet client for querying a remote server to
fetch schedule files. This should give us the ability to have clients we
don't really control stay up to date with test schedules, but needs a
bit more thought put into how often it should run and how it should
interact with the main schedule process.
Added a control server to the amplet client that will accept connections
from other clients that require specific test servers to be run (e.g.
throughput tests), and run them. Currently it accepts the id of the test
to be run and returns a port number that the new server is running on so
that the test knows where to connect to. Wrapped all this up in SSL as
well, validating both the certificate and the hostname/commonname, but
not yet checking revocation of certs.
Spent some time working on things to help keep the amplet code clean and
tidy. Added stricter compilation options and fixed up some cases where
these triggered warnings. Started working on unit tests for amplet based
on the built in automake target "check". Wrote very simple unit tests
for the icmp and traceroute tests as well as the nametable management.
While writing the nametable unit tests I found and fixed a bug that
would limit the nametable to only a single item.
Briefly had a look at different database options available to us that
might perform better with our data than postgres. There are still
further optimisations we can make to how we store our data in postgres,
but it will be interesting to see how they compare to something like
cassandra, HBase or riak.
Tidied up the reporting done on the icmp, traceroute and dns tests in
AMP to use variable length strings for names, as well as properly
packing and byteswapping the reporting structures. The average report
message size should now be much smaller than it was. Also updated the
nntsc plugins for amp data to deal with the new format.
Tweaked the parser for the http test to better ignore strings that look
are generated on the fly within the <script> block.
Was at NZNOG in Nelson for all of this week. Enjoyed the SDN workshop,
which has made a lot of those concepts more concrete and real for me
which is helpful. Presented part of our talk about AMP, and it sounds
like there are a few people who are keen to be involved in running it
(as new hosts, or even within their network as part of their own
Spent most of the week tidying up things in preparation for the AMP
website to be demoed at NZNOG next week. Fixed up some graph colours,
labels and descriptions that were inconsistent across views so that they
all match. Tried to squeeze a bit more performance out of our current
database setup to make our queries faster, and wrote a quick script to
keep refreshing the matrix data into memcache.
Added some error notifications for when ajax requests failed so that the
user has some feedback rather than waiting forever with no indication
that something might have gone wrong.
Fixed up a bug with the area selection in the summary graph that would
shrink the selection if it was at the edge of the graph.
Continued investigating a problem where ICMP test data was
intermittently failing to be reported. It appears to be due to fairly
aggressive timeouts stopping tests before they finish - resolving the
(quite large, and increasing) list of destinations was taking longer
than expected which was not leaving enough time to perform the
measurements. I've increased the timeouts to more reasonable values and
have a few ideas to exclude resolving time from the allowable test duration.
Fixed a couple of bugs in the tracking of active streams that meant
cached data was being recached with no new data being fetched. It's now
deployed and has shrunk the size of our queries. Also looked into adding
a query timeout to prevent long running queries from hosing the machine.
Spent a lot of time looking at explain/analyze output from postgresql,
trying to shave some time off fetching data for our graphs. Made a few
incremental improvements with a new index and some reordering of
queries, but I'm still looking for the magic bullet. We have a lot of
data and it takes a long time to read it!
Found and fixed a bug in the AMP DNS test where uninitialised data could
be reported if a server did not respond. Luckily this occurred once over
the break and gave us good logs as to what the problem was. Rabbit
easily dealt with the backlog of messages while this one blocked the
queue, which is very reassuring.
One of the current monitors appeared to stop reporting data for the ICMP
test, so I spent some time investigating that. The tests still run but
packets aren't always sent. Nothing in the logs gives any indication of
the problem, so will need to dig further.
Spent most of the week trying to improve the performance of database
queries by better limiting the query to only the data that is necessary.
Lots of streams for CDN targets are only active for a short time as
addresses change and we hit different instances, so we don't need to
check all of these for data. We now maintain a list of when streams were
active in order to limit the data that is queried.
Most of this week was spent getting the new AMP code ready to deploy
onto the existing mesh alongside the current tests. I built new AMP
packages for Debian Lenny (an upgrade of the amplets is on the todo
list!) as well as various dependencies that we didn't have older
versions of. Put together a directory with new, standalone erlang and
rabbitmq and wrote some install scripts to do the fiddly bits of getting
everything installed into the right place. Pushed this out to all the
machines on Thursday and they have been successfully reporting data since.
Having more active monitors testing to more locations showed up a few
bits of code that were previously good enough, but were too inefficient
and didn't scale well. Had to make some slight changes to the SQL around
inserting new data to make better use of transactions etc to keep it
running faster than realtime.
Added configuration modals for dns, smokeping, munin and some lpi data,
so that multiple data series (of the same type) can easily be viewed at
once and compared. Refactored the initial modal implementation used by
icmp and traceroute data to be much cleaner and easier to integrate the
new data types.
Updated the legend labels describing the currently displayed data to use
the more detailed information Shane has made available. Included in this
is a line number that is now used to fix the colouring order, making
sure that the line colour matches what the label describes.
Spent some time reworking small details on the newest AMP Debian
packages to install and run properly when installed by puppet on the new
Decided that we needed to simplify the database schema for storing
traceroute data, so spent some time working on that. The new schema
works better with the existing aggregation functions and is faster to
query. Moved all the existing data to the new schema.
Merged in the rainbow traceroute graphs that Brad created and got them
using data from the new traceroute data. Moved the default view of
combined traceroutes to use smokeping rather than a basic line graph to
better show what is happening with multiple addresses.
General tidyup of code that had got a bit crufty, removed some sections
that were duplicated or no longer required. Started work on moving the
DNS test to use views.