Brendon Jones's blog
Was at NZNOG in Nelson for all of this week. Enjoyed the SDN workshop,
which has made a lot of those concepts more concrete and real for me
which is helpful. Presented part of our talk about AMP, and it sounds
like there are a few people who are keen to be involved in running it
(as new hosts, or even within their network as part of their own
Spent most of the week tidying up things in preparation for the AMP
website to be demoed at NZNOG next week. Fixed up some graph colours,
labels and descriptions that were inconsistent across views so that they
all match. Tried to squeeze a bit more performance out of our current
database setup to make our queries faster, and wrote a quick script to
keep refreshing the matrix data into memcache.
Added some error notifications for when ajax requests failed so that the
user has some feedback rather than waiting forever with no indication
that something might have gone wrong.
Fixed up a bug with the area selection in the summary graph that would
shrink the selection if it was at the edge of the graph.
Continued investigating a problem where ICMP test data was
intermittently failing to be reported. It appears to be due to fairly
aggressive timeouts stopping tests before they finish - resolving the
(quite large, and increasing) list of destinations was taking longer
than expected which was not leaving enough time to perform the
measurements. I've increased the timeouts to more reasonable values and
have a few ideas to exclude resolving time from the allowable test duration.
Fixed a couple of bugs in the tracking of active streams that meant
cached data was being recached with no new data being fetched. It's now
deployed and has shrunk the size of our queries. Also looked into adding
a query timeout to prevent long running queries from hosing the machine.
Spent a lot of time looking at explain/analyze output from postgresql,
trying to shave some time off fetching data for our graphs. Made a few
incremental improvements with a new index and some reordering of
queries, but I'm still looking for the magic bullet. We have a lot of
data and it takes a long time to read it!
Found and fixed a bug in the AMP DNS test where uninitialised data could
be reported if a server did not respond. Luckily this occurred once over
the break and gave us good logs as to what the problem was. Rabbit
easily dealt with the backlog of messages while this one blocked the
queue, which is very reassuring.
One of the current monitors appeared to stop reporting data for the ICMP
test, so I spent some time investigating that. The tests still run but
packets aren't always sent. Nothing in the logs gives any indication of
the problem, so will need to dig further.
Spent most of the week trying to improve the performance of database
queries by better limiting the query to only the data that is necessary.
Lots of streams for CDN targets are only active for a short time as
addresses change and we hit different instances, so we don't need to
check all of these for data. We now maintain a list of when streams were
active in order to limit the data that is queried.
Most of this week was spent getting the new AMP code ready to deploy
onto the existing mesh alongside the current tests. I built new AMP
packages for Debian Lenny (an upgrade of the amplets is on the todo
list!) as well as various dependencies that we didn't have older
versions of. Put together a directory with new, standalone erlang and
rabbitmq and wrote some install scripts to do the fiddly bits of getting
everything installed into the right place. Pushed this out to all the
machines on Thursday and they have been successfully reporting data since.
Having more active monitors testing to more locations showed up a few
bits of code that were previously good enough, but were too inefficient
and didn't scale well. Had to make some slight changes to the SQL around
inserting new data to make better use of transactions etc to keep it
running faster than realtime.
Added configuration modals for dns, smokeping, munin and some lpi data,
so that multiple data series (of the same type) can easily be viewed at
once and compared. Refactored the initial modal implementation used by
icmp and traceroute data to be much cleaner and easier to integrate the
new data types.
Updated the legend labels describing the currently displayed data to use
the more detailed information Shane has made available. Included in this
is a line number that is now used to fix the colouring order, making
sure that the line colour matches what the label describes.
Spent some time reworking small details on the newest AMP Debian
packages to install and run properly when installed by puppet on the new
Decided that we needed to simplify the database schema for storing
traceroute data, so spent some time working on that. The new schema
works better with the existing aggregation functions and is faster to
query. Moved all the existing data to the new schema.
Merged in the rainbow traceroute graphs that Brad created and got them
using data from the new traceroute data. Moved the default view of
combined traceroutes to use smokeping rather than a basic line graph to
better show what is happening with multiple addresses.
General tidyup of code that had got a bit crufty, removed some sections
that were duplicated or no longer required. Started work on moving the
DNS test to use views.
Finished up the code to turn a single stream id into a view, for use
with events where we want to see the anomalous data. Merged all the view
changes back into the main branch, which highlighted a few broken cases
with things I hadn't considered (netevmon). Worked with Shane to get
those all sorted and working fairly quickly again.
Put together a nice query that will aggregate traceroute data to the
most common path within a binning period. Added a function to fetch this
in NNTSC which works fine for periodic data, but ran into some
difficulties extending this to a single, most recent block of data. It
shouldn't be difficult to get this working - hopefully a fresh look at
it on Monday will get it sorted.
First week with our new summer students, so spent some time working with
them and getting them all set up.
Updated the views database to be slightly more complex, making it much
easier to add or remove line groups to/from a view. Also wrote the
supporting code that actually enables users to do this - the label
describing each line group can now be clicked on to remove it from the
graph. The matrix will now generate any views that it needs when the
page is loaded so this no longer needs to be done by hand.
Started to add a streams-to-view interface so that events can be plotted
easily. Events are based on single streams rather than groups of
streams, so need to be viewable individually.
Spent a bit of time tidying up the new AMP packages to be more
consistent with how files and directories are named. Logging, config,
etc should all use the same name rather than 2-3 different ones. Fixed a
couple of small bugs in tests/reporting that Shane found while adding
them to NNTSC.
Lots of small fixes to things that use the new view interface. Fixed a
few more caching problems where the list of stale streams to fetch was
being ignored and instead all streams were being fetched. Updated the
tooltips on the matrix to use the new API and to split IPv4 and IPv6
results. Updated traceroute graphs to use the new API.
Replaced most of the dropdowns for the amp-icmp data with a modal
bootstrap window that allows selecting which streams (or groups of
streams) to display. Wrote most of the code to insert these views into
the database if they have not been seen before, and to fetch them out
when required. There are a couple of edge cases around determining
members of a view that look like they will require a slight redesign of
the database to accommodate.