Brendon Jones's blog
Started to mock up an interface to the graphs that would allow multiple
data series to be shown at once and hopefully still be fairly simple to
select what to display. Had a good look around the bootstrap library to
see what it is capable of. Will continue with this once we have the
capability to plot multiple series on a graph.
Moved on to making use of the new datastreams available once Shane split
apart those that had the same test parameters (including destination
name) but different target addresses. Updated the matrix to be able to
display summaries of groups of streams while also being able to display
information on individual ones. While looking at the tooltip graph
generation I found it to be very slow for targets with multiple
addresses - turns out the aggregated queries were being separated at the
last moment so we were hitting the same data in the database once per
series. Got this mostly aggregated again in those situations where the
query durations are similar.
Added a simple method to fetch AMP schedule files over HTTP. Currently
runs on startup, but will be scheduled to run regularly. Will hopefully
not be too much work to move to HTTPS and use information in the client
certificates to serve different files to different clients.
Spent a lot of time working on performance improvements to the web
interface, in particular the matrix and how it communicates to the
backend. It was previously creating a new connection to NNTSC to fetch
every cell it displayed, which was quite slow. The NNTSC backend already
supported querying data for multiple streams at once, but this needed to
be exposed to the rest of the code along the datapath. It can now query
data for any number of streams from a single collection.
Fixed the way that a single bin of the most recent data was being
queried to not use the regular binning code - modulo maths on the
timestamp could cause two half full bins to be created, of which we
would use the first. This should mean that the data on the matrix is now
slightly more accurate and quicker to respond to changes.
Installed the new amplet code on the REANNZ perfsonar machines and used
the install to document the process, with a few notes for more work that
the postinst scripts should do to make it easier. Started to work on a
simple method to fetch test schedules from within measured itself, to
help manage schedules on machines that are otherwise outside of our
Spent some more time looking at the broken NAT in virtual box (it looks
to use lwip) and how it was affecting my traceroute tests. In doing so,
found and fixed a few issues where incorrectly sized packets could be
sent, or correctly sized packets could be sent with incorrect length
fields. After switching from NAT to bridged mode everything seems to
A lot more time than I like was spent trying to differentiate between
problems caused by the NAT and problems with my tests. The ability to
run tests as standalone programs was very helpful for this and made it a
lot easier to pinpoint problems such as the NAT incorrectly sizing
embedded packets in ICMP responses.
Got the AMP packaging for CentOS working well enough that I can now
build and install packages that almost work straight out of the box.
Split the package into two parts - one that operates with a local broker
and one without. Merged some of the required changes back into the
Debian versions too.
While testing the packaging on a CentOS virtual machine I found a few
interesting issues. Tests where the targets were resolved at runtime
were being run to two targets, but with the same address - for some
reason identical, duplicate IPv4 addresses were being returned if
getaddrinfo() was given AF_UNSPEC (which I do because I want both IPv4
and IPv6 addresses if available). Also, with the traceroute test I was
seeing some very high latency to some hops, extra hops at the end of my
paths, etc. Some of this was due to late responses arriving and being
treated the same as on time ones, and some of this appears to be related
to possibly broken behaviour in the VM NAT - ICMP TTL expired messages
are being received where the TTL in the embedded packet is too large to
have expired on that hop!
Spent some time trying to figure out exactly what was going on in the VM
case, and how best to make the test robust in these cases.
Following on from the authentication work last week allowing AMP to auth
against a remote rabbitmq broker using SSL, I configured a local broker
to do the same. Using a local broker to initially receive results adds
reliability and means results won't get lost if the remote server is
unavailable for any reason. The way the SSL authentication and user
validation works means that even though the measurement client and
broker are out of our control, they can't falsify their identity to
report results as another user.
Started packaging up new AMP and some of the dependencies for CentOS and
Debian. Unfortunately adding dependencies has the problem that every
packaging system has out of date versions. Put together a patch for
librabbitmq-c with my changes for EXTERNAL auth. Will have a look at it
again to make sure I'm being sensible and then see if I can get it
Spent most of Wednesday listening to student practice talks ahead of the
honours conference. The overall quality of the presentations was quite
high, and they should do even better at the real presentations with a
week of polish and practice.
Tidied up the code I hacked in the previous week for displaying AMP ICMP
data on a smokeping style graph. There were a few edge cases around
missing test responses that were causing extra NULLs to be added to the
list of results. The database should now remove those before any other
stage sees the data, which simplifies the later code a lot. Also had to
make a few changes to how the smokeping loss data is presented to make
it match the AMP ICMP data, allowing them to use the same colouring.
Fixed the new AMP traceroute test to no longer report extraneous
unresponsive hops on the end of the path. While trying to shortcut a lot
of extra testing on incoming ICMP packets, it was being a bit too
accepting of TTL exceeded messages which was resetting some counters.
Sat down with the rabbitmq documentation and had a good think about the
best way to do authentication for AMP. I knew it could do SSL but wasn't
sure how that could tie in with rabbitmq/AMP users and ensure users
couldn't report data for others (in the situation where we don't have
total control over all the monitors). Digging deeper, the rabbitmq
server can authenticate against a list of users based on parts of the
client certificate (common name, a few others) - this means the username
is embedded in the certificate, the username must exist on the server
and the server must trust the cert.
After updating the rabbitmq-c library to the newest version to get SSL
support (and editing it to enable advertising some extra authentication
mechanisms) I now have a client that establishes an SSL connection to
the server using an "EXTERNAL" authentication mechanism. The server
treats the common name as the username to login with, and validates that
the userid in each message matches the one from the cert.
Short week this week as I was in Wellington for the second half of it.
Visited REANNZ while I was there and gave a short talk about the
direction of new AMP, NNTSC and the web interface.
Installed AMP on a REANNZ perfSONAR machine to see how well it runs on
CentOS alongside the perfSONAR tests. Will monitor this for a bit and
then should hopefully be doing this on the rest of the machines.
Wrote some SQL queries to fetch AMP ICMP data and format it similar to
the way smokeping data is naturally. This means that for any ICMP graphs
where the data is binned, "smoke" can be drawn on the graph using
deciles from the bin.
Added traceroute data from AMP to the matrix display, which involved
feeding it through all stages of the pipeline - collection, parsing,
storage, querying. Expanded the matrix to be smarter about selecting the
collections to query and refactored some of the AMP data fetching code
to make it easier to add new AMP collections.
While doing testing with the data collected by the new AMP I found and
fixed a bug where the results of name resolution of some targets in the
schedule were not being properly used (only the first address was being
tested to). This has also shown that we need a sensible way to deal with
multiple targets having the same label in the matrix and graph displays.
Started looking at being able to format AMP ICMP test data in a similar
manner to smokeping so that it can be used with the more interesting
smokeping style graphs. Looks like the database can do most of the heavy
lifting to generate percentiles, which we can then use to plot the
shaded smokey regions.
Spent a bit of time updating the REANNZ weathermap to bring it in line
with recent network changes.
Spent most of the week working on backend things that are required to
get the AMP data into the matrix view in the way that we want.
Installed memcached and changed the way memcache keys work when fetching
recent data to enable it to be cached properly, with different timeouts
for different duration data.
Updated all the data/metadata fetching to use the new NNTSC API rather
than pulling data from the old REST interface on erg. As part of this
added the ability to specify multiple aggregation functions across
(possibly duplicate) columns to efficiently fetch all the data required
(e.g. means, standard deviations, etc of the latency values). Mean and
standard deviation are now used to colour the matrix cells rather than
absolute latency differences. Also slightly tidied up the way tooltips
and sparkline graphs are drawn to better present this data.
Finished merging the different graph types into a single one, so all
graphs now benefit from the new navigation etc. They all also now
operate on the same data format so everything is much more consistent
and sensible without all the duplication that was present.
Started trying to get the matrix and AMP graphs working again with the
new, expanded ampy. With the work on rrd/lpi data sources this had moved
on quite a bit since I last touched it. With some help from Shane the
data is now available again, and now I need to get it into the
matrix/graph pages using the new systems.