Brendon Jones's blog
All percentile and aggregation data is now fetched as view labels rather
than by stream id, which means the database does the heavy lifting when
calculating stats across multiple streams. Tests that only display data
from a single stream are converted into the view format and use mostly
the same code path. Started work on the database to store the view
descriptions so that they are no longer hard-coded for testing.
Fixed a couple of bugs that meant the same time period could be queried
and displayed twice in some graphs. Fixed some caching problems where it
couldn't differentiate between no cached data and cached empty data.
Continued working through writing instructions for installing the AMP
client and server software onto a machine running NNTSC. During testing
I found a few cases where the URL handling for the matrix display was
expecting a hardcoded prefix, so refactored that to work properly with
other locations and removed a lot of messy duplicated code. Found a few
parts of the database code that were not set up properly as they had
been written since it was last installed.
Spent some time looking for obvious slow points in the display of AMP
graphs and fixed some locations in the smokeping style graphs where
redundant work was being done to generate line colours. A large portion
of the time fetching data is spent sorting it (using disk), so had a
quick look at how to best keep this in memory. Giving postgresql a lot
more memory to work with cut the sort time from around 3s to 100ms, but
this might not be scalable. Ideally we can find ways to limit the amount
of data that actually needs to be sorted.
Started to work on describing graphs and fetching data based on "views"
that contain a number of streams, rather than having ampweb list all the
streams that need to be fetched. Updated the amp-icmp smokeping style
graphs to work with a simple fixed mapping of view to stream ids for
Fixed a couple of bugs in the event grouping code that meant it was
running much slower than it should when groups got large. It should now
be a lot smarter about excluding attributes from the grouping process if
there is no way that using them could result in better groups.
Had a good meeting with Lightwire on Wednesday and got good feedback
about our software. Spent some time talking with Nathan trying to fix
issues they were having with it, and putting together
packages/instructions so that they can install AMP alongside their other
monitoring. This is looking much more complicated than it should be, so
will have to see how much of this can be taken care of in pre/post
install scripts etc. Most of the work is in setting up the server
though, so only needs to be performed once.
Finished reformatting the data to remove some mess and unnecessary
layers of nesting that had crept in while trying different things. It
should now be set up to deal properly with representing multiple lines,
split up or grouped by however the backend wants to do so. Updated all
the tests to use the new data format.
Spent an afternoon with Shane and Brad designing how we are going to
represent graphs with multiple lines, in a way that will let us merge
and split data series based on how the user wants to view the data.
Tidied up the autogenerated colours for the smokeping graphs to use
consistent series colours across the summary and detail views, while
also being able to use the default smokeping colouring if there is only
a single series being plotted.
Moved the multiple series line graphs back to using the smokegraph
module, but with colouring based on the series rather than to indicate
loss. This appears to work well for the smaller data series that I've
tested on, though I have yet to get a sensibly aggregated set of data
for those graphs with very large numbers of streams.
The new graphs with arbitrary numbers of data series had caused event
labels to be triggered on mouseover for almost all series except the
first, which I fixed. Only a dummy series will trigger mouse events, so
that it doesn't try to display information about every single data point
on the graph. Through profiling I also found many extraneous loops and
checks for events that could be prevented by properly disabling events
on the summary graph as well.
Also spent some time reading and critiquing honours reports, not long to go!
Spent most of the week trying to get timeseries AMP graphs working
nicely with the new split streams. Plotting a line for every single
address tested to is not always feasible, so tried plotting some graphs
aggregated on /24 and /48 boundaries to limit the number of lines on a
With sparse data (due to stream splitting) I ran into and fixed a few
issues with block fetching - data from adjacent blocks but with large
time gaps between them should have been drawn disjoint, but wasn't. Also
had to be smarter with timestamps to make sure that aggregated data from
multiple streams was being properly binned and not generating arbitrary
data timestamps within a bin.
Graphs are working now with the new split streams, but I'm not convinced
that they are showing data optimally. Need to work out exactly what we
want to show and how - at what level should data be aggregated and how
should it be displayed?
Started to mock up an interface to the graphs that would allow multiple
data series to be shown at once and hopefully still be fairly simple to
select what to display. Had a good look around the bootstrap library to
see what it is capable of. Will continue with this once we have the
capability to plot multiple series on a graph.
Moved on to making use of the new datastreams available once Shane split
apart those that had the same test parameters (including destination
name) but different target addresses. Updated the matrix to be able to
display summaries of groups of streams while also being able to display
information on individual ones. While looking at the tooltip graph
generation I found it to be very slow for targets with multiple
addresses - turns out the aggregated queries were being separated at the
last moment so we were hitting the same data in the database once per
series. Got this mostly aggregated again in those situations where the
query durations are similar.
Added a simple method to fetch AMP schedule files over HTTP. Currently
runs on startup, but will be scheduled to run regularly. Will hopefully
not be too much work to move to HTTPS and use information in the client
certificates to serve different files to different clients.
Spent a lot of time working on performance improvements to the web
interface, in particular the matrix and how it communicates to the
backend. It was previously creating a new connection to NNTSC to fetch
every cell it displayed, which was quite slow. The NNTSC backend already
supported querying data for multiple streams at once, but this needed to
be exposed to the rest of the code along the datapath. It can now query
data for any number of streams from a single collection.
Fixed the way that a single bin of the most recent data was being
queried to not use the regular binning code - modulo maths on the
timestamp could cause two half full bins to be created, of which we
would use the first. This should mean that the data on the matrix is now
slightly more accurate and quicker to respond to changes.
Installed the new amplet code on the REANNZ perfsonar machines and used
the install to document the process, with a few notes for more work that
the postinst scripts should do to make it easier. Started to work on a
simple method to fetch test schedules from within measured itself, to
help manage schedules on machines that are otherwise outside of our
Spent some more time looking at the broken NAT in virtual box (it looks
to use lwip) and how it was affecting my traceroute tests. In doing so,
found and fixed a few issues where incorrectly sized packets could be
sent, or correctly sized packets could be sent with incorrect length
fields. After switching from NAT to bridged mode everything seems to
A lot more time than I like was spent trying to differentiate between
problems caused by the NAT and problems with my tests. The ability to
run tests as standalone programs was very helpful for this and made it a
lot easier to pinpoint problems such as the NAT incorrectly sizing
embedded packets in ICMP responses.
Got the AMP packaging for CentOS working well enough that I can now
build and install packages that almost work straight out of the box.
Split the package into two parts - one that operates with a local broker
and one without. Merged some of the required changes back into the
Debian versions too.
While testing the packaging on a CentOS virtual machine I found a few
interesting issues. Tests where the targets were resolved at runtime
were being run to two targets, but with the same address - for some
reason identical, duplicate IPv4 addresses were being returned if
getaddrinfo() was given AF_UNSPEC (which I do because I want both IPv4
and IPv6 addresses if available). Also, with the traceroute test I was
seeing some very high latency to some hops, extra hops at the end of my
paths, etc. Some of this was due to late responses arriving and being
treated the same as on time ones, and some of this appears to be related
to possibly broken behaviour in the VM NAT - ICMP TTL expired messages
are being received where the TTL in the embedded packet is too large to
have expired on that hop!
Spent some time trying to figure out exactly what was going on in the VM
case, and how best to make the test robust in these cases.