User login

Search Projects

AMP

AMP- the active measurement project is a system for making active measurements. It is deployed at most Universities in New Zealand and most of the non-Telco ISPs. It has a large number of built in tests. Performance measurements from the public system are available at http://erg.cs.waikato.ac.nz

Historical information:

The NLANR AMP active measurment project was lead by Tony McGregor. At that timeAMP was the largest and most widespread active measurement system. It was designed for high performance research and education networks, especially the US Internet2 networks. It was deployed by the research and education networks in 11 countries (USA, Canada, Taiwan, Norway, Finland, Australia, Thailand, Japan, Ireland, Hungary and Korea). There were approx 140 measurement points worldwide.

05

Aug

2013

Added support for the AMP ICMP collection to ampy and amp-web, so we are now able to plot graphs of the test data Brendon has been collecting.

Spent a decent chunk of an afternoon working through the DPDK build system with Richard S., trying to make the DPDK libraries build as position-independent code so that we can link libtrace against them nicely.

Reworked a large amount of code in amp-web to move the collection-specific code out of the core source files and into separate little modules for each collection. This means that the core code should be much easier to follow and work on. Adding support for new collections should also be simpler and require less inside knowledge of how the whole system works.

29

Jul

2013

Table partitioning is now up and running inside of NNTSC. Migrated all the existing data over to partitioned tables.

Enabled per-user tracking in the LPI collector and updated Cuz to deal with multiple users sensibly. Changed the LPI collector to not export counters that have a value of zero -- the client now detects which protocols were missing counters and inserts zeroes accordingly. Also changed NNTSC to only create LPI streams when the time series has a non-zero value occur, which avoids the problem of creating hundreds of streams per user which are entirely zero because the user never uses that protocol.

Added ability to query NNTSC for a list of streams that had been added since a given stream was created. This is needed to allow ampy to keep up to date with streams that have been added since the connection to NNTSC was first made. This is not an ideal solution as it adds an extra database query to many ampy operations, but I'm hoping to come up with something better soon.

Revisited and thoroughly documented the ShewhartS-based event detection code in netevmon. In the process, I made a couple of tweaks that should reduce the number of 'unimportant' events that we have been getting.

22

Jul

2013

Somewhat disrupted week this week, due to illness.

Replaced the template-per-collection for the graph pages with a single template that uses TAL to automatically add the right dropdowns to the page for the collection being shown on that page. Added callback code to allow proper switching between LPI metrics when browsing the graphs -- it isn't perfect but it wasn't worth putting too much effort into it when we're probably going to completely change the graph selection method at some point.

Added code to ampy to query data from the AMP ICMP test. Also added an API function that returns details about all of the streams associated with a collection -- this will be used to populate the matrix with just one request rather than having to make a request for every stream.

Worked on getting NNTSC to use table partitioning so that we can avoid having to select from massive unwieldy data tables. Seems to working well with my test database but the big challenge is to migrate the existing 'production' database over to a partitioned setup.

15

Jul

2013

Made a number of minor changes to my paper on open-source traffic classifiers in response to reviewer comments.

Modified the NNTSC exporter to inform clients of the frequency of the datapoints it was returning in response to a historical data request. This allows ampy to detect missing data and insert None values appropriately, which will create a break in the time series graphs rather than drawing a straight line between the points either side of the area covered by the missing data. Calculating the frequency was a little harder than anticipated, as not every stream records a measurement frequency (and that frequency may change, e.g. if someone modifies the amp test schedule) and the returned values may be binned anyway, at which point the original frequency is not suitable for determining whether a measurement is missing.

Added support for the remaining LPI metrics to NNTSC, ampy and amp-web. We are now drawing graphs for packet counts, flow counts (both new and peak concurrent) and users (both active and observed), in addition to the original byte counts. Not detecting any events on these yet, as these metrics are very different to what we have at the moment so a bit of thought will have to go into which detectors we should use for each metric.

05

Jul

2013

Added support for the Libprotoident byte counters that we have been collecting from the red cable network to netevmon, ampy and amp-web. Now we can visualise the different protocols being used on the network and receive event alerts whenever someone does something out of the ordinary.

Replaced the dropdown list code in amp-web with a much nicer object-oriented approach. This should make it a lot easier to add dropdown lists for future NNTSC collections.

Managed to get our Munin graphs showing data using a Mbps unit. This was trickier than anticipated, as Munin sneakily divides the byte counts it gets from SNMP by its polling interval but this isn't very prominently documented. It took a little while for myself, Cathy and Brad to figure out why our numbers didn't match those being reported by the original Munin graphs.

Chased down and fixed a libtrace bug where converting a trace from any ERF format (including legacy) to PCAP would result in horrendously broken timestamps on Mac OS X. It turned out that the __BYTE_ORDER macro doesn't exist on BSD systems and so we were erroneously treating the timestamps as big endian regardless of what byte order the machine actually had.

Migrated wdcap and the LPI collector to use the new libwandevent3

Changed the NNTSC exporter to create a separate thread for each client rather than trying to deal with them all asynchronously. This alleviates the problem where a single client could request a large amount of history and prevent anyone else from connecting to the exporter until that request was served. Also made NNTSC and netevmon behave more robustly when a data source disappears -- rather than halting, they will now periodically try to reconnect so I don't have to restart everything from scratch when I want to apply changes to one component.

Finally, my paper on comparing the accuracy of various open-source traffic classifiers was accepted for WNM 2013. There's a few minor nits to possibly tidy up but it shouldn't require too much work to get camera-ready.

24

Jun

2013

Added manpages to netevmon to get it ready for Debian packaging. During this process, fixed a few little oversights in the netevmon script and the existing documentation.

Re-wrote much of the NNTSC API in ampy. The main goal was to reduce the amount of duplicated code in modules for individual NNTSC collections that was better suited to a more general NNTSC API. In the process I also changed the API to only use a single "NNTSC Connection" instance rather than creating and destroying one for every AJAX request. The main benefit of this is that we don't have to ask the database about collections and streams every time we make a request now -- instead we get them once and store that info for subsequent use. This will hopefully make the graph interface feel a bit more responsive.

Updated amp-web to use the new NNTSC API in ampy. I also spent a bit of time on Friday testing the web graphs on various browsers and fixing a few of the more obvious problems. Unsurprisingly, IE 10 was the biggest source of grief.

Added a new time series type to anomaly_ts -- JitterVariance. This time series tracks the standard deviation of the latencies reported by the individual smokeping pings. Using this, I've added a new event type designed to detect when the standard deviation has moved away from being near zero, e.g. the pings have started reporting variable latency. This helps us pick up on situations where the median stays roughly the same but the variance clearly indicates some issues. It also serves as a good early indicator of upcoming Plateau or Mode events on the median latency.

17

Jun

2013

Finished preparing NNTSC for packaging. Wrote an init script for the NNTSC collector and ensured that all of the subprocesses are cleaned up when the main collector process is killed. Wrote some manpages, updated the other documentation and added some licensing to NNTSC before handing it off to Brendon for packaging.

Also moved towards packaging netevmon. Again, lots of messing around with daemonisation and ensuring that the monitor can be started and stopped nicely without anyone having to manually hunt down processes.

Spent the rest of my time working on the interaction between amp-web and History.js. Only one entry is placed in the history for each visited graph now and selecting a graph from the history will actually show you the right graph. Navigating to a graph via the history will also now update the dropdown lists to match the currently viewed graph. When using click and drag to explore a graph, clicking once on the graph will return to the previous zoom level (this was already present, but only worked for exploring the detailed graph, not the summary one).

10

Jun

2013

Spent most of my week working on making the various components of NNTSC and netevmon backgroundable so that they are a lot easier to run long-term. This was pretty straightforward for the C++ programs but the python scripts have been a bit trickier, especially in terms of getting the logging going to the right place.

Also fixed a few of the outstanding issues with amp-web. In particular, I fixed the problems we were having with the X-axis of the summary graph being garbled and ensured that the summary graph will always show a sensible time period based on the region shown in the detailed view. These changes also meant I could remove the summary timestamps from the page URL, which cleans that up quite a bit.

04

Jun

2013

Finished fixing the URLs in amp-web so that they are ordered sensibly and can support NNTSC streams that are defined using more than just "source" and "target". I also changed the ordering of the timestamps in the URL so that we can specify a start and end time for the detailed graph only (sensible defaults for the summary graph are meant to be chosen in this case). This is really handy when creating URLs that link to graphs showing events.

Started looking into what needed to be done to prepare NNTSC and netevmon for packaging and a possible distribution for our friends at Lightwire. Spent a decent chunk of time writing a README that should describe exactly how to get a NNTSC instance up and running.

NNTSC and netevmon both have tracs now and I've added a series of tickets to each with the aim of getting a release ready for Lightwire by the end of the month.

05

Nov

2012

Spent some more time working on measured. Tests will now be forked and
run (currently just running touch or ping to check it works), with a timer
scheduled to kill any that run too long. Successful tests remove the timer
once they complete - catching the SIGCHLD from the test lets me do all the
required tidyup.

Tested it briefly on an emulation machine with 1000 tests scheduled
simultaneously every 20 seconds. Led to discovering a few small bugs with
the signal handling. After fixing them it all seems to run well, as long
as the watchdog timeout for hung tests is not too short (there isn't
always enough cpu time to go around). Everything works fine with slightly
fewer tasks or a slightly longer timeout.

Had another discussion with Shane about how we should structure tests and
started fleshing out a skeleton/example test. Basing it on a similar
structure to how Maji loads its various decoders etc, with lots of shared
objects that register various properties of the test when they are loaded.

01

Nov

2012

Spent some more time reading bits of honours reports before they were
handed in.

Updated addressing the KAREN AMP machines so they would continue to work
with recent network changes. In the process of doing so, discovered that
CFEngine would no longer update certain sites and spent quite a while
trying to debug it. It was failing to authenticate server keys properly,
which was fixed by forcing it to refetch the (exactly the same,
identical) key. Not impressed that it is acting flakey over something like
that.

Started work on a new implementation of AMP using some of the ideas we've
been talking about. Currently I'm working on a reimplementation of
measured using libwandevent. At this stage it can read the old format of
schedule file and creates a new timer event for each one, runs a dummy
function when the time arrives and reschedules itself afterwards.

15

Oct

2012

Continued working with Nathan to get smokeping data successfully into the
event detection system. I generated some random data to fill the
historical buffers and then continued to run it over live data, which
generated a small number of plausible looking events. I'm now looking into
the scalability and resource usage of this as it seems a little higher
than it should be. Also polished the dashboard graphs slightly, changing
them to use more sensible axis and better resolution data.

Spent some time with Richard, Tony and Shane thinking about the future
direction of AMP. We've got some good ideas and have a whiteboard full of
initial planning for the work that needs to be done.

Read draft introductions to a number of 520 reports and gave some
hopefully useful feedback. Everyone seems to be on the right track so far,
looking forward to reading more.

02

Oct

2012

Tried to make the generated alerts more efficient and more effective by
very slightly delaying the actual alerting - doing so means that the alert
can contain any other events that arrive immediately after the triggering
event. It also now sends me emails for certain event thresholds, but I
broke the live import of AMP data so need to fix that before I can get
more than the emails generated by my test data.

Started trying to make the information presented in the default web
interface a bit more concise and relevant to what is going on right now.
Trying to use a few graphs to give an initial overview of the recent data
while keeping the ability to go look at everything in detail as you can
now.

The AMP deployment on the NLNOG RING was mentioned during a talk at RIPE
about the RING along with screenshots and links back to WAND. The slides
look pretty good and I think it went well.

17

Sep

2012

Fixed the bug in the tput test that would sometimes cause it to refuse
connections for a minute when it was meant to be re-establishing a new
test connection. It was erroneously waiting for more data when there was
no more to follow, so wouldn't continue until select timed out. Updated
the NZ mesh with all the fixes from the last couple of weeks.

Worked on the backend for the event detection web interface to use a more
flexible and secure database abstraction. Made a few small changes to the
web interface to try to hide information that wasn't always needed, but
still make it available if required.

Investigated all the historical event groupings and found a few rare cases
where it wasn't doing the right thing due to the order in which events
arrived or due to some missing common attributes. Came up with an approach
to sometimes rebuild groups as needed to minimise their number and get
better matches.

10

Sep

2012

Fixed a bug in the AMP web API that would give incorrect traceroute data
for the last measurement bin in certain situations and was causing issues
with the path change event detection algorithms. Also, after running an
AMP client with the threading fixes for a week on some of the machines
most often affected by the bug I'm pretty confident that it's fixed.

Fixed the group membership checks using common path information to
properly group events based on all items having a shared attribute. I'm
quite happy with the contents of the new groups, they make good sense and
can help show underlying problems in intermediate networks that aren't
immediately obvious from looking at just sources/destinations.

Started putting together a database schema and web interface for a very
simple alerting system using the event groups detected.

05

Sep

2012

Found what looks to be a threading bug in AMP measured that has been
troubling me for a while. Test threads check that the nametable is up to
date before running but it was possible for them to deadlock on accessing
the file. I've had a fixed version of the code running for a couple of
days on some of the machines that were most often affected and have yet to
see the problem again. Hopefully that's fixed!

Fixed a couple of small bugs in the AMP matrix tooltips that were
triggering events on child elements without the appropriate attributes, so
weren't displaying information.

Wrote a program to insert common path information into the event database
to use for grouping events. Testing so far with this data shows that
fewer, larger groups of events are being created. Some of the membership
is a little bit questionable, so am know in the process of having it
describe the reasoning behind creating each of the groups.

27

Aug

2012

Ran some more tests on the IPv6 packet filtering in the AMP ICMP test and
it does indeed appear that the errors are due to packets arriving between
the socket being opened and the filter being applied. That makes most of
the warnings much less worrying, and I've lowered the priority on those
that I can confirm aren't an issue. While investigating this I also found
a situation where various test resources weren't being freed in the
traceroute test if they involved IPv6 addresses. Fixed that as well.

Finished updating the protocol between the different parts of the event
detection process to use the new protocol design. Also changed it from
using local unix sockets to run across the network, as our data sources
will likely be on different machines to the eventing system. Socket input
for the time series data is also now supported rather than only using
stdin.

Updated the sample web scripts that display event information to work with
the new database schema to confirm that everything is still working as it
should.

Pushed out the AMP matrix changes to the NLNOG RING. Also investigated
colouring cells based on current performance vs historical performance
rather than raw latency values, which was a request they had.

21

Aug

2012

Short week this week due to being in Wellington for Thursday and Friday.
While I was there I caught up with Jamie and Sam Russell at REANNZ for a
chat about AMP and perfSONAR deployments on the network. There should be a
lot of new monitors going in shortly and it would be great if we could run
both measurement platforms.

Spent some time investigating error messages that have been showing up
lately in amplet logs. It appears there is some weirdness happening with
raw icmp6 sockets receiving packets that should have been filtered out by
a socket option. Reading through the kernel source it looks like filters
are doing exactly what they should be doing and I know believe it's due to
packets arriving and being buffered in the time between the socket being
created and the filters being set.

Changed the tooltips in the matrix display to all be fetched via ajax
calls, so none of that data is sent to the client initially. This should
speed up page generation (no need to fetch data for the last week) and
shrink the raw page size futher. Will hopefully deploy and test this on
the NLNOG RING matrix shortly.

13

Aug

2012

Sat down with Shane and went over what we need to do to get our event
detection programs integrated. The protocol used between data fetching,
detection and eventing needs to be updated slightly as there is more
information needing to be shared and magic numbers from testing to be
changed into real data. Started work on updating the protocol to match
what is required and updating the database schema to match.

Integrated the path change detection into the main detection code and
updated the main code path to deal properly with the slight differences
between a wider variety of data - traceroute, latency, byte counts, etc.

Did a bit of maintenance on the NZ AMP mesh and KAREN weathermap as well -
updated some addresses and got IPv6 addresses for the Citylink and
Netspace AMPlets which is neat.

07

Aug

2012

Spent some time improving the performance of the NLNOG RING AMP matrix
page - with tens of thousands of cells the page got rather large and slow.
I've culled the individual tooltips down to one reusable one, drastically
reducing the number of DOM elements on the page as well as reducing the
size of the raw HTML. It's still a monster but is almost becoming
manageable. Next step will likely be to move all the tooltip data to an
AJAX call rather than embedding it in the page.

Fixed up the getCommonPath function in AMPcentral to better fetch data
from the desired time period. The ending condition for the time period was
using an incorrect value which resulted in using much longer periods. This
now gives me correct path data through the web interface which I can use
for event detection and hopefully smarter grouping of events. Added
database support for dealing with common attributes between sources and
destinations, now need to collect the data.

Rewrote my AMP data sampling program to properly sort all data by time
rather than by source/destination pair, and to deal properly with fetching
multiple data types in a single run.