User login

Shane Alcock's Blog




Spent most of the week preparing for my Sydney trip. Wrote the talk I will be presenting this coming Thursday and gave a practice rendition on Friday.

The rest of my time was spent fixing minor issues in Cuz -- trying not to break anything major before I go away for a week. Replaced the bad SQLAlchemy code in the ampy netevmon engine with some psycopg2 code, which should make us slightly more secure. Also tweaked some of the event display stuff on the dashboard so that useful information is displayed in a sensible format, i.e. less '|' characters all over the place.

Had a useful meeting with Lightwire on Wednesday. Was pleased to hear that their general impression of our software is good and will start working towards making it more useful to them over the summer.




The new psycopg2-based query system was generally working well but using significant amounts of memory. This turned out to be due to the default cursor being client-side, which meant that the entire result was being sent to the querier at once and stored in memory. I changed the large data queries to use a server-side cursor which immediately solved the memory problem. Instead, results are now shipped to the client in small chunks as needed -- since the NNTSC database and exporter process are typically located on the same host, this is not likely to be problematic.

Netevmon now tries to use the measurement frequency reported by NNTSC for the historical data wherever possible rather than trying to guesstimate the frequency based on the time difference between the first two measurements. The previous approach was failing badly with our new one stream per tested address approach for AMP as individual addresses were often tested intermittently. If there is no historical data, then a new algorithm is used that simply finds the smallest difference in the first N measurements and uses that.

Changed the table structure for storing AMP traceroute data. The previous method was causing too many problems and required too much special treatment to query efficiently. In the end, we decided to bite the bullet and re-design the whole thing, at the cost of all of the traceroute data we had collected over the past few months (actually, it is still there but would be painful to convert over to the new format).

Had a long but fruitful meeting with Brendon and Brad where we worked out a 'view' system for describing what streams should be displayed on a graph. Users will be able to create and customise their own views and share them easily with other users. Stream selections will be described using expressions rather than explicitly listing stream ids as it is now (although listing specific streams will still be possible).

This will allow us to create a graph showing a single line aggregating all streams that match the expression: "collection=amp-icmp AND AND AND family=ipv4". Our view could also include a second line for IPv6. By using expressions, we can have the view automatically update to include new streams that match the criteria after the view was created, e.g. new Google addresses.




Finished migrating our database query code in NNTSC from SQLAlchemy to psycopg2.

Released libwandevent 3.0 and updated netevmon to use it instead of the deprecated libwandevent 2 API.

Continued to be stymied by performance bottlenecks when querying large amounts of historical data from NNTSC using netevmon. The problems all relate to attempts to export live data at the same time breaking down, which eventually caused the data collection to block waiting to write live data to the exporter. Because the data collection was blocked, no new data was being collected or written to the database.

The first new problem I found was (surprisingly) caused by our trigger function that writes new data into the right partition. Because there is no "CREATE IF NOT EXISTS" for triggers in postgres, we were dropping the trigger and then re-creating it whenever we switched to a new partition. However, you can't drop a trigger from a table without having an exclusive lock on the table. If the table is under heavy query load (e.g. from netevmon) then the DROP TRIGGER command will block until the querying ends. The solution was reasonably straightforward -- check the metadata tables for the existence of the trigger and only create it if it doesn't exist.

The other problem was that our select queries were happening in the same thread as the reading of live data from the exporter. Despite last week's improvements, the queries can still take a little while and live data was building up while the query was taking place. Furthermore, we were only reading one message from the live data queue before returning to querying so we would never catch up once we fell behind. To fix this, I've implemented a worker thread pool for performing the select queries instead so we can export live data while a query is ongoing.




Attempting to run netevmon against a decent quantity of historical data has been causing significant performance problems and even preventing NNTSC from processing and storing new measurements. After a bit of hackish profiling, I realised that the biggest problem was the time taken to query for traceroute data. Unlike most of the other existing data tables, the traceroute data is spread across three tables which are joined to create a view that we query from.

Unfortunately, the join was not smart enough to recognise that the traceroute test ids it was looking for all fell within a certain set of table partitions. Instead, it would sequentially scan millions of rows across all of the test tables. After a lot of messing around with the SQL used to create the view, I found that the best approach was to instead use a procedure that figured out the test ids that fell within the time period being queried for and returned a table constructed using constraints on the test ids as well as timestamp and stream ids.

This managed to get the query time for several weeks worth of data down from 12 seconds to 2 seconds. The next problem was using the procedure within SQLAlchemy in place of a "data table", as SQLAlchemy treats the returned table as a Result object rather than a Table object. This meant that there weren't any Column objects available for us to operate on, e.g. apply aggregation functions for generating graph data.

At this point, it became apparent that SQLAlchemy was more of a hindrance than a benefit and I decided we would be better off replacing it with the much simpler but more intuitive psycopg2, at least for the database querying side of NNTSC. Spent the remainder of my week writing and testing the new query code.




Changed the stream definition for both the AMP ICMP and AMP traceroute collections in NNTSC to include the address that was tested to. This means that we can more easily analyse the behaviour of specific paths and show each one as a separate line on our graphs.

Also added support for multiple streams into ampy and amp-web. Previously, a graph URL would contain a single stream id which described the stream to be shown on the graph -- now the URL contains a series of stream ids separated by hyphens (although we only plot the first right now). Various ampy functions now return a list of streams rather than just one. Streams within the amp-web javascript are represented as objects rather than just the id number -- this allows us to store additional information with the stream such as the colour to use when plotting the stream and whether the stream should be plotted or not.

Added an LRU-based detector to netevmon, mainly for use with the traceroute data. The detector maintains an LRU of values that it has seen recently (e.g. hop counts) and creates an event anytime it has to add a new value to the LRU. This will also be used to check for changes in the full path returned by the traceroute test.




Updated ampy to cache stream information as well as data measurements. I had noticed that multiple requests for the same stream information were being generated when loading a graph, which seemed a little wasteful. Now we cache the details of what streams are available for each collection and the description of each stream (source, dest, metric etc.). The one downside is that newly-added streams won't be obvious until the cached stream list for the collection has expired.

Added support in NNTSC for table partitioning of traceroute data. This was much more complicated than anticipated for several reasons:
* the trigger function that inserts the data must return NULL to avoid a duplicate insertion into the parent table as well as the partitioned table.
* our traceroute test table had a "test id" column that was defined as a primary key based on an auto-incremented sequence, which meant sqlalchemy would try to return the newly inserted row by default.
* we needed the value of the test id for subsequent inserts into other tables relating to the traceroute test.
* sqlalchemy had no error-handling for the case where an insert operation that was meant to return a row returned null, resulting in a crash with little to no useful error message.

Once I'd figured all this out, I implemented a (somewhat hackish) solution: disable the implicit return, so we could keep our trigger function returning NULL without crashing sqlalchemy. Then, following our insert operation, immediately perform a SELECT to find the row we just inserted and grab the test id from that.

There was also the problem of the traceroute path table which I also wanted to partition but did not have a timestamp column. The partitioning code I had written was only designed to partition based on timestamp, so I had to re-engineer that to support any numeric column (although it defaults to using timestamp).

Finally, I had to then go and manually move all of the existing traceroute data into suitable partitions.

I also spent some time fixing up the Constant to Noisy algorithm in netevmon. Mostly this just involved refining some of the thresholds for the change detection, but I also avoid moving from Constant to Noisy unless the most recent N measurements have all demonstrated a reasonable amount of noise, i.e. the differences between consecutive measurements is significant relative to the mean.

One last thing: added timer events to the python version of libwandevent. Used this to ensure that anomalyfeed would request historical information at a sensible rate when first starting up, rather than asking for it all at once and completely hosing NNTSC with data requests.




Spent most of the week on leave, so not much got done this week.

In the time I was here, I fixed a number of bugs with the auto-scaling summary graph that occurred when there was no data to plot in the detail view.

I implemented yet another new algorithm for trying to determine if a time series is constant or noisy, as the previous one was pretty awful at recognising that the time series had moved from constant to noisy. The new one is better at that, but still appears to have problems for some of our streams -- it now tends to flick between constant and noisy a little too frequently -- so it will be back to the drawing board somewhat on that one.




Tidied up a lot of the javascript within amp-web. Moved all of the external scripts (i.e. stuff not developed by us) into a separate lib directory and ensured that everything used consistent and specific terminology.

Added config options to amp-web for specifying the location of the netevmon and amp meta-data databases. Previously we had assumed these were on the local machine, which proved troublesome when Brad tried to get Cuz running on warlock.

Capped the maximum range of the summary graph to prevent users from zooming out into empty space.

Fixed some byte-ordering bugs in libpacketdump's RadioTap and 802.11 header parsing on big endian architectures.




Added a smarter method of generating tick labels on the X axis to amp-web. Previously, if you were zoomed in far enough, the labels simply showed a time with no indication as to what day you are looking at. Now, we show the date as well as the time.

Reworked how zoom behaviour works with the summary graph. The zoom-level is now determined dynamically based on the selected range, e.g. selecting more than 75% of the current summary range will cause it to zoom out to the next level. Selecting a small area will cause it to zoom back in.

To support arbitrary changes to the summary graph range without having to re-fetch and re-draw both graphs, I decided to rewrite our graph management scripts to operate on an instance of a class rather than just being a function that gets called whenever we want to render the graphs. The class has methods that update just the summary graph or just the detail graph, so we only end up changing the graph that we need to. Also, the class can be subclassed to support different graph styles easily, e.g. our Smokeping style. While I was rewriting, I used jQuery.when to make all of the AJAX requests for graph data simultaneously rather than sequentially as we were previously.

Unfortunately, this was a pretty painful re-write as Javascript scoping behaviour was a constant thorn in my side. Turned out that there was a reason we did everything inside of one big function, as I frequently found that I could no longer access my parent object inside of callback functions that I had defined within the new class. Often the method that was used to setup the callback did not support passing in arbitrary parameters either, so ensuring I had all the information I needed inside my callback functions took a lot longer than anticipated.




Implemented a new data caching scheme within ampy to try and limit the number of queries that are made to the NNTSC database. Previously, data was cached based on the start and end time given in the original query, which meant that we would only get a cache hit if the exact same query was made. Instead, caching is now done based on time "blocks", where each block includes 12 individual datapoints, so we can more easily re-use the results from old queries that overlap with the current one.

Re-worked the JitterVariance detector in netevmon, as it had been producing some unimpressive results of late. Instead of looking at the standard deviation of the individual measurements, I now look at the standard deviation as a percentage of the mean latency. Also started running a Plateau detector against these values, which has been surprisingly effective at picking up on increases in "smoke" quickly.

Fixed the issue in amp-web where the y-axis on the detail graph was autoscaling to the largest value in the summary graph. Also tweaked some of the behaviour of the selection area in the summary graph: single-clicking is now a null operation (i.e. it won't reset the detail graph to show the full summary graph) and you can now click and drag on the shaded area to move the selection (previously, you could only use the tiny handle for this).

Tidied up the _get_data function in the core of ampy, as this was getting messy and disorganised. ampy parsers must now implement a request_data function which will form and make the request to NNTSC for data -- however, the clunky get_aggregate_columns, get_group_columns and get_aggregate_functions functions have all gone away.