Shane Alcock's Blog
Reworked how aggregation binsizes are calculated for the graphs. There is now a fixed set of aggregation levels that can be chosen, based on the time period being shown on the graph. This means that we should hit cached data a lot more often rather than choosing a new binsize every few zoom levels. Increased the minimum binsize to 300 seconds for all non-amp graphs and 60 seconds for amp graphs. This will help avoid problems where the binsize was smaller than the measurement frequency, resulting in empty bins that we had to recognise were not gaps in the data.
Added new matrices for DNS data, one showing relative latency and the other showing absolute latency. These act much like the existing latency matrices, except we have to be a lot smarter about which streams we use for colouring the matrix cell. If there are any non-recursive tests, we will use the streams for those tests as these are presumably cases where we are querying an authoritative server. Otherwise, we assume we are testing a public DNS server and use the results from querying for 'google.com', as this is a name that is most likely to be cached. This will require us to always schedule a 'google.com' test for any non-authoritative servers that we test, but that's probably not a bad idea anyway.
Wrote a script to more easily update the amp-meta databases to add new targets and update mesh memberships. Used this script to completely replace the meshes on prophet to better reflect the test schedules that we are running on hosts that report to prophet.
Merged the new ampy/amp-web into the develop branch, so hopefully Brad and I will be able to push out these changes to the main website soon.
Started working on adding support for the throughput test to ampy. Hopefully all the changes I have made over the past few weeks will make this a lot easier.
Finally went ahead with the database migration on skeptic. Upgrading NNTSC, ampy and amp-web to latest development versions went relatively smoothly and the graphs on amp.wand.net.nz are now much quicker to load. Insertion speed is now our primary concern for the future, as our attempts to speed up inserts by committing less often did not produce great returns. Firstly, we run into lock limits very quickly as each table we insert into requires a separate lock until we commit the transaction. Secondly, the commits take a significant amount of time that we are not using to process new messages -- will need to look into separating message processing and database operations into separate threads.
Finished deploying the new ampy and amp-web on prophet. The modal dialog code needed to be quite heavily modified to support the new way that ampy handles getting selection options. Also found and fixed a number of old bugs relating to how we fetch and organise time series data, including the big one where a time series would appear to jump backwards. Will do a bit more testing but soon we'll be ready to deploy this on skeptic as well.
Spent Mon-Wed on Jury service.
Continued fixing problems with gcc-isms in libtrace. Added proper checks for each of the various gcc optimisations that we use in libtrace, e.g. 'pure', 'deprecated', 'unused'. Tested the changes on a variety of system and they seem to be working as expected.
Started testing the new ampy/amp-web on prophet. Found plenty of little bugs that needed fixing, but it now seems to be capable of drawing sensible graphs for most of the collections. Just a couple more to test, along with the matrix.
Finished most of the ampy reimplementation. Implemented all of the remaining collections and documented everything that I hadn't done the previous week, including the external API. Add caching for stream->view and view->groups mappings and added extra methods for querying aspects of the amp meta-data that I had forgotten about, e.g. site information and a list of available meshes.
Started re-working amp-web to use the new ampy API, tidying up a lot of the python side of amp-web as I went. In particular, I've removed a lot of web API functions that we don't use anymore and also broken the matrix handling code down into more manageable functions. Next job is to actually install and test the new ampy and amp-web.
Spent a decent chunk of time chasing down a libtrace bug on Mac OS X, which was proving difficult to replicated. Unfortunately, it turned out that I had already fixed the bug in libtrace 3.0.19 but the reporter didn't realise they were using 3.0.18 instead. Also, received a patch to the libtrace build system to try and better support compilers other than gcc (e.g. clang) which prompted me to take a closer look at some of the gcc-isms in our build process. In the process, I found that our attempts to check if -fvisibility is available was not working at all. Once I had replaced the configure check with something that works, the whole libtrace build broke because some function symbols were no longer being exported. Managed to get it all back working again late on Friday afternoon, but I'll need to make sure the new checks work properly on other systems, particularly FreeBSD 10 which only has clang by default.
Continued the ampy reimplementation. Finished writing the code for the core API (aside from any little functions that we use in amp-web that I've forgotten about) and have implemented and tested the modules for the 3 AMP collections and Smokeping. Have also been adding detailed documentation to the new libampy classes, which has taken a fair chunk of my time.
Read over a couple of draft chapters from Meena's thesis and spent a bit of time working with her on improving the order and clarity of her writing.
Fixed a libtrace bug that Nevil reported where setting the snap length was not having any effect when using the ring: format.
Started on re-implementing ampy afresh. The ampy code-base had grown rather organically since we started on the project and the structure was quite messy and difficult to work with.
The main changes so far are as follows:
* Better use of OO to minimise code duplication, especially in the collection handling code
* Top-level API is all located in one module rather than being spread across several modules
* Added a StreamManager class that handles the dictionary hierarchy for storing stream properties. Collections can now simply express their hierarchy as an ordered list, e.g. ['source', 'dest', 'packetsize', 'family']. Inserting and searching are handled by the StreamManager -- no need to write code for each collection to manage dictionaries.
* Simplified view management code that does NOT call back into the collection modules.
* Fresh implementation of the block management and caching code, which will hopefully be easier to debug.
* Removed a whole lot of redundant or unused code.
So far, I'm about half-way through the re-implementation. Most of the API is there but I've only implemented one collection thus far. Since the goal is to make it easier to add new collections to ampy, hopefully adding the rest shouldn't take too long :)
Fixed problems we were having with netevmon causing NNTSC to fill up its queues and therefore use huge amounts of memory. There were two components to this fix: the most effective change was to modify netevmon to only ask for one stream at a time (previously we asked for them all at once because this was the most efficient way to query the old database schema). The other change was to compress the pickled query result before exporting it which reduced the queue footprint and also meant we could send the data faster, meaning that the queue would drain quicker.
Fixed a bug in ampy that was preventing events from showing up on the graphs or the dashboard. We now have a fully functioning netevmon running on prophet again.
Spent a couple of days going over the AMP event ground truth I generated a few weeks back after Meena reported that there were a number of events being reported now that didn't have ground truth. This was due to the changes and improvements I had made to netevmon while working on the ground truth -- as a result, some events disappeared but there were also a few new ones that took their place. Noticed a few bugs in Meena's new eventing script while I was doing this where it was reporting incorrect stream properties, so I tracked those down for her while I was at it.
Wrote a NNTSC dataparser for the new AMP throughput test. Found a few bugs in the test itself for Brendon to solve, but both the test and the dataparser seem to be working in the most basic cases.
Had a play with Nevil's python-libtrace code and reported a few bugs and missing features back to him. Looking forward to those being fixed as it is pretty nifty otherwise.
Updated the AMP dataparser in NNTSC to process more messages in a single batch before committing. This should improve speed when working through a large message backlog, as well as save on some I/O time during normal operation. This change required some modification to the way we handle disconnects and other errors, as we now have to re-insert all the previously uncommitted messages so we can't just disconnect and retry the current message.
Tried to bring our database cursor management in line with suggested best practice, i.e. closing cursors whenever we're done with them.
Improved exporting performance by limiting frequency calculations to the first 200 rows and using a RealDictCursor rather than a DictCursor to fetch query results. The RealDictCursor means we don't need to convert results into dictionaries ourselves -- they are already in the right format so we can avoid touching most rows by simply chucking them straight into our result.
Spent some time helping Meena write a script to batch-process her event data. This should allow us to easily repeat her event grouping and significance calculations using various parameters without requiring manual intervention. Found a few bugs along the way which have now been fixed.
Was planning to work the short week between Easter and Anzac day but fell ill with a cold instead.
Finished purging the last of the SQLAlchemy code from NNTSC. Once that was working, I was able to create a new class hierarchy for our database code to reduce the amount of duplicate code and ensure that we handle error cases consistently across all query types.
Split insertion operations across two different transactions: one for stream-related operations and one for measurement results. This allows us to commit new streams and data tables without having to commit any data results, which is an important step towards better synchronisation between the database and the messages in the Rabbit queue.
Spent a lot of time tracking down and fixing various error cases that were not being caught and handled within NNTSC. A lot of this work was focused on ensuring that no data was lost or duplicated after recovering from an error or a database restart, especially given our attempts to move towards committing less often.
Migrating the prophet development database over to the new NNTSC schema on Thursday. Generally things went pretty smoothly and we are now turning our attention to migrating skeptic and the live website as soon as possible.
Updated NNTSC to include the new 'smoke' and 'smokearray' aggregation functions. Replaced all calls to get_percentile_data in ampy with calls to get_aggregate_data using the new aggregation functions. Fixed a few glitches in amp-web resulting from changes to field names due to the switch-over.
Marked the 513 libtrace assignments. Overall, the quality of submissions was very good with many students demonstrating a high-level of understanding rather than just blindly copying from examples.
Modified NNTSC to handle a rare situation where we can try to insert a stream that already exists -- this can happen if two data-inserting NNTSCs are running on the same host. Now we detect the duplicate and return the stream id of the existing stream so NNTSC can update its own stream map to include the missing stream.
Discovered that our new table-heavy database schema was using a lot of memory due to SQLAlchemy trying to maintain a dictionary mapping all of the table names to table objects. This prompted me to finally rip out the last vestiges of SQLAlchemy from NNTSC. This involved replacing all of our table creation and data insertion code with psycopg2 and explicit SQL commands constructed programatically. Unfortunately, this will delay our database migration by at least another week but it will also end up simplifying our database code somewhat.