Shane Alcock's Blog
Released libtrace 3.0.20 on Monday.
Got as much of our development environment up and running again after the power outage over the weekend. There are still a few non-critical things that needed some assistance from Brad that I wasn't able to get going on Monday, but they can wait until next week when we're both here.
On leave from Tuesday for the rest of the week.
Libtrace 3.0.20 has been released today.
This release fixes several bugs that have been reported by users, adds support for LZMA compression to libwandio and adds an API function for getting the fragment offset for an IP packet.
The bugs fixed in this release are:
* Fixed broken snaplen option for ring: input.
* Fixed trace_get_source_port and trace_get_destination_port returning bogus port numbers when given a fragmented packet.
* Fixed timestamp byte ordering on big endian architectures.
* Removed assert failure if a bad compression level or method is provided when configuring an output trace. A libtrace error is raised instead.
* Fixed broken compiler feature checking in configure script. Compiler features are also detected for compilers other than gcc, e.g. clang.
* Fixed potential segfaults in OSPF libpacketdump parser if the packet is truncated midway through the OSPF header.
The OSPF bug fix unfortunately resulted in the 'len' field in the libtrace_ospf_t structure being renamed to 'ospf_len' -- if you are using libtrace to process OSPF packets, please make sure you update your code accordingly.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.
Carrying on from last week, storing a cache entry per stream turned out to be a bad idea. Some matrix meshes consist of 100s of streams so we spend a lot of time looking up cache entries. As a result, I rewrote the caching code to store one dictionary per collection, mapping stream ids to tuples containing the timestamps. This gets looked up once per query, so only one cache operation is required to generate a matrix.
Updating the cache when we have to query for missing values is a bit annoying, as we cannot simply update the dictionary and put it back in the cache once the query is complete as the data inserting process may have updated other cache entries with new 'most recent data' timestamps while we were fulfilling our query. Instead, we have to re-fetch the dictionary, update the one stream we're changing and then immediately store the dictionary again.
Updated ampy to no longer keep track of active streams and removed support for ACTIVE_STREAMS queries from the NNTSC protocol.
Merged Perry's lzma support into libwandio. Started working towards a new libtrace release -- managed to build and pass all tests on our various development boxes so should be able to push out a release next week.
Spent a day reading over Meenakshee's thesis. Suggested a series of mostly minor edits and changes but overall it is looking pretty good.
Found and fixed a large memory leak in netevmon that had caused prophet to run out of memory over the weekend. The problem was that I was allocated space for storing IPv6 address strings in the Traceroute detector but not freeing it properly if the address was already in our LRU. Also took the opportunity to make our memory use for Traceroute much more efficient, i.e. having a global hop LRU across all traceroute streams rather than one per stream which was leading to a lot of duplication.
Started looking into our insertion speed problems. One obvious source of slowdowns is the UPDATE that we use to remember when we last inserted data for a stream. This update is being called once per measurement interval for each collection and becomes quite onerous when the streams table gets very large. Implemented a solution where the first and last insertion for each stream is stored in memcache instead of the database. If there is no entry in memcache when a query comes in for the stream, we can query the data table for that stream for min and max timestamp instead, although this is a slightly expensive operation.
Once I had that working, I removed the 'streams' table from NNTSC entirely as it was no longer needed (each collection has its own stream table with specific details about each stream; the streams table was mainly for storing common properties across all collections like lasttimestamp). This meant I had to remove or change all references in the NNTSC database code to the streams table but was otherwise straightforward.
Spent Friday fixing a bug in libtrace where trace_get_source_port and trace_get_destination_port would return bogus values if called on fragmented packets. Added a new API function for getting the fragment offset and more fragment flag from a packet. I needed this anyway for fixing the bug and given the amount of bit-shifting, masking, multiplying and header parsing (for v6) involved, it would probably be useful to other people as well.
Short week last week, after being sick on Monday and recovering from a spot of minor surgery on Thursday and Friday.
Finished adding support for the amp-throughput test to amp-web, so we can now browse and view graphs for amp-throughput data. Once again, some fiddling with the modal code was required to ensure the modal dialog for the new collection supported all the required selection options.
Lines for basic time series graphs (i.e. non-smokeping style graphs) will now highlight and show tooltips if moused over on the detail graph, just like the smoke graphs do. This was an annoying inconsistency that had been ignored because all of the existing amp collections before now used the smoke line style. Also fixed the hit detection code for the highlighting so that it would work if a vertical line segment was moused over -- previously we were only matching on the horizontal segments.
Reworked how aggregation binsizes are calculated for the graphs. There is now a fixed set of aggregation levels that can be chosen, based on the time period being shown on the graph. This means that we should hit cached data a lot more often rather than choosing a new binsize every few zoom levels. Increased the minimum binsize to 300 seconds for all non-amp graphs and 60 seconds for amp graphs. This will help avoid problems where the binsize was smaller than the measurement frequency, resulting in empty bins that we had to recognise were not gaps in the data.
Added new matrices for DNS data, one showing relative latency and the other showing absolute latency. These act much like the existing latency matrices, except we have to be a lot smarter about which streams we use for colouring the matrix cell. If there are any non-recursive tests, we will use the streams for those tests as these are presumably cases where we are querying an authoritative server. Otherwise, we assume we are testing a public DNS server and use the results from querying for 'google.com', as this is a name that is most likely to be cached. This will require us to always schedule a 'google.com' test for any non-authoritative servers that we test, but that's probably not a bad idea anyway.
Wrote a script to more easily update the amp-meta databases to add new targets and update mesh memberships. Used this script to completely replace the meshes on prophet to better reflect the test schedules that we are running on hosts that report to prophet.
Merged the new ampy/amp-web into the develop branch, so hopefully Brad and I will be able to push out these changes to the main website soon.
Started working on adding support for the throughput test to ampy. Hopefully all the changes I have made over the past few weeks will make this a lot easier.
Finally went ahead with the database migration on skeptic. Upgrading NNTSC, ampy and amp-web to latest development versions went relatively smoothly and the graphs on amp.wand.net.nz are now much quicker to load. Insertion speed is now our primary concern for the future, as our attempts to speed up inserts by committing less often did not produce great returns. Firstly, we run into lock limits very quickly as each table we insert into requires a separate lock until we commit the transaction. Secondly, the commits take a significant amount of time that we are not using to process new messages -- will need to look into separating message processing and database operations into separate threads.
Finished deploying the new ampy and amp-web on prophet. The modal dialog code needed to be quite heavily modified to support the new way that ampy handles getting selection options. Also found and fixed a number of old bugs relating to how we fetch and organise time series data, including the big one where a time series would appear to jump backwards. Will do a bit more testing but soon we'll be ready to deploy this on skeptic as well.
Spent Mon-Wed on Jury service.
Continued fixing problems with gcc-isms in libtrace. Added proper checks for each of the various gcc optimisations that we use in libtrace, e.g. 'pure', 'deprecated', 'unused'. Tested the changes on a variety of system and they seem to be working as expected.
Started testing the new ampy/amp-web on prophet. Found plenty of little bugs that needed fixing, but it now seems to be capable of drawing sensible graphs for most of the collections. Just a couple more to test, along with the matrix.
Finished most of the ampy reimplementation. Implemented all of the remaining collections and documented everything that I hadn't done the previous week, including the external API. Add caching for stream->view and view->groups mappings and added extra methods for querying aspects of the amp meta-data that I had forgotten about, e.g. site information and a list of available meshes.
Started re-working amp-web to use the new ampy API, tidying up a lot of the python side of amp-web as I went. In particular, I've removed a lot of web API functions that we don't use anymore and also broken the matrix handling code down into more manageable functions. Next job is to actually install and test the new ampy and amp-web.
Spent a decent chunk of time chasing down a libtrace bug on Mac OS X, which was proving difficult to replicated. Unfortunately, it turned out that I had already fixed the bug in libtrace 3.0.19 but the reporter didn't realise they were using 3.0.18 instead. Also, received a patch to the libtrace build system to try and better support compilers other than gcc (e.g. clang) which prompted me to take a closer look at some of the gcc-isms in our build process. In the process, I found that our attempts to check if -fvisibility is available was not working at all. Once I had replaced the configure check with something that works, the whole libtrace build broke because some function symbols were no longer being exported. Managed to get it all back working again late on Friday afternoon, but I'll need to make sure the new checks work properly on other systems, particularly FreeBSD 10 which only has clang by default.
Continued the ampy reimplementation. Finished writing the code for the core API (aside from any little functions that we use in amp-web that I've forgotten about) and have implemented and tested the modules for the 3 AMP collections and Smokeping. Have also been adding detailed documentation to the new libampy classes, which has taken a fair chunk of my time.
Read over a couple of draft chapters from Meena's thesis and spent a bit of time working with her on improving the order and clarity of her writing.
Fixed a libtrace bug that Nevil reported where setting the snap length was not having any effect when using the ring: format.