Brendon Jones's blog
Spent the week working on updating packages for all the server-side components (ampweb, amy, nntsc, netevmon, etc). This was made harder than it should have been by me making the build dependencies list more accurate, causing a different build system to be used, which then tries to pull in further dependencies from PyPI (that are unavailable as packages in some Debian flavours) and build them on a machine that doesn't have (and shouldn't have) the libraries to do so. Ended up reverting some of the changes in the interests of getting it to work, and will revisit this another time.
Made various improvements to the packaging scripts, such as tidying up enabling events when ampweb and netevmon are both present, regardless of the order that they were installed, installing the new influxdb database, fixing permissions, and making sure users exist. Went through multiple test deployments on both Wheezy and Jessie to make sure that the upgrade process will work smoothly next week.
Removed the separate amplet2-client-lite packages that ran without a local rabbitmq server and updated the regular packages to determine if rabbitmq is present or not. This makes the Debian packaging a lot easier and also means that there are fewer options that the user needs to worry about getting correct.
Started to work on making the URLs used to access the scheduling web UI a lot more consistent and sensible, and as RESTful as possible. I want to be able to expose this to users so they can programmatically modify the test schedule using outside tools, as well as making the internals a lot easier to maintain.
Made some more quality of life fixes to amp-web, including setting input focus on new modal windows, updating icons to properly show the current state of expandable areas, and better configuration instructions when running under systemd.
Fixed some problems with the rsyslog configuration that meant amplet2 client logging wasn't going to the correct files in some cases (mostly in various flavours of Ubuntu). The default configuration that is sometimes present was sorting ahead of the amplet2 configuration, causing logs to be duplicated across syslog and the amplet2 log. Permissions were also sometimes an issue when a rsyslog user was expected to be performing the logging.
Spent some time working on a script to manage multiple amplet clients on a single machine, enabling/disabling them, and making sure that their rabbitmq configuration (if required) is up to date. Got most of it implemented, but there are many edge cases that can arise if the client configuration files are changed while running. Decided to put it aside for a bit while I think about how best to tackle some of these problems, preferably without having to make major changes to how the client gets configured.
Finished splitting the amplet2 client SSL configuration into its own section, which makes a lot more sense. Made sure that it is still fully backwards compatible, with the old deprecated style still working if it is present (but will log messages about the change). Also updated the RabbitMQ SSL configuration flag default value to be automatically set based on the port being used, which means one less item that needs to be set manually.
Updated the NZ AMP mesh with the new client and made the appropriate changes to the HTTP test streams/meshes to correct the target names now that they are being reported in a consistent format.
Spent some time updating the scheduling web interface. Individual test instances can now be enabled/disabled, and possibly heavy tests (e.g. throughput) being run from a mesh can have their start times automatically offset from each other. Added new tooltips to parts of the scheduling interface, and improved some error messages in the backend to more accurately describe what went wrong when debugging.
Found and fixed a few bugs that had been reported once installing the new amplet2 client onto the first batch of live machines. Differences in the behaviour of libcurl meant that URL fragments were being treated differently depending on what version we were running, so I updated the parser to strip all fragments before fetching. Also updated URL splitting so that those with query parameters but no slash after the hostname were properly split on the '?' rather than treating the whole thing as a hostname. Fixed some timing issues where 32-bit clients were overflowing the next timeout value, so made sure that it was correctly treated as a 64-bit value.
Spent a bit of time trying to improve the amplet2 client configuration file. Updated the default value of "vialocal" to check if there is a rabbitmq-server, to try to just do the right thing rather than requiring it be manually set. Started trying to split off the SSL settings away from the collector settings, as they are used for a lot more than that and it was causing confusion having them in the same section.
Had a look at doing work during the Debian postinst stage, conditional on the version numbers of the package being updated to and from. Started using this with the new ampy packages to make sure that the database constraints are updated appropriately, and have other ideas for upgrade paths where it will be useful too.
Spent some time working on the test scheduling web interface to try to make the flow clearer for tests that don't require a hostname as a target (i.e. the HTTP test). Tests that don't require a hostname target will now hide that option and display the other options required. Still need to do some work around creating meshes of non-hostname targets.
Updated some more documentation to fill in gaps or bring it up to date after making changes. Based on feedback I changed some default values for configuration options, making some of them compulsory rather than defaulting to values that were not particularly useful.
Made various other smaller fixes - the traceroute test now uses the same library functions to receive and timestamp results as the other tests do, python ampsave code was tidied up slightly and made more consistent, CentOS packaging scripts were updated to build new packages, etc.
Finally got the amplet2 code up on GitHub.
Found and fixed a few small bugs that had shown up in my recent testing packages, such as one where the DNS loss timeout was set too large (and causing the test to be killed if any of the targets failed to respond). Also spent some time looking at the ASN lookup errors in my logs and added further logging around that to try to track down what was causing them - so far it appears to be the server having issues, not us.
Created a github repository for the amplet2 code and started adding documentation. Cleaned up the source to the manpages so that I can generate nicer markdown from them to put in the wiki. The code should hopefully be added in the very near future too, just waiting to test a few things in the latest client so I can make a new release at the same time. Made a few last minute tidy-ups to the repository ahead of this, making sure no autogenerated files remain and that I've removed unrelated content (e.g. libwandevent and librabbitmq CentOS spec files).
Updated CentOS spec files for the amplet2 client and new librabbitmq. Built and deployed new CentOS and Debian Wheezy packages onto a couple of test clients to test over the weekend.
Found and fixed a bug in the test scheduling while dealing with some user queries around test scheduling. The default value for test frequency (used if not explicitly specified) had the wrong units and so caused tests to be scheduled more frequently than intended. Also fixed a couple of places where wrapping could occur, and wrote some unit tests to cover those code paths.
Removed an unused option and related code paths from the traceroute test that weren't adding a lot except some hideous looking code. While looking at this I tightened up the timers being used for sending/timing out packets to make sure that timeouts were correctly based on the oldest outstanding probe, and that probes were being sent as close to the desired rate as possible. Also added some knobs for changing how many targets are probed at once, and now randomise the initial TTL probed to try not to hammer any nearby hops quite so hard.
Removed some old and unused files from the amplet2 repository. Updated more documentation to be accurate with the current state of things.
Fixed up another couple of minor issues that had been reported. Fixed the loss timer in the tcpping test to start after the last packet is sent so that long interpacket delays are possible if desired. Tidied up a regex to match certificate filenames more accurately. Made more documentation updates. Tried to improve packaging to make sure that default configuration files were as usable as possible without manual edits.
Built new packages and pushed them out to one of our test deployments. Worked through a few issues with getting the HTTP tests and target meshes lined up properly so that they display. Still need to figure out the correct way to fix this so that users don't need to worry about this special case.
Spent some time reworking my Debian package build system after accidentally building with incorrect source due to some release candidate versioning suffixes being missed. The new system will also better deal with release specific Debian directories.
Made lots of small changes based on things that had been reported by users or that I had noticed behaving incorrectly in the last week. Fixed a cap on a retry timer that was alternating between two different values. Updated the HTTP test to always store the full URL including scheme, even if the user didn't explicitly specify it. Updated some error messages to try to be more useful and accurate.
Fixed the apache2 configuration in the amppki packages to work properly once everything is installed properly in the correct system locations. There were issues around the python path being incorrect and not able to find the libraries, as well as naming collisions with the ampweb WSGI processes.
Spent some time with Shane trying to track down the cause of some missing data in the web graphs. Found the cause of the missing DNS data (wrong column names being used) and why some sites didn't have path length data available (it's only sourced from one style of traceroute test).
Tried to expose through the web interface the ability to force the address family to use when resolving test targets. This was a bit more complicated than expected, due to new targets getting automatically added to the database and it including the various suffixes used internally to represent address families.
Put together some new server packages to test the new changes and started working through verifying that they worked, ahead of another release.