Brendon Jones's blog
Fixed the permissions for directories and files created for keys/certs
to make sure that rabbitmq can access them. Also added exponential
backoff when trying to fetch signed certificates - hopefully a machine
that is being actively installed will query soon enough to quickly get a
new certificate, but unattended installs won't hammer the server.
Investigated some reported issues about init scripts not performing
correctly, but not sure I can find a fault. Also looked into two clients
that are not testing to the full list of targets - they just appear to
be ignored and there is no obvious reason why.
Worked with Brad to update two more amplets to Wheezy, and spent some
time trying to determine why we partially lost access to one of the few
remaining un-updated machines.
Spent some time putting together a test environment similar to how some
of the Lightwire monitors are configured, with ppp interfaces inside of
network namespaces. This allowed me to start tracking down issues with
the tcpping test that they were seeing. Firstly the differences between
capturing on ethernet and Linux SLL/cooked interfaces weren't being
taken into account and header offsets were incorrectly calculated.
Secondly, I spent a lot of time trying to determine why the test was not
capturing the first response packet on a ppp interface - after a lot of
digging it turns out there is a bug in libpcap to do with bpf filters on
a cooked interface that was breaking it. The bug has been fixed, but
needs a backported package to get the new library version in Debian.
Tested building and running the amplet client and all the supporting
libraries on a Raspberry Pi. I've run standalone tests (it has a newer
kernel which I thought might help debug my ppp problems) and the results
look to be sensible. Will hopefully get a chance to test general
performance while reporting results next week.
Lots more small fixes to tidy up the AMP scheduling web interface.
Updated more dropdown menus to work with the changes that Brad made to
the API, properly set valid default meshes when using the matrix, making
sure that only meshes tested to are added. Put in links to the raw YAML
schedules for sites (possibly useful for debugging) and a link to an
example configuration script that will set up a client from scratch
(installing packages, configuration, etc).
Spent a morning at Lightwire doing a demo of the AMP web interface,
talking about the different data that can be collected and the ways it
can be useful. Tried to install a test client to show how that works,
but unfortunately ran into some issues with the test environment that
prevented name resolution from happening. Tracked it down to the way
that getifaddrs() describes ppp interfaces being unexpectedly different
from the ethernet interfaces we had tested on so far. Found and fixed a
heap of other smaller issues that came out of the meeting, mostly to do
with permissions and documentation.
Updated the amplet client packages to set appropriate permissions on the
key directories and to set the umask correctly so that rabbitmq is able
to use downloaded keys.
Worked with Brad to get Andrew some useful AMP data so that he can
perform some comparison tests between the existing AMP database and the
new ones that he is investigating.
Lots of little fixes to things in the AMP scheduling interface, bringing
fetching values properly), making sure redundant information isn't being
presented, a little bit of styling. Tried to be smarter in a few places
with selecting default values so that something useful will get
displayed in the matrix.
Started looking at systemd in order to get init scripts working in
Jessie for netevmon and nntsc. The existing init scripts are able to be
used mostly as-is, though some recent changes to netevmon have meant a
few things needed tidying up.
Made some changes to the amplet client in response to things I observed
while installing test clients for the Lightwire machine. Changed the log
level of some informational messages to avoid filling logfiles,
rearranged startup to create the pidfile earlier to work better with
puppet and added some more smarts to guessing the ampname when one isn't
supplied. Also rearranged some directory structure to better represent
the python modules involved.
Found and fixed a few bugs in various things on the server side as well.
Values from the new dropdowns weren't being fetched appropriately in
some cases, percentage loss was sometimes calculated incorrectly and
incomplete traceroute paths weren't being stored correctly.
Got the event detection systems up and running on the Lightwire machine,
which was delayed due to issues with embedded R behaving slightly
differently in the Jessie version. Also spent some time with Shane
chasing up some unusual looking events and unusual merging of event groups.
Brad and I finished updating the last of the reachable amplets to Debian
Wheezy, which brings us up to 13 monitors all running the new code now.
Updated the website auth to allow locking down the graphs and the
configuration pages separately. Also added an option to hide/show the
terms of service on the login page. This should hopefully cover the
different public and private deployments we have.
Installed my scheduling branch on the new Lightwire server and started
working through an example client install to make sure that all the
components are operating correctly. Set up a new amplet client
configured to fetch keys, schedules etc from the server, which works
fine. Created meshes and began creating test schedules through the web
interface and fixed some missing parts - matrix metadata tables were not
being updated with mesh/test combinations, target sites weren't being
created in some codepaths. Found some test controls that weren't
properly updating the test schedule modals which I also fixed.
Spent some more time working with Brad updating more amplets to Wheezy.
We've got the bulk of them done now, just a few more of the older (and
significantly slower!) ones to go.
Worked with Brad to update a test amplet and the first 2 production
amplets from Debian Lenny to Wheezy. Everything has gone well so far,
though some of the older machines have a lot of cruft to tidy up (some
have already been through multiple Debian upgrades in the past!).
Hopefully we can get the rest of the machines sorted over the next few
Built new 32bit amplet2 client packages for deployment on the NZ AMP
mesh as the machines are updated. Extracted all the current
configuration from the database on erg to use as a configuration guide
while updating them.
Spent some time getting all the AMP server components
(website/events/storage/etc) installed on the new Lightwire server. This
is the first time that most of these components have been installed in
this configuration and the first time on Jessie, so the process wasn't
particularly smooth. Everything is now installed and running without any
clients, so the next step will be to see if I can configure a new client
using the new web interface.
Brad added a new select dropdown widget that includes filtering of the
option list, and I spent some time adding missing functionality to it.
Keyboard navigation should all work as expected now -
pageup/pagedown/home/end all move around the list, and tab will select
and move on to the next input element. I also integrated this with all
of the dropdowns I've added for scheduling and site management which
involved making sure all the onchange events were properly hooked up and
that they properly followed visibility changes as dynamic forms are updated.
Spent some time tidying up labels, styling, etc on the scheduling pages
to make sure they are consistent with each other, and showing the right
level of information. Found and fixed a few instances where similar
fields were named differently between meshes and sites, leading to
missing data being displayed.
Spent most of my time working on input validation when editing schedules
and sites, and making sure that buttons and fields were enabled
appropriately based on user actions. Also updated some of the templates
to use the longer display names where possible (rather than short
internal names), and link them to the appropriate configuration pages.
Added permissions to the security model to allow separation between
users that are allowed to view the data and those that are allowed to
Confirmed that the data I was getting from the throughput test was the
same from both the 32bit and 64bit amplet clients. Initially they were
reporting quite different data, but after comparing TCP settings
discovered that the 64bit VM had been tuned somewhat - after applying
the same values to the 32bit VM they now agree.
Spent most of the week continuing to work on the test scheduling web
interface. The lists of meshes and sites are now the primary entry
points, and if you click through then you have access to the meta data
about the site/mesh and the specific schedule that applies. These can
all be edited to change the names displayed in the results interface,
and schedules that are updated are made available for amplet clients to
The layout and flow are mostly settled now, though will likely be
updated after more frequent use. I've got the base functionality working
and have started adding some of the nice features that help make sure
the right data gets added, or inform the user what is expected. Slow and