User login

Brendon Jones's blog

15

Dec

2017

Built and released new Debian and Ubuntu packages for amplet2-client, ampy, and ampweb.

Found and fixed a few issues in the netevmon email filtering that were caused by the incorrect types being used to make comparisons. Built and installed new packages in one deployment for testing.

Started work tidying up the C modules recently created for some of the more memory hungry parts of the BGP router. Was able to simplify it in a few places, reorganising code to be able to replace custom code with existing library functions, and shrink the amount of memory required slightly further again.

09

Nov

2017

Spent some time working on the packaging scripts to upgrade ampy/ampweb/netevmon to the newest version, including moving some database tables around, populating new tables and dealing with the debconf answers during install. Installed these multiple times on a few different Debian flavours while trying to make sure that they all work.

Found and fixed a few issues in the ampweb matrix that were preventing the udpstream data from displaying properly. There are still a few issues here around udpstream data being used to generate latency graphs, but it's getting closer to working in every case.

Continued to work on tidying up some of the BGP router code.

09

Nov

2017

Found and fixed a bug where the udpstream test would not receive all the reflected packets used for RTT calculations because it gave up listening after waiting after one inter-packet gap. It now waits until all the packets have been received or the global loss timeout is reached (multiple seconds). Also found and fixed the problem that led to discovering this - the web interface was asking for inter-packet gap in milliseconds and then treating the number entered as if it was in microseconds, leading to a gap 1000 times smaller than expected.

Added the ability for an ampweb user to change their own details through the web interface without requiring admin privileges. Spent some time testing that the new permissions model works correctly and that users are limited appropriately.

Started work on adding debconf support to the ampy package as a simple way to ensure there is a usable user right from the start without hardcoding one.

24

Oct

2017

Updated the AMP user UI to allow users to view/modify their own details, which required changing the way permissions were tested in a few places to properly control access. Tidied up some of the modal dialogs to properly update the different parts of the form in response to user input, hopefully making it easy to see what needs to be fixed/completed before the form can be submitted.

Spent some time tidying up the source for the BGP router and trying to make sure that the style is consistent across all the source files. Also ran some static analysis/lint tools over the source to help make sure we are being sensible.

24

Oct

2017

Expanded the new AMP user management interface to allow different roles to be given to each user, splitting the ability to view configuration from being able to edit it. Added and tested all the backend parts required to make the user management work - add/remove/modify/etc users. Updated each of the front end components to expect the correct level of permissions.

Continued to work on the BGP design document taking some feedback into account.

Started organising the BGP code into a more sensible looking module with tests and code in standard locations. Spent a lot of time getting the test setup working properly when run from setup.py. It appears that the default setuptools test loader wants to treat every single file as a test rather than just those that match the documented filter, so this had to be changed to exclude non-test files. This was also complicated by the fact that the majority of the code is python3 only but some tests need supporting elements that only run in python2, and so parts need to be skipped based on the version of python being used.

24

Oct

2017

Spent most of this week working on generating email events from AMP data. Moved the user filters (configurable via the website) into the eventing database so that they can be loaded by the eventing processes and filter events in realtime. The aim is to have email alerts triggered by sensible filters in the backend rather than triggering on any group that crosses a threshold in size. The web front end does a lot of work trying to improve the quality of the event groups that I've not replicated, but host filtering and event types will now be taken into account.

Added a basic web UI to configure users so that they can have email addresses associated with them for alerting.

24

Oct

2017

Started looking into prometheus as an option for extracting useful statistics out of the BGP router. It's pretty simple to get working, though doing monitoring across multiple processes isn't as clean as I would like. Added tracking of simple route/prefix statistics to a test branch, and had a think about ways to get more detailed routing information (such as a looking glass might present) without interfering with updating routes.

Updated the pregenerated/prepackaged martians filter to include IPv6 martians.

Investigated an issue with some AMP graphs that prevents the interactive sliders from working on certain browser/OS combinations (mostly Windows). Hard to replicate. Looked like it might be an issue with out of date javascript libraries, but updating them didn't fix it. Need to find an easier way to replicate it to get any further I expect.

Spent a bit more time working on the design document from last week.

28

Sep

2017

Rewrote the prefix filtering code to use a radix trie rather than the naive list I started with. Was an interesting challenge to allow for a range of prefix matches, that might not be even close to the prefix specified in the rule itself. The new filtering is approximately 300 times faster than the old, as well as using much less memory.

Changed the format of messages passed between peer and table processes in the BGP router to allow tables to also talk to other tables. This allows filters to be applied in stages, or export routes to peers at different stages. If desired, work specific to groups of peers can all be performed at once by a single table rather than multiple times by each individual peer.

Started to write up some basic design documentation describing the current state of the BGP router, what it is capable of, and how it all fits together.

28

Sep

2017

Spent some time chasing down issues in my BIRD configuration in my BGP resiliency testbed that meant routes were being shared inappropriately between peers (missing filters, which wouldn't have been a problem except I'm also messing with settings allowing the local AS to appear). Added further peers and edge devices in different configurations to make sure that they are all properly isolated. Everything looks to be working pretty well, and enabling/disabling specific peering sessions causes the appropriate route updates. Adding a second controller to the test correctly keeps the best routes available even when one of them is unavailable.

Noticed that as the number of peers increased, the number of full route recalculations was getting large, so tried to remove some extraneous causes of updates to be sent. Often we already had enough information in a peer process to do the work without asking the table to do work as well (and possibly triggering it to send unnecessary updates to other peers). Also added a very short dampening period to updates so that many consecutive messages in a short time period only cause only a single route recalculation to occur.

Fixed a few bugs that would allow saved routes to be modified by filters, meaning the next time these routes ran through filters the results would be cumulative. Hopefully the saved raw/original routes and filtered routes for distribution are now quite separate.

28

Sep

2017

Had a very interesting chat with Perry about the BGP router project, and how I was going about trying to make it more resilient. He suggested a few ways to go about it that were much more simple than what I was planning, and also did away with the nastiness of shared state between the redundant controllers. Each controller can independently do its own thing and use BGP route selection on the managed devices to settle any differences arising.

Started setting up a test environment so that I can trial the changes made to help the BGP router more resilient to failures. It's currently a simple network using docker containers running BIRD to act as my peers/routers, with another couple running redundant instances of my code. Most of the work so far has gone into getting my edge devices running BIRD to do the right thing with the routes, importing and exporting using the correct tables to make sure they don't get inadvertently modified or shared at the wrong location.

Updated the router to better track which peers are in an active state, and to add communities to exported routes when peers are missing in order to flag the degraded state to the recipient.