User login

Brendon Jones's blog

04

Sep

2017

Spent most of the week working with the TCP throughput test to investigate what I can actually do with the new information, and integrating it into the test. Retransmit counters, RTT etc are easy to extract and explain. I also get information about time spent blocked due to the receive window on the remote end, or the send buffer on the near end of the connection, but I'm not convinced about how accurate these are (or I'm not understanding what they mean correctly) - drastically limiting my send buffer size will only sometimes report any time spent limited by the send buffer, and querying when I know for certain that there is no outstanding data doesn't always report as being application limited. It's a starting point at least, so I'll keep looking at the data and see what can be done about it.

Kept working on making the BGP disaggregated router more resilient. Implemented a few new messages to communicate the state of peers between different processes within a single instance so that they can be compared with other instances. Got some simple logic working that will disable any instance that is known to have an incomplete view of the peers.

04

Sep

2017

Wrote up a sample program to test out some heartbeat ideas, and some fairly simple ideas look like they should work ok, while sharing minimum state between instances of the routing engine. Started to implement that inside the actual BGP disaggregated router code to see how it will behave with real data. Started setting up my test environment to allow multiple instances to be running and connected to the same BIRD process so that they all get the same routes.

Had another look at the new TCP info available to processes now that I'm running a 4.10 kernel. My userspace hasn't been updated, but all the new information is available to me if I use the updated version of the struct. Looks like I should be able to get timing information about what is causing send to block (which end is at fault), as well as retransmit counts, RTT, etc that can be used to try to determine why the throughput test reported a particular result.

04

Sep

2017

Implemented basic route refresh functionality in the BGP disaggregated router, and wrestled with exabgp to find out how to pass through the messages I required to do so. Also spent some time chasing down what looked like bugs in the topology module, but was actually a broken data file that didn't correctly describe the layout of the network.

Had another attempt at getting my chromium/youtube test working. It works fine when I build it within the chromium source tree alongside their example headless applications, but otherwise fails. It appears to be linking against a lot of object files deep inside chromium, as well as the headless static library (which I thought should contain everything needed to build a headless application?), as well as all the normal shared libraries. Back into the too-hard basket until they sort their stuff out or I have some more time to push through this.

Spent a lot of time reading about different approaches to HA/resilience, what sort of information nodes often pass around and how they go about sharing state (or avoiding sharing state).

04

Sep

2017

Finished up writing and testing the new address family selection options in AMP and making sure that all of the tests work properly when they are set. Changing the way the config files worked to allow globally setting options (but able to be overridden at test level) meant there were a few more edge cases than anticipated.

Started thinking and writing about how we might go about making the BGP disaggregated router more resilient, and what situations may arise that it will need to deal with.

04

Sep

2017

Spent some more time trying to fix the chromium/youtube test to get around segfaulting when javascript is present on a page, without any success yet.

Started work on improving manual IPv4/IPv6 address selection in AMP tests. Previously you could specify addresses to use for each address family, but generally couldn't control which family would be used (that was up to DNS). The plan is now to allow enabling/disabling individual address families globally or per test, and not require that a particular address be specified (but still allowing it if desired).

08

Aug

2017

Spent some time thinking and reading about ways to improve resilience and reliability of the BGP project, and how multiple instances of it could interact. Wrote up some failure cases and some ideas around how they can be dealt with.

Had another look at the tcp_info struct and how much of the web10g style information has made its way into the kernel. There is almost enough there now to do interesting things around reporting possible causes of slowness in throughput tests, though no probes in any of our deployments are running new enough kernels to do so. Will put this aside for a little while until kernels are updated or I get some ideas how to use the information that is available now.

Picked up the Chrome/youtube test again now that the headless mode is part of the main branch and included in the current Ubuntu packages. I have my test program building and running using the libraries in the package, though still have to link against a static headless Chromium library that doesn't get distributed. The test runs, but crashes if there is javascript on the webpage that it loads. Unsure so far what I'm missing that could be causing this.

24

Jul

2017

Spent some time working towards releasing a new version of the amplet2-client packages that have been stalled for a few months. Tidied up the Centos build to remove the -lite packages and ported the Debian init/postinst scripts. Merged some outstanding changes and tightened up some dependency checking. Just need to confirm that my transitional packages work correctly and then I think it's ready for release.

Had a look into some edge cases of the amp-web code to support an outside deployment that was seeing unusual behaviour. Found a few areas that need to be cleaned up to support viewing different data in the web interface - we don't use certain combinations of test options in our own deployment so viewing them hasn't been a priority.

Had a short chat with Florin about the BGP project that he is starting to work on this week. Got him up to speed with how how it all works and how it fits together.

24

Jul

2017

Had a quick look at moving the BGP code from python2.7 to python3, and it doesn't look like it will require much work. Didn't go ahead though, as moving to python3 appears to also require moving to exabgp4 which is still rather poorly documented. Probably still worth a look, sooner rather than later.

Spent some time hardening prefix creation to deal with incorrectly formed prefixes, and wrote some unit tests to make sure that it behaves correctly. Also updated unit tests for filters to bring them up to date with recent changes. Wrote some brief overview documentation to describe how it all fits together ahead of Florin starting work on this next week.

10

Jul

2017

Created configuration within my BGP program to peer with routers both internal to my test network and external, and to filter/distribute routes between them. I had to slightly change the way that the network topography was described to better account for loopback addresses and routed interfaces (my earlier networks were just graphs of node names), and update the nexthop calculations to take this into account (especially in the case where my BGP speaker was not directly connected to the external peer). Routes can now be sent around my test network, and filtered or modified as they are received/sent to peers.

Updated and pushed out some updates to the ampweb packages to fix some reported bugs.

10

Jul

2017

Brad set up a physical network using the lab Junipers for me to test my BGP code on, and I made a start working with it. Configured interfaces and set up multihop BGP to get the external peers talking to my controller running a few hops inside the network.

Spent most of the week working on the performance report.