Spent some time putting together a test environment similar to how some
of the Lightwire monitors are configured, with ppp interfaces inside of
network namespaces. This allowed me to start tracking down issues with
the tcpping test that they were seeing. Firstly the differences between
capturing on ethernet and Linux SLL/cooked interfaces weren't being
taken into account and header offsets were incorrectly calculated.
Secondly, I spent a lot of time trying to determine why the test was not
capturing the first response packet on a ppp interface - after a lot of
digging it turns out there is a bug in libpcap to do with bpf filters on
a cooked interface that was breaking it. The bug has been fixed, but
needs a backported package to get the new library version in Debian.
Tested building and running the amplet client and all the supporting
libraries on a Raspberry Pi. I've run standalone tests (it has a newer
kernel which I thought might help debug my ppp problems) and the results
look to be sensible. Will hopefully get a chance to test general
performance while reporting results next week.
Tracked down a rare bug with packets from the previous packet-out test still being transmitted after that tested completed, presumably due to some buffering on the switch. The caused some tests to segfault and fail to produce results.
Continued to work on drawing graphs from my resulting data and rerunning any failed tests such as those caused by the above bug. As per Matthew's suggestions I went with gnuplot which has made it possible to plot multiple data sources etc on the same graph which has been useful particularly to allow both packets in and out to be plotted on the same graph.
Wrote a skeleton for a centralised collector of progger data for Harris to start filling in with actual useful code.
Continued writing up the implementation chapter of the libtrace paper. It's turning out to be a pretty long paper, as there are a lot of design decisions that warrant discussion (memory management, combiners, hashers etc.).
Succumbed to my head cold on Thursday, so had a day at home to rest and recover.
We calculated that for the latency chart, having m x n different queries with current speeds would take a few seconds on influxDB (where m is number of sources and n is number of targets). I experimented with querying for whole rows and for the whole grid at once, and found significant speed ups (about 10x the speed for the whole grid)
Have been investigating why influx seems to have a baseline speed of 2.5ms by posting in forums etc, but have had no breakthrough. Influx has just upgraded their storage engine, so I will look into testing this when it comes out.
Have rewritten traceroute tests to group by IP paths over the past 48 hours, which has slowed the query down to take around two and a half seconds on average.
Also investigated whether we can use retention policies to discard but backup old data elsewhere. Not really what they're designed for, but it seems that something like this could be done with clustering. May need to test.
Lots more small fixes to tidy up the AMP scheduling web interface.
Updated more dropdown menus to work with the changes that Brad made to
the API, properly set valid default meshes when using the matrix, making
sure that only meshes tested to are added. Put in links to the raw YAML
schedules for sites (possibly useful for debugging) and a link to an
example configuration script that will set up a client from scratch
(installing packages, configuration, etc).
Spent a morning at Lightwire doing a demo of the AMP web interface,
talking about the different data that can be collected and the ways it
can be useful. Tried to install a test client to show how that works,
but unfortunately ran into some issues with the test environment that
prevented name resolution from happening. Tracked it down to the way
that getifaddrs() describes ppp interfaces being unexpectedly different
from the ethernet interfaces we had tested on so far. Found and fixed a
heap of other smaller issues that came out of the meeting, mostly to do
with permissions and documentation.
Last week I had been looking at some traffic that had been coming from Taobao servers (a shopping site in china that rivals aliexpress and ebay), that were using port 80, but weren't necessarily doing HTTP traffic.
I downloaded a few taobao applications on an android emulator to capture some traffic to try and replicate the traces we have been observing. It seemed promising as we were seeing traffic that was almost following the same trends being observed.
When I ran these traces through libprotoident this week, they were being classified as SPDY, which is used over HTTP to decrease loading time for web pages. Looking at their protocol manual, it appears that the traffic was conforming to ping packets for SPDY. I have now extended the module to account for this type of packet.
Last week, I worked on adding ports automatically to the virtual switch for Rhea based on the number of ports reported by OpenFlow switch to the Ryu controller at startup. This would do away with having to manually add and map ports on the virtual switch to ports on OpenFlow switches as it is done RouteFlow.
This week, I will be looking at how to convert routes received from the BGP daemon and Netlink update messages that will be sent to the controller into OpenFlow messages.
Tony McGregor has critiqued the background chapter of my thesis, and I have been making changes.
I reread the entire thesis and made a few corrections.
Started writing some content for the parallel libtrace paper. Managed to churn out an introduction, a background and a little bit of the implementation section.
Fixed a couple of bugs in netevmon prior to the deployment: crashing when trying to reconnect to a restarted NNTSC and some confusing event descriptions for changepoint events.
Finished setting up a mobile app test environment for JP. I've configured my old iPhone to act as an extra client for 2-way communication apps (messaging etc.). So far the environment has already been helpful, as we've managed to identify one of the major outstanding patterns as being used by the Taobao mobile shopping app.
Have run all common queries and their equivalents on both PostgreSQL and InfluxDB and made a table of results. Only a few gains were made by InfluxDB, but these were in some of the most common queries, and were reasonably significant.
I have noticed that InfluxDB queries seem to have a lower speed limit of about 2.5 milliseconds. I've also noticed that the Influx Database itself is taking a much bigger portion of the CPU than PostgreSQL during testing. This means that my testing may be partially limited by CPU.
Also used run length encoding to save space on traceroute data in InfluxDB and added unique ids to each as path, with the help of a second table for storing unique ids and paths. This is sort of using Influx for something it isn't designed for (as a relational DB), but it seems to be working for the limited purpose of reading a dictionary of already encountered paths and ids into memory before beginning to insert new data.