Tracked down a segfault in the Ostinato drone whenever I tried to halt packet generation on the DPDK interfaces. This took a lot longer than it normally would have, since valgrind doesn't work too well with DPDK and there are about 10 threads active when the problem occurs. It eventually proved to be a simple case of a '<=' being used instead of a '<', but that was enough to corrupt the return pointer for the function that was running at the time, causing the segfault.
Once I fixed that, I was able to write some scripts to orchestrate sending packets at specific rates for a period of time, while having a libtrace program running at the other end of the link trying to capture and process these packets. Once the packet generation is over, the libtrace program is halted. This will form the basis of my experiments to determine how much traffic we can capture and process with parallel libtrace. The experiments will use different capture methods (ring, DAG, DPDK, PF_RING etc), different packet rates, different numbers of processing threads (from 1 - 16) and different workloads ranging from just counting packets to cryptopan anonymisation.
My initial tests have shown that the numbers of dropped packets are not particularly consistent across captures with otherwise identical parameters, so I'll have to run each experiment multiple times so that I can get some more statistically valid results.
Also spent a bit of time helping Brendon capture some traces of his ICMP packets to help figure out whether his timing issues are network-based or host-based.
Updated amplet client init scripts to return proper LSB error codes when
starting without configuration, so systemd no longer falsely believes
the client was started ok. Updated the key permissions for the puppet
configuration to enforce to match those of the amplet client packages.
Built new packages and deployed them for testing.
Spent some time investigating start-stop-daemon and killing process
groups. There doesn't appear to be any nice way to make this happen
without writing our own code to stop everything, which is starting to
look worthwhile to make sure everything is tidied up properly when using
Wrote functions to format raw AS traceroute and path length data for
download from the graph pages. Still need to do full IP path traceroutes.
Found some interesting results when comparing the amplet ICMP test with
a few other data sources. Something is introducing delay and jitter in
one that isn't present in the others. Spent some time looking at source
code and traces to try to figure out what is going on (unsuccessfully so
far, will continue on Monday).
Added a new graph type to the AMP website for showing loss as a percentage over time. This graph is now shown when clicking on a cell in the loss matrix, as well as being able to be accessed through the graph browser. Fixed a complaint regarding the matrix where clicking on a cell in an IPv4 only matrix would take you to a graph showing lines for both IPv4 and IPv6 so you would never get the smokeping-style colouring via the matrix.
Started messing around with ostinato scripting on the 10g dev boxes and using DPDK to generate packets at 10G rates. Had a few issues initially because I was using an old version of the DPDK-enabled ostinato that Richard had lying around; updating to Dan's most recent version seemed to fix that.
Spent a bit of time looking at the data collected during the CSC and how it might be able to be used as ground truth for developing some security event detection techniques.
Fixed the permissions for directories and files created for keys/certs
to make sure that rabbitmq can access them. Also added exponential
backoff when trying to fetch signed certificates - hopefully a machine
that is being actively installed will query soon enough to quickly get a
new certificate, but unattended installs won't hammer the server.
Investigated some reported issues about init scripts not performing
correctly, but not sure I can find a fault. Also looked into two clients
that are not testing to the full list of targets - they just appear to
be ignored and there is no obvious reason why.
Worked with Brad to update two more amplets to Wheezy, and spent some
time trying to determine why we partially lost access to one of the few
remaining un-updated machines.
Finished up the implementation chapter of the libtrace paper. Added a couple of diagrams to augment some of the textual explanations. Got Richard S. to read over what I've got so far and made a few tweaks based on his feedback.
Spent a decent chunk of time looking at Unknown UDP port 80 traffic in libprotoident. Found a clear pattern that was contributing most of the traffic, which I traced back to Tencent. Unfortunately Tencent publishes a lot of applications so that knowledge wasn't conclusive on its own.
My initial suspicion was that it might have been game traffic so I downloaded and played a few popular multiplayer games via the Tencent games client, capturing the network traffic and comparing it against my current unknown traffic. No luck, but then I had the bright idea to look a bit more closely at video call traffic in WeChat (a messaging app). Sure enough, once I was able to successfully create two WeChat accounts and get a video call going between them, I started seeing the traffic I wanted.
Also added rules for Acer Cloud and OpenTracker over UDP.
This week I have collated some results from tests and produced graphs to demonstrate the difference in query times between influx and postgreSQL. I have also updated the version of Influx I am using and have been testing the new storage engine.
I have spent some time investigating Elasticsearch and getting it installed on a VM. I have it running now, so I will start working on filling it with production data when I get back after the break.
Spent some time putting together a test environment similar to how some
of the Lightwire monitors are configured, with ppp interfaces inside of
network namespaces. This allowed me to start tracking down issues with
the tcpping test that they were seeing. Firstly the differences between
capturing on ethernet and Linux SLL/cooked interfaces weren't being
taken into account and header offsets were incorrectly calculated.
Secondly, I spent a lot of time trying to determine why the test was not
capturing the first response packet on a ppp interface - after a lot of
digging it turns out there is a bug in libpcap to do with bpf filters on
a cooked interface that was breaking it. The bug has been fixed, but
needs a backported package to get the new library version in Debian.
Tested building and running the amplet client and all the supporting
libraries on a Raspberry Pi. I've run standalone tests (it has a newer
kernel which I thought might help debug my ppp problems) and the results
look to be sensible. Will hopefully get a chance to test general
performance while reporting results next week.
Tracked down a rare bug with packets from the previous packet-out test still being transmitted after that tested completed, presumably due to some buffering on the switch. The caused some tests to segfault and fail to produce results.
Continued to work on drawing graphs from my resulting data and rerunning any failed tests such as those caused by the above bug. As per Matthew's suggestions I went with gnuplot which has made it possible to plot multiple data sources etc on the same graph which has been useful particularly to allow both packets in and out to be plotted on the same graph.
Wrote a skeleton for a centralised collector of progger data for Harris to start filling in with actual useful code.
Continued writing up the implementation chapter of the libtrace paper. It's turning out to be a pretty long paper, as there are a lot of design decisions that warrant discussion (memory management, combiners, hashers etc.).
Succumbed to my head cold on Thursday, so had a day at home to rest and recover.
We calculated that for the latency chart, having m x n different queries with current speeds would take a few seconds on influxDB (where m is number of sources and n is number of targets). I experimented with querying for whole rows and for the whole grid at once, and found significant speed ups (about 10x the speed for the whole grid)
Have been investigating why influx seems to have a baseline speed of 2.5ms by posting in forums etc, but have had no breakthrough. Influx has just upgraded their storage engine, so I will look into testing this when it comes out.
Have rewritten traceroute tests to group by IP paths over the past 48 hours, which has slowed the query down to take around two and a half seconds on average.
Also investigated whether we can use retention policies to discard but backup old data elsewhere. Not really what they're designed for, but it seems that something like this could be done with clustering. May need to test.