Spent most of the week working on installing the server-side components of AMP, which took a lot longer than I thought it would. Ran into issues with deploying ampweb when it's not at the root of the website - a lot of URLs were absolute and URL parsing was expecting a particular layout or number of elements which was no longer the case. Fixed all the obvious ones, but more were still showing up a few days later and have also been fixed.
Found and fixed a few small edge cases in recent ampweb/ampy features that hadn't been tested with the sort of data I was wanting to look at. Also had issues with trying to test the changes I had made to make sure they worked elsewhere, as the influx database on prophet was misbehaving and making it difficult to fetch data in some circumstances.
Had a couple of days off sick this week, however this report also covers the week before.
Continued with work getting sample flows, this time from vandervecken. I setup the vandervecken ISO in a KVM machine and a small mininet network. Like with other apps I'm trying to keep the environment as contained as possible so it can easily be run on any machine. I also ran vandervecken on the new DPDK OVS software switch with some simple iperf3 benchmarks comparing fast-path to the slow-path, this gives me some solid results for the paper to show the improvement and proof of reaching line rate.
I've given more thought to the problem of creating dependency graphs while cacheflow gives a starting point it is still not entirely clear to me at the form a multitable implementation will take. The complex dependencies between tables, especially with apply vs write actions, are still not clear. This comes down to defining what a dependency is, which depends somewhat on what restrictions later possible transformations will have. A dependency between tables could simply be any rule which sends traffic to a rule in the next table and it's dependencies. However it is possible that for some transformations this is not the best definition, for instance if rules are writing an action to the action set it may be possibly to move these to an early table, it which case it is not obvious a dependency exists. The compression to a single table as in flowadapter is still a viable approach to overcoming this part of the problem.
My next step is to take rules independent of tables and dependencies and detect which tables these could be placed in in a new pipeline. This involves reading in a hardware description for which TTP still gives a good starting point, despite vary few vendors having released TTP's. In fact I'm only aware of a TTP for ofdpa and a couple samples released with the spec.
Made some progress on the InfluxDB memory issues we were having when catching up on old data. Now we are a lot less likely to drive the machine into swap, at the cost of taking a bit longer for backfilled data to be aggregated. Part of the problem was caused by my fix last week for the change in behaviour for the first() and last() aggregation functions in Influx 0.11 -- I've put in a new hacky fix but I'm basically waiting for Influx 0.13 which will hopefully provide us a way to get the old behaviour back.
Found another weird bug in Influx where if we query for certain streams, then sometimes a result row will get split into two "half-rows". This was messing with our querying code in NNTSC which assumes that the database will return only complete rows, so I've had to add extra code to deal with this possibility.
More influx issues: we aren't allowed to perform aggregation on the timestamp column in an Influx table, which was breaking our loss calculation for DNS -- we were using count(timestamp) to determine how many DNS requests we had sent as this was the only non-NULLable column in the DNS data table. Instead, I've had to add an extra "requests" column to the DNS data table so that we have an explicit count available in our aggregated data.
Lots of little fixes on the website. The changes to modals to bootstrap 3.3 are continuing to have a number of interesting flow-on effects, such as the "add new series" modal no longer working after the first time it is used. Added an AS path tab to latency and loss graphs that are only showing a single series, as we've often seen some interesting change and wondering whether the path has changed at the same time. Also fixed an issue where the last datapoint was often not visible on the graphs.
Finally, submitted my unexpected traffic paper to IMC on Thursday. Fingers crossed.
Spent some time tidying up control messages and configuration when scheduling tests that require cooperation from the server. As part of the previous changes the port number was no longer being sent to tests, which meant it could only operate using the default port - this is now fixed and works for both scheduled and standalone tests. Also fixed up some parameter parsing when running standalone tests where empty parameter lists were not being created properly.
Wrote some basic unit tests for the udpstream test and it's control messages. Fixed a possible memory leak when failing to send udpstream packets. Made sure documentation and protobuf files agreed on default values of test parameters.
Started to install the server-side components of AMP on another machine for a test deployment so that I can use the documentation I write as I go to help build/update the packaging for the most recent versions.
Started adding support for the new AMP UDPStream test to NNTSC, ampy and amp-web. Test results are now successfully inserted into the database and we can plot simple latency and loss graphs for the UDP streams. Next major tasks are to produce a new graph type that can be used to represent the jitter observed in the stream and to get some event detection working.
Spent much of my week chasing Influx issues. The first was that a change in how the last() function worked in 0.11 was messing with our enforced rollup approach -- the timestamp returned with the last row was no longer the timestamp of the last datapoint in the table; it was now the timestamp of the start of the period covered by the 'where' clause in your query. However, we had been using last() to figure out when we had last inserted an aggregated datapoint into the rollup tables, so this no longer worked.
The other issue I've been chasing (with mixed success) is memory usage when backfilling old data after NNTSC has been down for a little while. I believe this is mostly related to Influx caching our enforced rollup query results, which will be a lot of data if we're trying to catch up on the AMP queue. The end result on prophet is a machine that spends a lot of time swapping when you restart NNTSC with a bit of a backlog. I need to find a way to stop Influx from caching those query results or at least to flush them a lot sooner.
Added a latency measure to the udpstream test by reflecting probe packets at the receiver. The original sender can combine the RTT information with jitter and loss to calculate Mean Opinion Scores, which was slightly annoying as (depending on the test direction) the remote end of the test now has to collate and send back partial result data. Updated the ampsave function to reflect the new data reported by the test.
Updated the display of tcpping test information in the scheduling website to reflect the new packet size options. Worked with Shane to update the lamp deployment to the newest version of all the event detection and web display/management software.
Tidied up some more documentation and sent it to a prospective AMP user. Will hopefully get some feedback next week as they try to install it and I can see which areas of the documentation are still lacking.
Spent the better part of this week reviewing literature and thinking about the best starting point and the first issue to tackle.
CacheFlow gives a good outline of building dependency graphs, and the header space work it builds it's solution upon seems like a good approach. That is to look at packet headers as a series of bits rather than a set of fields. If I take this approach I will have to extend the solution to deal with multiple tables. Alternatively the FlowAdapter of normalising to a single table is still a possibility (some type of dependency graph is part 1 of this step anyway). My current thinking is that a dependency graph is likely to result in better optimisations than one big table, which would essentially have to be undone when placed back onto a multi-table switch.
I looked at the state of the art of minimising TCAM entries, most work particularly in the past have been on prefix based optimisation (as is seen in routes etc). More recently OpenFlow has sparked interest in generic TCAM rule optimisation (with out the prefix restriction), currently there appears to be only a single online solution currently. I don't think this is going to be a main area of my research, however if I have the time I could try an existing solution at in the pipeline directly before installing on the switches.
I read a few related papers which focused on spreading rules amongst multiple switches. These tend to be limited to only spreading the policy not the forwarding, and tend to construct subset of rules in such a way that order does not matter. Allowing the rules to be placed in any table along a packets path. This restriction is not needed within the bounds of a single switch as the order of tables is known and there is essentially only a single path. As such while interesting and useful as inspiration for algorithms, without the order restriction it is actually easy to move rules around, lower priority rules can be moved to a later table.
Finished up the first release version of the event filtering for amp-web and rolled it out to lamp on Thursday morning. Most of this week's work was polishing up some of the rough edges and making sure the UI behaves in a reasonable fashion -- Brad was very helpful playing the role of an average user and finding bad behaviour.
Post-release, tracked down and fixed the issue that was causing netevmon to not run the loss detector. Added support for loss events to eventing and the dashboard.
Released a new version of libprotoident, which includes all of my recent additions from the unexpected traffic study.
Marked the last libtrace assignment and pushed out the marks to the students.
After what seems like forever, I've finally managed to put together a new libprotoident release that includes all of the new protocol rules I've developed over the past couple of years. This release adds support for around 70 new protocols, including QUIC, SPDY, Cisco SSL VPN, Weibo and Line. A further 28 protocols have had their rules refined and improved, including BitTorrent, QQ, WeChat, Xunlei and DNS.
The lpi_live tool has been removed in this release, as this has been decommissioned in favour of the lpicollector tool.
Also, please note that libflowmanager 2.0.4 is required to build the libprotoident tools. Older versions of libflowmanager will fail the configure check.
The full list of changes can be found in the libprotoident ChangeLog.
Lots of minor fixes this week. Fixed the commands to properly kill the entire process group when stopping the AMP client using the init scripts. Still need a cleaner way to do this as part of the main process. Updated the AMP schedule fetching to follow HTTP redirects, which was required to make it work on the Lightwire deployment. Fixed the tcpping test to properly match response packets when the initial SYN contains payload. Different behaviour was observed in some cases where RSTs would acknowledge a different sequence number compared to a SYN ACK, and only one of these was being checked for.
Updated all the tests to report the DSCP settings that they used. They are not currently saved into the database, but they are being sent to the collector now.
Set the default packet interval of the udpstream test to 20ms, which is closer to VoIP than the global AMP minimum interval that it was using. Also wrote most of the code for the test to calculate Mean Opinion Scores based on the ITU recommendations, just need to add a latency measure to complete the calculation.