Made some changes to the amplet client in response to things I observed
while installing test clients for the Lightwire machine. Changed the log
level of some informational messages to avoid filling logfiles,
rearranged startup to create the pidfile earlier to work better with
puppet and added some more smarts to guessing the ampname when one isn't
supplied. Also rearranged some directory structure to better represent
the python modules involved.
Found and fixed a few bugs in various things on the server side as well.
Values from the new dropdowns weren't being fetched appropriately in
some cases, percentage loss was sometimes calculated incorrectly and
incomplete traceroute paths weren't being stored correctly.
Got the event detection systems up and running on the Lightwire machine,
which was delayed due to issues with embedded R behaving slightly
differently in the Jessie version. Also spent some time with Shane
chasing up some unusual looking events and unusual merging of event groups.
Brad and I finished updating the last of the reachable amplets to Debian
Wheezy, which brings us up to 13 monitors all running the new code now.
I spent last week examining the vandervecken code to understand how the interfaces on the OpenFlow switches and the OVS switch in the Vandervecken VM are mapped together.
This week, my focus is to code the mapping and port association functionalities needed for the new RouteFlow.
I've re-run a couple of tests this week firstly the add-flow tests on the pica8 because these had a threading bug in them. I then re-run the HP tests as these were unintentionally run on 100Mbit ports so for consistency purposed these were moved to the 1Gbit ports. I also run the tests against OVS.
I've been working through processing the results, verifying the results and looking for any anomalies that I might need to take into account. I found some very low figures from the Brocade tests in some cases, these could be related to it becoming unstable and crashing (either the cause or if it is a gradual leak just timing).
Some aspects of the tests that I have been looking at include receiving packets out of order along with rates and latencies.
Had a meeting with Tony McGregor about the introduction. Took away some critique sheets and used these to make changes and improvements.
Continued to read through the rest of the chapters so that I can, pass on a draft and get some critiques from Matthew and Richard on the later chapters, in particular.
Played around with getting netevmon to produce some useful events from the Ceilometer data and updated amp-web to be able to show those events on the dashboard. Some of our existing detection algorithms (Plateau, BinSegChangepoint, Changepoint) worked surprisingly well so we should have something useful to demo at the STRATUS forum on Friday.
Helped Brendon get netevmon up and running on lamp. There were a few issues unfortunately, mostly due to permission issues and R being terrible, but managed to get things running eventually. Spent a bit of time fixing some redundant event groups that we observed from the lamp data which were a side-effect from the fact that a group of traceroute events can be combined with both latency increase and decrease events. We also worked together to track down some bad IP traceroute paths that were being inserted into the database -- new amplets were not including a 'None' entry for non-responsive hops which NNTSC was expecting so an 11 hop path with 6 non-responsive hops was being recorded as a 5 hop contiguous path. Updated NNTSC to recognise a missing address field as a non-responsive hop.
Gave JP a crash course in libprotoident development so he can get started on his summer project.
I have put together a demonstration script for getting the mode traceroute over a period of time. Had some issues with timezones but have sorted these out. Looks like this will be a reasonably quick operation.
Also encountered an issue with influxDB earlier in the week where it was using up massive amounts of memory and system was killing it. Have learned that the reason for this is that it stores an in memory inverted index of the unique series. I had given the measurements too many tags with too many possible unique combinations, so I have redesigned these so that there are only 21938 unique series instead of over a million. This has seemed to fix the issue.
I have been backfilling data over the last month, though this is a slow process, as influxDB is designed for recent points to be inserted rather than older ones. Once I have a decent amount of data I will run some benchmarks against our PostgreSQL database.
My project is to try and classify packets within the university network that cannot be classified by libtrace protoident. This week I will be looking at all unidentified packets that are coming in or going out on TCP port 80 to see what applications may be using this for anything other than a web server.
Gave a talk at the internal PhD student conference on my thesis chapters.
I have been checking through my entire thesis for logical flow, and grammar errors, including use of 'which' and 'that', and commas. After I finish it will be forwarded to two more people to critique.
This week I've been running more testing on the Brocade which has a tendency to crash during testing. Testing over the secure channel seems to be problematic but I've managed to get through most tests now.
I've also run through the packet in and out tests on the HP. I will run those tests adding flows next week in case these cause issues (as I saw in preliminary testing).
With the results collected so far I've worked through the tests and constructed some GNU plot graphs to visualise some of the results.
On Friday I had the CS PhD student conference where I presented some of my recent results that I prepared throughout the week. The presentation was primarily based upon my earlier proposal however with more of a focus on the work I've done so far rather then purely what I propose to do for the PhD.
I have been working on investigating time series databases for the AMP system. Some of the more promising systems included KairosDB (http://kairosdb.github.io/), Prometheus (http://prometheus.io/) and InfluxDB (https://influxdb.com/index.html). I also looked at other systems including Druid, OpenTSDB and ElasticSearch, but for a number of reasons these didn't seem to be ideal for our use case.
A few things that were important to consider for AMP include
- The server may not necessarily be clustered, so systems that rely on clustering aren't particularly useful
- The data taken from each test (http, tcpping etc.) is reasonably multi-dimensional, so time series that rely on univariate data aren't useful either
- Not all data from tests is numeric (eg AS numbers in traceroute), so systems that use numeric typing wouldn't be ideal either
- We would prefer not to use a java based system if possible
- We want queries that are used on the website to be as fast as possible, so a database that supports fast querying and aggregation over time would be great.
After reviewing these options, I decided the best place to start would be InfluxDB. A few useful features of InfluxDB include:
- Aggregation is done on the fly through the use of continuous queries. We can set up queries to aggregate data that will be queried often into smaller tables. For example, one might make a table that contains just the mean rtt for tcpping between hosts in 5 minute intervals. This speeds up querying a lot.
- It has a very fast write speed
- It is written in Go
- It has an HTTP API
- One series per measurement, so the database schema is nice and simple
- There is a python API available (http://influxdb-python.readthedocs.org/en/latest/)
A couple of issues with InfluxDB are that:
- Custom functions are not yet supported (https://github.com/influxdb/influxdb/issues/68). It would be great if we could create a 'mode' function to help with traceroute measurements
- The system is based on what are called tags and fields (sort of both a feature and an issue). This means that each column in the table needs to be labelled as either an independent or a dependent variable. You cannot construct queries that select the values of all tags, and likewise you cannot have a 'group by' clause that is based on a particular value
I have successfully filled this database with data from the RabbitMQ dev queue, and have been testing times for standard queries, such as the mean rtts of tcpping over the last couple of days in 5m intervals. Some goals for next week include:
- compare query times for InfluxDB with our current system
- test InfluxDB on the prod stream of data, and backfill a longer period of time with random data, to test how it works with larger volumes of data
- demonstrate and test how a traceroute query may work to find the mode path over a period of time
A couple of results from initial tests on a few queries:
Testing query: 'Select * from "icmp_rtt.means.5m" where time > now() - 2d' for 5 clients (query on aggregated table):
Rows returned: 5243
Average Speed: 0.1007314 seconds
Testing query: 'Select mean(rtt) from icmp where time > now() - 2d group by time(5m)' for 5 clients (query on non-aggregated table):
Rows returned: 577
Average Speed: 6.2435292
So querying a pre-aggregated table is around 60 times faster, which is a great speed gain. It will be interesting to see how these queries do compared to SQL queries.