User login

Andrew Bell's blog

09

Feb

2016

Have finished the implementation using InfluxDB, and have spent some time ironing out bugs. Influx Conitinuous Queries only automatically query the last couple of bins of data in real time, so when inputting old data it needs to be explicitly rolled up. Have created code to do this.

Also did a bit of a hack to get the smoke arrays behaving a bit more sensibly when there are less than 20 observations per bin. Doesn't make sense to take 20 percentiles of 3 results, so needed to just return raw results if there were less than 20 of them. Currently doing this by still rolling up the percentiles and then taking only as many percentiles as we have results. Seems to be working, but not an ideal implementation. Would be good to do this at database level but can't with InfluxDB.

Will walk Brendon through what I've done this week so someone else can revisit the code once I leave.

02

Feb

2016

I have completed wiring up the querying functionality for amp-web to influx.

Will now spend some time refactoring, documenting code and tidying up a few corners. Graphs and matrices seem to be doing what they're supposed to, but will try to double check that everything goes once more data is inserted and that I haven't missed any edge cases. Will also see if I can speed up the matrix query by searching for multiple streams at once.

25

Jan

2016

This week I began investigating how easy it was to store data from influxdb on an external disk, and found that Influx has put out a new update (0.10.0) in Beta which provides functionality to backup and restore data. Combining this with Retention Policies means that we will be able to easily send data to a backup disk and have it drop off the main storage system after a given amount of time.

I have begun to implement InfluxDB into the AMP system, and have got the dataparsers sending data to influx for every test except traceroute, while still using PostgreSQL for information about streams. I have also added code to automatically generate continuous queries and retention policies on setup. Plans now are to get client queries using the InfluxDB data.

18

Jan

2016

This week I have been working on implementing and testing an Elasticsearch database for testing against InfluxDB. Have made tests for all measurements except the mode traceroute ones, as I haven't found an easy way to do this in Elasticsearch. Results seem to indicate that Elasticsearch tends to come in at second best in most tests.

Have also been encountering a lot of issues with memory and crashing, and am not finding Elasticsearch to be very reliable under lots of concurrent searches. I don't think Elasticsearch will be the right choice for AMP. Despite the positive results from this CERN evaluation http://cds.cern.ch/record/2011172/files/LHCb-TALK-2015-060.pdf, it doesn't seem to outperform Influx in speed or reliability consistently for our use case.

18

Dec

2015

This week I have collated some results from tests and produced graphs to demonstrate the difference in query times between influx and postgreSQL. I have also updated the version of Influx I am using and have been testing the new storage engine.

I have spent some time investigating Elasticsearch and getting it installed on a VM. I have it running now, so I will start working on filling it with production data when I get back after the break.

14

Dec

2015

We calculated that for the latency chart, having m x n different queries with current speeds would take a few seconds on influxDB (where m is number of sources and n is number of targets). I experimented with querying for whole rows and for the whole grid at once, and found significant speed ups (about 10x the speed for the whole grid)

Have been investigating why influx seems to have a baseline speed of 2.5ms by posting in forums etc, but have had no breakthrough. Influx has just upgraded their storage engine, so I will look into testing this when it comes out.

Have rewritten traceroute tests to group by IP paths over the past 48 hours, which has slowed the query down to take around two and a half seconds on average.

Also investigated whether we can use retention policies to discard but backup old data elsewhere. Not really what they're designed for, but it seems that something like this could be done with clustering. May need to test.

04

Dec

2015

Have run all common queries and their equivalents on both PostgreSQL and InfluxDB and made a table of results. Only a few gains were made by InfluxDB, but these were in some of the most common queries, and were reasonably significant.

I have noticed that InfluxDB queries seem to have a lower speed limit of about 2.5 milliseconds. I've also noticed that the Influx Database itself is taking a much bigger portion of the CPU than PostgreSQL during testing. This means that my testing may be partially limited by CPU.

Also used run length encoding to save space on traceroute data in InfluxDB and added unique ids to each as path, with the help of a second table for storing unique ids and paths. This is sort of using Influx for something it isn't designed for (as a relational DB), but it seems to be working for the limited purpose of reading a dictionary of already encountered paths and ids into memory before beginning to insert new data.

30

Nov

2015

Spent the first half of the week cleaning up some of my code and refactoring. I have begun benchmarking some queries speeds for InfluxDB against our current database. I have been provided with a list of common queries so am making my way through those and comparing them against equivalent queries on InfluxDB. Preliminary results seem to show that PostgreSQL is generally faster for making most queries, but the queries that utilise the Continuous Queries that InfluxDB provides speed things up quite a bit and often beat PostgreSQL speeds significantly. Continuous queries are like views which continually aggregate data and store it.

23

Nov

2015

I have put together a demonstration script for getting the mode traceroute over a period of time. Had some issues with timezones but have sorted these out. Looks like this will be a reasonably quick operation.

Also encountered an issue with influxDB earlier in the week where it was using up massive amounts of memory and system was killing it. Have learned that the reason for this is that it stores an in memory inverted index of the unique series. I had given the measurements too many tags with too many possible unique combinations, so I have redesigned these so that there are only 21938 unique series instead of over a million. This has seemed to fix the issue.

I have been backfilling data over the last month, though this is a slow process, as influxDB is designed for recent points to be inserted rather than older ones. Once I have a decent amount of data I will run some benchmarks against our PostgreSQL database.

16

Nov

2015

I have been working on investigating time series databases for the AMP system. Some of the more promising systems included KairosDB (http://kairosdb.github.io/), Prometheus (http://prometheus.io/) and InfluxDB (https://influxdb.com/index.html). I also looked at other systems including Druid, OpenTSDB and ElasticSearch, but for a number of reasons these didn't seem to be ideal for our use case.

A few things that were important to consider for AMP include
- The server may not necessarily be clustered, so systems that rely on clustering aren't particularly useful
- The data taken from each test (http, tcpping etc.) is reasonably multi-dimensional, so time series that rely on univariate data aren't useful either
- Not all data from tests is numeric (eg AS numbers in traceroute), so systems that use numeric typing wouldn't be ideal either
- We would prefer not to use a java based system if possible
- We want queries that are used on the website to be as fast as possible, so a database that supports fast querying and aggregation over time would be great.

After reviewing these options, I decided the best place to start would be InfluxDB. A few useful features of InfluxDB include:
- Aggregation is done on the fly through the use of continuous queries. We can set up queries to aggregate data that will be queried often into smaller tables. For example, one might make a table that contains just the mean rtt for tcpping between hosts in 5 minute intervals. This speeds up querying a lot.
- It has a very fast write speed
- It is written in Go
- It has an HTTP API
- One series per measurement, so the database schema is nice and simple
- There is a python API available (http://influxdb-python.readthedocs.org/en/latest/)

A couple of issues with InfluxDB are that:
- Custom functions are not yet supported (https://github.com/influxdb/influxdb/issues/68). It would be great if we could create a 'mode' function to help with traceroute measurements
- The system is based on what are called tags and fields (sort of both a feature and an issue). This means that each column in the table needs to be labelled as either an independent or a dependent variable. You cannot construct queries that select the values of all tags, and likewise you cannot have a 'group by' clause that is based on a particular value

I have successfully filled this database with data from the RabbitMQ dev queue, and have been testing times for standard queries, such as the mean rtts of tcpping over the last couple of days in 5m intervals. Some goals for next week include:
- compare query times for InfluxDB with our current system
- test InfluxDB on the prod stream of data, and backfill a longer period of time with random data, to test how it works with larger volumes of data
- demonstrate and test how a traceroute query may work to find the mode path over a period of time

A couple of results from initial tests on a few queries:

Testing query: 'Select * from "icmp_rtt.means.5m" where time > now() - 2d' for 5 clients (query on aggregated table):
Rows returned: 5243
Average Speed: 0.1007314 seconds

Testing query: 'Select mean(rtt) from icmp where time > now() - 2d group by time(5m)' for 5 clients (query on non-aggregated table):
Rows returned: 577
Average Speed: 6.2435292

So querying a pre-aggregated table is around 60 times faster, which is a great speed gain. It will be interesting to see how these queries do compared to SQL queries.