<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=113162732434478&amp;ev=PageView&amp;noscript=1">

Managing Big Data Time Series – Part 1

by Maximilian Schmidt on January 06, 2017
Maximilian Schmidt
Find me on:
Managing Big Data Time Series – Part 1
When it comes to the visualization of primitive data which is stored somewhere in your application architecture, you will basically encounter the following choice of options:
  • Retrieve the raw data and transform it into the form I need (pay me later)
  • Retrieve an already transformed form of the data (pay me now)

The data can be transformed to a different technical format (raw bytes to JSON) or to a new aggregation (raw data points to downsampled data points or just a sum of data points).

But these discussions aren't new. Everyone knows the pros and cons of both options.

The „pay me later“ option is easy to solve, but the logic of transformation might move towards the frontend of your application and cause it to run slower.

The „pay me now“ version may consist of a broader application usage to store and manage the different types of transformations, and thus may be hard to manage, but overall it can be much faster. Of course, a mix of these two approaches could also be feasible, but I don't want to cover this here.

What is time series data

One of the key problems we faced at Datapath.io was: How do we want to show the internet latency data we are constantly measuring? First of all, as we are constantly collecting this data we should call it „time series data“. Everyone has heard of this term – a series of data points shown over a period of time. There are plenty of libraries, full stack applications, and tools – some of them will make the „lazy“ approach, some the „early“ one, others both.

The other important problem which we had was the „key“ itself. This might sound like a strange issue, but consider what we want to show and how the cardinality (a current state corresponding to the deployment of datapath.io architecture) of the keys looks like.

A time series that is measured from a source (a transit link within an AWS-region) to a given network prefix:

  • 3 regions
  • 4 transit-links per region
  • 650.000 network prefixes
  • → 7.800.000 time series keys

A time series that is measured from a source (a transit link within an AWS-region) to a given network prefix (optimized):

  • 3 regions
  • optimized per region transit link
  • 650.000 network prefixes
  • → 1.950.000 time series keys

A time series that is measured from a source (see above) to a given geolocation (extracted from a geodatabase):

  • 3 regions
  • 4 transit links per region
  • 25.000 geolocations
  • → 300.000 time series keys

A time series that is measured from a source (see above) to a given Autonomous System:

  • 3 regions
  • 4 transit-links per region
  • 52.000 ASNs
  • → 624.000 time series keys

Of course, there are some other smaller special cases that we are also interested in, but these numbers should be enough to give you an idea of how much time series data we have.

To sum up: more than 10 million keys can be shown on the frontend.

Reducing the key cardinality

The first question you may ask yourself is: „Why the hell aren‘t they reducing the key cardinality?“.

Well, basically because we have the source data available and everyone wants to see it.

But what is the source data? As it was mentioned before, in our blog post about amqp rabbitmq, the raw data for all internet latency measurements is stored on an HDFS-Cluster in a byte format. Initially, this data is not human-readable or directly usable by another service. For transformation (in terms of format, calculations, e.g) of the data, we use Apache Spark.

InfluxDB and other solutions

Now to the final question – where can we put the transformed data so that it will be available for use by the frontend?

When thinking of how to handle a time series data, most engineers will remember the famous RRDtool – but to be honest, the „API“ and concept of this tool are not applicable for most of the modern application scenarios, and manipulating data with it is painful. But there are some other options: IBM‘s „Informix“ time series database is one of the enterprise software solutions, however, such enterprise software was not considered. There are also open-source solutions such as Graphite and InfluxDB.

InfluxDB is currently very famous on GitHub and is used by many companies because of some of its very nice features such as:
  • SQL alike queries
  • aggregation functions
  • distributed backend
  • REST interface
  • fast read/write access
  • high key cardinality

This all seems really nice, but when we delve deeper to find out exactly what they mean by „high key cardinality“, we will find in the documentation that the maximum key cardinality is around 10 million keys. So, unfortunately, InfluxDB will probably be infeasible for our purposes.

This drove us to an attempt to build our own solution.

To see how we did this, stay tuned for part 2 (Coming soon)

Download the  DevOps Network Guide Ebook

Mehr: Big Data