Skip to content
This repository has been archived by the owner. It is now read-only.
master
Go to file
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
Nov 25, 2018
Nov 25, 2018

README.md

GermanWeatherDataExample

Project

Going through a lot of pain has taught me, that you have to know about the capabilities and pitfalls of a technology, before putting it in production. These days it's all about "Big Data", but how well do technologies work when we actually throw realistic data at it? What knobs have to be turned to make the technologies scale? How far does a single machine take us?

In the project I want to evaluate TimescaleDB, Elasticsearch, SQL Server and InfluxDB on the 10 Minute Weather Data for Germany.

Dataset

The DWD Open Data portal of the Deutscher Wetterdienst (DWD) gives access to the historical weather data in Germany. I decided to analyze the available historical Air Temperature data for Germany given in a 10 minute resolution (FTP). If you want to recreate the example, you can find the list of files here: GermanWeatherDataExample/Resources/files.md.

The DWD dataset is given as CSV files and has a size of approximately 25.5 GB.

Experiment Setup

TODO

Status

SQL Server 2017

TODO

TimescaleDB

TODO

InfluxDB

InfluxDB 1.7.1 is unable to import the entire dataset without changes to the default configuration. It consumes too much RAM under load and could not write the batches anymore. After reading through documentation I am quite confident, that the Retention Policy has to be adjusted, so that the shards do not stay in memory forever:

It's because the default configuration of InfluxDB is optimized for realtime data with a short retention duration and a short shard duration. This makes InfluxDB chewing up the entire RAM, just because too many shards are created and the cached data is never written to disk actually.

So I am now creating the database using DURATION set to infinite (inf), to keep measurements forever. The SHARD DURATION is set to 4 weeks for limiting the number of shards being created during the import:

CREATE DATABASE "weather_data" WITH DURATION inf REPLICATION 1 SHARD DURATION 4w NAME "weather_data_policy"

In the influxdb.conf I am setting the cache-snapshot-write-cold-duration to 5 seconds for flushing the caches more agressively:

cache-snapshot-write-cold-duration = "5s"

Elasticsearch

The default configuration of Elasticsearch 6.5.1 is not optimized for bulk loading large amounts of data into the database. To improve the import for the initial load, the first I did was to disable indexing and replication by creating the index with 0 Replicas (since it is all running local anyway) and disabling the Index Refresh Interval, so the index isn't built on inserts.

All the performance hints are taken from the Elasticsearch documentation at:

In Elasticsearch 6.5.1 these settings have to be configured as an index template apparently, instead of editing the config/elasticsearch.yml. Anyways it can be easily be achieved with the NEST, the official .NET Connector for Elasticsearch:

// We are creating the Index with special indexing options for initial load, 
// as suggested in the Elasticsearch documentation at [1].
//
// We disable the performance-heavy indexing during the initial load and also 
// disable any replicas of the data. This comes at a price of not being able 
// to query the data in realtime, but it will enhance the import speed.
//
// After the initial load I will revert to the standard settings for the Index
// and set the default values for Shards and Refresh Interval.
//
// [1]: https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html
//
client.CreateIndex(settings => settings
    .NumberOfReplicas(0)
    .RefreshInterval(-1));

Additionally I made sure I am running a 64-bit JVM, so the heap size can scale to more than 2 GB for fair comparisms with systems like InfluxDB, that aggressively take ownership of the RAM. You can configure the Elasticsearch JVM settings in the config/jvm.options file.

I have set the initial and maximum size of the total heap space to 6 GB, so Elasticsearch should be able to allocate enough RAM to play with:

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms6g
-Xmx6g

And finally I wanted to make sure, that the Elasticsearch process isn't swapped out. According to the Elasticsearch documentation this can be configured in the config/elasticsearch.yml by adding:

bootstrap.memory_lock: true

More information on Heap Sizing and Swapping can be found at:

Resources

SQL Server

TimescaleDB

InfluxDB

Elasticsearch

You can’t perform that action at this time.