Elasticsearch Daily Indexing On Old Fashion Spinning Disks

I believe that, if it takes long, something is wrong. Regarding this minimalistic approach I’ve resolved something for one of my clients recently.

There was a single node with an Elasticsearch instance running on. I had to keep it as much fast as it was at the first days of running. I had no cluster, no scaling out option. Because the rack was full!!! It was a powerful old fashion machine, with 32 cores, 96GB RAM and 10 TB spinning disk capacity. Moreover there was a 160GB hourly incoming log stream!

The matter is, when you use Elasticsearch, you would need to be careful about the index size. Very large indexes make the Elasticsearch’s engine slow. Actually it is not an Elasticsearch issue. This is the increasing burden of indexing which is highly I/O consuming. Inside each index there are many segments. ES creates new segments by each refresh interval (1 second or more). Each new segment would merge within the previous ones. These all are pretty IO consuming. So the ingesting process looses it’s performance by growing the data size. The river that supposed to import incoming data massively, gets slow as well. Because the rivers are pretty IO bounded. Finely the disks will be full by raw data or billion records would be lost.

Sharding and partitioning into different indexes are similar approaches to slicing your data to prepare for handling massive amount of data. We notified in a short period of time (such as half a week) it works like a charm. Later it goes slow. One index per a day, were enough good for our issue.

The following script is really simple and easy to run and manipulate. I just wrote it when I needed to find a solution for resolving the I/O read/write burden issue within multi Tera byte size indexes were placed on mechanical spinning disks.

amirsedighi@amirMacBookPro~: cat idaily.sh
##########################################################
## This simply creates and maintains daily
## elasticsearch indexes.
## by: amir sedighi
##########################################################

today=$(date '+%Y-%m-%d')

### INDEX CREATING
curl -XPUT 'localhost:9200/--index-'$today'-index--'

sleep 5

### MAPPER CREATING
curl -XPUT 'localhost:9200/--index-'$today'-index--/--type-'$today'-type--/_mapping' -d '

{

"--type-'$today'-type--" : {

"properties" : {

MAPPING COMES HERE...

}}}';

sleep 5

### TODAY RIVER CREATION
curl -XPUT localhost:9200/_river/--river-$today-river--/_meta -d '
{
RIVER DEFINITION COMES HERE...

"index" : {

"index" : "--index-'$today'-index--",

"type" : "--type-'$today'-type--",

"bulk_size" : 4000,

"bulk_threshold" : 10
}
}';

##### DELETING YESTERDAY'S RIVER
yest=$(date --date="yesterday" +"%Y-%m-%d")

curl -XDELETE localhost:9200/_river/--river-$yest-river--

I’ve made a simple scheduling by crontab which is running the idaily.sh every 10 minutes. Re-creation of an existed object won’t be affected. Deletion of non-existed items won’t make anything wrong as well. So just let it go.

The front-end handles how many indexes should be queried. So the ES instance now works blazing fast within importing and indexing data no matter of the volume.

While it works fine, I would appreciate, you let me know if there is an automated method to resolve the issue of becoming slow on a single node spinning disk hardware.

NOTE:
I’ve asked @foundsays idea about the current post and she just answered. The following is the transcript:
@foundsays: We would recommend avoiding using rivers. Also, you could consider using Logstash to achieve your goal…
@amirsedighi: I’m already using Logstash for another client. This one has got it’s own log collector and insists to use it.

This entry was posted in Big Data, Cloud Computing, Cluster, Java, Linux, Mac OS X, Networking, Open Source, Software Engineering, Software Market Demands. Bookmark the permalink.

4 Responses to Elasticsearch Daily Indexing On Old Fashion Spinning Disks

  1. Mehdi says:

    Hi
    I think this is a good idea for big indexes but there is also the ttl (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html) inside ElasticSearch which may be useful to keep the size of index smaller and getting rid of old necessary information.
    Mehdi

    • admin says:

      Hi Mehdi,

      You right. Thank your for sharing. I’ve already used it. But it wasn’t working in our scenario. Here the matter was the huge amount of the input stream. The sustain rate is around 160 GB per hour. In peak moments it goes up to 200 GB per hour. The machine was using spinning disks. Right now we maintain something around 10TB log files, full-text search, enabled by this simple method.

      Chakereteem.

  2. Reza says:

    Hi; It’s good approach. I’m also learning & working with ELK for a project and I also used small indices approach too (without sharding) for this project. I’ve got a question about your approach to creating daily indices; Why you don’t use aliases in front of daily indices and also template to mapping and setting for indices? Don’t these approach work with “rivers”?

    Note: I’m new to study English! Please forgive me for my mistakes 🙂

Leave a Reply