Elasticsearch Throttling Indexing

James CarrJune 15th, 2016Last Updated: June 13th, 2016

0 168 3 minutes read

Recent adventures in Zapierland had me in a somewhat scary predicament. Starting at some point last week I’d get an alert that the Elasticsearch cluster we run Graylog against had hit very high CPU, load and memory usage. I was confused… we had more than enough horsepower to handle queries and normally the cluster only uses about 10% of CPU on each node. I paused message processing (that’s a nice feature of graylog there, it just sticks the messages into a local journal) and began looking around. Nothing seemed off. Nothing stuck out in the logs. I noticed that the CPU and load eventually dropped, shrugged my shoulders and resumed message processing. Maybe a rogue query? Who knows.

A couple days later it happened again. Pause, resume. Everything began operating normally. While I had some hunches I had no solid evidence as to the cause and assumed that maybe it was time to expand the cluster. Since it wasn’t a fire (we just use graylog for internal log searching of API requests) I added a card to our trello board to just go ahead and kill two birds with one stone by upgrading Graylog and ES and ensuring we up the cluster size in the rebuilt cluster. I can handle the pause/unpause cycle in the short term.

As it continued happening daily, I got very curious. It only happened during peak load time and when there were a lot of staff members using it. And it always became 100% fine with a pause/unpause messaging cycle. I looked around some more and realized that we should tune the logging level a bit and see if anything stood out the next time it hit.

Sure enough, when our engineering team did a quick support blitz it happened again. This time the Elasticsearch logs were filled to the brim with entries like the two below.

[2016-06-06 18:20:34,880][INFO ][index.engine ] [ip-10-0-0-2.ec2.internal] [graylog2_761][0] now throttling indexing: numMergesInFlight=5, maxNumMerges=4
[2016-06-06 18:20:40,804][INFO ][index.engine ] [ip-10-0-0-2.ec2.internal] [graylog2_761][0] stop throttling indexing: numMergesInFlight=3, maxNumMerges=4

Well that’s a bit curious. Off to google. After a little searching I came across this entry in the elasticsearch reference.

Segment merging is computationally expensive, and can eat up a lot of disk I/O. Merges are scheduled to operate in the background because they can take a long time to finish, especially large segments. This is normally fine, because the rate of large segment merges is relatively rare.
But sometimes merging falls behind the ingestion rate. If this happens, Elasticsearch will automatically throttle indexing requests to a single thread. This prevents a segment explosion problem, in which hundreds of segments are generated before they can be merged. Elasticsearch will log INFO-level messages stating now throttling indexing when it detects merging falling behind indexing.
Elasticsearch defaults here are conservative: you don’t want search performance to be impacted by background merging. But sometimes (especially on SSD, or logging scenarios), the throttle limit is too low.

A key point following that statement is the revelation of what the default rate is.

The default is 20 MB/s, which is a good setting for spinning disks. If you have SSDs, you might consider increasing this to 100–200 MB/s.

Well this seems promising because we DO use SSD. From the documentation I issued a curl call to up the indices.store.throttle.max_bytes_per_sec rate.

PUT /_cluster/settings
{
    "persistent" : {
        "indices.store.throttle.max_bytes_per_sec" : "100mb"
    }
}

With these settings in place I had a few hours until the next support blitz. Suffice to say it has been several days and we haven’t had to pause message processing yet. This is definitely an important setting to ensure you configure correctly when you tune the defaults on your cluster.

Reference:

Elasticsearch Throttling Indexing from our SCG partner James Carr at the Rants and Musings of an Agile Developer blog.