Better stream management #76

orangejulius · 2018-01-30T13:24:37Z

We have seen recently that dbclient really does a poor job of sending bulk requests to Elasticsearch at a slow enough rate to avoid overloading small clusters. In particular, if there is an error with one bulk request, we suspect it will retry that request, but it will do that while new requests are being sent.

This can cause a feedback loop where a single request timing out due to high load causes retries, that further increase the load on the cluster until the import fails.

One option to consider is using an existing elasticsearch bulk import tool like https://github.com/hmalphettes/elasticsearch-streams

Connects pelias/openaddresses#328

orangejulius · 2018-01-30T19:14:35Z

It looks like based on default elasticsearch-js and Node.js settings, that this module will send out a potentially unlimited number of concurrent batch indexing requests to Elasticsearch. This would only happen if records are being processed by the importer using dbclient at a much faster rate than Elasticsearch can process them. However this is easily possible for a number of reasons:

In a development environment, the resources allocated to Elasticsearch will often be very small.
As an import progresses, Elasticsearch gets slower and slower at indexing new records, since the total number of records in Elasticsearch increases.
Often times multiple importers are run in parallel, causing extra strain on Elasticsearch

We should be able to use the createNodeAgent function configuration option in elasticsearch-js to control the maximum number of concurrent requests a bit.

orangejulius · 2018-02-17T23:12:42Z

This may also be related to elastic/elasticsearch-js#196

The BatchManager (roughly) manages the maximum number of bulk index requests that can be in processing by Elasticsearch simultaneously. The default of 50 is good for very large clusters, but not small ones. In order to make Pelias work better out of the box on smaller setups, the defaults should be changed. Worst case, this will make imports on larger Elasticsearch clusters slightly slower, but I doubt we'll even notice. It might even make them faster. Connects pelias/openaddresses#328 Connects #76

orangejulius · 2018-03-11T18:22:38Z

I did some accidental reasearch on this yesterday.

The setup was:
1x Elasticsearch host on a t2.micro (about as small as you can get: only one CPU)
Importing the NYC OpenAddresses data (about 1M records)

With the default configuration settings, I get many, many errors like this:

2018-03-10T15:40:09.484Z - error: [dbclient] [429] type=es_rejected_execution_exception, reason=rejected execution of org.elasticsearch.transport.Transp
ortService$4@6bd7866a on EsThreadPoolExecutor[bulk, queue capacity= 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@3a6f01e9[Running,
pool size = 1, active threads = 1, queued tasks = 50, completed tasks = 1180423]]

@chriswhong mentioned he fixed this by setting the batch size to 1 (see example here). It had to be exactly 1, even two would break. I was seeing the same behavior. The downside to this workaround is it drastically reduces import speed. The whole point of the bulk index endpoint is to perform many operations at once and avoid overhead.

Then I did some research to remember how Elasticsearch works. Basically, almost all processes within Elasticsearch work with a model where there is a queue of work items to be done (it might be search queries, managing shards, or in this case, bulk indexes), and a thread pool which is sized according to the number of CPUs.

Importantly, in Elasticsearch 2, the bulk index thread pool has a max queue size of 50. You can see in the message above that Elasticsearch is complaining there are 50 bulk index tasks and so it has to drop some of them.

Meanwhile, our code in pelias-dbclient is configured to allow a certain number of requests to be in flight at once. That number is... 50.

This is clearly too high for a small Elasticsearch instance, especially considering we usually run multiple importers in parallel.

The good news is in Elasticsearch 5, the bulk index queue size is changed to 200.

In the meantime, I've opened #83 to lower the defaults a little bit to values that worked fine during my testing.

otbutz · 2018-03-12T08:22:52Z

Did you tune the number of shards?

orangejulius · 2018-03-12T13:13:33Z

@otbutz In this case we were using 5 shards, the Elasticsearch default. I think on such a small machine shards likely make things worse, but it's a good stress test case.

mapmeld · 2018-03-12T14:23:39Z

Is there a way that I could set batch size to 1 when I am using the Docker installation?

orangejulius · 2018-03-12T14:44:06Z

@mapmeld currently it's only configurable by code in the importer. And to be clear, batch size 1 is not the fix, because it has an extreme impact on import times.

However, you can test out the actual proposed fix for the OpenAddresses importer, and doing so would be much appreciated :)

Change the image property in docker-compose.yml for the openaddresses importer to read like this:
image: pelias/openaddresses:test-new-dbclient-2018-03-11-b6522df219022092c5f6de47135dfcc67f05bc15

Let me know if it improves things.

Connects pelias/dbclient#76

orangejulius · 2018-03-12T20:07:30Z

Okay, I've tested the new dbclient branch with OA and OSM. It looks to completely resolve the rejected execution issue. Time to merge :)

orangejulius · 2018-03-30T12:09:48Z

For anyone else with this issue, an alternative solution put forth by @paraskashyap is to increase the bulk threadpool on Elasticsearch:

curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "threadpool.bulk.queue_size" : 500 } }'"

mapmeld · 2018-03-30T15:05:51Z

@orangejulius as a Docker newbie, what's the right place to make this change in the Dockerfile?

orangejulius · 2018-03-30T15:07:01Z

@mapmeld that last command can be run straight on the command line once you've started Elasticsearch. I don't think it's something that would go in a Dockerfile or docker-compose.yml

orangejulius · 2018-03-30T15:36:55Z

And also, to be clear. We expect this issue to be resolved with code changes, so that workaround shouldn't be needed in the long term.

otbutz · 2018-04-03T07:55:42Z

@orangejulius quoting an elastic blog post from Nov 2017:

Increasing the size of the queue is not likely to improve the indexing performance or throughput of your cluster. Instead it would just make the cluster queue up more data in memory, which is likely to result in bulk requests taking longer to complete. The more bulk requests there are in the queue, the more precious heap space will be consumed. If the pressure on the heap gets too large, it can cause a lot of other performance problems and even cluster instability.

Adjusting the queue sizes is therefore strongly discouraged, as it is like putting a temporary band-aid on the problem rather than actually fixing the underlying issue.

orangejulius · 2018-04-04T19:27:57Z

@otbutz Right, it's a good temporary workaround until our code changes to reduce the concurrency of our importers to something that is not overly excessive makes it to the production branches (it's going to be there soon).

Interestingly, they did raise the default from 50 to 200 in Elasticsearch 5.

orangejulius · 2018-09-21T20:29:43Z

It's been nearly 6 months, and I haven't heard any complaints from folks running small builds regarding this issue. It's definitely still possible to overwhelm Elasticsearch if several importers are running in parallel and there are not a lot of resources dedicated to Elasticsearch, but this case isn't happening all the time any more.

I'm going to close this, but if anyone else is still having issues, please let us know.

This package has historically been very aggressive regarding how many requests it will allow to be in flight to Elasticsearch. We lowered the maximum number of in-flight requests to 10 recently (see #76), but I think this is still too high. Recently we have seen some Elasticsearch timeouts when running highly parallel imports. My suspicion is that it's very unlikely a high number of in-flight bulk index requests is the best way to ensure high performance. For geocode.earth, we run planet builds on a 36 core machine, with a total of 6 importer processes running at once at the start (2 OA, OSM, polylines, geonames, WOF). Since the bulk import endpoint already allows importing many records in parallel (500 by default in this package), 6 importers could lead to up to 60 bulk requests in flight at once. My guess is even 2-3 bulk requests is enough to keep Elasticsearch busy. Eventually I'd like to allow us to configure this option easily across all importers, but for now lets test this value. Connects #76 Connects #83

orangejulius mentioned this issue Feb 1, 2018

170M entries missing after full planet import pelias/pelias#707

Closed

orangejulius mentioned this issue Feb 27, 2018

Wall of errors during successful setup, not sure what is wrong pelias-deprecated/dockerfiles#46

Closed

orangejulius mentioned this issue Mar 11, 2018

fix: Reduce default number of max concurrent requests #83

Merged

orangejulius added a commit to pelias/openstreetmap that referenced this issue Mar 12, 2018

Test dbclient with lower concurrency limits

c526853

Connects pelias/dbclient#76

orangejulius closed this as completed Sep 21, 2018

kylebarron mentioned this issue Nov 20, 2018

Set elasticsearch thread pool higher kylebarron/pelias-install#3

Closed

orangejulius mentioned this issue May 12, 2019

Reduce default max in flight requests to 5 #108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better stream management #76

Better stream management #76

orangejulius commented Jan 30, 2018 •

edited

Loading

orangejulius commented Jan 30, 2018

orangejulius commented Feb 17, 2018

orangejulius commented Mar 11, 2018 •

edited

Loading

otbutz commented Mar 12, 2018

orangejulius commented Mar 12, 2018

mapmeld commented Mar 12, 2018

orangejulius commented Mar 12, 2018

orangejulius commented Mar 12, 2018

orangejulius commented Mar 30, 2018

mapmeld commented Mar 30, 2018

orangejulius commented Mar 30, 2018

orangejulius commented Mar 30, 2018

otbutz commented Apr 3, 2018

orangejulius commented Apr 4, 2018

orangejulius commented Sep 21, 2018

Better stream management #76

Better stream management #76

Comments

orangejulius commented Jan 30, 2018 • edited Loading

orangejulius commented Jan 30, 2018

orangejulius commented Feb 17, 2018

orangejulius commented Mar 11, 2018 • edited Loading

otbutz commented Mar 12, 2018

orangejulius commented Mar 12, 2018

mapmeld commented Mar 12, 2018

orangejulius commented Mar 12, 2018

orangejulius commented Mar 12, 2018

orangejulius commented Mar 30, 2018

mapmeld commented Mar 30, 2018

orangejulius commented Mar 30, 2018

orangejulius commented Mar 30, 2018

otbutz commented Apr 3, 2018

orangejulius commented Apr 4, 2018

orangejulius commented Sep 21, 2018

orangejulius commented Jan 30, 2018 •

edited

Loading

orangejulius commented Mar 11, 2018 •

edited

Loading