Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better stream management #76

Closed
orangejulius opened this issue Jan 30, 2018 · 15 comments
Closed

Better stream management #76

orangejulius opened this issue Jan 30, 2018 · 15 comments

Comments

@orangejulius
Copy link
Member

orangejulius commented Jan 30, 2018

We have seen recently that dbclient really does a poor job of sending bulk requests to Elasticsearch at a slow enough rate to avoid overloading small clusters. In particular, if there is an error with one bulk request, we suspect it will retry that request, but it will do that while new requests are being sent.

This can cause a feedback loop where a single request timing out due to high load causes retries, that further increase the load on the cluster until the import fails.

One option to consider is using an existing elasticsearch bulk import tool like https://github.com/hmalphettes/elasticsearch-streams

Connects pelias/openaddresses#328

@orangejulius
Copy link
Member Author

It looks like based on default elasticsearch-js and Node.js settings, that this module will send out a potentially unlimited number of concurrent batch indexing requests to Elasticsearch. This would only happen if records are being processed by the importer using dbclient at a much faster rate than Elasticsearch can process them. However this is easily possible for a number of reasons:

  • In a development environment, the resources allocated to Elasticsearch will often be very small.
  • As an import progresses, Elasticsearch gets slower and slower at indexing new records, since the total number of records in Elasticsearch increases.
  • Often times multiple importers are run in parallel, causing extra strain on Elasticsearch

We should be able to use the createNodeAgent function configuration option in elasticsearch-js to control the maximum number of concurrent requests a bit.

@orangejulius
Copy link
Member Author

This may also be related to elastic/elasticsearch-js#196

orangejulius added a commit that referenced this issue Mar 11, 2018
The BatchManager (roughly) manages the maximum number of bulk index
requests that can be in processing by Elasticsearch simultaneously.

The default of 50 is good for very large clusters, but not small ones.

In order to make Pelias work better out of the box on smaller setups,
the defaults should be changed. Worst case, this will make imports on
larger Elasticsearch clusters slightly slower, but I doubt we'll even
notice. It might even make them faster.

Connects pelias/openaddresses#328
Connects #76
orangejulius added a commit that referenced this issue Mar 11, 2018
The BatchManager (roughly) manages the maximum number of bulk index
requests that can be in processing by Elasticsearch simultaneously.

The default of 50 is good for very large clusters, but not small ones.

In order to make Pelias work better out of the box on smaller setups,
the defaults should be changed. Worst case, this will make imports on
larger Elasticsearch clusters slightly slower, but I doubt we'll even
notice. It might even make them faster.

Connects pelias/openaddresses#328
Connects #76
@orangejulius
Copy link
Member Author

orangejulius commented Mar 11, 2018

I did some accidental reasearch on this yesterday.

The setup was:
1x Elasticsearch host on a t2.micro (about as small as you can get: only one CPU)
Importing the NYC OpenAddresses data (about 1M records)

With the default configuration settings, I get many, many errors like this:

2018-03-10T15:40:09.484Z - error: [dbclient] [429] type=es_rejected_execution_exception, reason=rejected execution of org.elasticsearch.transport.Transp
ortService$4@6bd7866a on EsThreadPoolExecutor[bulk, queue capacity= 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@3a6f01e9[Running,
pool size = 1, active threads = 1, queued tasks = 50, completed tasks = 1180423]]

@chriswhong mentioned he fixed this by setting the batch size to 1 (see example here). It had to be exactly 1, even two would break. I was seeing the same behavior. The downside to this workaround is it drastically reduces import speed. The whole point of the bulk index endpoint is to perform many operations at once and avoid overhead.

Then I did some research to remember how Elasticsearch works. Basically, almost all processes within Elasticsearch work with a model where there is a queue of work items to be done (it might be search queries, managing shards, or in this case, bulk indexes), and a thread pool which is sized according to the number of CPUs.

Importantly, in Elasticsearch 2, the bulk index thread pool has a max queue size of 50. You can see in the message above that Elasticsearch is complaining there are 50 bulk index tasks and so it has to drop some of them.

Meanwhile, our code in pelias-dbclient is configured to allow a certain number of requests to be in flight at once. That number is... 50.

This is clearly too high for a small Elasticsearch instance, especially considering we usually run multiple importers in parallel.

The good news is in Elasticsearch 5, the bulk index queue size is changed to 200.

In the meantime, I've opened #83 to lower the defaults a little bit to values that worked fine during my testing.

@otbutz
Copy link

otbutz commented Mar 12, 2018

Did you tune the number of shards?

@orangejulius
Copy link
Member Author

@otbutz In this case we were using 5 shards, the Elasticsearch default. I think on such a small machine shards likely make things worse, but it's a good stress test case.

@mapmeld
Copy link

mapmeld commented Mar 12, 2018

Is there a way that I could set batch size to 1 when I am using the Docker installation?

@orangejulius
Copy link
Member Author

@mapmeld currently it's only configurable by code in the importer. And to be clear, batch size 1 is not the fix, because it has an extreme impact on import times.

However, you can test out the actual proposed fix for the OpenAddresses importer, and doing so would be much appreciated :)

Change the image property in docker-compose.yml for the openaddresses importer to read like this:
image: pelias/openaddresses:test-new-dbclient-2018-03-11-b6522df219022092c5f6de47135dfcc67f05bc15

Let me know if it improves things.

orangejulius added a commit to pelias/openstreetmap that referenced this issue Mar 12, 2018
@orangejulius
Copy link
Member Author

Okay, I've tested the new dbclient branch with OA and OSM. It looks to completely resolve the rejected execution issue. Time to merge :)

@orangejulius
Copy link
Member Author

For anyone else with this issue, an alternative solution put forth by @paraskashyap is to increase the bulk threadpool on Elasticsearch:

curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "threadpool.bulk.queue_size" : 500 } }'"

@mapmeld
Copy link

mapmeld commented Mar 30, 2018

@orangejulius as a Docker newbie, what's the right place to make this change in the Dockerfile?

@orangejulius
Copy link
Member Author

@mapmeld that last command can be run straight on the command line once you've started Elasticsearch. I don't think it's something that would go in a Dockerfile or docker-compose.yml

@orangejulius
Copy link
Member Author

And also, to be clear. We expect this issue to be resolved with code changes, so that workaround shouldn't be needed in the long term.

@otbutz
Copy link

otbutz commented Apr 3, 2018

@orangejulius quoting an elastic blog post from Nov 2017:

Increasing the size of the queue is not likely to improve the indexing performance or throughput of your cluster. Instead it would just make the cluster queue up more data in memory, which is likely to result in bulk requests taking longer to complete. The more bulk requests there are in the queue, the more precious heap space will be consumed. If the pressure on the heap gets too large, it can cause a lot of other performance problems and even cluster instability.

Adjusting the queue sizes is therefore strongly discouraged, as it is like putting a temporary band-aid on the problem rather than actually fixing the underlying issue.

@orangejulius
Copy link
Member Author

@otbutz Right, it's a good temporary workaround until our code changes to reduce the concurrency of our importers to something that is not overly excessive makes it to the production branches (it's going to be there soon).

Interestingly, they did raise the default from 50 to 200 in Elasticsearch 5.

@orangejulius
Copy link
Member Author

It's been nearly 6 months, and I haven't heard any complaints from folks running small builds regarding this issue. It's definitely still possible to overwhelm Elasticsearch if several importers are running in parallel and there are not a lot of resources dedicated to Elasticsearch, but this case isn't happening all the time any more.

I'm going to close this, but if anyone else is still having issues, please let us know.

orangejulius added a commit that referenced this issue May 12, 2019
This package has historically been very aggressive regarding how many
requests it will allow to be in flight to Elasticsearch.

We lowered the maximum number of in-flight requests to 10 recently
(see #76), but I think this is still too high. Recently we have seen
some Elasticsearch timeouts when running highly parallel imports.

My suspicion is that it's very unlikely a high number of in-flight bulk
index requests is the best way to ensure high performance. For
geocode.earth, we run planet builds on a 36 core machine, with a total
of 6 importer processes running at once at the start (2 OA, OSM,
polylines, geonames, WOF).

Since the bulk import endpoint already allows importing many records in
parallel (500 by default in this package), 6 importers could lead to up
to 60 bulk requests in flight at once. My guess is even 2-3 bulk
requests is enough to keep Elasticsearch busy.

Eventually I'd like to allow us to configure this option easily across
all importers, but for now lets test this value.

Connects #76
Connects #83
orangejulius added a commit that referenced this issue May 12, 2019
This package has historically been very aggressive regarding how many
requests it will allow to be in flight to Elasticsearch.

We lowered the maximum number of in-flight requests to 10 recently
(see #76), but I think this is still too high. Recently we have seen
some Elasticsearch timeouts when running highly parallel imports.

My suspicion is that it's very unlikely a high number of in-flight bulk
index requests is the best way to ensure high performance. For
geocode.earth, we run planet builds on a 36 core machine, with a total
of 6 importer processes running at once at the start (2 OA, OSM,
polylines, geonames, WOF).

Since the bulk import endpoint already allows importing many records in
parallel (500 by default in this package), 6 importers could lead to up
to 60 bulk requests in flight at once. My guess is even 2-3 bulk
requests is enough to keep Elasticsearch busy.

Eventually I'd like to allow us to configure this option easily across
all importers, but for now lets test this value.

Connects #76
Connects #83
orangejulius added a commit that referenced this issue May 12, 2019
This package has historically been very aggressive regarding how many
requests it will allow to be in flight to Elasticsearch.

We lowered the maximum number of in-flight requests to 10 recently
(see #76), but I think this is still too high. Recently we have seen
some Elasticsearch timeouts when running highly parallel imports.

My suspicion is that it's very unlikely a high number of in-flight bulk
index requests is the best way to ensure high performance. For
geocode.earth, we run planet builds on a 36 core machine, with a total
of 6 importer processes running at once at the start (2 OA, OSM,
polylines, geonames, WOF).

Since the bulk import endpoint already allows importing many records in
parallel (500 by default in this package), 6 importers could lead to up
to 60 bulk requests in flight at once. My guess is even 2-3 bulk
requests is enough to keep Elasticsearch busy.

Eventually I'd like to allow us to configure this option easily across
all importers, but for now lets test this value.

Connects #76
Connects #83
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants