-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better stream management #76
Comments
It looks like based on default
We should be able to use the |
This may also be related to elastic/elasticsearch-js#196 |
The BatchManager (roughly) manages the maximum number of bulk index requests that can be in processing by Elasticsearch simultaneously. The default of 50 is good for very large clusters, but not small ones. In order to make Pelias work better out of the box on smaller setups, the defaults should be changed. Worst case, this will make imports on larger Elasticsearch clusters slightly slower, but I doubt we'll even notice. It might even make them faster. Connects pelias/openaddresses#328 Connects #76
The BatchManager (roughly) manages the maximum number of bulk index requests that can be in processing by Elasticsearch simultaneously. The default of 50 is good for very large clusters, but not small ones. In order to make Pelias work better out of the box on smaller setups, the defaults should be changed. Worst case, this will make imports on larger Elasticsearch clusters slightly slower, but I doubt we'll even notice. It might even make them faster. Connects pelias/openaddresses#328 Connects #76
I did some accidental reasearch on this yesterday. The setup was: With the default configuration settings, I get many, many errors like this:
@chriswhong mentioned he fixed this by setting the batch size to 1 (see example here). It had to be exactly 1, even two would break. I was seeing the same behavior. The downside to this workaround is it drastically reduces import speed. The whole point of the bulk index endpoint is to perform many operations at once and avoid overhead. Then I did some research to remember how Elasticsearch works. Basically, almost all processes within Elasticsearch work with a model where there is a queue of work items to be done (it might be search queries, managing shards, or in this case, bulk indexes), and a thread pool which is sized according to the number of CPUs. Importantly, in Elasticsearch 2, the bulk index thread pool has a max queue size of 50. You can see in the message above that Elasticsearch is complaining there are 50 bulk index tasks and so it has to drop some of them. Meanwhile, our code in pelias-dbclient is configured to allow a certain number of requests to be in flight at once. That number is... 50. This is clearly too high for a small Elasticsearch instance, especially considering we usually run multiple importers in parallel. The good news is in Elasticsearch 5, the bulk index queue size is changed to 200. In the meantime, I've opened #83 to lower the defaults a little bit to values that worked fine during my testing. |
Did you tune the number of shards? |
@otbutz In this case we were using 5 shards, the Elasticsearch default. I think on such a small machine shards likely make things worse, but it's a good stress test case. |
Is there a way that I could set batch size to 1 when I am using the Docker installation? |
@mapmeld currently it's only configurable by code in the importer. And to be clear, batch size 1 is not the fix, because it has an extreme impact on import times. However, you can test out the actual proposed fix for the OpenAddresses importer, and doing so would be much appreciated :) Change the Let me know if it improves things. |
Okay, I've tested the new |
For anyone else with this issue, an alternative solution put forth by @paraskashyap is to increase the bulk threadpool on Elasticsearch:
|
@orangejulius as a Docker newbie, what's the right place to make this change in the Dockerfile? |
@mapmeld that last command can be run straight on the command line once you've started Elasticsearch. I don't think it's something that would go in a Dockerfile or docker-compose.yml |
And also, to be clear. We expect this issue to be resolved with code changes, so that workaround shouldn't be needed in the long term. |
@orangejulius quoting an elastic blog post from Nov 2017:
|
@otbutz Right, it's a good temporary workaround until our code changes to reduce the concurrency of our importers to something that is not overly excessive makes it to the production branches (it's going to be there soon). Interestingly, they did raise the default from 50 to 200 in Elasticsearch 5. |
It's been nearly 6 months, and I haven't heard any complaints from folks running small builds regarding this issue. It's definitely still possible to overwhelm Elasticsearch if several importers are running in parallel and there are not a lot of resources dedicated to Elasticsearch, but this case isn't happening all the time any more. I'm going to close this, but if anyone else is still having issues, please let us know. |
This package has historically been very aggressive regarding how many requests it will allow to be in flight to Elasticsearch. We lowered the maximum number of in-flight requests to 10 recently (see #76), but I think this is still too high. Recently we have seen some Elasticsearch timeouts when running highly parallel imports. My suspicion is that it's very unlikely a high number of in-flight bulk index requests is the best way to ensure high performance. For geocode.earth, we run planet builds on a 36 core machine, with a total of 6 importer processes running at once at the start (2 OA, OSM, polylines, geonames, WOF). Since the bulk import endpoint already allows importing many records in parallel (500 by default in this package), 6 importers could lead to up to 60 bulk requests in flight at once. My guess is even 2-3 bulk requests is enough to keep Elasticsearch busy. Eventually I'd like to allow us to configure this option easily across all importers, but for now lets test this value. Connects #76 Connects #83
This package has historically been very aggressive regarding how many requests it will allow to be in flight to Elasticsearch. We lowered the maximum number of in-flight requests to 10 recently (see #76), but I think this is still too high. Recently we have seen some Elasticsearch timeouts when running highly parallel imports. My suspicion is that it's very unlikely a high number of in-flight bulk index requests is the best way to ensure high performance. For geocode.earth, we run planet builds on a 36 core machine, with a total of 6 importer processes running at once at the start (2 OA, OSM, polylines, geonames, WOF). Since the bulk import endpoint already allows importing many records in parallel (500 by default in this package), 6 importers could lead to up to 60 bulk requests in flight at once. My guess is even 2-3 bulk requests is enough to keep Elasticsearch busy. Eventually I'd like to allow us to configure this option easily across all importers, but for now lets test this value. Connects #76 Connects #83
This package has historically been very aggressive regarding how many requests it will allow to be in flight to Elasticsearch. We lowered the maximum number of in-flight requests to 10 recently (see #76), but I think this is still too high. Recently we have seen some Elasticsearch timeouts when running highly parallel imports. My suspicion is that it's very unlikely a high number of in-flight bulk index requests is the best way to ensure high performance. For geocode.earth, we run planet builds on a 36 core machine, with a total of 6 importer processes running at once at the start (2 OA, OSM, polylines, geonames, WOF). Since the bulk import endpoint already allows importing many records in parallel (500 by default in this package), 6 importers could lead to up to 60 bulk requests in flight at once. My guess is even 2-3 bulk requests is enough to keep Elasticsearch busy. Eventually I'd like to allow us to configure this option easily across all importers, but for now lets test this value. Connects #76 Connects #83
We have seen recently that dbclient really does a poor job of sending bulk requests to Elasticsearch at a slow enough rate to avoid overloading small clusters. In particular, if there is an error with one bulk request, we suspect it will retry that request, but it will do that while new requests are being sent.
This can cause a feedback loop where a single request timing out due to high load causes retries, that further increase the load on the cluster until the import fails.
One option to consider is using an existing elasticsearch bulk import tool like https://github.com/hmalphettes/elasticsearch-streams
Connects pelias/openaddresses#328
The text was updated successfully, but these errors were encountered: