-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Ignoring failures for indices that have just been created #65846
Comments
Pinging @elastic/es-search (Team:Search) |
//CC @elastic/ml-core |
Other related issues: |
/cc @henningandersen and @fcofdez because I think some work the distributed team is doing might be able to solve this |
Yes, we've been exploring different approaches to overcome these kinds of issues. I wrote a small POC (#64315) to explore more the problem but we have been hesitant to change how search currently works and we're leaning towards a different model for async searches. We think that for blocking searches it makes sense to fail fast as it does now. Additionally, even if we end up implementing a way to wait for shards to be allocated, we should bound that wait with a timeout and there's no guarantee that these kinds of spurious failures won't happen again. Maybe what we can do is provide some kind of primitive for tests to wait until the shards are allocated, so failures caused by a delayed shard allocation are easier to find. |
But what about end users? It seems like we're going in the wrong direction if we're writing special functionality to make the tests pass. It's as though the reason Elasticsearch exists is to make the tests pass rather than as a useful product for end users. The scenario where it breaks is:
I think it will be hard to add test functionality that solves that. One way to solve it in tests is to proactively create all indices up-front, then wait for yellow status on the cluster before proceeding to the rest of the test code. But if that is not what the production system does then the test isn't testing what end users observe. Some pieces of ML functionality create indices only when they're needed, and if a different thread is doing searches against those indices not caring if they don't exist then that is where the problem arises.
If it's going to fail fast, could it fail fast with a specific exception type meaning "all shards failed because this index has only just been created"? At the moment we get the search exception with "all shards failed", but we have no idea if this is because the index was only just created or because of some major problem in the cluster that caused both primaries and replicas to be inaccessible. This would solve the problem for both tests and end users because then the ML code that doesn't care if there are no results from a search because the index being searched does not exist could treat that specific type of exception as meaning "no results". |
I do think it is possible to solve this without waiting, since an empty index should result in no hits. I am mainly worried about the potential edge cases that could be the result of such a check up front (like if the coordinator is isolated, it may return no hits when there really are hits). I agree that we should not solve this for tests only (unless it turns out to be a test only issue). We could make the check when receiving the failure, which would mitigate at least some of that. |
I think this is related to some configuration on the http layer, maybe
That should result in shard search failures too, as long as the connection is closed at some point (since the internal search |
When creating an index, the primary will be either unassigned or initializing for a short while. This causes issues when concurrent processes does searches hitting those indices, either explicitly, through patterns, aliases or data streams. This commit changes the behavior to disregard shards where the primary is inactive due to having just been created. Closes elastic#65846
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Description of the problem including expected versus actual behavior:
When a search expands to a recently created index, and the index is not fully initialized, there is a window of time where the search will fail due to missing shards.
For many uses inside of the machine learning plugin, this is problematic as we don't know if it is a "true" failure or not.
Current ML failures caused by this scenario.
Steps to reproduce:
Reproducing this is exceptionally difficult.
Probably the best chance to reproduce this locally at the moment is with DeleteExpiredDataIT.testDeleteExpiredDataNoThrottle or DeleteExpiredDataIT.testDeleteExpiredDataWithStandardThrottle. However, it’s tricky because these tests are much more likely to fail if other tests have run before them so they start immediately after the cluster cleaning that runs in between tests. You can unmute those two by reverting 309684f. Then I guess if you run all the tests in DeleteExpiredDataIT repeatedly then one of them will have to run after another’s cleanup.
This situation might be addressable via allowing an option to wait for the primary shard to be assigned (if it is being initialized).
The text was updated successfully, but these errors were encountered: