-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we consider "Shard not available exception" as regular search failures #47700
Comments
Pinging @elastic/es-search (:Search/Search) |
Pinging @elastic/es-distributed (:Distributed/Distributed) |
Here's the place in code where we ignore the ShardNotAvailableException: elasticsearch/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java Line 418 in 4e6756f
And we compute the number of failed shards here: elasticsearch/server/src/main/java/org/elasticsearch/action/search/SearchResponse.java Line 186 in 4e6756f
Note that in this case we return the number of shard failures minus the ShardNotAvailableException that we ignored in the previous snippet. |
I'm not sure if we should add the "shard not available" exceptions to the |
Practically speaking, clients like Kibana should be telling their users if the results are lacking in any way. This isn't just about failed vs unavailable - it also includes "timed out" flags. It seems trappy that end-users might not appreciate some important results might be missing because the client application's logic fails to check all the various flags elasticsearch provides. Maybe we can make this easier with a |
While boring they also contain the name of the index and shard id that is not available. I think it's an important piece of information to return. IMO the best way for a consumer to check if a result is partial or not is to check if
I am not sure we should consider the |
Isn't it I was looking to provide a flag that had exactly the same yes/no logic that we apply when |
No, skipped shards are considered successful too.
Maybe that's the issue then, we shouldn't apply this flag to the |
Timeouts could be introduced by a cluster administrator through a new cluster default rather than by a client application making search requests - in that context it could be an unexpected result. I tend to look at this from an accuracy perspective - the error margin for a result where we haven't looked at all the data is unbounded. That's a hell of a thing to overlook if a client app misinterprets our current set of properties. |
Perhaps we can leave out the exception body to avoid blowing up the search response in an unhealthy cluster, o.w. I'm ok with this. |
We discussed this in FixIt Thursday, no one has any objections and it appears we are converging in an approach, so I'm removing the |
Today search responses do not report failures for shard that were not available for the search. So if one shard is not assigned on a search over 5 shards, the search response will report: ``` "_shards": { "total": 5, "successful": 4, "skipped": 0, "failed": 0 } ``` If all shards are unassigned, we report a generic search phase exception with no cause. It's easy to spot that `successful` is less than `total` in the response but not reporting the failure is misleading for users. This change removes the special handling of not available shards exception in search responses and treat them as any other failure that could occur on a shard. These exceptions will count in the `failed` section and will be reported in details in the `shard_failures` section. If all shards are unavailable, the search API will now return 404 NOT_FOUND as an indication that the search failed because it couldn't find any of the resources. Closes elastic#47700
Today search responses do not report failures for shard that were not available for the search. So if one shard is not assigned on a search over 5 shards, the search response will report: ``` "_shards": { "total": 5, "successful": 4, "skipped": 0, "failed": 0 } ``` If all shards are unassigned, we report a generic search phase exception with no cause. It's easy to spot that `successful` is less than `total` in the response but not reporting the failure is misleading for users. This change removes the special handling of not available shards exception in search responses and treat them as any other failure that could occur on a shard. These exceptions will count in the `failed` section and will be reported in details in the `shard_failures` section. If all shards are unavailable, the search API will now return 404 NOT_FOUND as an indication that the search failed because it couldn't find any of the resources. Closes #47700
Today search responses do not report failures for shard that were not available for the search. So if one shard is not assigned on a search over 5 shards, the search response will report: ``` "_shards": { "total": 5, "successful": 4, "skipped": 0, "failed": 0 } ``` If all shards are unassigned, we report a generic search phase exception with no cause. It's easy to spot that `successful` is less than `total` in the response but not reporting the failure is misleading for users. This change removes the special handling of not available shards exception in search responses and treat them as any other failure that could occur on a shard. These exceptions will count in the `failed` section and will be reported in details in the `shard_failures` section. If all shards are unavailable, the search API will now return 404 NOT_FOUND as an indication that the search failed because it couldn't find any of the resources. Closes #47700
Today when we detect that no replicas are available for a shard we mark the shard as failed but don't count it as
failed
in the search response header. We also don't return the exception in the response so the only way to detect that a set of replicas is not available for a shard is to check whethertotal
is equal tosuccessful
in the response. While this could be documented properly I wonder if we should not treat this exception as a regular failure in order to simplify the handling/detection of search failures. This behavior is quite old so this would be a breaking change but I doubt that a lot of users are aware of this. We also have a way to be not lenient when shard failures are detected (accept_partial_results
) so it feels more natural to me to consider this kind of failure as any other exceptions.The text was updated successfully, but these errors were encountered: