Should we consider "Shard not available exception" as regular search failures #47700

jimczi · 2019-10-07T19:58:15Z

Today when we detect that no replicas are available for a shard we mark the shard as failed but don't count it as failed in the search response header. We also don't return the exception in the response so the only way to detect that a set of replicas is not available for a shard is to check whether total is equal to successful in the response. While this could be documented properly I wonder if we should not treat this exception as a regular failure in order to simplify the handling/detection of search failures. This behavior is quite old so this would be a breaking change but I doubt that a lot of users are aware of this. We also have a way to be not lenient when shard failures are detected (accept_partial_results) so it feels more natural to me to consider this kind of failure as any other exceptions.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-07T19:58:17Z

Pinging @elastic/es-search (:Search/Search)

elasticmachine · 2019-10-07T19:58:18Z

Pinging @elastic/es-distributed (:Distributed/Distributed)

jimczi · 2019-10-08T08:34:07Z

Here's the place in code where we ignore the ShardNotAvailableException:

elasticsearch/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

Line 418 in 4e6756f

if (TransportActions.isShardNotAvailableException(e) == false) {

And we compute the number of failed shards here:

elasticsearch/server/src/main/java/org/elasticsearch/action/search/SearchResponse.java

Line 186 in 4e6756f

    
           // we don't return totalShards - successfulShards, we don't count "no shards available" as a failed shard, just don't

Note that in this case we return the number of shard failures minus the ShardNotAvailableException
that we ignored in the previous snippet.

ywelsch · 2019-10-08T09:49:49Z

I'm not sure if we should add the "shard not available" exceptions to the failed counter. The main reason against this is that we would then also have to return these boring exceptions as part of the "failures" array. I wonder if we should instead return a field called unavailable which counts these. We would then have total = successful + failed + unavailable.

markharwood · 2019-10-08T16:23:41Z

We would then have total = successful + failed + unavailable.

Practically speaking, clients like Kibana should be telling their users if the results are lacking in any way. This isn't just about failed vs unavailable - it also includes "timed out" flags.

It seems trappy that end-users might not appreciate some important results might be missing because the client application's logic fails to check all the various flags elasticsearch provides.

Maybe we can make this easier with a partial_results boolean that is the summarisation of the more detailed reasons why this can happen (failed/timed out/unavailable)

jimczi · 2019-10-10T10:20:38Z

I'm not sure if we should add the "shard not available" exceptions to the failed counter. The main reason against this is that we would then also have to return these boring exceptions as part of the "failures" array.

While boring they also contain the name of the index and shard id that is not available. I think it's an important piece of information to return. IMO the best way for a consumer to check if a result is partial or not is to check if total == successful, that's simple and should be emphasized in the documentation. However every failure should be considered equal and reported, otherwise the results look inconsistent and we rely on documentation to educate users.

Maybe we can make this easier with a partial_results boolean that is the summarisation of the more detailed reasons why this can happen (failed/timed out/unavailable)

I am not sure we should consider the timeout and terminate_after case the same as failures.
Results may be partial in both cases but you still need to look at the individual flags to differentiate a partial result due to a failure from one that timed out or use terminate after. In terms of correctness I'd also say that a failure is different than a requested timeout because unlike the timeout, the failure is unexpected. IMO it should remain simple in the general case, successful not equal to total means unexpected failures. For advanced options that are not set by default the documentation should be clear on how to detect if the request met the condition or not but imo we should not try to group them with failures.

Relates to elastic#47700

markharwood · 2019-10-10T10:50:49Z

check if total == successful, that's simple

Isn't it total == successful + skipped?
Even if not true, the number of properties returned just emphasises the problem here.

I was looking to provide a flag that had exactly the same yes/no logic that we apply when allow_partial_search_results= false.
The goal is to summarise whether the results can be trusted and are complete - regardless of why that condition may have come about. The low-level details are still there if people want them.

jimczi · 2019-10-10T11:59:49Z

Isn't it total == successful + skipped?

No, skipped shards are considered successful too.

I was looking to provide a flag that had exactly the same yes/no logic that we apply when allow_partial_search_results= false.

Maybe that's the issue then, we shouldn't apply this flag to the timeout parameter. If it is set in the request the user expects partial results on shards that timed out, otherwise a timeout client side is more appropriate. The more I think of it and the more I want to remove this timeout option. I opened another issue to discuss the options we have #47716 but my point here is that we shouldn't add a new flag just for this feature. It is simple to rely on total == successful in the general case so we should just document it clearly ?

markharwood · 2019-10-10T12:28:20Z

Timeouts could be introduced by a cluster administrator through a new cluster default rather than by a client application making search requests - in that context it could be an unexpected result.

I tend to look at this from an accuracy perspective - the error margin for a result where we haven't looked at all the data is unbounded. That's a hell of a thing to overlook if a client app misinterprets our current set of properties.

ywelsch · 2019-10-18T14:37:43Z

While boring they also contain the name of the index and shard id that is not available. I think it's an important piece of information to return. IMO the best way for a consumer to check if a result is partial or not is to check if total == successful, that's simple and should be emphasized in the documentation. However every failure should be considered equal and reported, otherwise the results look inconsistent and we rely on documentation to educate users.

Perhaps we can leave out the exception body to avoid blowing up the search response in an unhealthy cluster, o.w. I'm ok with this.

pcsanwald · 2019-10-24T20:52:44Z

We discussed this in FixIt Thursday, no one has any objections and it appears we are converging in an approach, so I'm removing the discuss label.

Today search responses do not report failures for shard that were not available for the search. So if one shard is not assigned on a search over 5 shards, the search response will report: ``` "_shards": { "total": 5, "successful": 4, "skipped": 0, "failed": 0 } ``` If all shards are unassigned, we report a generic search phase exception with no cause. It's easy to spot that `successful` is less than `total` in the response but not reporting the failure is misleading for users. This change removes the special handling of not available shards exception in search responses and treat them as any other failure that could occur on a shard. These exceptions will count in the `failed` section and will be reported in details in the `shard_failures` section. If all shards are unavailable, the search API will now return 404 NOT_FOUND as an indication that the search failed because it couldn't find any of the resources. Closes elastic#47700

Today search responses do not report failures for shard that were not available for the search. So if one shard is not assigned on a search over 5 shards, the search response will report: ``` "_shards": { "total": 5, "successful": 4, "skipped": 0, "failed": 0 } ``` If all shards are unassigned, we report a generic search phase exception with no cause. It's easy to spot that `successful` is less than `total` in the response but not reporting the failure is misleading for users. This change removes the special handling of not available shards exception in search responses and treat them as any other failure that could occur on a shard. These exceptions will count in the `failed` section and will be reported in details in the `shard_failures` section. If all shards are unavailable, the search API will now return 404 NOT_FOUND as an indication that the search failed because it couldn't find any of the resources. Closes #47700

jimczi added >enhancement discuss :Search/Search Search-related issues that do not fall into other categories :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Oct 7, 2019

markharwood mentioned this issue Oct 10, 2019

Search - add “partial_results” summary flag to SearchResponse #47847

Closed

markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 10, 2019

Initial commit of new “partial_results” summary flag

54ed18e

Relates to elastic#47700

pcsanwald removed the discuss label Oct 24, 2019

rjernst added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Search Meta label for search team labels May 4, 2020

jimczi mentioned this issue Oct 29, 2020

Do not skip not available shard exception in search response #64337

Merged

jimczi closed this as completed in #64337 Nov 24, 2020

stevejgordon mentioned this issue Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

benwtrent mentioned this issue Oct 14, 2022

Searches against a large number of unavailable shards result in very large responses #90622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we consider "Shard not available exception" as regular search failures #47700

Should we consider "Shard not available exception" as regular search failures #47700

jimczi commented Oct 7, 2019

elasticmachine commented Oct 7, 2019

elasticmachine commented Oct 7, 2019

jimczi commented Oct 8, 2019

ywelsch commented Oct 8, 2019

markharwood commented Oct 8, 2019

jimczi commented Oct 10, 2019

markharwood commented Oct 10, 2019

jimczi commented Oct 10, 2019

markharwood commented Oct 10, 2019

ywelsch commented Oct 18, 2019

pcsanwald commented Oct 24, 2019

Should we consider "Shard not available exception" as regular search failures #47700

Should we consider "Shard not available exception" as regular search failures #47700

Comments

jimczi commented Oct 7, 2019

elasticmachine commented Oct 7, 2019

elasticmachine commented Oct 7, 2019

jimczi commented Oct 8, 2019

ywelsch commented Oct 8, 2019

markharwood commented Oct 8, 2019

jimczi commented Oct 10, 2019

markharwood commented Oct 10, 2019

jimczi commented Oct 10, 2019

markharwood commented Oct 10, 2019

ywelsch commented Oct 18, 2019

pcsanwald commented Oct 24, 2019