Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we consider "Shard not available exception" as regular search failures #47700

Closed
jimczi opened this issue Oct 7, 2019 · 11 comments · Fixed by #64337
Closed

Should we consider "Shard not available exception" as regular search failures #47700

jimczi opened this issue Oct 7, 2019 · 11 comments · Fixed by #64337
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Distributed Meta label for distributed team (obsolete) Team:Search Meta label for search team

Comments

@jimczi
Copy link
Contributor

jimczi commented Oct 7, 2019

Today when we detect that no replicas are available for a shard we mark the shard as failed but don't count it as failed in the search response header. We also don't return the exception in the response so the only way to detect that a set of replicas is not available for a shard is to check whether total is equal to successful in the response. While this could be documented properly I wonder if we should not treat this exception as a regular failure in order to simplify the handling/detection of search failures. This behavior is quite old so this would be a breaking change but I doubt that a lot of users are aware of this. We also have a way to be not lenient when shard failures are detected (accept_partial_results) so it feels more natural to me to consider this kind of failure as any other exceptions.

@jimczi jimczi added >enhancement discuss :Search/Search Search-related issues that do not fall into other categories :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Oct 7, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@jimczi
Copy link
Contributor Author

jimczi commented Oct 8, 2019

Here's the place in code where we ignore the ShardNotAvailableException:

if (TransportActions.isShardNotAvailableException(e) == false) {

And we compute the number of failed shards here:
// we don't return totalShards - successfulShards, we don't count "no shards available" as a failed shard, just don't

Note that in this case we return the number of shard failures minus the ShardNotAvailableException
that we ignored in the previous snippet.

@ywelsch
Copy link
Contributor

ywelsch commented Oct 8, 2019

I'm not sure if we should add the "shard not available" exceptions to the failed counter. The main reason against this is that we would then also have to return these boring exceptions as part of the "failures" array. I wonder if we should instead return a field called unavailable which counts these. We would then have total = successful + failed + unavailable.

@markharwood
Copy link
Contributor

We would then have total = successful + failed + unavailable.

Practically speaking, clients like Kibana should be telling their users if the results are lacking in any way. This isn't just about failed vs unavailable - it also includes "timed out" flags.

It seems trappy that end-users might not appreciate some important results might be missing because the client application's logic fails to check all the various flags elasticsearch provides.

Maybe we can make this easier with a partial_results boolean that is the summarisation of the more detailed reasons why this can happen (failed/timed out/unavailable)

@jimczi
Copy link
Contributor Author

jimczi commented Oct 10, 2019

I'm not sure if we should add the "shard not available" exceptions to the failed counter. The main reason against this is that we would then also have to return these boring exceptions as part of the "failures" array.

While boring they also contain the name of the index and shard id that is not available. I think it's an important piece of information to return. IMO the best way for a consumer to check if a result is partial or not is to check if total == successful, that's simple and should be emphasized in the documentation. However every failure should be considered equal and reported, otherwise the results look inconsistent and we rely on documentation to educate users.

Maybe we can make this easier with a partial_results boolean that is the summarisation of the more detailed reasons why this can happen (failed/timed out/unavailable)

I am not sure we should consider the timeout and terminate_after case the same as failures.
Results may be partial in both cases but you still need to look at the individual flags to differentiate a partial result due to a failure from one that timed out or use terminate after. In terms of correctness I'd also say that a failure is different than a requested timeout because unlike the timeout, the failure is unexpected. IMO it should remain simple in the general case, successful not equal to total means unexpected failures. For advanced options that are not set by default the documentation should be clear on how to detect if the request met the condition or not but imo we should not try to group them with failures.

markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 10, 2019
@markharwood
Copy link
Contributor

check if total == successful, that's simple

Isn't it total == successful + skipped?
Even if not true, the number of properties returned just emphasises the problem here.

I was looking to provide a flag that had exactly the same yes/no logic that we apply when allow_partial_search_results= false.
The goal is to summarise whether the results can be trusted and are complete - regardless of why that condition may have come about. The low-level details are still there if people want them.

@jimczi
Copy link
Contributor Author

jimczi commented Oct 10, 2019

Isn't it total == successful + skipped?

No, skipped shards are considered successful too.

I was looking to provide a flag that had exactly the same yes/no logic that we apply when allow_partial_search_results= false.

Maybe that's the issue then, we shouldn't apply this flag to the timeout parameter. If it is set in the request the user expects partial results on shards that timed out, otherwise a timeout client side is more appropriate. The more I think of it and the more I want to remove this timeout option. I opened another issue to discuss the options we have #47716 but my point here is that we shouldn't add a new flag just for this feature. It is simple to rely on total == successful in the general case so we should just document it clearly ?

@markharwood
Copy link
Contributor

Timeouts could be introduced by a cluster administrator through a new cluster default rather than by a client application making search requests - in that context it could be an unexpected result.

I tend to look at this from an accuracy perspective - the error margin for a result where we haven't looked at all the data is unbounded. That's a hell of a thing to overlook if a client app misinterprets our current set of properties.

@ywelsch
Copy link
Contributor

ywelsch commented Oct 18, 2019

While boring they also contain the name of the index and shard id that is not available. I think it's an important piece of information to return. IMO the best way for a consumer to check if a result is partial or not is to check if total == successful, that's simple and should be emphasized in the documentation. However every failure should be considered equal and reported, otherwise the results look inconsistent and we rely on documentation to educate users.

Perhaps we can leave out the exception body to avoid blowing up the search response in an unhealthy cluster, o.w. I'm ok with this.

@pcsanwald
Copy link
Contributor

We discussed this in FixIt Thursday, no one has any objections and it appears we are converging in an approach, so I'm removing the discuss label.

@pcsanwald pcsanwald removed the discuss label Oct 24, 2019
@rjernst rjernst added Team:Distributed Meta label for distributed team (obsolete) Team:Search Meta label for search team labels May 4, 2020
jimczi added a commit to jimczi/elasticsearch that referenced this issue Oct 29, 2020
 Today search responses do not report failures for shard that were not available
 for the search.
 So if one shard is not assigned on a search over 5 shards, the
 search response will report:

 ```
 "_shards": {
    "total": 5,
    "successful": 4,
    "skipped": 0,
    "failed": 0
}
```

If all shards are unassigned, we report a generic search phase exception with no cause.
It's easy to spot that `successful` is less than `total` in the response but not reporting
the failure is misleading for users.

This change removes the special handling of not available shards exception in search responses
and treat them as any other failure that could occur on a shard.
These exceptions will count in the `failed` section and will be reported in details in
the `shard_failures` section.
If all shards are unavailable, the search API will now return 404 NOT_FOUND as an indication
that the search failed because it couldn't find any of the resources.

Closes elastic#47700
jimczi added a commit that referenced this issue Nov 24, 2020
Today search responses do not report failures for shard that were not available
 for the search.
 So if one shard is not assigned on a search over 5 shards, the
 search response will report:

 ```
 "_shards": {
    "total": 5,
    "successful": 4,
    "skipped": 0,
    "failed": 0
}
```

If all shards are unassigned, we report a generic search phase exception with no cause.
It's easy to spot that `successful` is less than `total` in the response but not reporting
the failure is misleading for users.

This change removes the special handling of not available shards exception in search responses
and treat them as any other failure that could occur on a shard.
These exceptions will count in the `failed` section and will be reported in details in
the `shard_failures` section.
If all shards are unavailable, the search API will now return 404 NOT_FOUND as an indication
that the search failed because it couldn't find any of the resources.

Closes #47700
jimczi added a commit that referenced this issue Nov 24, 2020
Today search responses do not report failures for shard that were not available
 for the search.
 So if one shard is not assigned on a search over 5 shards, the
 search response will report:

 ```
 "_shards": {
    "total": 5,
    "successful": 4,
    "skipped": 0,
    "failed": 0
}
```

If all shards are unassigned, we report a generic search phase exception with no cause.
It's easy to spot that `successful` is less than `total` in the response but not reporting
the failure is misleading for users.

This change removes the special handling of not available shards exception in search responses
and treat them as any other failure that could occur on a shard.
These exceptions will count in the `failed` section and will be reported in details in
the `shard_failures` section.
If all shards are unavailable, the search API will now return 404 NOT_FOUND as an indication
that the search failed because it couldn't find any of the resources.

Closes #47700
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Distributed Meta label for distributed team (obsolete) Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants