-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle all exceptions in data nodes can match #117469
Handle all exceptions in data nodes can match #117469
Conversation
During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true. During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard. Closes elastic#104994
Hi @javanna, I've created a changelog YAML for you. |
} | ||
} | ||
|
||
static class CanMatchContext implements Releasable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the changes in SearchService have a simple goal: make the canMatch method unit testable. Given how difficult it is to recreate a SearchService instance (no wonder it requires spinning up a node like we do in SearchServiceTests), I went for making the method static and have some extensible way to provide what it needs as an argument.
final MinAndMax<?> minMax = sortBuilder != null ? FieldSortBuilder.getMinMaxOrNull(context, sortBuilder) : null; | ||
return new CanMatchShardResponse(true, minMax); | ||
} catch (Exception e) { | ||
return new CanMatchShardResponse(true, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core of the change: I am thinking that we should perhaps move the catching entirely to this method. Why send back exceptions when they are consumed by the coord node as can match = true, min max = null? The lack of exception handling in this method has caused this bug in the first place, because it requires all its callers to handle exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now replaced the catch with a broader one and removed the specific one that I had initially added.
l.onFailure(exc); | ||
return; | ||
// if can_match throws for some reason, we go ahead with the query phase | ||
canMatchResp = new CanMatchShardResponse(true, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a targeted catch exception for sorting related exceptions, but we still need here a broader one. Based on my proposal below to handle all exceptions in the can match method, perhaps this can go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this, given that exception handling is now in callers code.
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
@@ -1619,6 +1634,7 @@ public void canMatch(CanMatchNodeRequest request, ActionListener<CanMatchNodeRes | |||
final List<CanMatchNodeResponse.ResponseOrFailure> responses = new ArrayList<>(shardLevelRequests.size()); | |||
for (var shardLevelRequest : shardLevelRequests) { | |||
try { | |||
// TODO remove the exception handling as it's now in canMatch itself | |||
responses.add(new CanMatchNodeResponse.ResponseOrFailure(canMatch(request.createShardSearchRequest(shardLevelRequest)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this for a follow-up as I need to add a new transport version and bw comp handling.
*/ | ||
package org.elasticsearch.search; | ||
|
||
import org.apache.lucene.index.DirectoryReader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is hard to follow, but I just renamed the existing SearchServiceTests to SearchServiceSingleNodeTests, and added a new test class with unit tests only, which I called SearchServiceTests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review aid here
null | ||
) | ||
).canMatch() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the previous version of the assertion came from retrying PIT requests on search context missing. I don't think it's a good reason to make can match fail on it though, hence I adjusted the test. Retries will be attempted in the query phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this Luca.
Left a couple of minor, optional, suggestions.
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
*/ | ||
package org.elasticsearch.search; | ||
|
||
import org.apache.lucene.index.DirectoryReader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review aid here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would have been nice to do this without the new context class but I see the point :) but 🤞 I guess this goes away somewhat shortly anyway :)
💔 Backport failed
You can use sqren/backport to manually backport by running |
During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true. During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard. Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future. At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance. Closes elastic#104994
* Handle all exceptions in data nodes can match (#117469) During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true. During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard. Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future. At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance. Closes #104994 * fix compile
…lastic#118533) * Handle all exceptions in data nodes can match (elastic#117469) During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true. During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard. Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future. At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance. Closes elastic#104994 * fix compile
…lastic#118533) * Handle all exceptions in data nodes can match (elastic#117469) During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true. During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard. Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future. At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance. Closes elastic#104994 * fix compile
… (#118570) * Handle all exceptions in data nodes can match (#117469) During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true. During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard. Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future. At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance. Closes #104994 * fix compile
… (#118572) * Handle all exceptions in data nodes can match (#117469) During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true. During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard. Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future. At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance. Closes #104994 * fix compile
During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true.
During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard.
The issue is addressed by folding the exception handling in SearchService#canMatch, rather than leaving it to its consumers. There isn't a single usecase where a failure on can match does not translate to canMatch=true (besides the bug that this fix is addressing). This has the nice benefit that data nodes will stop sending back exceptions for the coordinator to handle, which is unnecessary given that the handling is always the same.
Closes #104994