Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle all exceptions in data nodes can match #117469

Merged
merged 16 commits into from
Dec 10, 2024

Conversation

javanna
Copy link
Member

@javanna javanna commented Nov 25, 2024

During the can match phase, prior to the query phase, we may have exceptions that are returned back to the coordinating node, handled gracefully as if the shard returned canMatch=true.

During the query phase, we perform an additional rewrite and can match phase to eventually shortcut the query phase for the shard. That needs to handle exceptions as well. Currently, an exception there causes shard failures, while we should rather go ahead and execute the query on the shard.

The issue is addressed by folding the exception handling in SearchService#canMatch, rather than leaving it to its consumers. There isn't a single usecase where a failure on can match does not translate to canMatch=true (besides the bug that this fix is addressing). This has the nice benefit that data nodes will stop sending back exceptions for the coordinator to handle, which is unnecessary given that the handling is always the same.

Closes #104994

During the can match phase, prior to the query phase, we may have exceptions
that are returned back to the coordinating node, handled gracefully as if the
shard returned canMatch=true.

During the query phase, we perform an additional rewrite and can match phase
to eventually shortcut the query phase for the shard. That needs to handle
exceptions as well. Currently, an exception there causes shard failures, while
we should rather go ahead and execute the query on the shard.

Closes elastic#104994
@elasticsearchmachine
Copy link
Collaborator

Hi @javanna, I've created a changelog YAML for you.

}
}

static class CanMatchContext implements Releasable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the changes in SearchService have a simple goal: make the canMatch method unit testable. Given how difficult it is to recreate a SearchService instance (no wonder it requires spinning up a node like we do in SearchServiceTests), I went for making the method static and have some extensible way to provide what it needs as an argument.

final MinAndMax<?> minMax = sortBuilder != null ? FieldSortBuilder.getMinMaxOrNull(context, sortBuilder) : null;
return new CanMatchShardResponse(true, minMax);
} catch (Exception e) {
return new CanMatchShardResponse(true, null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core of the change: I am thinking that we should perhaps move the catching entirely to this method. Why send back exceptions when they are consumed by the coord node as can match = true, min max = null? The lack of exception handling in this method has caused this bug in the first place, because it requires all its callers to handle exceptions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now replaced the catch with a broader one and removed the specific one that I had initially added.

l.onFailure(exc);
return;
// if can_match throws for some reason, we go ahead with the query phase
canMatchResp = new CanMatchShardResponse(true, null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a targeted catch exception for sorting related exceptions, but we still need here a broader one. Based on my proposal below to handle all exceptions in the can match method, perhaps this can go away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this, given that exception handling is now in callers code.

@javanna javanna marked this pull request as ready for review December 2, 2024 20:12
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Dec 2, 2024
@@ -1619,6 +1634,7 @@ public void canMatch(CanMatchNodeRequest request, ActionListener<CanMatchNodeRes
final List<CanMatchNodeResponse.ResponseOrFailure> responses = new ArrayList<>(shardLevelRequests.size());
for (var shardLevelRequest : shardLevelRequests) {
try {
// TODO remove the exception handling as it's now in canMatch itself
responses.add(new CanMatchNodeResponse.ResponseOrFailure(canMatch(request.createShardSearchRequest(shardLevelRequest))));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this for a follow-up as I need to add a new transport version and bw comp handling.

*/
package org.elasticsearch.search;

import org.apache.lucene.index.DirectoryReader;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is hard to follow, but I just renamed the existing SearchServiceTests to SearchServiceSingleNodeTests, and added a new test class with unit tests only, which I called SearchServiceTests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review aid here

@javanna javanna changed the title Handle exceptions in query phase can match Handle all exceptions in data nodes can match Dec 2, 2024
null
)
).canMatch()
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous version of the assertion came from retrying PIT requests on search context missing. I don't think it's a good reason to make can match fail on it though, hence I adjusted the test. Retries will be attempted in the query phase.

@javanna javanna removed the v8.16.2 label Dec 4, 2024
@javanna javanna added the v8.18.0 label Dec 4, 2024
@javanna javanna requested a review from andreidan December 4, 2024 16:15
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this Luca.

Left a couple of minor, optional, suggestions.

*/
package org.elasticsearch.search;

import org.apache.lucene.index.DirectoryReader;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review aid here

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would have been nice to do this without the new context class but I see the point :) but 🤞 I guess this goes away somewhat shortly anyway :)

@javanna javanna added the v8.16.2 label Dec 9, 2024
@javanna javanna added the auto-backport Automatically create backport pull requests when merged label Dec 10, 2024
@javanna javanna merged commit 730f42a into elastic:main Dec 10, 2024
16 checks passed
@javanna javanna deleted the fix/can_match_query_handle_exception branch December 10, 2024 21:54
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.16 Commit could not be cherrypicked due to conflicts
8.17 Commit could not be cherrypicked due to conflicts
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 117469

javanna added a commit to javanna/elasticsearch that referenced this pull request Dec 12, 2024
During the can match phase, prior to the query phase, we may have exceptions
that are returned back to the coordinating node, handled gracefully as if the
shard returned canMatch=true.

During the query phase, we perform an additional rewrite and can match phase
to eventually shortcut the query phase for the shard. That needs to handle
exceptions as well. Currently, an exception there causes shard failures, while
we should rather go ahead and execute the query on the shard.

Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future.

At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance.

Closes elastic#104994
elasticsearchmachine pushed a commit that referenced this pull request Dec 12, 2024
* Handle all exceptions in data nodes can match (#117469)

During the can match phase, prior to the query phase, we may have exceptions
that are returned back to the coordinating node, handled gracefully as if the
shard returned canMatch=true.

During the query phase, we perform an additional rewrite and can match phase
to eventually shortcut the query phase for the shard. That needs to handle
exceptions as well. Currently, an exception there causes shard failures, while
we should rather go ahead and execute the query on the shard.

Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future.

At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance.

Closes #104994

* fix compile
javanna added a commit to javanna/elasticsearch that referenced this pull request Dec 12, 2024
…lastic#118533)

* Handle all exceptions in data nodes can match (elastic#117469)

During the can match phase, prior to the query phase, we may have exceptions
that are returned back to the coordinating node, handled gracefully as if the
shard returned canMatch=true.

During the query phase, we perform an additional rewrite and can match phase
to eventually shortcut the query phase for the shard. That needs to handle
exceptions as well. Currently, an exception there causes shard failures, while
we should rather go ahead and execute the query on the shard.

Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future.

At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance.

Closes elastic#104994

* fix compile
javanna added a commit to javanna/elasticsearch that referenced this pull request Dec 12, 2024
…lastic#118533)

* Handle all exceptions in data nodes can match (elastic#117469)

During the can match phase, prior to the query phase, we may have exceptions
that are returned back to the coordinating node, handled gracefully as if the
shard returned canMatch=true.

During the query phase, we perform an additional rewrite and can match phase
to eventually shortcut the query phase for the shard. That needs to handle
exceptions as well. Currently, an exception there causes shard failures, while
we should rather go ahead and execute the query on the shard.

Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future.

At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance.

Closes elastic#104994

* fix compile
elasticsearchmachine pushed a commit that referenced this pull request Dec 12, 2024
… (#118570)

* Handle all exceptions in data nodes can match (#117469)

During the can match phase, prior to the query phase, we may have exceptions
that are returned back to the coordinating node, handled gracefully as if the
shard returned canMatch=true.

During the query phase, we perform an additional rewrite and can match phase
to eventually shortcut the query phase for the shard. That needs to handle
exceptions as well. Currently, an exception there causes shard failures, while
we should rather go ahead and execute the query on the shard.

Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future.

At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance.

Closes #104994

* fix compile
elasticsearchmachine pushed a commit that referenced this pull request Dec 12, 2024
… (#118572)

* Handle all exceptions in data nodes can match (#117469)

During the can match phase, prior to the query phase, we may have exceptions
that are returned back to the coordinating node, handled gracefully as if the
shard returned canMatch=true.

During the query phase, we perform an additional rewrite and can match phase
to eventually shortcut the query phase for the shard. That needs to handle
exceptions as well. Currently, an exception there causes shard failures, while
we should rather go ahead and execute the query on the shard.

Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future.

At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance.

Closes #104994

* fix compile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.16.2 v8.17.1 v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unsupported Operation Exception querying Frozen Tier data
5 participants