-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Improve error handling in async search code #57925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error handling in async search code #57925
Conversation
|
Pinging @elastic/es-search (:Search/Search) |
x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not found a simple way to test this. Unit testing a transport action is a bit of a nightmare with all the required dependencies. And from an integ test, how do I trigger a failure when scheduling the wait for completion thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I added the same catch that we have below for storeFinalResponse. It's based on paranoia, but should not hurt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also make the try/catch in deleteResponse and call onFailure instead of throwing an exception ?
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments
x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of this and I don't think we should add the failure here. It can be transient so a retry may fix the issue. I'd prefer that we try/catch the call to toAsyncSearchResponse and use a plain ActionListener to notify the failure, wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also make the try/catch in deleteResponse and call onFailure instead of throwing an exception ?
- The exception that we caught when failing to schedule a thread was incorrect. - We may have failures when reducing the response before returning it, which were not handled correctly and may have caused get or submit async search task to not be properly unregistered from the task manager - when the completion listener onFailure method is invoked, the search task has to be unregistered. Not doing so may cause the search task to be stuck in the task manager although it has completed.
0c342e9 to
5184732
Compare
|
@jimczi I pushed an update. I am still working on a new integ test for the submit and get action in case getResponse fails, but I wanted to give you the chance to comment on the recent changes sooner rather than later. |
| } else { | ||
| this.failure = rootCauses[0]; | ||
| } | ||
| this.failure = ExceptionsHelper.convertToElastic(exc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is simpler and even preserves status codes, not sure why we were using guessRootCauses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
|
@jimczi can you have a look please? There are still a couple of TODOs that you may have ideas about, but it should be close. At least I managed to test almost everything I wanted to test. |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left more comments
| listener.accept(finalResponse); | ||
| } | ||
| completionListeners.clear(); | ||
| getResponse(completionListeners); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we copy the completion listeners in the synchronized block to avoid concurrent delete (unregister) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
| listener.onResponse(resp); | ||
| } | ||
| }, | ||
| listener::onFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need to cancel the cancellable on failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed :)
| listener.onResponse(asyncSearchResponse); | ||
| } | ||
|
|
||
| AsyncSearchResponse buildErrorResponse(SearchResponse searchResponse, Exception exception) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other solution would be to build the error response when catching the exception in getResponse ? This way we don't need to differentiate between a failure during partial reduce and a fatal failure. They both return an async search response that contains a failure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is along the lines of what I had in the first iteration, and I believe it defeats the purpose of holding ActionListeners instead of Consumers. We have three places that require different behaviour currently when we get reduction failures:
- submit when the timeout expires, we treat is a fatal failure and cancel the search task
- submit when the response completes, we already returned, we store async search response
- get: we treat it as a transient failure
I think we could maintain the behaviour listed above even if we did not hold action listeners? Only when getting a response the consumer should check if it holds a failure and act accordingly?
I am not sure though if you meant on changing also some of this behaviour, especially because I don't completely follow your statement " This way we don't need to differentiate between a failure during partial reduce and a fatal failure.".
I was actually wondering if we should be more clear with the user, maybe wrapping the exception, and let them know when the exception comes from search and when it comes from async search. I can see how using suppressed exceptions gives you all that happened but it's hard to decipher and debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that the context of the failure is important so we should leave it to the consumer ? In general, and sorry for the back and forth on this, I think we should always return an AsyncSearchResponse even in the case of a failure.
The failure can be transient, in such case is_running should be true (the search action is still running).
I also don't think we should cancel the search if we have a failure when reducing a partial search response. We should return the response with the transient failure and the metadata associated with the current search (number of shards, id, ...). That would simplify the handling of failures and would be consistent with transient failures in get ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I need to give this a try, especially as the first iteration did not have yet the described different behaviour in submit and get. I think that the main argument for this is to simplify things, adding action listeners adds complexity and having to call buildErrorResponse from submit is weird and should be removed if possible, which is kind of why I initially went down that route :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having gone back to the consumer approach, I see the "context" argument better. Returning async search response is more convenient as it holds the needed info to tell what is happening, while an exception alone does not say much besides that an error has happened.
| } else { | ||
| this.failure = rootCauses[0]; | ||
| } | ||
| this.failure = ExceptionsHelper.convertToElastic(exc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| public class FailReduceAggPlugin extends Plugin implements SearchPlugin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't really need a full plugin since we only use the FailReduceInternalAgg and are in charge of the registry in AsyncSearchTaskTests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the plugin is used in AsyncSearchActionIT not in AsyncSearchTaskTests
|
@jimczi I pushed an update, I haven't updated tests yet. |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a minor comment but I like it better. Thanks for iterating on this.
| error = this.failure; | ||
| error.addSuppressed(exception); | ||
| } | ||
| //TODO add some search response here rather than null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, we should return the current stats (number of shards, shard failures, ...) and just omit the partial aggs.
| searchResponse.get().updateWithFailure(exc); | ||
| // if the failure occurred before calling onListShards | ||
| searchResponse.compareAndSet(null, new MutableSearchResponse(-1, -1, null, threadPool.getThreadContext())); | ||
| searchResponse.get().updateWithFailure(new ElasticsearchStatusException("error while executing search", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the additional wrapping ok? I think it's odd that we have to have it, but useful to clarify where errors come from: async search or search execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
|
@jimczi I pushed an update, this should be ready. I removed the IT tests as they were super fragile and at this point not needed given that there are no functional changes to the submit and get actions, and all that has changed in the async search task can be tested through unit tests which are much easier to write and reason about. |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment regarding a non-BWC change in AsyncSearchResponse. The rest looks good to me.
| searchResponse.get().updateWithFailure(exc); | ||
| // if the failure occurred before calling onListShards | ||
| searchResponse.compareAndSet(null, new MutableSearchResponse(-1, -1, null, threadPool.getThreadContext())); | ||
| searchResponse.get().updateWithFailure(new ElasticsearchStatusException("error while executing search", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
| public void onFailure(Exception exc) { | ||
| submitListener.onFailure(exc); | ||
| //this will only ever be called when there's an issue scheduling the thread will invoke | ||
| //the completion listener once the wait for completion timeout expires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reword this comment to make it understandable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops ok
| private final SearchResponse searchResponse; | ||
| @Nullable | ||
| private final Exception error; | ||
| private final ElasticsearchException error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can make that change without breaking BWC ? You'd need to wrap the exception if we read from an earlier version ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I think we should stick to an Exception here to keep things simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I have checked and this is fine. I haven't changed how the exception gets serialized? it was in fact always an elasticsearch exception before I think? I will check again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am double checking: this change is not required but I think it simplifies things (I could also simplify the status method which I missed before) yet I agree that we have to make sure it does not break anything.
The failure in MutableSearchResponse has always been an ElasticsearchException , and it used to be the only exception that gets passed in when building an AsyncSearchResponse. With this change we can also have a reduce exception, but that is still an ElasticsearchException. So, effectively, ElasticsearchException is the only exception that AsyncSearchResponse will ever hold. I think as long as we don't modify how we serialize it over the wire (by removing the type of exception because we already know its type) we should be ok? Is there anything I am missing that could cause a bw comp breakage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had issues there before. The ElasticsearchException can be serialized over the wire and deserialized as another exception if it is not registered so I'd prefer that we keep Exception for now. I am not sure why do you think it simplifies things ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it simplifies things because it is super confusing to declare an exception when what we carry is always ElasticsearchException, and it simplifies returning the correct status, no guessing needed. I do get nervous though about the cast in StreamInput#readException, it is trappy and I was also wondering if this is not too risky. I reverted this bit, but I still don't get what it would break :) Possibly though it is wise to keep things as-is because async responses are stored in the index using the wire format which makes things tricky when it comes to bw comp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the casting is trappy so +1 to keep it as is for the moment. We can change the way exceptions are handled in a follow up but that should be only for new versions imo.
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- The exception that we caught when failing to schedule a thread was incorrect. - We may have failures when reducing the response before returning it, which were not handled correctly and may have caused get or submit async search task to not be properly unregistered from the task manager - when the completion listener onFailure method is invoked, the search task has to be unregistered. Not doing so may cause the search task to be stuck in the task manager although it has completed. Closes #58995
- The exception that we caught when failing to schedule a thread was incorrect. - We may have failures when reducing the response before returning it, which were not handled correctly and may have caused get or submit async search task to not be properly unregistered from the task manager - when the completion listener onFailure method is invoked, the search task has to be unregistered. Not doing so may cause the search task to be stuck in the task manager although it has completed. Closes #58995
Closes #58995