Improve error handling in async search code #57925

javanna · 2020-06-10T10:37:22Z

The exception that we caught when failing to schedule a thread was incorrect.
We may have failures when reducing the response before returning it, which were not handled correctly and may have caused get or submit async search task to not be properly unregistered from the task manager
when the completion listener onFailure method is invoked, the search task has to be unregistered. Not doing so may cause the search task to be stuck in the task manager although it has completed.

elasticmachine · 2020-06-10T10:37:24Z

Pinging @elastic/es-search (:Search/Search)

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

javanna · 2020-06-10T12:14:01Z

...nc-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java

I have not found a simple way to test this. Unit testing a transport action is a bit of a nightmare with all the required dependencies. And from an integ test, how do I trigger a failure when scheduling the wait for completion thread?

javanna · 2020-06-10T12:14:29Z

...nc-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java

here I added the same catch that we have below for storeFinalResponse. It's based on paranoia, but should not hurt?

We could also make the try/catch in deleteResponse and call onFailure instead of throwing an exception ?

jimczi

I left some comments

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

jimczi · 2020-06-10T19:01:17Z

.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java

I was thinking of this and I don't think we should add the failure here. It can be transient so a retry may fix the issue. I'd prefer that we try/catch the call to toAsyncSearchResponse and use a plain ActionListener to notify the failure, wdyt ?

jimczi · 2020-06-10T19:02:14Z

...nc-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java

We could also make the try/catch in deleteResponse and call onFailure instead of throwing an exception ?

- The exception that we caught when failing to schedule a thread was incorrect. - We may have failures when reducing the response before returning it, which were not handled correctly and may have caused get or submit async search task to not be properly unregistered from the task manager - when the completion listener onFailure method is invoked, the search task has to be unregistered. Not doing so may cause the search task to be stuck in the task manager although it has completed.

javanna · 2020-06-25T15:54:41Z

@jimczi I pushed an update. I am still working on a new integ test for the submit and get action in case getResponse fails, but I wanted to give you the chance to comment on the recent changes sooner rather than later.

javanna · 2020-06-28T21:45:44Z

.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java

-        } else {
-            this.failure = rootCauses[0];
-        }
+        this.failure = ExceptionsHelper.convertToElastic(exc);


I think this is simpler and even preserves status codes, not sure why we were using guessRootCauses

javanna · 2020-06-28T21:47:29Z

@jimczi can you have a look please? There are still a couple of TODOs that you may have ideas about, but it should be close. At least I managed to test almost everything I wanted to test.

jimczi

I left more comments

jimczi · 2020-06-30T13:44:44Z

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

-            listener.accept(finalResponse);
-        }
-        completionListeners.clear();
+        getResponse(completionListeners);


should we copy the completion listeners in the synchronized block to avoid concurrent delete (unregister) ?

jimczi · 2020-06-30T13:46:43Z

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

+                                listener.onResponse(resp);
+                            }
+                        },
+                        listener::onFailure


we still need to cancel the cancellable on failure

yes indeed :)

jimczi · 2020-06-30T13:49:00Z

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

+        listener.onResponse(asyncSearchResponse);
+    }
+
+    AsyncSearchResponse buildErrorResponse(SearchResponse searchResponse, Exception exception) {


The other solution would be to build the error response when catching the exception in getResponse ? This way we don't need to differentiate between a failure during partial reduce and a fatal failure. They both return an async search response that contains a failure ?

That is along the lines of what I had in the first iteration, and I believe it defeats the purpose of holding ActionListeners instead of Consumers. We have three places that require different behaviour currently when we get reduction failures:

submit when the timeout expires, we treat is a fatal failure and cancel the search task

submit when the response completes, we already returned, we store async search response

get: we treat it as a transient failure

I think we could maintain the behaviour listed above even if we did not hold action listeners? Only when getting a response the consumer should check if it holds a failure and act accordingly?

I am not sure though if you meant on changing also some of this behaviour, especially because I don't completely follow your statement " This way we don't need to differentiate between a failure during partial reduce and a fatal failure.".

I was actually wondering if we should be more clear with the user, maybe wrapping the exception, and let them know when the exception comes from search and when it comes from async search. I can see how using suppressed exceptions gives you all that happened but it's hard to decipher and debug.

What I meant is that the context of the failure is important so we should leave it to the consumer ? In general, and sorry for the back and forth on this, I think we should always return an AsyncSearchResponse even in the case of a failure.
The failure can be transient, in such case is_running should be true (the search action is still running).
I also don't think we should cancel the search if we have a failure when reducing a partial search response. We should return the response with the transient failure and the metadata associated with the current search (number of shards, id, ...). That would simplify the handling of failures and would be consistent with transient failures in get ?

Ok I need to give this a try, especially as the first iteration did not have yet the described different behaviour in submit and get. I think that the main argument for this is to simplify things, adding action listeners adds complexity and having to call buildErrorResponse from submit is weird and should be removed if possible, which is kind of why I initially went down that route :)

having gone back to the consumer approach, I see the "context" argument better. Returning async search response is more convenient as it holds the needed info to tell what is happening, while an exception alone does not say much besides that an error has happened.

jimczi · 2020-06-30T13:49:57Z

.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java

-        } else {
-            this.failure = rootCauses[0];
-        }
+        this.failure = ExceptionsHelper.convertToElastic(exc);


jimczi · 2020-06-30T13:55:45Z

...ck/plugin/async-search/src/test/java/org/elasticsearch/xpack/search/FailReduceAggPlugin.java

+import java.util.List;
+import java.util.Map;
+
+public class FailReduceAggPlugin extends Plugin implements SearchPlugin {


You don't really need a full plugin since we only use the FailReduceInternalAgg and are in charge of the registry in AsyncSearchTaskTests ?

the plugin is used in AsyncSearchActionIT not in AsyncSearchTaskTests

javanna · 2020-07-01T11:04:26Z

@jimczi I pushed an update, I haven't updated tests yet.

jimczi

I left a minor comment but I like it better. Thanks for iterating on this.

jimczi · 2020-07-01T11:09:13Z

.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java

+            error = this.failure;
+            error.addSuppressed(exception);
+        }
+        //TODO add some search response here rather than null


++, we should return the current stats (number of shards, shard failures, ...) and just omit the partial aggs.

javanna · 2020-07-02T09:10:51Z

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

-            searchResponse.get().updateWithFailure(exc);
+            // if the failure occurred before calling onListShards
+            searchResponse.compareAndSet(null, new MutableSearchResponse(-1, -1, null, threadPool.getThreadContext()));
+            searchResponse.get().updateWithFailure(new ElasticsearchStatusException("error while executing search",


is the additional wrapping ok? I think it's odd that we have to have it, but useful to clarify where errors come from: async search or search execution.

javanna · 2020-07-02T09:12:09Z

@jimczi I pushed an update, this should be ready. I removed the IT tests as they were super fragile and at this point not needed given that there are no functional changes to the submit and get actions, and all that has changed in the async search task can be tested through unit tests which are much easier to write and reason about.

jimczi

I left one comment regarding a non-BWC change in AsyncSearchResponse. The rest looks good to me.

jimczi · 2020-07-02T16:44:24Z

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

-            searchResponse.get().updateWithFailure(exc);
+            // if the failure occurred before calling onListShards
+            searchResponse.compareAndSet(null, new MutableSearchResponse(-1, -1, null, threadPool.getThreadContext()));
+            searchResponse.get().updateWithFailure(new ElasticsearchStatusException("error while executing search",


jimczi · 2020-07-02T16:45:27Z

...nc-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java

                public void onFailure(Exception exc) {
-                    submitListener.onFailure(exc);
+                    //this will only ever be called when there's an issue scheduling the thread will invoke
+                    //the completion listener once the wait for completion timeout expires


Can you reword this comment to make it understandable ?

jimczi · 2020-07-02T16:47:23Z

...lugin/core/src/main/java/org/elasticsearch/xpack/core/search/action/AsyncSearchResponse.java

    private final SearchResponse searchResponse;
    @Nullable
-    private final Exception error;
+    private final ElasticsearchException error;


I don't think you can make that change without breaking BWC ? You'd need to wrap the exception if we read from an earlier version ?

Although I think we should stick to an Exception here to keep things simple

I thought I have checked and this is fine. I haven't changed how the exception gets serialized? it was in fact always an elasticsearch exception before I think? I will check again.

I am double checking: this change is not required but I think it simplifies things (I could also simplify the status method which I missed before) yet I agree that we have to make sure it does not break anything.

The failure in MutableSearchResponse has always been an ElasticsearchException , and it used to be the only exception that gets passed in when building an AsyncSearchResponse. With this change we can also have a reduce exception, but that is still an ElasticsearchException. So, effectively, ElasticsearchException is the only exception that AsyncSearchResponse will ever hold. I think as long as we don't modify how we serialize it over the wire (by removing the type of exception because we already know its type) we should be ok? Is there anything I am missing that could cause a bw comp breakage?

We had issues there before. The ElasticsearchException can be serialized over the wire and deserialized as another exception if it is not registered so I'd prefer that we keep Exception for now. I am not sure why do you think it simplifies things ?

I think it simplifies things because it is super confusing to declare an exception when what we carry is always ElasticsearchException, and it simplifies returning the correct status, no guessing needed. I do get nervous though about the cast in StreamInput#readException, it is trappy and I was also wondering if this is not too risky. I reverted this bit, but I still don't get what it would break :) Possibly though it is wise to keep things as-is because async responses are stored in the index using the wire format which makes things tricky when it comes to bw comp.

Yep the casting is trappy so +1 to keep it as is for the moment. We can change the way exceptions are handled in a follow up but that should be only for new versions imo.

This reverts commit e5b8596.

jimczi

LGTM

- The exception that we caught when failing to schedule a thread was incorrect. - We may have failures when reducing the response before returning it, which were not handled correctly and may have caused get or submit async search task to not be properly unregistered from the task manager - when the completion listener onFailure method is invoked, the search task has to be unregistered. Not doing so may cause the search task to be stuck in the task manager although it has completed. Closes #58995

javanna added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.8.1 v7.9.0 labels Jun 10, 2020

javanna requested a review from jimczi June 10, 2020 10:37

elasticmachine added the Team:Search Meta label for search team label Jun 10, 2020

javanna commented Jun 10, 2020

View reviewed changes

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java Outdated Show resolved Hide resolved

javanna commented Jun 10, 2020

View reviewed changes

jimczi reviewed Jun 10, 2020

View reviewed changes

This was referenced Jun 18, 2020

Handle failures with no explicit cause in async search #58319

Merged

Unable to get progress in _async_search forever #58311

Closed

javanna added 3 commits June 25, 2020 12:22

notify listener instead of try catch

1a01112

iter

5184732

javanna force-pushed the fix/completion_listener_failures branch from 0c342e9 to 5184732 Compare June 25, 2020 15:49

revert assertion error

cb8d10e

javanna added 4 commits June 26, 2020 11:50

store initial response together with failure

cbcb08d

fix compile error

27c98db

add tests

3f91a20

remove temporary changes

f01df9b

javanna commented Jun 28, 2020

View reviewed changes

jimczi reviewed Jun 30, 2020

View reviewed changes

javanna added 2 commits July 1, 2020 11:17

iter

42af6bf

wip: go back to consumer

7e73685

jimczi approved these changes Jul 1, 2020

View reviewed changes

javanna added 2 commits July 2, 2020 11:04

iter

4376ff5

iter

9a8927c

javanna commented Jul 2, 2020

View reviewed changes

iter

57aa863

jimczi requested changes Jul 2, 2020

View reviewed changes

javanna added 6 commits July 2, 2020 21:11

Merge branch 'master' into fix/completion_listener_failures

b52b779

clarify comment

0c737ab

simplify AsyncSearchResponse#status method

e5b8596

adapt to internalaggs changes

6b6fb54

Revert "simplify AsyncSearchResponse#status method"

fd26c74

This reverts commit e5b8596.

iter

a2d0eb5

javanna mentioned this pull request Jul 3, 2020

Async Search is not Catching some Exceptions (and Leaking Listeners as a Result?) #58995

Closed

jimczi approved these changes Jul 3, 2020

View reviewed changes

javanna merged commit 4366360 into elastic:master Jul 3, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Improve error handling in async search code #57925

Improve error handling in async search code #57925

Uh oh!

Conversation

javanna commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jun 10, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna commented Jun 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna commented Jun 28, 2020

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna commented Jul 1, 2020

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna commented Jul 2, 2020

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

javanna commented Jun 10, 2020 •

edited

Loading

javanna Jul 1, 2020 •

edited

Loading