-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new x-pack endpoints to track the progress of a search asynchronously #49931
Conversation
Pinging @elastic/es-search (:Search/Search) |
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woohoo! This looks great in general, some questions:
- If I read correctly, it's up to the user to garbage collect responses manually. Should we do this automatically when a final response has been retrieved? We already have a
wait_for_completion
parameter that allows to reduce the number of roundtrips for fast requests, so it doesn't feel consistent to always require a new roundtrip to delete the response? I'm also a bit biased towards reducing the number of cases when responses need to be garbage-collected via ILM, as you could accumulate a large volume of responses in 5 days? - If we moved from time-based ids to true uuids - which can't be guessed, I wonder whether we'd still need to require that the user that views a response is the same as the user who submitted the request. I don't think it would be surprising to users that sharing the id of an async search has pretty much the same consequences as sharing the response of the search request?
- Since
response
andpartial_response
should have mostly the same format, I wonder whether we should use the sameresponse
key combined with apartial
flag?
x-pack/plugin/core/src/main/resources/async-search-history.json
Outdated
Show resolved
Hide resolved
x-pack/plugin/core/src/main/resources/async-search-history.json
Outdated
Show resolved
Hide resolved
I like the idea, getting the same final response twice is something that our regular caches should handle transparently so this would also emphasize the fact that this response are not meant to be used as an additional cache.
+1 for true uuids, I agree with the response sharing analogy but since we want to delete final responses when they are reported back I think it would be nice to have this extra layer. It's not a lot of work and something that we already implement in scrolls.
Are you talking of the rest format or the internal response ? I think it's important to keep the distinction internally (for the hlrc) but I agree that the response could look like this:
Is it what you meant ? |
Yes that's what I meant. |
This all sounds great. Two suggestions to consider:
|
I pushed another iteration that addresses @jpountz's comments. We also discussed the best options to secure the system index with @elastic/es-security. The simplest way today would be to add the async search index in the |
I wonder if we should add "pending_shards" to the partial_response section so it's explicit that we are expecting those shard to return? Also I wonder if we should group the shard counts together into its own object? |
FYI, you'll need to merge in the latest changes from |
…s to the user that submitted the initial rquest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple more small comments but LGTM. Can you also update the description and remove the mention of the 304 status code which I believe is outdated? Thanks for taking this to the finish line.
...ugin/async-search/src/main/java/org/elasticsearch/xpack/search/RestGetAsyncSearchAction.java
Outdated
Show resolved
Hide resolved
ActionRequestValidationException validationException = submit.validate(); | ||
if (validationException != null) { | ||
throw validationException; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see it, did you mean to remove it?
request.setCcsMinimizeRoundtrips(false); | ||
request.setPreFilterShardSize(1); | ||
request.setBatchedReduceSize(5); | ||
request.requestCache(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still missing where we reject ccs minimize roundtrips set to false
|
||
@Override | ||
public Task createTask(long id, String type, String action, TaskId parentTaskId, Map<String, String> headers) { | ||
return new CancellableTask(id, type, action, "", parentTaskId, headers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you address this too?
x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.get.json
Show resolved
Hide resolved
"path":"/_async_search", | ||
"methods":[ | ||
"GET", | ||
"POST" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more thing I had missed before, here we may want to remove GET? I tend to think that POST is the only method that suits an API that submits something. Was it here only for consistency with search?
x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.get.json
Outdated
Show resolved
Hide resolved
}, | ||
"keep_alive": { | ||
"type": "time", | ||
"description": "Specify the time that the request should remain reachable in the cluster." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rephrase to something like "Specify the time interval in which the results (partial or final) for this search will be available"
AsyncSearchActionTests#testCleanupOnFailure fails sporadically in CI but not locally. This commit switches the tests into a SuiteScopeTestCase that creates internal states once on static members in order to make the tests more reproducible. Relates #49931
Deleting an async search id can throw a ResourceNotFoundException even if the query was successfully cancelled. We delete the stored response automatically if the query is cancelled so that creates a race with the delete action that also ensures that the task is removed. This change ensures that we ignore missing async search ids in the async search index if they were successfuly cancelled. Relates elastic#53360 Relates elastic#49931
Deleting an async search id can throw a ResourceNotFoundException even if the query was successfully cancelled. We delete the stored response automatically if the query is cancelled so that creates a race with the delete action that also ensures that the task is removed. This change ensures that we ignore missing async search ids in the async search index if they were successfully cancelled. Relates #53360 Relates #49931
…usly (#49931) (#53591) This change introduces a new API in x-pack basic that allows to track the progress of a search. Users can submit an asynchronous search through a new endpoint called `_async_search` that works exactly the same as the `_search` endpoint but instead of blocking and returning the final response when available, it returns a response after a provided `wait_for_completion` time. ```` GET my_index_pattern*/_async_search?wait_for_completion=100ms { "aggs": { "date_histogram": { "field": "@timestamp", "fixed_interval": "1h" } } } ```` If after 100ms the final response is not available, a `partial_response` is included in the body: ```` { "id": "9N3J1m4BgyzUDzqgC15b", "version": 1, "is_running": true, "is_partial": true, "response": { "_shards": { "total": 100, "successful": 5, "failed": 0 }, "total_hits": { "value": 1653433, "relation": "eq" }, "aggs": { ... } } } ```` The partial response contains the total number of requested shards, the number of shards that successfully returned and the number of shards that failed. It also contains the total hits as well as partial aggregations computed from the successful shards. To continue to monitor the progress of the search users can call the get `_async_search` API like the following: ```` GET _async_search/9N3J1m4BgyzUDzqgC15b/?wait_for_completion=100ms ```` That returns a new response that can contain the same partial response than the previous call if the search didn't progress, in such case the returned `version` should be the same. If new partial results are available, the version is incremented and the `partial_response` contains the updated progress. Finally if the response is fully available while or after waiting for completion, the `partial_response` is replaced by a `response` section that contains the usual _search response: ```` { "id": "9N3J1m4BgyzUDzqgC15b", "version": 10, "is_running": false, "response": { "is_partial": false, ... } } ```` Asynchronous search are stored in a restricted index called `.async-search` if they survive (still running) after the initial submit. Each request has a keep alive that defaults to 5 days but this value can be changed/updated any time: ````` GET my_index_pattern*/_async_search?wait_for_completion=100ms&keep_alive=10d ````` The default can be changed when submitting the search, the example above raises the default value for the search to `10d`. ````` GET _async_search/9N3J1m4BgyzUDzqgC15b/?wait_for_completion=100ms&keep_alive=10d ````` The time to live for a specific search can be extended when getting the progress/result. In the example above we extend the keep alive to 10 more days. A background service that runs only on the node that holds the first primary shard of the `async-search` index is responsible for deleting the expired results. It runs every hour but the expiration is also checked by running queries (if they take longer than the keep_alive) and when getting a result. Like a normal `_search`, if the http channel that is used to submit a request is closed before getting a response, the search is automatically cancelled. Note that this behavior is only for the submit API, subsequent GET requests will not cancel if they are closed. Asynchronous search are not persistent, if the coordinator node crashes or is restarted during the search, the asynchronous search will stop. To know if the search is still running or not the response contains a field called `is_running` that indicates if the task is up or not. It is the responsibility of the user to resume an asynchronous search that didn't reach a final response by re-submitting the query. However final responses and failures are persisted in a system index that allows to retrieve a response even if the task finishes. ```` DELETE _async_search/9N3J1m4BgyzUDzqgC15b ```` The response is also not stored if the initial submit action returns a final response. This allows to not add any overhead to queries that completes within the initial `wait_for_completion`. The `.async-search` index is a restricted index (should be migrated to a system index in +8.0) that is accessible only through the async search APIs. These APIs also ensure that only the user that submitted the initial query can retrieve or delete the running search. Note that admins/superusers would still be able to cancel the search task through the task manager like any other tasks. Relates #49091 Co-authored-by: Luca Cavanna <javanna@users.noreply.github.com>
Could you please provide an example on how to update the keep alive time of a currently running async search? I tried this without success. Also, the final GET returns all of the data on a single response, without pagination. Is that the way this should always be? |
The response returns the initial keep alive instead of the updated one, I opened #55435 to fix the bug.
What do you mean by pagination ? It should return the same final response than a normal |
High level view
This change introduces a new API in x-pack basic that allows to track the progress of a search.
Users can submit an asynchronous search through a new endpoint called
_async_search
thatworks exactly the same as the
_search
endpoint but instead of blocking and returning the final response when available, it returns a response after a providedwait_for_completion
time.If after 100ms the final response is not available, a
partial_response
is included in the body:The partial response contains the total number of requested shards, the number of shards that successfully returned and the number of shards that failed.
It also contains the total hits as well as partial aggregations computed from the successful shards.
To continue to monitor the progress of the search users can call the get
_async_search
API like the following:That returns a new response that can contain the same partial response than the previous call if the search didn't progress, in such case the returned
version
should be the same. If new partial results are available, the version is incremented and the
partial_response
contains the updated progress.Finally if the response is fully available while or after waiting for completion, the
partial_response
is replaced by aresponse
section that contains the usual _search response:Persistency
Asynchronous search are stored in a restricted index called
.async-search
if they survive (still running) after the initial submit. Each request has a keep alive that defaults to 5 days but this value can be changed/updated any time:The default can be changed when submitting the search, the example above raises the default value for the search to
10d
.The time to live for a specific search can be extended when getting the progress/result. In the example above we extend the keep alive to 10 more days.
A background service that runs only on the node that holds the first primary shard of the
async-search
index is responsible for deleting the expired results. It runs every hour but the expiration is also checked by running queries (if they take longer than the keep_alive) and when getting a result.Like a normal
_search
, if the http channel that is used to submit a request is closed before getting a response, the search is automatically cancelled. Note that this behavior is only for the submit API, subsequent GET requests will not cancel if they are closed.Resiliency
Asynchronous search are not persistent, if the coordinator node crashes or is restarted during the search, the asynchronous search will stop. To know if the search is still running or not the response contains a field called
is_running
that indicates if the task is up or not. It is the responsibility of the user to resume an asynchronous search that didn't reach a final response by re-submitting the query. However final responses and failures are persisted in a system index that allowsto retrieve a response even if the task finishes.
The response is also not stored if the initial submit action returns a final response. This allows to not add any overhead to queries that completes within the initial
wait_for_completion
.Security
The
.async-search
index is a restricted index (should be migrated to a system index in +8.0) that is accessible only through the async search APIs. These APIs also ensure that only the user that submitted the initial query can retrieve or delete the running search. Note that admins/superusers would still be able to cancel the search task through the task manager like any other tasks.Relates #49091