-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout to ESQL #99526
Add timeout to ESQL #99526
Conversation
|
||
public class TransportEsqlQueryAction extends HandledTransportAction<EsqlQueryRequest, EsqlQueryResponse> { | ||
|
||
public static TimeValue DEFAULT_TIMEOUT = new TimeValue(4, TimeUnit.MINUTES); | ||
public static TimeValue MAX_TIMEOUT = new TimeValue(4, TimeUnit.MINUTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default and max values are supposed to be lower than other timeouts that can happen in the infrastructure, eg. the a http timeouts on cloud proxies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have MAX_TIMEOUT since ES can be deployed to an environment without the proxy timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move it to a Setting and disable it by default, so that we can enable it when we know we have a proxy timeout.
Pinging @elastic/es-ql (Team:QL) |
Hi @luigidellaquila, I've created a changelog YAML for you. |
Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL) |
this.pragmas = new QueryPragmas(in); | ||
this.resultTruncationMaxSize = in.readVInt(); | ||
this.timeout = in.readOptionalTimeValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need a TransportVersion bump since we are in tech preview and serverless is not out yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to bump the version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serverless is "out" in the sense that it's running in a production environment and serialization errors during upgrades may trip assertions and block a future release. Please don't skip this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are disabling ES|QL in Serverless now to simplify the developement process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luigidellaquila. I've left some comments.
Response response = client().performRequest(bulk); | ||
Assert.assertEquals("{\"errors\":false}", EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8)); | ||
|
||
// let's make it a bit expensive, so that the timeout will most likely trigger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create a test plugin to avoid the flakiness here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering to add a sleep(x)
function only in test, to simulate slow queries. Not sure it's what you mean, if you think there is a better way please let me know
|
||
public class TransportEsqlQueryAction extends HandledTransportAction<EsqlQueryRequest, EsqlQueryResponse> { | ||
|
||
public static TimeValue DEFAULT_TIMEOUT = new TimeValue(4, TimeUnit.MINUTES); | ||
public static TimeValue MAX_TIMEOUT = new TimeValue(4, TimeUnit.MINUTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have MAX_TIMEOUT since ES can be deployed to an environment without the proxy timeout.
this.pragmas = new QueryPragmas(in); | ||
this.resultTruncationMaxSize = in.readVInt(); | ||
this.timeout = in.readOptionalTimeValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to bump the version.
private TimeValue timeout(EsqlQueryRequest request) { | ||
TimeValue rTimeout = request.timeout(); | ||
if (rTimeout == null) { | ||
return DEFAULT_TIMEOUT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we can have the default timeout at the request level here. Once we have the default, it will be very hard to remove as it's considered a breaking change. I think we need to discuss with the team. Let's remove it from this PR, then re-introduce it in a follow-up after the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a default timeout could be a useful feature, but I'll move it to a Setting and I'll set it to -1
(ie. no timeout) by default.
Let's discuss it with the team though, maybe there are better options.
if (configuration.timeout() != null) { | ||
return ListenerTimeouts.wrapWithTimeout(pool, configuration.timeout(), esqlExecutor, l, (x) -> { | ||
String timeoutMessage = "ESQL query timed out after {}"; | ||
l.onFailure(new ElasticsearchTimeoutException(timeoutMessage, configuration.timeout())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to cancel the ESQL request when it's timed out.
@elasticmachine update branch |
Turns out that the implementation I did with |
@luigidellaquila The issue is that we need to call @Override
protected void doExecute(Task task, EsqlQueryRequest request, ActionListener<EsqlQueryResponse> listener) {
// workaround for https://github.com/elastic/elasticsearch/issues/97916 - TODO remove this when we can
final ActionListener<EsqlQueryResponse> wrappedListener;
final TimeValue timeout = timeout(request);
if (timeout != null) {
final ThreadPool threadPool = transportService.getThreadPool();
final var executor = threadPool.executor(EsqlPlugin.ESQL_THREAD_POOL_NAME);
wrappedListener = ListenerTimeouts.wrapWithTimeout(threadPool, timeout, executor, listener, l -> {
logger.debug("cancelling ESQL task {} on timeout", task);
final TaskManager taskManager = transportService.getTaskManager();
taskManager.cancelTaskAndDescendants((CancellableTask) task, "timeout", false, ActionListener.noop());
listener.onFailure(new ElasticsearchTimeoutException("ESQL query timed out after {}", timeout));
});
} else {
wrappedListener = listener;
}
requestExecutor.execute(ActionRunnable.wrap(wrappedListener, l -> doExecuteForked(task, request, l)));
} |
I've pushed 7b71306 |
wrappedListener = ListenerTimeouts.wrapWithTimeout(threadPool, timeout, executor, listener, l -> { | ||
logger.debug("cancelling ESQL task {} on timeout", task); | ||
final TaskManager taskManager = transportService.getTaskManager(); | ||
taskManager.cancelTaskAndDescendants((CancellableTask) task, "timeout", false, ActionListener.noop()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even on a timeout we should wait for the child tasks to complete before responding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; we should wait for all outstanding tasks before responding. I deliberately chose not to wait because of the specific use case we are trying to address. Here, we want users to see the timeout error from Elasticsearch rather than the Cloud proxy. If we wait for outstanding tasks, it's possible that it might exceed the proxy's timeout before we can respond. However, thinking about this more, I believe we should wait and reduce the default timeout instead. Having a parent task completed before its child tasks isn't good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I still think we should not try and undercut the proxy timeout by default. Our response is just an exception saying ESQL query timed out after {}
, this is not practically any more useful than the proxy timeout message. Put differently, I expect most users who see this response will just increase the timeout in the request until they start to get the proxy timeout message instead. And then there's a good chance they'll open a support case complaining that they can't control the proxy timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what you said, @DaveCTurner. I would like to hear from @nik9000 and @costin. If we all agree, then we should not introduce the timeout parameter until we really need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see some value in a timeout parameter. Although generally it's be better to time out on the client side since server-side timeouts will always be a little unreliable and/or late, I think there are some clients that make it harder than they should to do this so a server-side parameter is a nice feature. And server-side timeouts do save the cost of tearing down the connection and opening a new one at least. I just don't think we should have a server-side timeout by default.
logger.debug("cancelling ESQL task {} on timeout", task); | ||
final TaskManager taskManager = transportService.getTaskManager(); | ||
taskManager.cancelTaskAndDescendants((CancellableTask) task, "timeout", false, ActionListener.noop()); | ||
listener.onFailure(new ElasticsearchTimeoutException("ESQL query timed out after {}", timeout)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should complete the supplied listener l
rather than the outer listener
, otherwise listener
may be completed twice (and also to avoid unnecessarily capturing listener
in this lambda)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait sorry, this API is very confusing. What you've written is right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we even pass in a listener to the onTimeout
callback? This listener is a no-op in all cases isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the API is confusing. I will look into simplifying it.
Thanks for fixing it @dnhatn! I still see some random failures in the timeout tests:
maybe there is something missing in the cancelation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the approach of this PR is to fail requests when hitting the timeout, I'm curious if we should consider returning a partial response instead.
@@ -166,6 +179,7 @@ private static ObjectParser<EsqlQueryRequest, Void> objectParser(Supplier<EsqlQu | |||
); | |||
parser.declareField(EsqlQueryRequest::params, EsqlQueryRequest::parseParams, PARAMS_FIELD, VALUE_ARRAY); | |||
parser.declareString((request, localeTag) -> request.locale(Locale.forLanguageTag(localeTag)), LOCALE_FIELD); | |||
parser.declareInt((request, value) -> request.timeout(new TimeValue(value, TimeUnit.MILLISECONDS)), TIMEOUT_FIELD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we take the timeout as a time value directly like the _search
API rather than an int?
@jpountz we considered that, but at this stage it's complicated. In the future we will have pagination and partial results, but it will require some additional work |
@elasticmachine update branch |
expected head sha didn’t match current head ref. |
GitHub is having a bad day today, it doesn't see latest commits. Trying to switch branches to see if it recovers |
@elasticmachine update branch |
@elasticmachine update branch |
Hi @luigidellaquila, I've updated the changelog YAML for you. |
Folks, thanks for the feedback.
As such, introducing yet another timeout on the ESQL endpoint doesn't add any value rather more confusion. |
Adding a timeout to ESQL requests.
By default a query will have no timeout, but it can be overridden adding a
timeout
parameter to the payloador configuring ES with
esql.query.default_timeout: <value>
. A value of-1
means no timeout.The implementation also defines a maximum timeout for queries, disabled by default, that can be configured setting
esql.query.max_timeout
.If the value passed in the payload is > max timeout or < 0, then the max timeout will be used.
Fixes #99141