Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to ESQL #99526

Closed
wants to merge 19 commits into from
Closed

Conversation

luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Sep 13, 2023

Adding a timeout to ESQL requests.

By default a query will have no timeout, but it can be overridden adding a timeout parameter to the payload

{
   "timeout": "500ms",
   "query": "from ..."
}

or configuring ES with esql.query.default_timeout: <value>. A value of -1 means no timeout.

The implementation also defines a maximum timeout for queries, disabled by default, that can be configured setting esql.query.max_timeout.
If the value passed in the payload is > max timeout or < 0, then the max timeout will be used.

Fixes #99141


public class TransportEsqlQueryAction extends HandledTransportAction<EsqlQueryRequest, EsqlQueryResponse> {

public static TimeValue DEFAULT_TIMEOUT = new TimeValue(4, TimeUnit.MINUTES);
public static TimeValue MAX_TIMEOUT = new TimeValue(4, TimeUnit.MINUTES);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default and max values are supposed to be lower than other timeouts that can happen in the infrastructure, eg. the a http timeouts on cloud proxies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have MAX_TIMEOUT since ES can be deployed to an environment without the proxy timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move it to a Setting and disable it by default, so that we can enable it when we know we have a proxy timeout.

@luigidellaquila luigidellaquila marked this pull request as ready for review September 13, 2023 12:41
@elasticsearchmachine elasticsearchmachine added the Team:QL (Deprecated) Meta label for query languages team label Sep 13, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

this.pragmas = new QueryPragmas(in);
this.resultTruncationMaxSize = in.readVInt();
this.timeout = in.readOptionalTimeValue();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need a TransportVersion bump since we are in tech preview and serverless is not out yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to bump the version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serverless is "out" in the sense that it's running in a production environment and serialization errors during upgrades may trip assertions and block a future release. Please don't skip this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are disabling ES|QL in Serverless now to simplify the developement process.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luigidellaquila. I've left some comments.

Response response = client().performRequest(bulk);
Assert.assertEquals("{\"errors\":false}", EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8));

// let's make it a bit expensive, so that the timeout will most likely trigger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a test plugin to avoid the flakiness here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering to add a sleep(x) function only in test, to simulate slow queries. Not sure it's what you mean, if you think there is a better way please let me know


public class TransportEsqlQueryAction extends HandledTransportAction<EsqlQueryRequest, EsqlQueryResponse> {

public static TimeValue DEFAULT_TIMEOUT = new TimeValue(4, TimeUnit.MINUTES);
public static TimeValue MAX_TIMEOUT = new TimeValue(4, TimeUnit.MINUTES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have MAX_TIMEOUT since ES can be deployed to an environment without the proxy timeout.

this.pragmas = new QueryPragmas(in);
this.resultTruncationMaxSize = in.readVInt();
this.timeout = in.readOptionalTimeValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to bump the version.

private TimeValue timeout(EsqlQueryRequest request) {
TimeValue rTimeout = request.timeout();
if (rTimeout == null) {
return DEFAULT_TIMEOUT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we can have the default timeout at the request level here. Once we have the default, it will be very hard to remove as it's considered a breaking change. I think we need to discuss with the team. Let's remove it from this PR, then re-introduce it in a follow-up after the discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a default timeout could be a useful feature, but I'll move it to a Setting and I'll set it to -1 (ie. no timeout) by default.
Let's discuss it with the team though, maybe there are better options.

if (configuration.timeout() != null) {
return ListenerTimeouts.wrapWithTimeout(pool, configuration.timeout(), esqlExecutor, l, (x) -> {
String timeoutMessage = "ESQL query timed out after {}";
l.onFailure(new ElasticsearchTimeoutException(timeoutMessage, configuration.timeout()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to cancel the ESQL request when it's timed out.

@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

@luigidellaquila
Copy link
Contributor Author

Turns out that the implementation I did with ListenerTimeouts is incorrect (the tests fail).
@dnhatn let's discuss it off-line, I'm afraid I need some help to better understand where to plug the timeout. I have the feeling (and a TODO in the code someway confirms it) that there are some missing pieces to be able to implement proper cancelation.

@dnhatn
Copy link
Member

dnhatn commented Sep 18, 2023

@luigidellaquila The issue is that we need to call onFailure() on the original listener, not the wrapped listener, as the isDone flag is already set to true. Also, we need to wrap the timeout in TransportEsqlQueryAction to ensure that the request can time out even during the resolution. Below is the snippet that we can use in TransportEsqlQueryAction#execute.

@Override
protected void doExecute(Task task, EsqlQueryRequest request, ActionListener<EsqlQueryResponse> listener) {
    // workaround for https://github.com/elastic/elasticsearch/issues/97916 - TODO remove this when we can
    final ActionListener<EsqlQueryResponse> wrappedListener;
    final TimeValue timeout = timeout(request);
    if (timeout != null) {
        final ThreadPool threadPool = transportService.getThreadPool();
        final var executor = threadPool.executor(EsqlPlugin.ESQL_THREAD_POOL_NAME);
        wrappedListener = ListenerTimeouts.wrapWithTimeout(threadPool, timeout, executor, listener, l -> {
            logger.debug("cancelling ESQL task {} on timeout", task);
            final TaskManager taskManager = transportService.getTaskManager();
            taskManager.cancelTaskAndDescendants((CancellableTask) task, "timeout", false, ActionListener.noop());
            listener.onFailure(new ElasticsearchTimeoutException("ESQL query timed out after {}", timeout));
        });
    } else {
        wrappedListener = listener;
    }
    requestExecutor.execute(ActionRunnable.wrap(wrappedListener, l -> doExecuteForked(task, request, l)));
}

@dnhatn
Copy link
Member

dnhatn commented Sep 18, 2023

I've pushed 7b71306

@dnhatn dnhatn self-requested a review September 18, 2023 18:55
wrappedListener = ListenerTimeouts.wrapWithTimeout(threadPool, timeout, executor, listener, l -> {
logger.debug("cancelling ESQL task {} on timeout", task);
final TaskManager taskManager = transportService.getTaskManager();
taskManager.cancelTaskAndDescendants((CancellableTask) task, "timeout", false, ActionListener.noop());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even on a timeout we should wait for the child tasks to complete before responding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; we should wait for all outstanding tasks before responding. I deliberately chose not to wait because of the specific use case we are trying to address. Here, we want users to see the timeout error from Elasticsearch rather than the Cloud proxy. If we wait for outstanding tasks, it's possible that it might exceed the proxy's timeout before we can respond. However, thinking about this more, I believe we should wait and reduce the default timeout instead. Having a parent task completed before its child tasks isn't good.

Copy link
Contributor

@DaveCTurner DaveCTurner Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I still think we should not try and undercut the proxy timeout by default. Our response is just an exception saying ESQL query timed out after {}, this is not practically any more useful than the proxy timeout message. Put differently, I expect most users who see this response will just increase the timeout in the request until they start to get the proxy timeout message instead. And then there's a good chance they'll open a support case complaining that they can't control the proxy timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what you said, @DaveCTurner. I would like to hear from @nik9000 and @costin. If we all agree, then we should not introduce the timeout parameter until we really need it.

Copy link
Contributor

@DaveCTurner DaveCTurner Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see some value in a timeout parameter. Although generally it's be better to time out on the client side since server-side timeouts will always be a little unreliable and/or late, I think there are some clients that make it harder than they should to do this so a server-side parameter is a nice feature. And server-side timeouts do save the cost of tearing down the connection and opening a new one at least. I just don't think we should have a server-side timeout by default.

logger.debug("cancelling ESQL task {} on timeout", task);
final TaskManager taskManager = transportService.getTaskManager();
taskManager.cancelTaskAndDescendants((CancellableTask) task, "timeout", false, ActionListener.noop());
listener.onFailure(new ElasticsearchTimeoutException("ESQL query timed out after {}", timeout));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should complete the supplied listener l rather than the outer listener, otherwise listener may be completed twice (and also to avoid unnecessarily capturing listener in this lambda)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait sorry, this API is very confusing. What you've written is right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even pass in a listener to the onTimeout callback? This listener is a no-op in all cases isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the API is confusing. I will look into simplifying it.

@luigidellaquila
Copy link
Contributor Author

Thanks for fixing it @dnhatn!

I still see some random failures in the timeout tests:

java.lang.AssertionError: Leftover exchanges ExchangeService{sinks=[AMxhb_ukRK2ZVb_mqZOq_w:38], sources=[]} on node node_s3	
at __randomizedtesting.SeedInfo.seed([20BE10A7D69C7EA3:74AC932D7B1A6D7A]:0)	
at org.junit.Assert.fail(Assert.java:88)	
at org.junit.Assert.assertTrue(Assert.java:41)	
at org.elasticsearch.xpack.esql.action.AbstractEsqlIntegTestCase.lambda$ensureExchangesAreReleased$0(AbstractEsqlIntegTestCase.java:38)	

maybe there is something missing in the cancelation?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the approach of this PR is to fail requests when hitting the timeout, I'm curious if we should consider returning a partial response instead.

@@ -166,6 +179,7 @@ private static ObjectParser<EsqlQueryRequest, Void> objectParser(Supplier<EsqlQu
);
parser.declareField(EsqlQueryRequest::params, EsqlQueryRequest::parseParams, PARAMS_FIELD, VALUE_ARRAY);
parser.declareString((request, localeTag) -> request.locale(Locale.forLanguageTag(localeTag)), LOCALE_FIELD);
parser.declareInt((request, value) -> request.timeout(new TimeValue(value, TimeUnit.MILLISECONDS)), TIMEOUT_FIELD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take the timeout as a time value directly like the _search API rather than an int?

@luigidellaquila
Copy link
Contributor Author

I'm curious if we should consider returning a partial response instead.

@jpountz we considered that, but at this stage it's complicated. In the future we will have pagination and partial results, but it will require some additional work

@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

expected head sha didn’t match current head ref.

@luigidellaquila
Copy link
Contributor Author

expected head sha didn’t match current head ref.

GitHub is having a bad day today, it doesn't see latest commits. Trying to switch branches to see if it recovers

@luigidellaquila luigidellaquila changed the base branch from main to feature/esql September 19, 2023 13:22
@luigidellaquila luigidellaquila requested a review from a team as a code owner September 19, 2023 13:22
@luigidellaquila luigidellaquila changed the base branch from feature/esql to main September 19, 2023 13:23
@luigidellaquila luigidellaquila changed the base branch from main to feature/esql September 19, 2023 13:23
@luigidellaquila luigidellaquila removed the request for review from a team September 19, 2023 13:23
@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

@luigidellaquila luigidellaquila changed the base branch from feature/esql to main September 19, 2023 13:25
@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

@costin costin mentioned this pull request Sep 19, 2023
32 tasks
@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've updated the changelog YAML for you.

@mattc58 mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023
@costin
Copy link
Member

costin commented Oct 10, 2023

Folks, thanks for the feedback.
Luigi has been looking into this issue for a while and discovered multiple timeout mechanisms in place:

  • at the REST layer in ES
  • in the HTTP(s) library used by the client (Kibana)
  • at the socket level used by the client infra (e.g. nodeJS)
  • on any proxies, transparent or not, between the client and ES

As such, introducing yet another timeout on the ESQL endpoint doesn't add any value rather more confusion.
Hence why, it was agreed to close this PR without merging - potential improvements around the cancellation infrastructure will be handled in separate PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:QL (Deprecated) Meta label for query languages team v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: default request timeout
8 participants