Cancel query if given detector already have one #54

zhanghg08 · 2020-03-05T23:37:39Z

Issue #55, if available:
Cancel running query if given detector already has one running.

Description of changes:

./gradlew build
start my own cluster and tested with a 20sec delay query as long running query. If one detector is running a query, a second query request will cancel the previous one and start itself.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

kaituo · 2020-03-10T04:12:56Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

+        TaskInfo matchedTask = null;
+        for (TaskInfo task : tasks) {
+            if (!task.getHeaders().isEmpty() && task.getHeaders().get(Task.X_OPAQUE_ID) != null) {
+                if (task.getHeaders().get(Task.X_OPAQUE_ID).contains(detectorId)) {


Would
"if (task.getHeaders().get(Task.X_OPAQUE_ID).equals(CommonName.ANOMALY_DETECTOR + ":" + detectorId))"
improves performance since you might need do this comparison a lot of times if there are a lot of tasks?

equals is O(n), while contains can be O(n*m) where m is the string to match and n is the string to search.

See the implementation of contains (depends on indexOf):

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/String.java#l1740

Thanks for the suggestion, will change it.

kaituo · 2020-03-10T04:38:34Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

+            String detectorId = detector.getDetectorId();
+            if (!throttler.insertFilteredQuery(detectorId, request)) {
+                LOG.info("There is one query running for detectorId: {}. Trying to cancel the long running query", detectorId);
+                cancelRunningQuery(client, detectorId, LOG);


Return after cancelling since we don't know when the cancel would actually happen? We might keep piling up new queries when the previous old queries are not cancelled.

Also, we need to send InternalFailure not EndRunException. EndRunException is used for scenarios when we might need to terminate AD job running soon.

Agree. Return after cancelling will be more safer. Will update.

kaituo · 2020-03-10T04:53:39Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

+     * @param LOG Logger
+     */
+    private void cancelRunningQuery(Client client, String detectorId, Logger LOG) {
+        ListTasksRequest listTasksRequest = new ListTasksRequest();


You can add some parameters to speed up task search:

group_by=parents: so each group you only need to check header once

actions=*search: since our queries are searches. We don't care about write or update.

Thanks for the advice. For the group_by=parents, it's a little weird. For the api it supports this feature(https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html#_task_grouping). However when it comes to java api, it's not supported as far as I see.

For actions=*search, I will added it as "actions=search" since I can see some child query has something like "indices:data/read/search[phase/query]"

you meant we need to use "actions=*search*", right? Yes, please do that. Your current code uses "*search".

kaituo · 2020-03-10T05:17:57Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorPlugin.java

@@ -199,7 +199,7 @@ private static Void initGson() {
        Settings settings = environment.settings();


Is there a unit/integration test for the cancel mechanism? If not, I strongly suggest we add one.

You can add an ESIntegTestCase where we

define a SearchOperationListener such that index operations are delayed to simulate long running queries;

create a fake plugin to use listener defined in 1)

add AD plugin and the fake plugin together

... automate what you did on manual testing ..

This is also one of my concern. I noticed we don't have any unit test for clientUtil and tried to add one but it's too complicated. When manual testing, I use similar listener which will delay the search to make up the long running query. Not sure if that can be done in integration test, I will sync up with you offline.

ylwu-amzn · 2020-03-10T18:48:22Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

@@ -162,11 +180,11 @@ public ClientUtil(Settings setting, Client client, Throttler throttler) {
     * Send a nonblocking request with a timeout and return response. The request will first be put into
     * the negative cache. Once the request complete, it will be removed from the negative cache.


Please add more description about cancel request process

ylwu-amzn · 2020-03-10T18:50:39Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

                throw e;
            }

            if (!latch.await(requestTimeout.getSeconds(), TimeUnit.SECONDS)) {
+


Remove empty line

ylwu-amzn · 2020-03-10T18:57:19Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

+            return;
+        }
+        // case 2: we can find the task for given detectorId
+        TaskId parentTaskId = matchedTask.getParentTaskId().isSet() ? matchedTask.getParentTaskId() : matchedTask.getTaskId();


Is it possible the parent task has parent too? If yes, should we find the root task and kill all?

For our search query, there is only two-level parent-child relationship.

ylwu-amzn · 2020-03-10T19:00:52Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

+        List<ElasticsearchException> nodeFailures = cancelTasksResponse.getNodeFailures();
+        List<TaskOperationFailure> taskFailures = cancelTasksResponse.getTaskFailures();
+        if (nodeFailures.isEmpty() && taskFailures.isEmpty()) {
+            LOG.info("Cancelling query for detectorId: {} succeeds. Clear entry from Throttler", detectorId);


Better to add some retry for these failed tasks. Otherwise, will wait for next detector run to cancel again.

Will add a todo comment for now.

ylwu-amzn · 2020-03-10T19:20:40Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

-            try {
+            try (ThreadContext.StoredContext context = threadPool.getThreadContext().stashContext()) {
+                assert context != null;
+                threadPool.getThreadContext().putHeader(Task.X_OPAQUE_ID, CommonName.ANOMALY_DETECTOR + ":" + detectorId);
                consumer.accept(request, new LatchedActionListener<Response>(ActionListener.wrap(response -> {


It's possible the cancelRunningQuery in progress or fail when start a new request. If cancelRunningQuery is not time consuming, better to start a new request when we get respond of cancelRunningQuery. If it's heavy action, may need to monitor the cancelation status and retry if failed; so we can terminate unnecessary AD query to protect cluster performance. It's ok to add some todo&comments here and refactor it later if you think the change will be big.

I think this is similar with Kaituo's comments. For safety concern, I will not start new request if we need to cancel the running one, just in case the cancel failed somehow and we keep adding new requests. We can revisit it later.

ylwu-amzn · 2020-03-10T19:25:14Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

-            try {
+            try (ThreadContext.StoredContext context = threadPool.getThreadContext().stashContext()) {
+                assert context != null;
+                threadPool.getThreadContext().putHeader(Task.X_OPAQUE_ID, CommonName.ANOMALY_DETECTOR + ":" + detectorId);


Just asking, will the X_OPAQUE_ID header be passed to child tasks?

Yes, it will. I can see both parent and children tasks have the same header from the manual test.

ylwu-amzn · 2020-03-10T19:31:38Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

+        // case 2: we can find the task for given detectorId
+        TaskId parentTaskId = matchedTask.getParentTaskId().isSet() ? matchedTask.getParentTaskId() : matchedTask.getTaskId();
+        CancelTasksRequest cancelTaskRequest = new CancelTasksRequest();
+        cancelTaskRequest.setParentTaskId(parentTaskId);


From line292, if matchedTask.getParentTaskId().isSet() is false, will get matchedTask.getTaskId() as parentTaskId. For this case, cancelTaskRequest.setParentTaskId(parentTaskId) will cancel tasks which has parent task id as matchedTask.getTaskId(). Is it possible the matchedTask has no child tasks? If it's possible, will cancelTaskRequest.setParentTaskId(parentTaskId) throw exception or cancel matchedTask ?

I got your point. To avoid this corner case, I will go through the entire tasks list(previously it will early terminate once found matched). If there is only one task(no parent), we need to setTaskId, otherwise setParentTaskId

ylwu-amzn · 2020-03-10T19:37:59Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

+            throttler.clearFilteredQuery(detectorId);
+            return;
+        }
+        throw new InternalFailure(detectorId, "Failed to cancel current tasks due to node or task failures");


log failures?

1. Adding description to throttledTimedRequest 2. Don't send new request if there is one running query. We only cancel the running one. 3. Adding logic to deal with single task cancelling(no parent task) 4. Adding log info/error and removing extra space.

kaituo · 2020-03-13T03:21:49Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClientUtil.java

+     * @param LOG Logger
+     */
+    private void cancelRunningQuery(Client client, String detectorId, Logger LOG) {
+        ListTasksRequest listTasksRequest = new ListTasksRequest();


you meant we need to use "actions=*search*", right? Yes, please do that. Your current code uses "*search".

ylwu-amzn

LGTM. Thanks for the change!

zhanghg08 added 8 commits February 18, 2020 14:03

Add daily cron job to clean negative cache

3375396

Adding missing java doc and simplify code.

87badb1

Create CancelQueryUtil to handle query canceling logic

4c18971

Add cron job test cases

08dfb54

Add additional clean

8c64199

Commit all changes

99e9487

Merge branch 'cron' into cancel_query

df38d00

Cancel long running query if a new request coming for given detector id

d797543

zhanghg08 requested review from wnbts and ylwu-amzn March 5, 2020 23:42

zhanghg08 marked this pull request as ready for review March 5, 2020 23:45

kaituo reviewed Mar 10, 2020

View reviewed changes

ylwu-amzn reviewed Mar 10, 2020

View reviewed changes

Address feedback:

8b9d699

1. Adding description to throttledTimedRequest 2. Don't send new request if there is one running query. We only cancel the running one. 3. Adding logic to deal with single task cancelling(no parent task) 4. Adding log info/error and removing extra space.

kaituo approved these changes Mar 13, 2020

View reviewed changes

1. change listtask filter from "*search" to "*search*"

fd1883b

ylwu-amzn approved these changes Mar 13, 2020

View reviewed changes

Merge branch 'development' into cancel_query

0673741

zhanghg08 merged commit f40c532 into opendistro-for-elasticsearch:development Mar 13, 2020

kaituo mentioned this pull request Jul 16, 2020

Cancel query code path needs to be updated #189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel query if given detector already have one #54

Cancel query if given detector already have one #54

zhanghg08 commented Mar 5, 2020 •

edited

Loading

kaituo Mar 10, 2020

zhanghg08 Mar 11, 2020

kaituo Mar 10, 2020

zhanghg08 Mar 11, 2020

kaituo Mar 10, 2020

zhanghg08 Mar 11, 2020

kaituo Mar 13, 2020 •

edited

Loading

kaituo Mar 10, 2020

zhanghg08 Mar 11, 2020

ylwu-amzn Mar 10, 2020

zhanghg08 Mar 11, 2020

ylwu-amzn Mar 10, 2020

zhanghg08 Mar 11, 2020

ylwu-amzn Mar 10, 2020 •

edited

Loading

zhanghg08 Mar 11, 2020

ylwu-amzn Mar 10, 2020 •

edited

Loading

zhanghg08 Mar 11, 2020

ylwu-amzn Mar 10, 2020

zhanghg08 Mar 11, 2020

ylwu-amzn Mar 10, 2020

zhanghg08 Mar 11, 2020

ylwu-amzn Mar 10, 2020

zhanghg08 Mar 11, 2020

ylwu-amzn Mar 10, 2020

zhanghg08 Mar 11, 2020

kaituo Mar 13, 2020 •

edited

Loading

ylwu-amzn left a comment

		@@ -199,7 +199,7 @@ private static Void initGson() {
		Settings settings = environment.settings();

		@@ -162,11 +180,11 @@ public ClientUtil(Settings setting, Client client, Throttler throttler) {
		* Send a nonblocking request with a timeout and return response. The request will first be put into
		* the negative cache. Once the request complete, it will be removed from the negative cache.

Cancel query if given detector already have one #54

Cancel query if given detector already have one #54

Conversation

zhanghg08 commented Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo Mar 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Mar 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Mar 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo Mar 13, 2020 • edited Loading

Choose a reason for hiding this comment

ylwu-amzn left a comment

Choose a reason for hiding this comment

zhanghg08 commented Mar 5, 2020 •

edited

Loading

kaituo Mar 13, 2020 •

edited

Loading

ylwu-amzn Mar 10, 2020 •

edited

Loading

ylwu-amzn Mar 10, 2020 •

edited

Loading

kaituo Mar 13, 2020 •

edited

Loading