-
Notifications
You must be signed in to change notification settings - Fork 36
Cancel query if given detector already have one #54
Cancel query if given detector already have one #54
Conversation
TaskInfo matchedTask = null; | ||
for (TaskInfo task : tasks) { | ||
if (!task.getHeaders().isEmpty() && task.getHeaders().get(Task.X_OPAQUE_ID) != null) { | ||
if (task.getHeaders().get(Task.X_OPAQUE_ID).contains(detectorId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would
"if (task.getHeaders().get(Task.X_OPAQUE_ID).equals(CommonName.ANOMALY_DETECTOR + ":" + detectorId))"
improves performance since you might need do this comparison a lot of times if there are a lot of tasks?
equals is O(n), while contains can be O(n*m) where m is the string to match and n is the string to search.
See the implementation of contains (depends on indexOf):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, will change it.
String detectorId = detector.getDetectorId(); | ||
if (!throttler.insertFilteredQuery(detectorId, request)) { | ||
LOG.info("There is one query running for detectorId: {}. Trying to cancel the long running query", detectorId); | ||
cancelRunningQuery(client, detectorId, LOG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return after cancelling since we don't know when the cancel would actually happen? We might keep piling up new queries when the previous old queries are not cancelled.
Also, we need to send InternalFailure not EndRunException. EndRunException is used for scenarios when we might need to terminate AD job running soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Return after cancelling will be more safer. Will update.
* @param LOG Logger | ||
*/ | ||
private void cancelRunningQuery(Client client, String detectorId, Logger LOG) { | ||
ListTasksRequest listTasksRequest = new ListTasksRequest(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add some parameters to speed up task search:
- group_by=parents: so each group you only need to check header once
- actions=*search: since our queries are searches. We don't care about write or update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice. For the group_by=parents, it's a little weird. For the api it supports this feature(https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html#_task_grouping). However when it comes to java api, it's not supported as far as I see.
For actions=*search, I will added it as "actions=search" since I can see some child query has something like "indices:data/read/search[phase/query]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you meant we need to use "actions=*search*", right? Yes, please do that. Your current code uses "*search".
@@ -199,7 +199,7 @@ private static Void initGson() { | |||
Settings settings = environment.settings(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a unit/integration test for the cancel mechanism? If not, I strongly suggest we add one.
You can add an ESIntegTestCase where we
- define a SearchOperationListener such that index operations are delayed to simulate long running queries;
- create a fake plugin to use listener defined in 1)
- add AD plugin and the fake plugin together
- ... automate what you did on manual testing ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also one of my concern. I noticed we don't have any unit test for clientUtil and tried to add one but it's too complicated. When manual testing, I use similar listener which will delay the search to make up the long running query. Not sure if that can be done in integration test, I will sync up with you offline.
@@ -162,11 +180,11 @@ public ClientUtil(Settings setting, Client client, Throttler throttler) { | |||
* Send a nonblocking request with a timeout and return response. The request will first be put into | |||
* the negative cache. Once the request complete, it will be removed from the negative cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more description about cancel request process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
throw e; | ||
} | ||
|
||
if (!latch.await(requestTimeout.getSeconds(), TimeUnit.SECONDS)) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return; | ||
} | ||
// case 2: we can find the task for given detectorId | ||
TaskId parentTaskId = matchedTask.getParentTaskId().isSet() ? matchedTask.getParentTaskId() : matchedTask.getTaskId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible the parent task has parent too? If yes, should we find the root task and kill all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our search query, there is only two-level parent-child relationship.
List<ElasticsearchException> nodeFailures = cancelTasksResponse.getNodeFailures(); | ||
List<TaskOperationFailure> taskFailures = cancelTasksResponse.getTaskFailures(); | ||
if (nodeFailures.isEmpty() && taskFailures.isEmpty()) { | ||
LOG.info("Cancelling query for detectorId: {} succeeds. Clear entry from Throttler", detectorId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add some retry for these failed tasks. Otherwise, will wait for next detector run to cancel again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a todo comment for now.
try { | ||
try (ThreadContext.StoredContext context = threadPool.getThreadContext().stashContext()) { | ||
assert context != null; | ||
threadPool.getThreadContext().putHeader(Task.X_OPAQUE_ID, CommonName.ANOMALY_DETECTOR + ":" + detectorId); | ||
consumer.accept(request, new LatchedActionListener<Response>(ActionListener.wrap(response -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible the cancelRunningQuery
in progress or fail when start a new request. If cancelRunningQuery
is not time consuming, better to start a new request when we get respond of cancelRunningQuery
. If it's heavy action, may need to monitor the cancelation status and retry if failed; so we can terminate unnecessary AD query to protect cluster performance. It's ok to add some todo&comments here and refactor it later if you think the change will be big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is similar with Kaituo's comments. For safety concern, I will not start new request if we need to cancel the running one, just in case the cancel failed somehow and we keep adding new requests. We can revisit it later.
try { | ||
try (ThreadContext.StoredContext context = threadPool.getThreadContext().stashContext()) { | ||
assert context != null; | ||
threadPool.getThreadContext().putHeader(Task.X_OPAQUE_ID, CommonName.ANOMALY_DETECTOR + ":" + detectorId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just asking, will the X_OPAQUE_ID
header be passed to child tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will. I can see both parent and children tasks have the same header from the manual test.
// case 2: we can find the task for given detectorId | ||
TaskId parentTaskId = matchedTask.getParentTaskId().isSet() ? matchedTask.getParentTaskId() : matchedTask.getTaskId(); | ||
CancelTasksRequest cancelTaskRequest = new CancelTasksRequest(); | ||
cancelTaskRequest.setParentTaskId(parentTaskId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From line292, if matchedTask.getParentTaskId().isSet()
is false, will get matchedTask.getTaskId()
as parentTaskId
. For this case, cancelTaskRequest.setParentTaskId(parentTaskId)
will cancel tasks which has parent task id as matchedTask.getTaskId()
. Is it possible the matchedTask
has no child tasks? If it's possible, will cancelTaskRequest.setParentTaskId(parentTaskId)
throw exception or cancel matchedTask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point. To avoid this corner case, I will go through the entire tasks list(previously it will early terminate once found matched). If there is only one task(no parent), we need to setTaskId, otherwise setParentTaskId
throttler.clearFilteredQuery(detectorId); | ||
return; | ||
} | ||
throw new InternalFailure(detectorId, "Failed to cancel current tasks due to node or task failures"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
1. Adding description to throttledTimedRequest 2. Don't send new request if there is one running query. We only cancel the running one. 3. Adding logic to deal with single task cancelling(no parent task) 4. Adding log info/error and removing extra space.
* @param LOG Logger | ||
*/ | ||
private void cancelRunningQuery(Client client, String detectorId, Logger LOG) { | ||
ListTasksRequest listTasksRequest = new ListTasksRequest(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you meant we need to use "actions=*search*", right? Yes, please do that. Your current code uses "*search".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the change!
Issue #55, if available:
Cancel running query if given detector already has one running.
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.