Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added resource usage trackers for in-flight cancellation of SearchShardTask #4805

Conversation

ketanv3
Copy link
Contributor

@ketanv3 ketanv3 commented Oct 17, 2022

Description

The following implementations of TaskResourceUsageTracker have been added:

  1. CpuUsageTracker: cancels tasks if they consume too much CPU time
  2. ElapsedTimeTracker: cancels tasks if they have been running for too long
  3. HeapUsageTracker: cancels tasks if they consume too much heap memory

Issues Resolved

#1181

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Ketan Verma ketan9495@gmail.com

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch 2 times, most recently from 831a425 to 87923a9 Compare October 17, 2022 08:35
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 marked this pull request as ready for review October 17, 2022 09:00
@ketanv3 ketanv3 requested review from a team and reta as code owners October 17, 2022 09:00
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link

@nssuresh2007 nssuresh2007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from 87923a9 to 7719d71 Compare October 20, 2022 06:39
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch 2 times, most recently from ae3269d to c42859e Compare October 20, 2022 19:25
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from c42859e to aeb6521 Compare October 20, 2022 20:02
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from aeb6521 to 0dc3749 Compare October 21, 2022 05:00
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from 0dc3749 to 5cbfa8a Compare October 21, 2022 06:11
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 21, 2022

Gradle check is repeatedly failing with the same test failures in MixedClusterClientYamlTestSuiteIT. This seems to be happening after pulling the last commit 515f84b from the 'main' branch to avoid merge-conflicts. Gradle check also failed with the same test failures for this commit (https://build.ci.opensearch.org/job/gradle-check/5078/).

Related to this issue - #4852

Update: This issue has been resolved with 49a9b81.

@ketanv3 ketanv3 requested a review from Bukhtawar October 21, 2022 09:27
@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from 5cbfa8a to de64339 Compare October 22, 2022 09:11
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from de64339 to a00e440 Compare October 22, 2022 09:39
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Oct 22, 2022

Codecov Report

Merging #4805 (442ff82) into main (782dc59) will decrease coverage by 0.59%.
The diff coverage is 63.28%.

@@             Coverage Diff              @@
##               main    #4805      +/-   ##
============================================
- Coverage     71.37%   70.77%   -0.60%     
+ Complexity    58338    57879     -459     
============================================
  Files          4689     4692       +3     
  Lines        277022   276959      -63     
  Branches      40315    40301      -14     
============================================
- Hits         197718   196018    -1700     
- Misses        63304    64680    +1376     
- Partials      16000    16261     +261     
Impacted Files Coverage Δ
.../opensearch/gradle/info/GlobalBuildInfoPlugin.java 37.15% <ø> (+0.59%) ⬆️
...ternal/InternalDistributionArchiveCheckPlugin.java 0.00% <0.00%> (ø)
...ternal/InternalDistributionArchiveSetupPlugin.java 0.00% <ø> (ø)
...ain/java/org/opensearch/painless/antlr/Walker.java 85.01% <ø> (+0.18%) ⬆️
...admin/indices/segments/IndicesSegmentResponse.java 73.86% <ø> (+8.86%) ⬆️
...main/java/org/opensearch/bootstrap/JNANatives.java 18.85% <0.00%> (+2.06%) ⬆️
...ava/org/opensearch/cluster/node/DiscoveryNode.java 89.83% <ø> (+3.68%) ⬆️
...rg/opensearch/common/settings/ClusterSettings.java 91.89% <ø> (ø)
...main/java/org/opensearch/index/engine/Segment.java 71.42% <ø> (+4.05%) ⬆️
.../opensearch/index/mapper/DocumentMapperParser.java 90.14% <ø> (+7.02%) ⬆️
... and 529 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

…rdTask

1. CpuUsageTracker: cancels tasks if they consume too much CPU
2. ElapsedTimeTracker: cancels tasks if they consume too much time
3. HeapUsageTracker: cancels tasks if they consume too much heap

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from a00e440 to 98610e3 Compare October 26, 2022 05:31
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from 98610e3 to 962cc05 Compare October 26, 2022 07:23
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -213,7 +215,7 @@ TaskCancellation getTaskCancellation(CancellableTask task) {
List<Runnable> callbacks = new ArrayList<>();

for (TaskResourceUsageTracker tracker : taskResourceUsageTrackers) {
Optional<TaskCancellation.Reason> reason = tracker.cancellationReason(task);
Optional<TaskCancellation.Reason> reason = tracker.checkAndMaybeGetCancellationReason(task);
Copy link
Collaborator

@Bukhtawar Bukhtawar Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, but in general we should think about decoupling tracking and action once thresholds have breached. Today it might be search cancellation but I do envision this as an action that modifies threadpool size/queue in a manner that creates a backpressure
We can think about that refactor as a fast follow up as that will help us add more actions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cancellation isn't truly back pressure :) it's load shedding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Trackers can recommend actions once thresholds are met, and cancellation of tasks can be one such action. This will however influence how dissimilar actions from different trackers are grouped/compared with each other in the SearchBackpressureService. For example, we need to aggregate the cancellation scores from each tracker before we start cancelling tasks. With generic actions, this might become really complicated.

Let's do a detailed design of this first and refactor as a follow-up. Enhancement: #4985

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

public abstract Optional<TaskCancellation.Reason> cancellationReason(Task task);
public abstract Optional<TaskCancellation.Reason> checkAndMaybeGetCancellationReason(Task task);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature still doesn't look right, it doesn't clarify if cancellation will occur or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel that it's correct – this only returns a TaskCancellation.Reason for a task eligible for cancellation. It doesn't say anything about whether the task will be actually cancelled or not.

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from 6e9e7a6 to 57667ef Compare October 31, 2022 09:15
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 31, 2022

Gradle checks are failing due to a missing dependency JAR issue.

A problem occurred configuring root project 'OpenSearch'.
> Could not resolve all files for configuration ':classpath'.
   > Could not find spotless-lib-extra-2.30.0.jar (com.diffplug.spotless:spotless-lib-extra:2.30.0).
     Searched in the following locations:
         https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib-extra/2.30.0/spotless-lib-extra-2.30.0.jar
   > Could not find spotless-lib-2.30.0.jar (com.diffplug.spotless:spotless-lib:2.30.0).
     Searched in the following locations:
         https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.30.0/spotless-lib-2.30.0.jar

It looks like the JAR has vanished from this repo: https://repo.gradle.org/ui/native/jcenter/com/diffplug/spotless/spotless-lib/2.30.0/

Related: #4987

@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from 57667ef to e0ca2c8 Compare October 31, 2022 11:56
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
@ketanv3 ketanv3 force-pushed the feature/inflight-cancellation-trackers branch from e0ca2c8 to 442ff82 Compare October 31, 2022 12:24
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@Bukhtawar Bukhtawar merged commit 6a56b39 into opensearch-project:main Oct 31, 2022
ketanv3 added a commit to ketanv3/OpenSearch that referenced this pull request Nov 1, 2022
…on of SearchShardTask (opensearch-project#4805)

1. CpuUsageTracker: cancels tasks if they consume too much CPU
2. ElapsedTimeTracker: cancels tasks if they consume too much time
3. HeapUsageTracker: cancels tasks if they consume too much heap

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
ketanv3 added a commit to ketanv3/OpenSearch that referenced this pull request Nov 2, 2022
…on of SearchShardTask (opensearch-project#4805)

1. CpuUsageTracker: cancels tasks if they consume too much CPU
2. ElapsedTimeTracker: cancels tasks if they consume too much time
3. HeapUsageTracker: cancels tasks if they consume too much heap

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
Bukhtawar pushed a commit that referenced this pull request Nov 3, 2022
…ource consumption (#5039)

* [Backport 2.x] Added in-flight cancellation of SearchShardTask based on resource consumption (#4575)

This feature aims to identify and cancel resource intensive SearchShardTasks if they have breached certain
thresholds. This will help in terminating problematic queries which can put nodes in duress and degrade the
cluster performance.

* [Backport 2.x] Added resource usage trackers for in-flight cancellation of SearchShardTask (#4805)

1. CpuUsageTracker: cancels tasks if they consume too much CPU
2. ElapsedTimeTracker: cancels tasks if they consume too much time
3. HeapUsageTracker: cancels tasks if they consume too much heap

* [Backport 2.x]Added search backpressure stats API

Added search backpressure stats to the existing node/stats API to describe:
1. the number of cancellations (currently for SearchShardTask only)
2. the current state of TaskResourceUsageTracker

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
ketanv3 added a commit to ketanv3/OpenSearch that referenced this pull request Nov 3, 2022
…ource consumption (opensearch-project#5039)

* [Backport 2.x] Added in-flight cancellation of SearchShardTask based on resource consumption (opensearch-project#4575)

This feature aims to identify and cancel resource intensive SearchShardTasks if they have breached certain
thresholds. This will help in terminating problematic queries which can put nodes in duress and degrade the
cluster performance.

* [Backport 2.x] Added resource usage trackers for in-flight cancellation of SearchShardTask (opensearch-project#4805)

1. CpuUsageTracker: cancels tasks if they consume too much CPU
2. ElapsedTimeTracker: cancels tasks if they consume too much time
3. HeapUsageTracker: cancels tasks if they consume too much heap

* [Backport 2.x]Added search backpressure stats API

Added search backpressure stats to the existing node/stats API to describe:
1. the number of cancellations (currently for SearchShardTask only)
2. the current state of TaskResourceUsageTracker

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 3, 2022
…ource consumption (#5039)

* [Backport 2.x] Added in-flight cancellation of SearchShardTask based on resource consumption (#4575)

This feature aims to identify and cancel resource intensive SearchShardTasks if they have breached certain
thresholds. This will help in terminating problematic queries which can put nodes in duress and degrade the
cluster performance.

* [Backport 2.x] Added resource usage trackers for in-flight cancellation of SearchShardTask (#4805)

1. CpuUsageTracker: cancels tasks if they consume too much CPU
2. ElapsedTimeTracker: cancels tasks if they consume too much time
3. HeapUsageTracker: cancels tasks if they consume too much heap

* [Backport 2.x]Added search backpressure stats API

Added search backpressure stats to the existing node/stats API to describe:
1. the number of cancellations (currently for SearchShardTask only)
2. the current state of TaskResourceUsageTracker

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
(cherry picked from commit 7c521b9)
Bukhtawar pushed a commit that referenced this pull request Nov 3, 2022
…ource consumption (#5039) (#5058)

* [Backport 2.x] Added in-flight cancellation of SearchShardTask based on resource consumption (#4575)

This feature aims to identify and cancel resource intensive SearchShardTasks if they have breached certain
thresholds. This will help in terminating problematic queries which can put nodes in duress and degrade the
cluster performance.

* [Backport 2.x] Added resource usage trackers for in-flight cancellation of SearchShardTask (#4805)

1. CpuUsageTracker: cancels tasks if they consume too much CPU
2. ElapsedTimeTracker: cancels tasks if they consume too much time
3. HeapUsageTracker: cancels tasks if they consume too much heap

* [Backport 2.x]Added search backpressure stats API

Added search backpressure stats to the existing node/stats API to describe:
1. the number of cancellations (currently for SearchShardTask only)
2. the current state of TaskResourceUsageTracker

Signed-off-by: Ketan Verma <ketan9495@gmail.com>
(cherry picked from commit 7c521b9)

Co-authored-by: Ketan Verma <ketanv3@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants