Data node changes for master task throttling #4204

dhwanilpatel · 2022-08-12T13:31:28Z

Data node side changes for master task throttling

Signed-off-by: Dhwanil Patel dhwanip@amazon.com

Description

This is one of multiple PR planned for master task throttling. In this PR we are making changes in TransportClusterManagerNodeAction and MasterTaskThrottlingRetryListener.

MasterTaskThrottlingRetryListener : Introduced new action listener which listens to throttling exception from master node and perform the retries with exponential backoff. It also takes care of tasks getting timed out in retry.
TransportClusterManagerNodeAction : Plugged the MasterTaskThrottlingRetryListener while sending tasks to master so it can perform the retries on master throttling.

Issues Resolved

Relates : #479

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

github-actions · 2022-08-12T13:50:17Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/1697/
CommitID: 14acc5c

dreamer-89 · 2022-08-12T16:43:22Z

Gradle check failing on bwc distribution task. @dhwanilpatel : I think rebasing your changes against main should solve this issue.

Execution failed for task ':distribution:bwc:minor:buildBwcLinuxTar'.
> Building 2.2.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/minor/build/bwc/checkout-2.x/distribution/archives/linux-tar/build/distributions/opensearch-min-2.2.0-SNAPSHOT-linux-x64.tar.gz

...rc/main/java/org/opensearch/action/support/clustermanager/MasterThrottlingRetryListener.java

shwetathareja · 2022-08-17T07:26:31Z

...er/src/main/java/org/opensearch/action/support/clustermanager/ClusterManagerNodeRequest.java

@@ -110,6 +114,11 @@ public final Request masterNodeTimeout(String timeout) {
        return clusterManagerNodeTimeout(timeout);
    }

+    public final Request setRemoteRequest(boolean remoteRequest) {
+        this.remoteRequest = remoteRequest;


I am still unclear why do we need this flag? The action knows where it should be executed, the retry listener should just help run the same action after scheduled delay on same threadpool

We need this distinguishion between requests coming to Master node as we have same code block (In TransportClusterManagerNodeAction) executing for both the case (local/remote master) and our retry logic is also on top of it only.

Using this flag we will determine whether request is generated from local node or from remote node. If it is local node's request we need to perform the retries on this node. If it is remote node's request, we will not perform retries on this node and let remote node perform the retries.

If request is from remote data node, then data node will set remoteRequest flag in {@link MasterNodeRequest} and send request to master, using that on master node we can determine if the request was localRequest or remoteRequest.

gbbafna · 2022-08-17T11:51:45Z

...ain/java/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeAction.java

+        public void start() {
+            ClusterState state = clusterService.state();
+            logger.trace("starting processing request [{}] with cluster state version [{}]", request, state.version());
+            doStart(state);
+        }


why do we need start ? can't we directly log in doStart itself ?

gbbafna · 2022-08-17T13:23:42Z

...ain/java/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeAction.java

        if (task != null) {
            request.setParentTask(clusterService.localNode().getId(), task.getId());
        }
-        new AsyncSingleAction(task, request, listener).doStart(state);
+        new AsyncSingleAction(task, request, listener).start();


what are we doing for cases like below:

OpenSearch/server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java

Line 192 in 3a97d4c

transportService.sendRequest(clusterManagerNode, actionName, request, new EmptyTransportResponseHandler(ThreadPool.Names.SAME) {

Here on the sender side , TransportClusterManagerNodeAction is not extended .

Good point, I tried to scan over the code and found similar for NodeMappingRefreshAction as well. Let me know if I am missing from somewhere else as well.

We will need to plug the retry logic over here as well. Once we have finalized approach on it using RetryableAction/ThrottlingRetryListener will add that over here as well.

Till then will keep this comment open for tracking.

...rc/main/java/org/opensearch/action/support/clustermanager/MasterThrottlingRetryListener.java

dhwanilpatel · 2022-08-19T14:16:25Z

I have incorporated high level comment on modifying RetryableAction and using it for our usecase with our custom backoff policy. I have made changes around that and updating PR for early feedback on the approach.

@shwetathareja / @Bukhtawar / @gbbafna please provide your thoughts on it.

Will add relevant UTs and cleanup MasterThrottlingRetryListener after initial feedback on approach(to save throwaway efforts)

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

github-actions · 2022-08-19T14:44:22Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/1936/
CommitID: 3b85f13

server/src/main/java/org/opensearch/action/support/RetryableAction.java

gbbafna · 2022-08-22T07:34:54Z

...ain/java/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeAction.java

+
+        @Override
+        public boolean shouldRetry(Exception e) {
+            if (localRequest) {


how is remote node retrying MasterTaskThrottlingException now ?

When request would be made to remote master node, we will set remoteRequest flag in it and send it to master.

For throttling exception, master will not perform the retry on it based on this check and let the exception flow to the data node and data node will perform the retry.

Since same code block is getting run for both remote/local master we need this segregation.

As RetryableAction is triggering final listener after all the retries have exhausted, that explains why there is a need to differentiate local vs remote call? Can we differentiate by checking the transport request sourceNode instead of changing the request object?

Sure made changes to rely on remoteAddress of the request, which will be null for local request for remote address it will have remote node's transport address.

Removed the new filed from the request.

server/src/main/java/org/opensearch/action/support/RetryableAction.java

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

github-actions · 2022-08-24T07:47:24Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/2054/
CommitID: a9cd2d1

github-actions · 2022-08-24T12:12:27Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/2061/
CommitID: a9cd2d1

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

github-actions · 2022-08-30T14:56:35Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/2252/
CommitID: 3ae89ea

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

github-actions · 2022-09-01T11:18:25Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/2419/
CommitID: 45760f5

shwetathareja · 2022-09-01T12:45:07Z

server/src/main/java/org/opensearch/action/bulk/BackoffPolicy.java

+        @Override
+        public TimeValue next() {
+            TimeValue delayToReturn = TimeValue.timeValueMillis(Randomness.get().nextInt(Math.toIntExact(currentDelay)) + 1);
+            currentDelay = Math.min(2 * currentDelay, Integer.MAX_VALUE);


this should be first statement in the method (currentDelay calculation)?

We want to first calculate the randomDelay out of current delay which we needs to return as part of current call and then double it and store in currentDelay for next call.
So that's why we are first doing retrun delay calculation and then update current delay with double of it.

server/src/main/java/org/opensearch/action/bulk/BackoffPolicy.java

...ava/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeActionTests.java

…se TransportClusterManagerNodeAction Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

github-actions · 2022-09-01T13:43:23Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/2427/
CommitID: 4bf4e8b

shwetathareja · 2022-09-01T14:33:41Z

...ava/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeActionTests.java

+        assertFalse(exception.get());
+    }
+
+    public void testShouldRetry() {


there is no action.execute? how is it checking shouldRetry?

I started writing this one and then realized shouldRetry is being tested in other UT, so it was redundant UT. Missed to remove it. Will Remove it.

ShouldRetry is being tested in testThrottlingRetryLocalMaster and testThrottlingRetryRemoteMaster. Where we are throwing Throttling exception directly and via transport exception as well and verifying that retries are being made.

Bukhtawar · 2022-09-01T15:20:01Z

server/src/main/java/org/opensearch/action/support/RetryableAction.java

+            timeoutValue,
+            listener,
+            BackoffPolicy.exponentialRandomBackoff(initialDelay.getMillis()),
+            ThreadPool.Names.SAME


which is this threadpool?

As per my understanding, This is basically caller's threadpool only. We will not create new threadpool for retries but it will perform the retries on the same threadpool of the caller's.

Bukhtawar · 2022-09-01T15:22:37Z

server/src/main/java/org/opensearch/cluster/action/index/NodeMappingRefreshAction.java

+    /**
+     * RetryableAction for performing retires for cluster manager throttling.
+     */
+    private class NodeMappingRefreshClusterManagerAction extends RetryableAction {
+
+        private final DiscoveryNode clusterManagerNode;
+        private final NodeMappingRefreshRequest request;
+        private static final int BASE_DELAY_MILLIS = 10;
+        private static final int MAX_DELAY_MILLIS = 10;
+
+        private NodeMappingRefreshClusterManagerAction(DiscoveryNode clusterManagerNode, NodeMappingRefreshRequest request) {
+            super(
+                logger,
+                threadPool,
+                TimeValue.timeValueMillis(BASE_DELAY_MILLIS),
+                TimeValue.timeValueMillis(Integer.MAX_VALUE), // Shard tasks are internal and don't have timeout
+                new ActionListener() {
+                    @Override
+                    public void onResponse(Object o) {}
+
+                    @Override
+                    public void onFailure(Exception e) {
+                        logger.warn("Mapping refresh for [{}] failed due to [{}]", request.index, e.getMessage());
+                    }
+                },
+                BackoffPolicy.exponentialEqualJitterBackoff(BASE_DELAY_MILLIS, MAX_DELAY_MILLIS),
+                ThreadPool.Names.SAME
+            );
+            this.clusterManagerNode = clusterManagerNode;
+            this.request = request;
+        }
+
+        @Override
+        public void tryAction(ActionListener listener) {
+            sendNodeMappingRefreshToClusterManager(clusterManagerNode, request, listener);
+        }
+


This would make onboarding more actions to task throttling framework more tedious. Can we abstract out all complexities and avoid creating a new class per action altogether.

Ideally new actions which needs to be perform on master should extend TransportClusterManagerNodeAction. All this retry logic is abstracted out in it.

These two actions refresh-mapping/shard-state was not extending this TransportClusterManagerNodeAction and directly sending request to master node, so we need to add retryable logic here as well.

[NOTE]: To keep this Data node side PR clean, I am going to remove this changes of NodeMappingRefreshAction and ShardStateAction. Will add those in PR where we onboard those task in these framework.

server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java

shwetathareja · 2022-09-01T15:59:25Z

server/src/main/java/org/opensearch/cluster/action/index/NodeMappingRefreshAction.java


    @Inject
-    public NodeMappingRefreshAction(TransportService transportService, MetadataMappingService metadataMappingService) {
+    public NodeMappingRefreshAction(


should have raised a different PR for these changes,

Yes Shweta, I am going to remove this change to keep Data node side changes clean. Will add this changes into different PR where we will onboard refresh-mapping and shard-state actions into throttling framework.

…h dont use TransportClusterManagerNodeAction" This reverts commit 4bf4e8b. Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

gbbafna · 2022-09-02T05:41:13Z

server/src/main/java/org/opensearch/cluster/action/index/NodeMappingRefreshAction.java

+    /**
+     * RetryableAction for performing retires for cluster manager throttling.
+     */


can we extend TransportClusterManagerNodeAction for same instead of writing another Action ?

Actually I tried to check that as well, for that we might need more changes. Will give it one more look.

Anyway I am going to remove this changes from here, will add it back when we onboard this task type to framework. We can discuss more in that PR.

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

shwetathareja

Thanks for the changes @dhwanilpatel . LGTM.

github-actions · 2022-09-02T06:55:26Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/2518/
CommitID: 5a4eaca

…' into throttling-data-change-pr Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

github-actions · 2022-09-02T13:04:36Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/2542/
CommitID: b3721e4

* Data node changes for master task throttling Signed-off-by: Dhwanil Patel <dhwanip@amazon.com> * Using Retryable action for retries * Used RemoteAddress instead of new field for checking local Request

Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… #3527 Changes required in Master node to perform throttling. Master node changes for master task throttling #3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling #4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling #4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

* Add basic thorttler/exponential backoff policy for retry/Defination of throttling exception (#3856) * Corrected Java doc for Throttler * Changed the default behaviour of Throttler to return Optional * Removed generics from Throttler and used String as key * Ignore backport / autocut / dependabot branches for gradle checks on push * Master node changes for master task throttling (#3882) * Data node changes for master task throttling (#4204) * Onboarding of few task types to throttling (#4542) * Fix timeout exception and Add Integ test for Master task throttling (#4588) * Complete TODO for version change and removed unused classes(Throttler and Semaphore) (#4846) * Remove V1 version from throttling testcase Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… #3527 Changes required in Master node to perform throttling. Master node changes for master task throttling #3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling #4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling #4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

Data node changes for master task throttling

14acc5c

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

dhwanilpatel requested review from a team and reta as code owners August 12, 2022 13:31

dhwanilpatel mentioned this pull request Aug 12, 2022

Cluster Manager Task Throttling #479

Closed

shwetathareja reviewed Aug 17, 2022

View reviewed changes

gbbafna reviewed Aug 17, 2022

View reviewed changes

Bukhtawar reviewed Aug 18, 2022

View reviewed changes

...rc/main/java/org/opensearch/action/support/clustermanager/MasterThrottlingRetryListener.java Outdated Show resolved Hide resolved

Bukhtawar reviewed Aug 18, 2022

View reviewed changes

...rc/main/java/org/opensearch/action/support/clustermanager/MasterThrottlingRetryListener.java Outdated Show resolved Hide resolved

Using Retryable action for retries

3b85f13

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

Bukhtawar reviewed Aug 19, 2022

View reviewed changes

server/src/main/java/org/opensearch/action/support/RetryableAction.java Outdated Show resolved Hide resolved

Bukhtawar reviewed Aug 19, 2022

View reviewed changes

server/src/main/java/org/opensearch/action/support/RetryableAction.java Outdated Show resolved Hide resolved

gbbafna reviewed Aug 22, 2022

View reviewed changes

shwetathareja reviewed Aug 23, 2022

View reviewed changes

server/src/main/java/org/opensearch/action/support/RetryableAction.java Outdated Show resolved Hide resolved

server/src/main/java/org/opensearch/action/support/RetryableAction.java Show resolved Hide resolved

Incorporated comments 08/24

a9cd2d1

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

Moved back backoff policy to BackOffPolicy class

3ae89ea

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

Used RemoteAddress instead of new field for checking localRequest

45760f5

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

shwetathareja reviewed Sep 1, 2022

View reviewed changes

Add retryable action for refres-mapping and shard action which dont u…

4bf4e8b

…se TransportClusterManagerNodeAction Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

shwetathareja reviewed Sep 1, 2022

View reviewed changes

Bukhtawar reviewed Sep 1, 2022

View reviewed changes

server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java Outdated Show resolved Hide resolved

shwetathareja reviewed Sep 1, 2022

View reviewed changes

Revert "Add retryable action for refres-mapping and shard action whic…

0f66ede

…h dont use TransportClusterManagerNodeAction" This reverts commit 4bf4e8b. Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

gbbafna reviewed Sep 2, 2022

View reviewed changes

Incorporated Comments

5a4eaca

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

shwetathareja approved these changes Sep 2, 2022

View reviewed changes

dhwanilpatel added 2 commits September 2, 2022 18:03

Merge remote-tracking branch 'upstream/feature/master-task-throttling…

4b3afa5

…' into throttling-data-change-pr Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

Changed throttling exception name due to merge from main

b3721e4

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

gbbafna approved these changes Sep 6, 2022

View reviewed changes

shwetathareja merged commit 23f15a5 into opensearch-project:feature/master-task-throttling Sep 6, 2022

dhwanilpatel mentioned this pull request Oct 31, 2022

[Feature]Cluster manager task throttling feature #4986

Merged

6 tasks

This was referenced Nov 2, 2022

[Backport 2.x] Cluster Manager task throttling #5041

Merged

Cluster manager task throttling [DOC] opensearch-project/documentation-website#1792

Closed

andrross mentioned this pull request Jul 27, 2023

[BUG] MasterThrottlingRetryListener looked up at runtime, but does it exist? opensearch-project/performance-analyzer#514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data node changes for master task throttling #4204

Data node changes for master task throttling #4204

dhwanilpatel commented Aug 12, 2022 •

edited

Loading

github-actions bot commented Aug 12, 2022

dreamer-89 commented Aug 12, 2022

shwetathareja Aug 17, 2022

dhwanilpatel Aug 19, 2022

gbbafna Aug 17, 2022

gbbafna Aug 17, 2022

dhwanilpatel Aug 19, 2022

dhwanilpatel commented Aug 19, 2022

github-actions bot commented Aug 19, 2022

gbbafna Aug 22, 2022

dhwanilpatel Aug 23, 2022

shwetathareja Sep 1, 2022

dhwanilpatel Sep 1, 2022

github-actions bot commented Aug 24, 2022

github-actions bot commented Aug 24, 2022

github-actions bot commented Aug 30, 2022

github-actions bot commented Sep 1, 2022

shwetathareja Sep 1, 2022 •

edited

Loading

dhwanilpatel Sep 2, 2022

github-actions bot commented Sep 1, 2022

shwetathareja Sep 1, 2022

dhwanilpatel Sep 2, 2022

shwetathareja Sep 2, 2022

Bukhtawar Sep 1, 2022

dhwanilpatel Sep 2, 2022

Bukhtawar Sep 1, 2022

dhwanilpatel Sep 2, 2022

shwetathareja Sep 1, 2022

dhwanilpatel Sep 2, 2022

gbbafna Sep 2, 2022 •

edited

Loading

dhwanilpatel Sep 2, 2022

shwetathareja left a comment

github-actions bot commented Sep 2, 2022

github-actions bot commented Sep 2, 2022

Data node changes for master task throttling #4204

Data node changes for master task throttling #4204

Conversation

dhwanilpatel commented Aug 12, 2022 • edited Loading

Description

Issues Resolved

Check List

github-actions bot commented Aug 12, 2022

Gradle Check (Jenkins) Run Completed with:

dreamer-89 commented Aug 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhwanilpatel commented Aug 19, 2022

github-actions bot commented Aug 19, 2022

Gradle Check (Jenkins) Run Completed with:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Aug 24, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Aug 30, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Sep 1, 2022

Gradle Check (Jenkins) Run Completed with:

shwetathareja Sep 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 1, 2022

Gradle Check (Jenkins) Run Completed with:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbbafna Sep 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwetathareja left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 2, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Sep 2, 2022

Gradle Check (Jenkins) Run Completed with:

dhwanilpatel commented Aug 12, 2022 •

edited

Loading

shwetathareja Sep 1, 2022 •

edited

Loading

gbbafna Sep 2, 2022 •

edited

Loading