Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data node changes for master task throttling #4204

Conversation

dhwanilpatel
Copy link
Contributor

@dhwanilpatel dhwanilpatel commented Aug 12, 2022

Data node side changes for master task throttling

Signed-off-by: Dhwanil Patel dhwanip@amazon.com

Description

This is one of multiple PR planned for master task throttling. In this PR we are making changes in TransportClusterManagerNodeAction and MasterTaskThrottlingRetryListener.

MasterTaskThrottlingRetryListener : Introduced new action listener which listens to throttling exception from master node and perform the retries with exponential backoff. It also takes care of tasks getting timed out in retry.
TransportClusterManagerNodeAction : Plugged the MasterTaskThrottlingRetryListener while sending tasks to master so it can perform the retries on master throttling.

Issues Resolved

Relates : #479

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@dhwanilpatel dhwanilpatel requested review from a team and reta as code owners August 12, 2022 13:31
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Gradle check failing on bwc distribution task. @dhwanilpatel : I think rebasing your changes against main should solve this issue.

Execution failed for task ':distribution:bwc:minor:buildBwcLinuxTar'.
> Building 2.2.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/minor/build/bwc/checkout-2.x/distribution/archives/linux-tar/build/distributions/opensearch-min-2.2.0-SNAPSHOT-linux-x64.tar.gz

@@ -110,6 +114,11 @@ public final Request masterNodeTimeout(String timeout) {
return clusterManagerNodeTimeout(timeout);
}

public final Request setRemoteRequest(boolean remoteRequest) {
this.remoteRequest = remoteRequest;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still unclear why do we need this flag? The action knows where it should be executed, the retry listener should just help run the same action after scheduled delay on same threadpool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this distinguishion between requests coming to Master node as we have same code block (In TransportClusterManagerNodeAction) executing for both the case (local/remote master) and our retry logic is also on top of it only.

Using this flag we will determine whether request is generated from local node or from remote node. If it is local node's request we need to perform the retries on this node. If it is remote node's request, we will not perform retries on this node and let remote node perform the retries.

If request is from remote data node, then data node will set remoteRequest flag in {@link MasterNodeRequest} and send request to master, using that on master node we can determine if the request was localRequest or remoteRequest.

Comment on lines 163 to 167
public void start() {
ClusterState state = clusterService.state();
logger.trace("starting processing request [{}] with cluster state version [{}]", request, state.version());
doStart(state);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need start ? can't we directly log in doStart itself ?

if (task != null) {
request.setParentTask(clusterService.localNode().getId(), task.getId());
}
new AsyncSingleAction(task, request, listener).doStart(state);
new AsyncSingleAction(task, request, listener).start();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we doing for cases like below:

transportService.sendRequest(clusterManagerNode, actionName, request, new EmptyTransportResponseHandler(ThreadPool.Names.SAME) {

Here on the sender side , TransportClusterManagerNodeAction is not extended .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I tried to scan over the code and found similar for NodeMappingRefreshAction as well. Let me know if I am missing from somewhere else as well.

We will need to plug the retry logic over here as well. Once we have finalized approach on it using RetryableAction/ThrottlingRetryListener will add that over here as well.

Till then will keep this comment open for tracking.

@dhwanilpatel
Copy link
Contributor Author

I have incorporated high level comment on modifying RetryableAction and using it for our usecase with our custom backoff policy. I have made changes around that and updating PR for early feedback on the approach.

@shwetathareja / @Bukhtawar / @gbbafna please provide your thoughts on it.

Will add relevant UTs and cleanup MasterThrottlingRetryListener after initial feedback on approach(to save throwaway efforts)

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:


@Override
public boolean shouldRetry(Exception e) {
if (localRequest) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is remote node retrying MasterTaskThrottlingException now ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When request would be made to remote master node, we will set remoteRequest flag in it and send it to master.

For throttling exception, master will not perform the retry on it based on this check and let the exception flow to the data node and data node will perform the retry.

Since same code block is getting run for both remote/local master we need this segregation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As RetryableAction is triggering final listener after all the retries have exhausted, that explains why there is a need to differentiate local vs remote call? Can we differentiate by checking the transport request sourceNode instead of changing the request object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure made changes to rely on remoteAddress of the request, which will be null for local request for remote address it will have remote node's transport address.

Removed the new filed from the request.

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2022

Gradle Check (Jenkins) Run Completed with:

@Override
public TimeValue next() {
TimeValue delayToReturn = TimeValue.timeValueMillis(Randomness.get().nextInt(Math.toIntExact(currentDelay)) + 1);
currentDelay = Math.min(2 * currentDelay, Integer.MAX_VALUE);
Copy link
Member

@shwetathareja shwetathareja Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be first statement in the method (currentDelay calculation)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to first calculate the randomDelay out of current delay which we needs to return as part of current call and then double it and store in currentDelay for next call.
So that's why we are first doing retrun delay calculation and then update current delay with double of it.

…se TransportClusterManagerNodeAction

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2022

Gradle Check (Jenkins) Run Completed with:

assertFalse(exception.get());
}

public void testShouldRetry() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no action.execute? how is it checking shouldRetry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started writing this one and then realized shouldRetry is being tested in other UT, so it was redundant UT. Missed to remove it. Will Remove it.

ShouldRetry is being tested in testThrottlingRetryLocalMaster and testThrottlingRetryRemoteMaster. Where we are throwing Throttling exception directly and via transport exception as well and verifying that retries are being made.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks.

timeoutValue,
listener,
BackoffPolicy.exponentialRandomBackoff(initialDelay.getMillis()),
ThreadPool.Names.SAME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is this threadpool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding, This is basically caller's threadpool only. We will not create new threadpool for retries but it will perform the retries on the same threadpool of the caller's.

Comment on lines 115 to 151
/**
* RetryableAction for performing retires for cluster manager throttling.
*/
private class NodeMappingRefreshClusterManagerAction extends RetryableAction {

private final DiscoveryNode clusterManagerNode;
private final NodeMappingRefreshRequest request;
private static final int BASE_DELAY_MILLIS = 10;
private static final int MAX_DELAY_MILLIS = 10;

private NodeMappingRefreshClusterManagerAction(DiscoveryNode clusterManagerNode, NodeMappingRefreshRequest request) {
super(
logger,
threadPool,
TimeValue.timeValueMillis(BASE_DELAY_MILLIS),
TimeValue.timeValueMillis(Integer.MAX_VALUE), // Shard tasks are internal and don't have timeout
new ActionListener() {
@Override
public void onResponse(Object o) {}

@Override
public void onFailure(Exception e) {
logger.warn("Mapping refresh for [{}] failed due to [{}]", request.index, e.getMessage());
}
},
BackoffPolicy.exponentialEqualJitterBackoff(BASE_DELAY_MILLIS, MAX_DELAY_MILLIS),
ThreadPool.Names.SAME
);
this.clusterManagerNode = clusterManagerNode;
this.request = request;
}

@Override
public void tryAction(ActionListener listener) {
sendNodeMappingRefreshToClusterManager(clusterManagerNode, request, listener);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make onboarding more actions to task throttling framework more tedious. Can we abstract out all complexities and avoid creating a new class per action altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally new actions which needs to be perform on master should extend TransportClusterManagerNodeAction. All this retry logic is abstracted out in it.

These two actions refresh-mapping/shard-state was not extending this TransportClusterManagerNodeAction and directly sending request to master node, so we need to add retryable logic here as well.

[NOTE]: To keep this Data node side PR clean, I am going to remove this changes of NodeMappingRefreshAction and ShardStateAction. Will add those in PR where we onboard those task in these framework.


@Inject
public NodeMappingRefreshAction(TransportService transportService, MetadataMappingService metadataMappingService) {
public NodeMappingRefreshAction(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have raised a different PR for these changes,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Shweta, I am going to remove this change to keep Data node side changes clean. Will add this changes into different PR where we will onboard refresh-mapping and shard-state actions into throttling framework.

…h dont use TransportClusterManagerNodeAction"

This reverts commit 4bf4e8b.

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Comment on lines 115 to 117
/**
* RetryableAction for performing retires for cluster manager throttling.
*/
Copy link
Collaborator

@gbbafna gbbafna Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we extend TransportClusterManagerNodeAction for same instead of writing another Action ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I tried to check that as well, for that we might need more changes. Will give it one more look.

Anyway I am going to remove this changes from here, will add it back when we onboard this task type to framework. We can discuss more in that PR.

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Copy link
Member

@shwetathareja shwetathareja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @dhwanilpatel . LGTM.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 2, 2022

Gradle Check (Jenkins) Run Completed with:

…' into throttling-data-change-pr

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Sep 2, 2022

Gradle Check (Jenkins) Run Completed with:

@shwetathareja shwetathareja merged commit 23f15a5 into opensearch-project:feature/master-task-throttling Sep 6, 2022
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 2, 2022
* Data node changes for master task throttling

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>

* Using Retryable action for retries
* Used RemoteAddress instead of new field for checking local Request
Bukhtawar pushed a commit that referenced this pull request Nov 4, 2022
Basic Throttler Framework / Exponential Basic back off policy. 
Add basic thorttler/exponential backoff policy for retry/Defination o… #3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling #3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling #4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling #4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Bukhtawar pushed a commit that referenced this pull request Nov 4, 2022
* Add basic thorttler/exponential backoff policy for retry/Defination of throttling exception (#3856)

* Corrected Java doc for Throttler

* Changed the default behaviour of Throttler to return Optional

* Removed generics from Throttler and used String as key

* Ignore backport / autocut / dependabot branches for gradle checks on push

* Master node changes for master task throttling (#3882)

* Data node changes for master task throttling (#4204)

* Onboarding of few task types to throttling (#4542)

* Fix timeout exception and Add Integ test for Master task throttling (#4588)

* Complete TODO for version change and removed unused classes(Throttler and Semaphore) (#4846)

* Remove V1 version from throttling testcase

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Bukhtawar pushed a commit that referenced this pull request Nov 4, 2022
Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… #3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling #3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling #4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling #4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants