-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data node changes for master task throttling #4204
Data node changes for master task throttling #4204
Conversation
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Gradle check failing on bwc distribution task. @dhwanilpatel : I think rebasing your changes against main should solve this issue.
|
...rc/main/java/org/opensearch/action/support/clustermanager/MasterThrottlingRetryListener.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/opensearch/action/support/clustermanager/MasterThrottlingRetryListener.java
Outdated
Show resolved
Hide resolved
@@ -110,6 +114,11 @@ public final Request masterNodeTimeout(String timeout) { | |||
return clusterManagerNodeTimeout(timeout); | |||
} | |||
|
|||
public final Request setRemoteRequest(boolean remoteRequest) { | |||
this.remoteRequest = remoteRequest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still unclear why do we need this flag? The action knows where it should be executed, the retry listener should just help run the same action after scheduled delay on same threadpool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this distinguishion between requests coming to Master node as we have same code block (In TransportClusterManagerNodeAction
) executing for both the case (local/remote master) and our retry logic is also on top of it only.
Using this flag we will determine whether request is generated from local node or from remote node. If it is local node's request we need to perform the retries on this node. If it is remote node's request, we will not perform retries on this node and let remote node perform the retries.
If request is from remote data node, then data node will set remoteRequest flag in {@link MasterNodeRequest} and send request to master, using that on master node we can determine if the request was localRequest or remoteRequest.
public void start() { | ||
ClusterState state = clusterService.state(); | ||
logger.trace("starting processing request [{}] with cluster state version [{}]", request, state.version()); | ||
doStart(state); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need start
? can't we directly log in doStart
itself ?
if (task != null) { | ||
request.setParentTask(clusterService.localNode().getId(), task.getId()); | ||
} | ||
new AsyncSingleAction(task, request, listener).doStart(state); | ||
new AsyncSingleAction(task, request, listener).start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we doing for cases like below:
OpenSearch/server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java
Line 192 in 3a97d4c
transportService.sendRequest(clusterManagerNode, actionName, request, new EmptyTransportResponseHandler(ThreadPool.Names.SAME) { |
Here on the sender side , TransportClusterManagerNodeAction is not extended .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I tried to scan over the code and found similar for NodeMappingRefreshAction
as well. Let me know if I am missing from somewhere else as well.
We will need to plug the retry logic over here as well. Once we have finalized approach on it using RetryableAction
/ThrottlingRetryListener
will add that over here as well.
Till then will keep this comment open for tracking.
...rc/main/java/org/opensearch/action/support/clustermanager/MasterThrottlingRetryListener.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/opensearch/action/support/clustermanager/MasterThrottlingRetryListener.java
Outdated
Show resolved
Hide resolved
I have incorporated high level comment on modifying @shwetathareja / @Bukhtawar / @gbbafna please provide your thoughts on it. Will add relevant UTs and cleanup |
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
server/src/main/java/org/opensearch/action/support/RetryableAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/action/support/RetryableAction.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public boolean shouldRetry(Exception e) { | ||
if (localRequest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is remote node retrying MasterTaskThrottlingException
now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When request would be made to remote master node, we will set remoteRequest
flag in it and send it to master.
For throttling exception, master will not perform the retry on it based on this check and let the exception flow to the data node and data node will perform the retry.
Since same code block is getting run for both remote/local master we need this segregation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As RetryableAction is triggering final listener after all the retries have exhausted, that explains why there is a need to differentiate local vs remote call? Can we differentiate by checking the transport request sourceNode instead of changing the request object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure made changes to rely on remoteAddress
of the request, which will be null for local request for remote address it will have remote node's transport address.
Removed the new filed from the request.
server/src/main/java/org/opensearch/action/support/RetryableAction.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
@Override | ||
public TimeValue next() { | ||
TimeValue delayToReturn = TimeValue.timeValueMillis(Randomness.get().nextInt(Math.toIntExact(currentDelay)) + 1); | ||
currentDelay = Math.min(2 * currentDelay, Integer.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be first statement in the method (currentDelay calculation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to first calculate the randomDelay out of current delay which we needs to return as part of current call and then double it and store in currentDelay for next call.
So that's why we are first doing retrun delay calculation and then update current delay with double of it.
server/src/main/java/org/opensearch/action/bulk/BackoffPolicy.java
Outdated
Show resolved
Hide resolved
...ava/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeActionTests.java
Outdated
Show resolved
Hide resolved
…se TransportClusterManagerNodeAction Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
assertFalse(exception.get()); | ||
} | ||
|
||
public void testShouldRetry() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no action.execute? how is it checking shouldRetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started writing this one and then realized shouldRetry is being tested in other UT, so it was redundant UT. Missed to remove it. Will Remove it.
ShouldRetry is being tested in testThrottlingRetryLocalMaster
and testThrottlingRetryRemoteMaster
. Where we are throwing Throttling exception directly and via transport exception as well and verifying that retries are being made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks.
timeoutValue, | ||
listener, | ||
BackoffPolicy.exponentialRandomBackoff(initialDelay.getMillis()), | ||
ThreadPool.Names.SAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is this threadpool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my understanding, This is basically caller's threadpool only. We will not create new threadpool for retries but it will perform the retries on the same threadpool of the caller's.
/** | ||
* RetryableAction for performing retires for cluster manager throttling. | ||
*/ | ||
private class NodeMappingRefreshClusterManagerAction extends RetryableAction { | ||
|
||
private final DiscoveryNode clusterManagerNode; | ||
private final NodeMappingRefreshRequest request; | ||
private static final int BASE_DELAY_MILLIS = 10; | ||
private static final int MAX_DELAY_MILLIS = 10; | ||
|
||
private NodeMappingRefreshClusterManagerAction(DiscoveryNode clusterManagerNode, NodeMappingRefreshRequest request) { | ||
super( | ||
logger, | ||
threadPool, | ||
TimeValue.timeValueMillis(BASE_DELAY_MILLIS), | ||
TimeValue.timeValueMillis(Integer.MAX_VALUE), // Shard tasks are internal and don't have timeout | ||
new ActionListener() { | ||
@Override | ||
public void onResponse(Object o) {} | ||
|
||
@Override | ||
public void onFailure(Exception e) { | ||
logger.warn("Mapping refresh for [{}] failed due to [{}]", request.index, e.getMessage()); | ||
} | ||
}, | ||
BackoffPolicy.exponentialEqualJitterBackoff(BASE_DELAY_MILLIS, MAX_DELAY_MILLIS), | ||
ThreadPool.Names.SAME | ||
); | ||
this.clusterManagerNode = clusterManagerNode; | ||
this.request = request; | ||
} | ||
|
||
@Override | ||
public void tryAction(ActionListener listener) { | ||
sendNodeMappingRefreshToClusterManager(clusterManagerNode, request, listener); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make onboarding more actions to task throttling framework more tedious. Can we abstract out all complexities and avoid creating a new class per action altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally new actions which needs to be perform on master should extend TransportClusterManagerNodeAction
. All this retry logic is abstracted out in it.
These two actions refresh-mapping/shard-state was not extending this TransportClusterManagerNodeAction
and directly sending request to master node, so we need to add retryable logic here as well.
[NOTE]: To keep this Data node side PR clean, I am going to remove this changes of NodeMappingRefreshAction and ShardStateAction. Will add those in PR where we onboard those task in these framework.
server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java
Outdated
Show resolved
Hide resolved
|
||
@Inject | ||
public NodeMappingRefreshAction(TransportService transportService, MetadataMappingService metadataMappingService) { | ||
public NodeMappingRefreshAction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have raised a different PR for these changes,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes Shweta, I am going to remove this change to keep Data node side changes clean. Will add this changes into different PR where we will onboard refresh-mapping and shard-state actions into throttling framework.
…h dont use TransportClusterManagerNodeAction" This reverts commit 4bf4e8b. Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
/** | ||
* RetryableAction for performing retires for cluster manager throttling. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we extend TransportClusterManagerNodeAction
for same instead of writing another Action
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I tried to check that as well, for that we might need more changes. Will give it one more look.
Anyway I am going to remove this changes from here, will add it back when we onboard this task type to framework. We can discuss more in that PR.
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @dhwanilpatel . LGTM.
Gradle Check (Jenkins) Run Completed with:
|
…' into throttling-data-change-pr Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
23f15a5
into
opensearch-project:feature/master-task-throttling
* Data node changes for master task throttling Signed-off-by: Dhwanil Patel <dhwanip@amazon.com> * Using Retryable action for retries * Used RemoteAddress instead of new field for checking local Request
Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… #3527 Changes required in Master node to perform throttling. Master node changes for master task throttling #3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling #4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling #4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
* Add basic thorttler/exponential backoff policy for retry/Defination of throttling exception (#3856) * Corrected Java doc for Throttler * Changed the default behaviour of Throttler to return Optional * Removed generics from Throttler and used String as key * Ignore backport / autocut / dependabot branches for gradle checks on push * Master node changes for master task throttling (#3882) * Data node changes for master task throttling (#4204) * Onboarding of few task types to throttling (#4542) * Fix timeout exception and Add Integ test for Master task throttling (#4588) * Complete TODO for version change and removed unused classes(Throttler and Semaphore) (#4846) * Remove V1 version from throttling testcase Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… #3527 Changes required in Master node to perform throttling. Master node changes for master task throttling #3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling #4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling #4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Data node side changes for master task throttling
Signed-off-by: Dhwanil Patel dhwanip@amazon.com
Description
This is one of multiple PR planned for master task throttling. In this PR we are making changes in TransportClusterManagerNodeAction and MasterTaskThrottlingRetryListener.
MasterTaskThrottlingRetryListener : Introduced new action listener which listens to throttling exception from master node and perform the retries with exponential backoff. It also takes care of tasks getting timed out in retry.
TransportClusterManagerNodeAction : Plugged the MasterTaskThrottlingRetryListener while sending tasks to master so it can perform the retries on master throttling.
Issues Resolved
Relates : #479
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.