Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing retry for remote connector to mitigate throttling issue #2462

Merged
merged 36 commits into from
Jun 6, 2024

Conversation

zhichao-aws
Copy link
Member

@zhichao-aws zhichao-aws commented May 20, 2024

Description

Please check #2438 for more details and context
What does this PR change:

  1. Use core class RetryableAction to implement a retry policy for the RemoteConnectorExecutor.invokeRemoteModel method to retry for SageMaker throttling exception.
    a. New class RetryableException is added. And we can extend it once we found other retryable cases.
  2. For sub-requests cases, we need to retry for failed sub-request, instead of all sub-requests. So we refactor the RemoteConnectorExecutor.executePredict and MLSdkAsyncHttpResponseHandler, to use GroupedActionListener instead of CoundDownLatch for sub-requests case.
    With previous code we call onResponse/onFailure only after all sub-requests have response. With the refactor, we call onResponse/onFailure for all sub-requests, and the onFailure will further trigger the retry logic for sub-request in RetryableAction.
  3. Add ConnectorRetryOption to control the behavior of retry. Per @zane-neo and @ylwu-amzn 's suggestion, we'll move the retry settings from cluster settings to ConnectorClientConfig.

Note: the retry is disabled by default to make the behavior consistent with previous version.

Issues Resolved

#2438

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env May 20, 2024 07:43 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env May 20, 2024 07:43 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env May 20, 2024 07:43 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env May 20, 2024 07:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env May 20, 2024 07:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env May 20, 2024 07:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 00:16 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 00:16 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 00:16 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 00:16 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 00:16 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:09 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:09 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:09 — with GitHub Actions Inactive
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 01:44 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 02:37 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 02:37 — with GitHub Actions Inactive
@zhichao-aws zhichao-aws temporarily deployed to ml-commons-cicd-env June 6, 2024 02:37 — with GitHub Actions Inactive
@zane-neo zane-neo merged commit 399825f into opensearch-project:main Jun 6, 2024
12 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 6, 2024
…2462)

* use retryable action; execution context

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change to groupedActionListener

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix group

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* retry policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* base time

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* retry option, cluster settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* lint

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change interface to class

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix ut due to code change

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add ut

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix core interface

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* use exception holder

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add max retry times settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix typo

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change the order to avoid misleading log

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* move settings to connector

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* remove settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add retry_backoff_policy setting

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* changes for comments

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix retry times

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* make the error handling more neat in MLSdkAsyncHttpResponseHandler

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change to SageMakerThrottlingException

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* use enum for retry backoff policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix seconds to milliseconds in equal jitter policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* disable retry by default

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

---------

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
(cherry picked from commit 399825f)
ylwu-amzn pushed a commit that referenced this pull request Jun 6, 2024
…2462) (#2509)

* use retryable action; execution context

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change to groupedActionListener

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix group

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* retry policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* base time

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* retry option, cluster settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* lint

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change interface to class

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix ut due to code change

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add ut

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix core interface

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* use exception holder

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add max retry times settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix typo

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change the order to avoid misleading log

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* move settings to connector

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* remove settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add retry_backoff_policy setting

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* changes for comments

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix retry times

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* make the error handling more neat in MLSdkAsyncHttpResponseHandler

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change to SageMakerThrottlingException

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* use enum for retry backoff policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix seconds to milliseconds in equal jitter policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* disable retry by default

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

---------

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
(cherry picked from commit 399825f)

Co-authored-by: zhichao-aws <zhichaog@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 30, 2024
…2462)

* use retryable action; execution context

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change to groupedActionListener

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix group

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* retry policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* base time

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* retry option, cluster settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* lint

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change interface to class

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix ut due to code change

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add ut

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix core interface

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* use exception holder

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add max retry times settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix typo

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change the order to avoid misleading log

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* move settings to connector

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* remove settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add retry_backoff_policy setting

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* changes for comments

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix retry times

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* make the error handling more neat in MLSdkAsyncHttpResponseHandler

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change to SageMakerThrottlingException

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* use enum for retry backoff policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix seconds to milliseconds in equal jitter policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* disable retry by default

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

---------

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
(cherry picked from commit 399825f)
dhrubo-os pushed a commit that referenced this pull request Sep 30, 2024
…2462) (#3013)

* use retryable action; execution context

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change to groupedActionListener

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix group

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* retry policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* base time

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* retry option, cluster settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* lint

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change interface to class

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix ut due to code change

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add ut

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix core interface

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* use exception holder

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add max retry times settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix typo

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change the order to avoid misleading log

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* license header

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* move settings to connector

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* remove settings

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add retry_backoff_policy setting

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* changes for comments

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix retry times

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* make the error handling more neat in MLSdkAsyncHttpResponseHandler

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* change to SageMakerThrottlingException

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* use enum for retry backoff policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix seconds to milliseconds in equal jitter policy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* disable retry by default

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

---------

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
(cherry picked from commit 399825f)

Co-authored-by: zhichao-aws <zhichaog@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants