Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Add segment transfer timeout dynamic setting #13679

Merged
merged 12 commits into from
May 23, 2024

Conversation

linuxpi
Copy link
Collaborator

@linuxpi linuxpi commented May 15, 2024

Description

  • Segment uploads for Remote Store happen without any timeout and can hold the thread on latch.await() for ever if there is an error which is swallowed silently by the code.
  • Since the future is never completed, the shard is stuck and is never able to retry the upload
  • This leads to remote time lag to increase and never come down until we restart the process
  • This PR adds the timeout to segment uploads to recover automatically from such situations after the timeout.

Related Issues

Resolves #13783

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Varun Bansal <bansvaru@amazon.com>
Copy link
Contributor

❌ Gradle check result for 8f59a8d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 85416eb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

linuxpi added 2 commits May 20, 2024 12:45
…rrupted latch.await

Signed-off-by: Varun Bansal <bansvaru@amazon.com>
Signed-off-by: Varun Bansal <bansvaru@amazon.com>
Copy link
Contributor

❌ Gradle check result for a4ac198: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@linuxpi
Copy link
Collaborator Author

linuxpi commented May 22, 2024

Please add changelog entry, fix the gradle build and complete the check-list in PR description.

Thanks @sachinpkale for approving. I've added the necessary details and addressed your comment.

@linuxpi linuxpi self-assigned this May 22, 2024
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Storage Issues and PRs relating to data and metadata storage Storage:Resiliency Issues and PRs related to the storage resiliency v2.15.0 Issues and PRs related to version 2.15.0 labels May 22, 2024
@linuxpi linuxpi added backport 2.x Backport to 2.x branch and removed enhancement Enhancement or improvement to existing feature or request v2.15.0 Issues and PRs related to version 2.15.0 labels May 22, 2024
Copy link
Contributor

❕ Gradle check result for 251ecd0: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@sachinpkale sachinpkale merged commit b3049fb into opensearch-project:main May 23, 2024
32 of 35 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-13679-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b3049fb5ec860f63fdfe33c5b176869b1e4255e6
# Push it to GitHub
git push --set-upstream origin backport/backport-13679-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-13679-to-2.x.

@linuxpi linuxpi deleted the segment-upload-timeout branch May 23, 2024 07:29
linuxpi added a commit to linuxpi/OpenSearch that referenced this pull request May 23, 2024
…ch-project#13679)

* [Remote Store] Add segment transfer timeout dynamic setting

Signed-off-by: Varun Bansal <bansvaru@amazon.com>
(cherry picked from commit b3049fb)
linuxpi added a commit to linuxpi/OpenSearch that referenced this pull request May 24, 2024
…ch-project#13679)

* [Remote Store] Add segment transfer timeout dynamic setting

Signed-off-by: Varun Bansal <bansvaru@amazon.com>
(cherry picked from commit b3049fb)
sachinpkale pushed a commit that referenced this pull request May 24, 2024
…13793)

* [Remote Store] Add segment transfer timeout dynamic setting

Signed-off-by: Varun Bansal <bansvaru@amazon.com>
(cherry picked from commit b3049fb)
parv0201 pushed a commit to parv0201/OpenSearch that referenced this pull request Jun 10, 2024
…ch-project#13679)

* [Remote Store] Add segment transfer timeout dynamic setting

Signed-off-by: Varun Bansal <bansvaru@amazon.com>
kkewwei pushed a commit to kkewwei/OpenSearch that referenced this pull request Jul 24, 2024
…ch-project#13679) (opensearch-project#13793)

* [Remote Store] Add segment transfer timeout dynamic setting

Signed-off-by: Varun Bansal <bansvaru@amazon.com>
(cherry picked from commit b3049fb)
Signed-off-by: kkewwei <kkewwei@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed Storage:Resiliency Issues and PRs related to the storage resiliency Storage Issues and PRs relating to data and metadata storage
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[Remote Store] Add support to timeout segment uploads
4 participants