Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations in s3 async upload flow and guards against S3 async SDK… #11327

Merged

Conversation

vikasvb90
Copy link
Contributor

@vikasvb90 vikasvb90 commented Nov 24, 2023

Description

This PR optimizes S3 async upload flow and add guards against unexpected S3 SDK errors.

  1. Guards against access to stream after callback from SDK. Cases were observed in which SDK was trying to access stream even after callback was invoked by it. This was leading to segment fault errors in segment upload path as MMap files were being removed from memory on close and attempts to read invalid references were happening by SDK. Also, JDK 17 is not handing these errors gracefully which was leading to JVM crash.
  2. Multiple invocations to close was another cause of segment fault errors.
  3. In special case like force merge invoked with max_num_segments = 1 and in stressed environments, repeated failures in uploads were observed since at present all parts of a file are uploaded at once which eventually get timed out at different points. Temporarily, such uploads are redirected to slow sync client which provides natural backpressure by blocking the main thread on a single part upload.
  4. For, small files like translog, checkpoint files or remote cluster state, streams are wrapped with buffered stream to enable SDK retries on failures.

For points 1 and 2, a new PR is forked out : #11334

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@andrross
Copy link
Member

andrross commented Nov 25, 2023

Can we split this PR up? Seems like there are a couple of mostly unrelated changes in this PR (namely the guard stream access after callback seems unrelated to the large upload changes).

repeated failures in uploads were observed since at present all parts of a file are uploaded at once which eventually get timed out at different points. Temporarily, such uploads are redirected to slow sync client which provides natural backpressure by blocking the main thread on a single part upload.

Since this is intended as a temporary fix, can we do it without introducing a user-facing repository setting? Similar question related to the upload retry setting...is this something we want to be configurable in the long term?

@vikasvb90
Copy link
Contributor Author

@andrross
I can split if review is getting difficult. They were minor async flow changes so kept them under one umbrella.

Settings should be present due to added functionalities and these are dynamic repository settings which users can control. Tomorrow when we replace any of it then settings will just be ineffective. We won't run into any backward compatibility issue. We can call this out in CHANGELOG for informational purposes when these are replaced.

@vikasvb90 vikasvb90 force-pushed the s3_optimizations_and_sdk_fixes branch from 26350ff to 94cde48 Compare November 25, 2023 02:20
Copy link
Contributor

❌ Gradle check result for 94cde48: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vikasvb90 vikasvb90 force-pushed the s3_optimizations_and_sdk_fixes branch from 94cde48 to cb924a1 Compare November 25, 2023 03:09
Copy link
Contributor

❕ Gradle check result for cb924a1: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteStoreStatsIT.testNonZeroPrimaryStatsOnNewlyCreatedIndexWithZeroDocs

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Nov 25, 2023

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (74b2d7d) 71.25% compared to head (a1ebc61) 71.38%.
Report is 1 commits behind head on main.

Files Patch % Lines
...va/org/opensearch/repositories/s3/S3BlobStore.java 50.00% 3 Missing ⚠️
...a/org/opensearch/repositories/s3/S3Repository.java 50.00% 2 Missing ⚠️
...ch/repositories/s3/async/AsyncTransferManager.java 75.00% 1 Missing and 1 partial ⚠️
...rg/opensearch/repositories/s3/S3BlobContainer.java 96.55% 0 Missing and 1 partial ⚠️
...earch/repositories/s3/async/AsyncPartsHandler.java 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11327      +/-   ##
============================================
+ Coverage     71.25%   71.38%   +0.13%     
- Complexity    58936    58981      +45     
============================================
  Files          4890     4890              
  Lines        277447   277497      +50     
  Branches      40313    40317       +4     
============================================
+ Hits         197696   198101     +405     
+ Misses        63338    62914     -424     
- Partials      16413    16482      +69     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❕ Gradle check result for 29d11f6: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.common.util.concurrent.QueueResizableOpenSearchThreadPoolExecutorTests.classMethod
      1 org.opensearch.common.util.concurrent.QueueResizableOpenSearchThreadPoolExecutorTests.testResizeQueueDown

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

✅ Gradle check result for 7103f3d: SUCCESS

@vikasvb90 vikasvb90 force-pushed the s3_optimizations_and_sdk_fixes branch from 7103f3d to 9463c0a Compare November 27, 2023 06:25
Copy link
Contributor

❌ Gradle check result for 9463c0a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: vikasvb90 <vikasvb@amazon.com>
@vikasvb90 vikasvb90 force-pushed the s3_optimizations_and_sdk_fixes branch from 9463c0a to a1ebc61 Compare November 27, 2023 06:54
Copy link
Contributor

❕ Gradle check result for a1ebc61: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@gbbafna gbbafna merged commit 77a4daf into opensearch-project:main Dec 1, 2023
34 of 35 checks passed
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Dec 1, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-11327-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 77a4daf3a13866ab1887dc4174f63ff1fb9912fb
# Push it to GitHub
git push --set-upstream origin backport/backport-11327-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-11327-to-2.x.

@vikasvb90 vikasvb90 added backport 2.x Backport to 2.x branch and removed backport 2.x Backport to 2.x branch labels Dec 4, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 4, 2023
Signed-off-by: vikasvb90 <vikasvb@amazon.com>
(cherry picked from commit 77a4daf)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Dec 4, 2023
(cherry picked from commit 77a4daf)

Signed-off-by: vikasvb90 <vikasvb@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
deshsidd pushed a commit to deshsidd/OpenSearch that referenced this pull request Dec 11, 2023
Signed-off-by: vikasvb90 <vikasvb@amazon.com>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
Signed-off-by: vikasvb90 <vikasvb@amazon.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
Signed-off-by: vikasvb90 <vikasvb@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed skip-changelog v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants