Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Segments] Retry segment uploads to remote store on failure #7400

Merged
merged 10 commits into from
May 9, 2023

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented May 3, 2023

Description

This PR does the following -

  • Introduces retry on failed sync segment attempts during remote refresh.
  • Uses dedicated threadpool for RemoteStoreRefreshListener.afterRefresh(didRefresh)
  • Integ test that introduces random failures during remote store operations and retries to sync the segments from local to remote store.

Related Issues

Resolves #7363

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@ashking94 ashking94 self-assigned this May 3, 2023
@ashking94 ashking94 added Storage:Durability Issues and PRs related to the durability framework skip-changelog v2.8.0 'Issues and PRs related to version v2.8.0' labels May 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@ashking94
Copy link
Member Author

ashking94 commented May 4, 2023

FYI - One of the other approach is to make changes in scheduleRefresh() method of IndexShard class.

public boolean scheduledRefresh() {
verifyNotClosed();
boolean listenerNeedsRefresh = refreshListeners.refreshNeeded();
if (isReadAllowed() && (listenerNeedsRefresh || getEngine().refreshNeeded())) {
if (listenerNeedsRefresh == false // if we have a listener that is waiting for a refresh we need to force it
&& isSearchIdle()
&& indexSettings.isExplicitRefresh() == false
&& indexSettings.isSegRepEnabled() == false
// Indices with segrep enabled will never wait on a refresh and ignore shard idle. Primary shards push out new segments only
// after a refresh, so we don't want to wait for a search to trigger that cycle. Replicas will only refresh after receiving
// a new set of segments.
&& active.get()) { // it must be active otherwise we might not free up segment memory once the shard became inactive
// lets skip this refresh since we are search idle and
// don't necessarily need to refresh. the next searcher access will register a refreshListener and that will
// cause the next schedule to refresh.
final Engine engine = getEngine();
engine.maybePruneDeletes(); // try to prune the deletes in the engine if we accumulated some
setRefreshPending(engine);
return false;
} else {
if (logger.isTraceEnabled()) {
logger.trace("refresh with source [schedule]");
}
return getEngine().maybeRefresh("schedule");
}
}
final Engine engine = getEngine();
engine.maybePruneDeletes(); // try to prune the deletes in the engine if we accumulated some
return false;
}

In the first if condition can be changed to
if (isReadAllowed() && (listenerNeedsRefresh || getEngine().refreshNeeded()) || noRemoteSegmentsLag()) {
However, this is relatively more intrusive and will lead to complete refresh process getting triggered although most of it will be no-op. And on failures this will continue to get retried after every index.refresh_interval time which can be catastrophic if the retries induce more failures (throttling due to more retries).

@github-actions
Copy link
Contributor

github-actions bot commented May 4, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 4, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@codecov-commenter
Copy link

codecov-commenter commented May 4, 2023

Codecov Report

Merging #7400 (b7be6f2) into main (01e5c29) will decrease coverage by 0.12%.
The diff coverage is 75.53%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #7400      +/-   ##
============================================
- Coverage     70.70%   70.59%   -0.12%     
+ Complexity    59652    59552     -100     
============================================
  Files          4877     4877              
  Lines        285806   285847      +41     
  Branches      41165    41171       +6     
============================================
- Hits         202085   201793     -292     
- Misses        67144    67453     +309     
- Partials      16577    16601      +24     
Impacted Files Coverage Δ
...opensearch/snapshots/mockstore/MockRepository.java 0.00% <0.00%> (ø)
...search/index/shard/RemoteStoreRefreshListener.java 81.88% <82.92%> (+0.57%) ⬆️
...ain/java/org/opensearch/threadpool/ThreadPool.java 81.89% <100.00%> (-0.72%) ⬇️

... and 503 files with indirect coverage changes

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

Gradle Check (Jenkins) Run Completed with:

@ashking94
Copy link
Member Author

org.opensearch.cluster.shards.ClusterShardLimitIT.testCreateIndexWithMaxClusterShardSetting
Looks to be a flaky test, will create an issue for the same if it does not exist.

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

Gradle Check (Jenkins) Run Completed with:

@ashking94
Copy link
Member Author

Looks to be unrelated flaky test again.

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2023

Gradle Check (Jenkins) Run Completed with:

@gbbafna gbbafna merged commit 769a0ea into opensearch-project:main May 9, 2023
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label May 9, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-7400-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 769a0ea449845ff5a462020ed674a917386ad517
# Push it to GitHub
git push --set-upstream origin backport/backport-7400-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-7400-to-2.x.

ashking94 added a commit to ashking94/OpenSearch that referenced this pull request May 23, 2023
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request May 23, 2023
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request May 23, 2023
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request May 23, 2023
gbbafna pushed a commit that referenced this pull request May 23, 2023
indrajohn7 pushed a commit to indrajohn7/OpenSearch that referenced this pull request Jun 29, 2023
…pensearch-project#7400)

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: INDRAJIT BANERJEE <indrajohn7@gmail.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…pensearch-project#7400)

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog Storage:Durability Issues and PRs related to the durability framework v2.8.0 'Issues and PRs related to version v2.8.0'
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Remote Segments] Retry segment uploads to remote store on failure
4 participants