Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CloseableRetryableRefreshListener to drain ongoing after refresh tasks on relocation #8683

Merged
merged 7 commits into from
Jul 18, 2023

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Jul 13, 2023

Description

RefreshListeners are async in nature and are triggered after segments are refreshed. Today, during relocation handoff, it is possible that the remote segments upload are happening from the older primary while the relocation has happened. Now, we are introducing a CloseableRetryableRefreshListener which will be extended by RemoteStoreRefreshListener and CheckpointRefreshListener. The CloseableRetryableRefreshListener has capabilities to be closed which guarantees that refreshes would not trigger any after refresh operations on these listeners once closed.

In summary, the PR does the following -

  1. Introduces CloseableRetryableRefreshListener which has capabilities to be closed. It achieves the same by acquiring all available permits during close and leading to no further invocation of void afterRefresh(boolean didRefresh) method.
  2. CloseableRetryableRefreshListener invokes the performAfterRefresh(boolean didRefresh, boolean isRetry) synchronously on the same calling thread. The performAfterRefresh method returns true if the invocation was successful and otherwise false.
  3. CloseableRetryableRefreshListener provides capabilities to schedule retry if the original performAfterRefresh returns false. It would retry the same performAfterRefresh after an interval returned by the implementor of the CloseableRetryableRefreshListener abstract class.
  4. CloseableRetryableRefreshListener also has constructs present internally to ensure that there are at max at a time no more than 1 retry scheduled for a future time. It also ensures that if the performAfterRefresh and retry runs do not overlap by using semaphore permits.
  5. In IndexShard, the relocation method has been updated to ensure that the refresh listeners are closed before the handoff which ensures consistency of the uploaded segments data.
  6. REMOTE_REFRESH threadpool has been renamed to REMOTE_REFRESH_RETRY.
  7. The afterRefresh method invocation in the RemoteStoreRefreshListener is synchronous as before but also happens on the original REFRESH threadpool.

Related Issues

Resolves #8345

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

…tasks on relocation

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testStaleCommitDeletionWithoutInvokeFlush
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards

@codecov
Copy link

codecov bot commented Jul 13, 2023

Codecov Report

Merging #8683 (fabb98b) into main (064f265) will increase coverage by 0.07%.
The diff coverage is 81.08%.

@@             Coverage Diff              @@
##               main    #8683      +/-   ##
============================================
+ Coverage     70.87%   70.95%   +0.07%     
- Complexity    57201    57252      +51     
============================================
  Files          4771     4772       +1     
  Lines        270312   270352      +40     
  Branches      39505    39513       +8     
============================================
+ Hits         191590   191823     +233     
+ Misses        62619    62414     -205     
- Partials      16103    16115      +12     
Impacted Files Coverage Δ
...nsearch/index/shard/CheckpointRefreshListener.java 84.61% <50.00%> (+1.28%) ⬆️
...index/shard/CloseableRetryableRefreshListener.java 73.33% <73.33%> (ø)
...search/index/shard/RemoteStoreRefreshListener.java 82.82% <94.73%> (+1.96%) ⬆️
...in/java/org/opensearch/index/shard/IndexShard.java 70.15% <100.00%> (+0.13%) ⬆️
...ain/java/org/opensearch/threadpool/ThreadPool.java 82.52% <100.00%> (+0.57%) ⬆️

... and 443 files with indirect coverage changes

@ashking94 ashking94 changed the title Use CloesableRetryableRefreshListener to drain ongoing after refresh tasks on relocation Use CloseableRetryableRefreshListener to drain ongoing after refresh tasks on relocation Jul 14, 2023
@ashking94 ashking94 marked this pull request as ready for review July 14, 2023 03:51
@ashking94 ashking94 requested a review from sachinpkale as a code owner July 14, 2023 03:51
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we closing the listeners?

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@ashking94
Copy link
Member Author

Where are we closing the listeners?

Have added it now.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testSearchAggregationWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testDropPrimaryDuringReplication
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocation

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocationWithSegRepFailure

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testStaleCommitDeletionWithInvokeFlush
      1 org.opensearch.remotestore.RemoteStoreIT.testStaleCommitDeletionWithInvokeFlush

@sachinpkale sachinpkale merged commit 2ba1157 into opensearch-project:main Jul 18, 2023
@sachinpkale sachinpkale added the backport 2.x Backport to 2.x branch label Jul 18, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-8683-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 2ba1157947c84418234386ad5671719a99f4b889
# Push it to GitHub
git push --set-upstream origin backport/backport-8683-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-8683-to-2.x.

ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Jul 19, 2023
…tasks on relocation (opensearch-project#8683)

* Use CloesableRetryableRefreshListener to drain ongoing after refresh tasks on relocation
---------
Signed-off-by: Ashish Singh <ssashish@amazon.com>
sachinpkale pushed a commit that referenced this pull request Jul 19, 2023
…tasks on relocation (#8683) (#8773)

* Use CloesableRetryableRefreshListener to drain ongoing after refresh tasks on relocation
---------
Signed-off-by: Ashish Singh <ssashish@amazon.com>
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 29, 2023
…tasks on relocation (opensearch-project#8683)

* Use CloesableRetryableRefreshListener to drain ongoing after refresh tasks on relocation
---------
Signed-off-by: Ashish Singh <ssashish@amazon.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…tasks on relocation (opensearch-project#8683)

* Use CloesableRetryableRefreshListener to drain ongoing after refresh tasks on relocation
---------
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…tasks on relocation (opensearch-project#8683)

* Use CloesableRetryableRefreshListener to drain ongoing after refresh tasks on relocation
---------
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardise & Simplify relocation hand-off for segrep and remote store
3 participants