Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Fix relocation failure due to transport receive timeout #10761

Merged
merged 3 commits into from
Oct 20, 2023

Conversation

ashking94
Copy link
Member

Description

This PR solves Peer Recovery failure that can happen due to receive timeout exceptions during prepare translog phase of peer recovery. This type of failure can be prominent in zero replica remote store backed indexes. However this can also happen during slow remote interactions. This PR fixes the problem handles both kind of issues.

This PR handles the following cases -

  • Removes searchIdle support for remote enabled indexes. This means that we continue to refresh and upload segments periodically at the configured refresh interval.
  • Fixes the bug where actual translog retention was higher than the translog flush threshold size.
  • Makes the transport timeout configurable for prepare translog phase in peer recovery. This is now controlled through the indices.recovery.internal_action_long_timeout index setting. This, by default, is 30 mins.

Related Issues

Resolves #10727

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@ashking94
Copy link
Member Author

Gradle Check (Jenkins) Run Completed with:

Rebasing with main and retrying.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Oct 20, 2023

Codecov Report

Merging #10761 (9589791) into main (200ad5d) will increase coverage by 0.13%.
Report is 1 commits behind head on main.
The diff coverage is 78.57%.

@@             Coverage Diff              @@
##               main   #10761      +/-   ##
============================================
+ Coverage     71.12%   71.25%   +0.13%     
- Complexity    58545    58659     +114     
============================================
  Files          4859     4859              
  Lines        276252   276280      +28     
  Branches      40191    40196       +5     
============================================
+ Hits         196473   196872     +399     
+ Misses        63347    63016     -331     
+ Partials      16432    16392      -40     
Files Coverage Δ
...in/java/org/opensearch/index/shard/IndexShard.java 70.10% <100.00%> (+0.49%) ⬆️
...search/index/translog/InternalTranslogManager.java 66.66% <100.00%> (-4.57%) ⬇️
...rg/opensearch/index/translog/RemoteFsTranslog.java 74.50% <100.00%> (+1.13%) ⬆️
...n/java/org/opensearch/index/translog/Translog.java 80.51% <100.00%> (-0.11%) ⬇️
.../indices/replication/common/ReplicationTarget.java 81.25% <100.00%> (+2.30%) ⬆️
...ch/indices/recovery/PeerRecoverySourceService.java 61.93% <50.00%> (+6.74%) ⬆️
.../main/java/org/opensearch/index/IndexSettings.java 86.40% <0.00%> (-0.39%) ⬇️
.../indices/recovery/RemoteRecoveryTargetHandler.java 89.70% <50.00%> (-2.61%) ⬇️
...nsearch/index/store/RemoteStoreFileDownloader.java 92.15% <84.00%> (-0.87%) ⬇️
...ices/replication/RemoteStoreReplicationSource.java 85.24% <75.00%> (-5.83%) ⬇️

... and 461 files with indirect coverage changes

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrross andrross merged commit a1fde65 into opensearch-project:main Oct 20, 2023
16 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 20, 2023
#10761)

* [Remote Store] Fix relocation failure due to transport receive timeout

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Fix existing extended shardIdle for remote backed shards

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Incorporate PR review comments

Signed-off-by: Ashish Singh <ssashish@amazon.com>

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
(cherry picked from commit a1fde65)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Oct 21, 2023
opensearch-project#10761)

* [Remote Store] Fix relocation failure due to transport receive timeout

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Fix existing extended shardIdle for remote backed shards

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Incorporate PR review comments

Signed-off-by: Ashish Singh <ssashish@amazon.com>

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
@ashking94 ashking94 deleted the 10727 branch October 21, 2023 19:27
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Oct 21, 2023
opensearch-project#10761)

* [Remote Store] Fix relocation failure due to transport receive timeout

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Fix existing extended shardIdle for remote backed shards

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Incorporate PR review comments

Signed-off-by: Ashish Singh <ssashish@amazon.com>

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Oct 22, 2023
opensearch-project#10761)

* [Remote Store] Fix relocation failure due to transport receive timeout

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Fix existing extended shardIdle for remote backed shards

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Incorporate PR review comments

Signed-off-by: Ashish Singh <ssashish@amazon.com>

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
gbbafna pushed a commit that referenced this pull request Oct 22, 2023
#10761) (#10788)

* [Remote Store] Fix relocation failure due to transport receive timeout



* Fix existing extended shardIdle for remote backed shards



* Incorporate PR review comments



---------


(cherry picked from commit a1fde65)

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
austintlee pushed a commit to austintlee/OpenSearch that referenced this pull request Oct 23, 2023
opensearch-project#10761)

* [Remote Store] Fix relocation failure due to transport receive timeout

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Fix existing extended shardIdle for remote backed shards

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Incorporate PR review comments

Signed-off-by: Ashish Singh <ssashish@amazon.com>

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
opensearch-project#10761)

* [Remote Store] Fix relocation failure due to transport receive timeout

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Fix existing extended shardIdle for remote backed shards

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Incorporate PR review comments

Signed-off-by: Ashish Singh <ssashish@amazon.com>

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working skip-changelog Storage:Durability Issues and PRs related to the durability framework v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [Remote Store] Timeout on shard relocation
3 participants