Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Download segments from remote segment store in failover flow #5579

Conversation

sachinpkale
Copy link
Member

@sachinpkale sachinpkale commented Dec 15, 2022

Description

  • For an index with remote segment store enabled, if primary goes down and replica is lagging, instead of replaying translog, we can download the segments that were not yet copied to replica (new primary) from remote store.
  • This will help in faster failover as translog replays are time consuming.
  • With remote translog, this is required as remote translog may not have all the operations that are required to replay.

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@sachinpkale
Copy link
Member Author

This change is built on top of #5253. Once refresh level durability PR merges, we need to rebase and open the PR for review.

@sachinpkale sachinpkale changed the title Feature/remote segment store failover integration [Remote Store] Download segments from remote segment store in failover flow Dec 15, 2022
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@sachinpkale sachinpkale force-pushed the feature/remote-segment-store-failover-integration branch from 3f57b9b to 5fb9bce Compare December 19, 2022 04:57
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@sachinpkale sachinpkale force-pushed the feature/remote-segment-store-failover-integration branch from 0ed52cf to 4e87f36 Compare December 23, 2022 06:59
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2022

Codecov Report

Merging #5579 (88ca9ec) into main (361133c) will increase coverage by 0.09%.
The diff coverage is 88.23%.

@@             Coverage Diff              @@
##               main    #5579      +/-   ##
============================================
+ Coverage     70.96%   71.06%   +0.09%     
- Complexity    58554    58662     +108     
============================================
  Files          4760     4760              
  Lines        279515   279566      +51     
  Branches      40348    40357       +9     
============================================
+ Hits         198363   198670     +307     
+ Misses        64965    64717     -248     
+ Partials      16187    16179       -8     
Impacted Files Coverage Δ
...java/org/opensearch/index/shard/StoreRecovery.java 68.23% <50.00%> (-0.41%) ⬇️
...in/java/org/opensearch/index/shard/IndexShard.java 71.12% <85.96%> (+0.84%) ⬆️
...ation/OpenSearchIndexLevelReplicationTestCase.java 90.30% <93.75%> (+0.03%) ⬆️
...earch/index/store/RemoteSegmentStoreDirectory.java 99.31% <100.00%> (+<0.01%) ⬆️
...org/opensearch/index/shard/IndexShardTestCase.java 94.05% <100.00%> (+0.26%) ⬆️
...g/opensearch/index/analysis/CharFilterFactory.java 0.00% <0.00%> (-100.00%) ⬇️
...port/ResponseHandlerFailureTransportException.java 0.00% <0.00%> (-60.00%) ⬇️
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) ⬇️
...a/org/opensearch/tasks/TaskCancelledException.java 50.00% <0.00%> (-50.00%) ⬇️
...opensearch/persistent/PersistentTasksExecutor.java 22.22% <0.00%> (-44.45%) ⬇️
... and 487 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sachinpkale sachinpkale force-pushed the feature/remote-segment-store-failover-integration branch from e6a5e13 to cb5f829 Compare December 26, 2022 05:26
@sachinpkale sachinpkale marked this pull request as ready for review December 26, 2022 05:34
@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      2 org.opensearch.cluster.service.MasterServiceTests.classMethod
      1 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes

@@ -623,6 +634,7 @@ public void updateShardState(
if (indexSettings.isSegRepEnabled()) {
// this Shard's engine was read only, we need to update its engine before restoring local history from xlog.
assert newRouting.primary() && currentRouting.primary() == false;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : remove empty lines. Here and elsewhere.

Sachin Kale added 13 commits January 2, 2023 16:01
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
@sachinpkale sachinpkale force-pushed the feature/remote-segment-store-failover-integration branch from 65adf0f to 08e72f6 Compare January 2, 2023 10:32
@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2023

Gradle Check (Jenkins) Run Completed with:

Comment on lines +4227 to +4228
logger.info("Downloaded segments: {}", downloadedSegments);
logger.info("Skipped download for segments: {}", skippedSegments);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit combine both log lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we gain by combining these log lines? Won't it impact the ability to debug?

* @param override flag to override local segment files with those in remote store
* @throws IOException if exception occurs while reading segments from remote store
*/
public void syncSegmentsFromRemoteSegmentStore(boolean override) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rename the param to be more specific

if (checksum == CodecUtil.retrieveChecksum(indexInput)) {
return true;
} else {
logger.warn("Checksum mismatch between local and remote segment file: {}, will override local file", file);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could checksum mismatch be a sign of a corruption that should be flagged, which would otherwise get unnoticed?

Copy link
Member Author

@sachinpkale sachinpkale Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mismatch could be due to two things:

  1. The node where we are downloading files, already has a segment file with the same name but created independently of the primary. This could happen if there are two primaries creating segments.
  2. As you pointed out, due to corruption.

Here, we are handling # 1. We are making sure that while downloading segments, if the segment with similar name exists, we don't just assume that both are same file but compare checksums as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a tracking issue to add checksum to metadata file and compare checksum while reading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already being tracked here: #4605

logger.warn("Checksum mismatch between local and remote segment file: {}, will override local file", file);
}
} catch (IOException e) {
logger.debug("Exception while reading checksum of file: {}, this can happen if file does not exist", file);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we have the right handling here as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safe to just log the exception as we will go ahead with downloading the segment file from remote store.
Or are you talking about the log level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the log level to warn and also adding details of possible corruption in the log message

Signed-off-by: Sachin Kale <kalsac@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 3, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sachin Kale <kalsac@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 3, 2023

Gradle Check (Jenkins) Run Completed with:

@sachinpkale sachinpkale requested a review from Bukhtawar January 3, 2023 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants