Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor translog download flow and Add support to run SegRep integ tests using remote store settings #6405

Conversation

sachinpkale
Copy link
Member

@sachinpkale sachinpkale commented Feb 21, 2023

Description

  • We want to test various scenarios involving replica promotion, peer recovery, index close and re-open with remote store.
  • These scenarios are already being tested in various integ tests of Segment Replication.
  • In this change, we re-use the same tests by enabling remote store settings.

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@sachinpkale sachinpkale force-pushed the refactor-translog-download branch from 9b3a4f0 to ed57c70 Compare March 1, 2023 07:45
@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2023

Gradle Check (Jenkins) Run Completed with:

@sachinpkale
Copy link
Member Author

Following test is failing, looking into it.

[org.opensearch.index.translog.RemoteFSTranslogTests.testMetadataFileDeletion]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@sachinpkale sachinpkale force-pushed the refactor-translog-download branch from 69172da to 897d263 Compare March 20, 2023 07:33
@sachinpkale sachinpkale requested review from andrross and removed request for adnapibar March 20, 2023 07:44
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      2 org.opensearch.cluster.service.MasterServiceTests.classMethod
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocation
      1 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2023

Codecov Report

Merging #6405 (0fa7a9c) into main (8c9cfdc) will increase coverage by 0.11%.
The diff coverage is 76.47%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6405      +/-   ##
============================================
+ Coverage     70.73%   70.85%   +0.11%     
- Complexity    59281    59313      +32     
============================================
  Files          4812     4812              
  Lines        283614   283641      +27     
  Branches      40896    40900       +4     
============================================
+ Hits         200606   200962     +356     
+ Misses        66552    66201     -351     
- Partials      16456    16478      +22     
Impacted Files Coverage Δ
...java/org/opensearch/index/shard/StoreRecovery.java 67.39% <33.33%> (+1.43%) ⬆️
...in/java/org/opensearch/index/shard/IndexShard.java 69.83% <53.33%> (-0.10%) ⬇️
...va/org/opensearch/index/engine/ReadOnlyEngine.java 72.34% <66.66%> (+0.19%) ⬆️
...dex/translog/transfer/TranslogTransferManager.java 81.43% <87.50%> (-1.18%) ⬇️
...rg/opensearch/index/translog/RemoteFsTranslog.java 76.40% <95.23%> (+0.77%) ⬆️
...anslog/RemoteBlobStoreInternalTranslogFactory.java 76.92% <100.00%> (+1.92%) ⬆️

... and 459 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Comment on lines 4380 to 4386
public void syncTranslogFilesFromRemoteTranslog() throws IOException {
TranslogFactory translogFactory = translogFactorySupplier.apply(indexSettings, shardRouting);
assert translogFactory instanceof RemoteBlobStoreInternalTranslogFactory;
Repository repository = ((RemoteBlobStoreInternalTranslogFactory) translogFactory).getRepository();
assert repository instanceof BlobStoreRepository : "repository should be instance of BlobStoreRepository";
BlobStoreRepository blobStoreRepository = (BlobStoreRepository) repository;
FileTransferTracker fileTransferTracker = new FileTransferTracker(shardId);
TranslogTransferManager translogTransferManager = RemoteFsTranslog.buildTranslogTransferManager(
blobStoreRepository,
getThreadPool(),
shardId,
fileTransferTracker
);
RemoteFsTranslog.download(translogTransferManager, shardPath().resolveTranslog());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire logic should sit in RemoteFsTranslog and IndexShard should ever interact with a common Translog interface. Otherwise every consumer will need to be aware of the underlying translog which will make the interfaces too tightly coupled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will make the changes accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking it here: #5679

Sachin Kale added 7 commits March 27, 2023 11:52
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
Signed-off-by: Sachin Kale <kalsac@amazon.com>
@sachinpkale sachinpkale force-pushed the refactor-translog-download branch from 897d263 to af69749 Compare March 27, 2023 08:07
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sachin Kale <kalsac@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all}


import static org.opensearch.test.hamcrest.OpenSearchAssertions.assertAcked;

@OpenSearchIntegTestCase.ClusterScope(scope = OpenSearchIntegTestCase.Scope.TEST, numDataNodes = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question - what numDataNodes = 0 means here? I checked the code base and saw multiple such occurrences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to control number of nodes in the test cluster. That is why we provide 0 and start nodes as per the test requirement.

More on numDataNodes: https://github.com/opensearch-project/OpenSearch/blob/main/test/framework/src/main/java/org/opensearch/test/OpenSearchIntegTestCase.java#L1722

@@ -137,7 +137,9 @@ public ReadOnlyEngine(
}
if (seqNoStats == null) {
seqNoStats = buildSeqNoStats(config, lastCommittedSegmentInfos);
ensureMaxSeqNoEqualsToGlobalCheckpoint(seqNoStats);
Copy link
Member

@ashking94 ashking94 Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - would it make this code more readable, if we offload this check to within ensureMaxSeqNoEqualsToGlobalCheckpoint method -

 if (requireCompleteHistory == false || engineConfig.getIndexSettings().isRemoteTranslogStoreEnabled()) {
            return;
        }

protected void ensureMaxSeqNoEqualsToGlobalCheckpoint(final SeqNoStats seqNoStats) {
if (requireCompleteHistory == false) {
return;

@@ -186,7 +188,7 @@ protected void ensureMaxSeqNoEqualsToGlobalCheckpoint(final SeqNoStats seqNoStat
// In addition to that we only execute the check if the index the engine belongs to has been
// created after the refactoring of the Close Index API and its TransportVerifyShardBeforeCloseAction
// that guarantee that all operations have been flushed to Lucene.
assert assertMaxSeqNoEqualsToGlobalCheckpoint(seqNoStats.getMaxSeqNo(), seqNoStats.getGlobalCheckpoint());
assertMaxSeqNoEqualsToGlobalCheckpoint(seqNoStats.getMaxSeqNo(), seqNoStats.getGlobalCheckpoint());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also change the contract of the method to return void since the return value is not being used any more?

@@ -3079,7 +3085,8 @@ public void updateGlobalCheckpointOnReplica(final long globalCheckpoint, final S
* while the global checkpoint update may have emanated from the primary when we were in that state, we could subsequently move
* to recovery finalization, or even finished recovery before the update arrives here.
*/
assert state() != IndexShardState.POST_RECOVERY && state() != IndexShardState.STARTED
assert (state() != IndexShardState.POST_RECOVERY && state() != IndexShardState.STARTED)
|| (indexSettings.isRemoteTranslogStoreEnabled() == true && state() != IndexShardState.RECOVERING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a short comment or add to method's java doc on why this condition has been added?

@@ -141,6 +154,25 @@ public static void download(TranslogTransferManager translogTransferManager, Pat
}
}

private static void deleteTranslogFilesNotUploaded(Path location, long uploadedGeneration) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall the earlier check on deleting files before downloading translog files was added as otherwise the download used to fail, right? Now we are deleting files that are greater than the max generation referenced by the remote translog metadata file. While this is really a good optimisation, I think we should check the local file's checksum and the expected checksum. If there is a story on the checksum part already, then do let me know. If not, let's follow it up in next PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On high level, comparing checksum makes sense. This is what we do for segments and can be added here as well.
The only reason I have not added this check was due to the invariant that we have: only one node will be uploading a translog file at any given time. So, if the translog file name matches, checksum must be same. Let me create a tracking issue to discuss this in more detail.

Copy link
Member Author

@sachinpkale sachinpkale Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created tracking issue: #6896

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed on high level. Do take a look at the comments and address.

Signed-off-by: Sachin Kale <kalsac@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.split/30_copy_settings/Copy settings during split index}

@sachinpkale sachinpkale marked this pull request as draft April 4, 2023 13:06
@sachinpkale
Copy link
Member Author

This PR is trying to solve multiple things. Breaking it down into multiple PRs

@sachinpkale sachinpkale closed this May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants