Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Adding PrimaryMode check before publishing checkpoint and processing a received checkpoint. #4157

Conversation

Rishikesh1159
Copy link
Member

Signed-off-by: Rishikesh1159 rishireddy1159@gmail.com

Description

This PR adds a check to publish checkpoint only in primary mode.

Issues Resolved

#3923

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2022

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Aug 8, 2022

Codecov Report

Merging #4157 (4989d2b) into main (4751f3b) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main    #4157      +/-   ##
============================================
- Coverage     70.61%   70.58%   -0.03%     
+ Complexity    57083    57058      -25     
============================================
  Files          4604     4604              
  Lines        274552   274555       +3     
  Branches      40210    40211       +1     
============================================
- Hits         193862   193788      -74     
- Misses        64396    64539     +143     
+ Partials      16294    16228      -66     
Impacted Files Coverage Δ
...nsearch/index/shard/CheckpointRefreshListener.java 100.00% <100.00%> (ø)
...in/java/org/opensearch/index/shard/IndexShard.java 68.85% <100.00%> (-0.14%) ⬇️
...adonly/AddIndexBlockClusterStateUpdateRequest.java 0.00% <0.00%> (-75.00%) ⬇️
...readonly/TransportVerifyShardIndexBlockAction.java 9.75% <0.00%> (-73.18%) ⬇️
...a/org/opensearch/client/cluster/ProxyModeInfo.java 0.00% <0.00%> (-55.00%) ⬇️
...n/admin/indices/readonly/AddIndexBlockRequest.java 17.85% <0.00%> (-53.58%) ⬇️
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) ⬇️
...ava/org/opensearch/action/NoSuchNodeException.java 0.00% <0.00%> (-50.00%) ⬇️
...cluster/coordination/PendingClusterStateStats.java 20.00% <0.00%> (-48.00%) ⬇️
...adcast/BroadcastShardOperationFailedException.java 55.55% <0.00%> (-44.45%) ⬇️
... and 467 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@Rishikesh1159 Rishikesh1159 requested a review from Bukhtawar August 8, 2022 07:01
* creates a new initializing shard. The shard will will be put in its proper path under the
* here we are starting a new primary shard in PrimaryMode and testing if the shard publishes checkpoint after refresh.
*/
public void testPublishCheckpointOnPrimaryMode() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these segrep specific tests to SegmentReplicationIndexShardTests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, moved to SegmentReplicationIndexShardTests in new PR.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2022

Gradle Check (Jenkins) Run Completed with:

Comment on lines -43 to 45
if (didRefresh) {
if (didRefresh && shard.getReplicationTracker().isPrimaryMode()) {
publisher.publish(shard);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also reject checkpoints if the replica copy(where the checkpoint was sent to) is operating in primary mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also on that note - we will want to cancel any ongoing replication events on both sides. I've added #4136 to cover this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it also worth checking shardRouting here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Logic to reject checkpoints if shard is in PrimaryMode in latest commit.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
….com/Rishikesh1159/OpenSearch into segrep/checkpointPublish-PrimaryMode

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Comment on lines 1438 to 1442
logger.trace(
() -> new ParameterizedMessage(
"Ignoring new replication checkpoint - shard is in primaryMode and cannot receive any checkpoints."
)
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about the silent logging, ideally this should be rare, if we are seeing this quite often, it might reflect a bug. Let log with a WARN level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will change this to warn level.

….java class.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@Rishikesh1159 Rishikesh1159 changed the title [Segment Replication] Adding PrimaryMode check before publishing checkpoint. [Segment Replication] Adding PrimaryMode check before publishing checkpoint and processing a received checkpoint. Aug 10, 2022
Copy link
Member

@kartg kartg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one tiny nitpick

@@ -1434,6 +1434,14 @@ public final boolean shouldProcessCheckpoint(ReplicationCheckpoint requestCheckp
logger.trace(() -> new ParameterizedMessage("Ignoring new replication checkpoint - shard is not started {}", state()));
return false;
}
if (getReplicationTracker().isPrimaryMode()) {
logger.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - You don't really need lazy logging here since:

  1. warn is a pretty high log level and is unlikely to be suppressed
  2. The actual log line is a simple string and doesn't involve any heavyweight object dumps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense, I will remove lazy loading before merging in next commit.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@Rishikesh1159 Rishikesh1159 merged commit 96bfd1f into opensearch-project:main Aug 15, 2022
Rishikesh1159 added a commit to Rishikesh1159/OpenSearch that referenced this pull request Aug 17, 2022
…kpoint and processing a received checkpoint. (opensearch-project#4157)

* Adding PrimaryMode check before publishing checkpoint.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Applying spotless check

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Moving segrep specific tests to SegmentReplicationIndexShardTests.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Adding logic and tests for rejecting checkpoints if shard is in PrimaryMode.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Applying ./gradlew :server:spotlessApply.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Applying ./gradlew :server:spotlessApply

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Changing log level to warn in shouldProcessCheckpoint() of IndexShard.java class.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Removing unnecessary lazy logging in shouldProcessCheckpoint().

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Rishikesh1159 added a commit that referenced this pull request Aug 17, 2022
…aining segment replication changes (#4243)

* [Segment Replication] Checkpoint Replay on Replica Shard (#3658)

* Adding Latest Recevied checkpoint, replay checkpoint logic along with tests

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* [Segment Replication] Wire up segment replication with peer recovery and add ITs. (#3743)

* Add null check when computing max segment version.

With segment replication enabled it is possible Lucene does not set the SegmentInfos
min segment version, leaving the default value as null.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update peer recovery to set the translogUUID of replicas to the UUID generated on the primary.

This change updates the UUID when the translog is created to the value stored in the passed segment userdata.
This is to ensure during failover scenarios that the replica can be promoted and not have a uuid mismatch with the value stored in user data.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Wire up Segment Replication under the feature flag.

This PR wires up segment replication and adds some initial integration tests.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add test to ensure replicas use primary translog uuid with segrep.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update SegmentReplicationIT to assert previous commit points are valid and SegmentInfos can be built.
Fix nitpicks in PR feedback.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix test with Assert.fail to include a message.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Disable shard idle with segment replication. (#4118)

This change disables shard idle when segment replication is enabled.
Primary shards will only push out new segments on refresh, we do not want to block this based on search behavior.
Replicas will only refresh on an externally provided SegmentInfos, so we do not want search requests to hang waiting for a refresh.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix isAheadOf logic for ReplicationCheckpoint comparison (#4112)

Signed-off-by: Suraj Singh <surajrider@gmail.com>

* Fix waitUntil refresh policy for segrep enabled indices. (#4137)

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add IndexShard#getLatestReplicationCheckpoint behind segrep enable feature flag (#4163)

* Add IndexShard#getLatestReplicationCheckpoint behind segrep enable feature flag

Signed-off-by: Suraj Singh <surajrider@gmail.com>

* Address review comment. Move tests to SegmentReplicationIndexShardTests

Signed-off-by: Suraj Singh <surajrider@gmail.com>

* Add segrep enbaled index settings in TargetServiceTests, SourceHandlerTests

Signed-off-by: Suraj Singh <surajrider@gmail.com>

* [Segment Replication] Fix OngoingSegmentReplications to key by allocation ID instead of DiscoveryNode. (#4182)

* Fix OngoingSegmentReplications to key by allocation ID instead of DiscoveryNode.

This change fixes segrep to work with multiple shards per node by keying ongoing replications on
allocation ID.  This also updates cancel methods to ensure state is properly cleared on shard cancel.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Clean up cancel methods.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Signed-off-by: Marc Handalian <handalm@amazon.com>

* [Bug] [Segment Replication] Update store metadata recovery diff logic to ignore missing files causing exception (#4185)

* Update Store for segment replication dif

Signed-off-by: Poojita Raj <poojiraj@amazon.com>

Signed-off-by: Poojita Raj <poojiraj@amazon.com>

* Update recoveryDiff logic to ingore missing files causing exception on replica during copy

Signed-off-by: Suraj Singh <surajrider@gmail.com>

* Address review comments

Signed-off-by: Suraj Singh <surajrider@gmail.com>

Signed-off-by: Poojita Raj <poojiraj@amazon.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Co-authored-by: Poojita Raj <poojiraj@amazon.com>

* [Segment Replication] Adding PrimaryMode check before publishing checkpoint and processing a received checkpoint. (#4157)

* Adding PrimaryMode check before publishing checkpoint.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Applying spotless check

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Moving segrep specific tests to SegmentReplicationIndexShardTests.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Adding logic and tests for rejecting checkpoints if shard is in PrimaryMode.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Applying ./gradlew :server:spotlessApply.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Applying ./gradlew :server:spotlessApply

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Changing log level to warn in shouldProcessCheckpoint() of IndexShard.java class.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Removing unnecessary lazy logging in shouldProcessCheckpoint().

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* [Segment Replication] Wait for documents to replicate to replica shards (#4236)

* [Segment Replication] Add thread sleep to account for replica lag in delete operations test

Signed-off-by: Suraj Singh <surajrider@gmail.com>

* Address review comments, assertBusy on doc count rather than sleep

Signed-off-by: Suraj Singh <surajrider@gmail.com>

Signed-off-by: Suraj Singh <surajrider@gmail.com>

* Segment Replication - Remove unnecessary call to markAllocationIdAsInSync. (#4224)

This PR Removes an unnecessary call to markAllocationIdAsInSync on the primary shard when replication events complete.
Recovery will manage this initial call.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Segment Replication - Add additional unit tests for update & delete ops. (#4237)

* Segment Replication - Add additional unit tests for update & delete operations.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix spotless.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Signed-off-by: Marc Handalian <handalm@amazon.com>

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Poojita Raj <poojiraj@amazon.com>
Co-authored-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
Co-authored-by: Poojita Raj <poojiraj@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants