[Segment Replication] Update PrimaryShardAllocator to prefer replicas with higher replication checkpoint #4041

dreamer-89 · 2022-07-29T00:21:01Z

Signed-off-by: Suraj Singh surajrider@gmail.com

Description

Update the PrimaryShardAllocator (PSA) to include highest ReplicationCheckpoint (RC) info for evaluating shard to be promoted as primary.

Today, PSA orders shard copies from nodes holding active copy of unassigned primary shard and orders using below comparators and selects the first the first shard (satisfying existing allocation deciders).

Readable shards with no exception on index open
Existing primary. Shards which are marked primary are preferred over replica copies.

As part of this change, we are adding another comparator (runs after above two) which orders shard copies based on highest RC (a > b or a == b && c > d here a, b are primary terms & c,d are segment info version fields of RC. Please check #4112 for details). This is to ensure in case of failover master always chooses the shard copy which is furthest ahead (after satisfying first 2 comparators) and keeps the translog operations to rerun at minimum.

Please note, there are two places where shard promotion logic is used. This PR handles the PrimaryShardAllocator. The other one is RoutingNodes.failShard which will be taken up in follow up PR. Please check #3988 for more details.

Pending

Integration tests which needs #3989 changes (performs InternalEngine <-> NRTEngineReplication) in main.

Issues Resolved

#3988

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/1413/
CommitID: f0e264c6e3bc021389bd1ad29d6d31df5103e401

codecov-commenter · 2022-08-04T00:36:18Z

Codecov Report

Merging #4041 (1f07d13) into main (6993ac9) will decrease coverage by 0.06%.
The diff coverage is 62.85%.

@@             Coverage Diff              @@
##               main    #4041      +/-   ##
============================================
- Coverage     70.68%   70.61%   -0.07%     
+ Complexity    57125    57041      -84     
============================================
  Files          4603     4603              
  Lines        274535   274563      +28     
  Branches      40209    40216       +7     
============================================
- Hits         194058   193886     -172     
- Misses        64196    64355     +159     
- Partials      16281    16322      +41

Impacted Files	Coverage Δ
...ateway/TransportNodesListGatewayStartedShards.java	`51.67% <47.61%> (-1.36%)`	⬇️
.../opensearch/test/gateway/TestGatewayAllocator.java	`94.00% <71.42%> (-3.73%)`	⬇️
.../org/opensearch/gateway/PrimaryShardAllocator.java	`75.24% <100.00%> (+0.50%)`	⬆️
.../replication/checkpoint/ReplicationCheckpoint.java	`56.81% <100.00%> (+3.32%)`	⬆️
...java/org/opensearch/client/indices/DataStream.java	`0.00% <0.00%> (-76.09%)`	⬇️
.../opensearch/client/indices/CloseIndexResponse.java	`17.50% <0.00%> (-73.75%)`	⬇️
...a/org/opensearch/client/cluster/SniffModeInfo.java	`0.00% <0.00%> (-52.95%)`	⬇️
.../java/org/opensearch/node/NodeClosedException.java	`50.00% <0.00%> (-50.00%)`	⬇️
...ch/transport/ReceiveTimeoutTransportException.java	`50.00% <0.00%> (-50.00%)`	⬇️
...cluster/coordination/PendingClusterStateStats.java	`20.00% <0.00%> (-48.00%)`	⬇️
... and 475 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

server/src/main/java/org/opensearch/gateway/PrimaryShardAllocator.java

mch2 · 2022-08-08T18:01:38Z

server/src/main/java/org/opensearch/gateway/TransportNodesListGatewayStartedShards.java

@@ -399,6 +430,14 @@ public void writeTo(StreamOutput out) throws IOException {
            } else {
                out.writeBoolean(false);
            }
+            if (out.getVersion().onOrAfter(Version.V_3_0_0)) {


Lets make sure we update this to V_2_3_0 when this is backported?

… having higher replication checkpoint Signed-off-by: Suraj Singh <surajrider@gmail.com>

Signed-off-by: Suraj Singh <surajrider@gmail.com>

…Checkpoint field Signed-off-by: Suraj Singh <surajrider@gmail.com>

…oint

…comparator only for segrep enabled indices

…Allocator

Signed-off-by: Suraj Singh <surajrider@gmail.com>

dreamer-89 · 2022-08-09T18:31:53Z

@Bukhtawar @andrross @mch2 : Addressed the review comments, please have a look.

dreamer-89 · 2022-08-09T18:45:11Z

Trying to write an IT which simulate no active shard copies (to ensure RoutingNodes#failShard doesn't select any random active replica) during primary failure but at the same time shard copies should be part of in-sync allocation ids. This is not a blocker for this PR.

https://gist.github.com/dreamer-89/9e6a781e57dbf1008c97034103b6a9a8

github-actions · 2022-08-09T18:46:39Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/1626/
CommitID: de07dac

Signed-off-by: Suraj Singh <surajrider@gmail.com>

github-actions · 2022-08-09T19:49:27Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/1627/
CommitID: 1f07d13

dreamer-89 · 2022-08-09T20:25:55Z

Failure looks unrelated and not reproducible locally. Refiring

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testReplicationOnDone" -Dtests.seed=18F9CB5CD66407EA -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ms -Dtests.timezone=Europe/Sofia -Druntime.java=17

org.opensearch.indices.replication.SegmentReplicationTargetServiceTests > testReplicationOnDone FAILED
    org.mockito.exceptions.verification.TooFewActualInvocations: 
    segmentReplicationTargetService.onNewCheckpoint(
        ReplicationCheckpoint{shardId=[index][0], primaryTerm=94, segmentsGen=3, seqNo=-1, version=9},
        <any>
    );
    Wanted 2 times:
    -> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testReplicationOnDone(SegmentReplicationTargetServiceTests.java:236)
    But was 1 time:
    -> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testReplicationOnDone(SegmentReplicationTargetServiceTests.java:230)
        at __randomizedtesting.SeedInfo.seed([18F9CB5CD66407EA:2371B911C98E848]:0)
        at app//org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testReplicationOnDone(SegmentReplicationTargetServiceTests.java:236)

github-actions · 2022-08-09T20:52:25Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/1632/
CommitID: 1f07d13

dreamer-89 · 2022-08-16T04:47:45Z

@Bukhtawar @andrross @mch2 : Addressed the review comments, please have a look.

Bukhtawar

LGTM

… with higher replication checkpoint (opensearch-project#4041) * [Segment Replication] Update PrimaryShardAllocator to prefer replicas having higher replication checkpoint Signed-off-by: Suraj Singh <surajrider@gmail.com> * Use empty replication checkpoint to avoid NPE Signed-off-by: Suraj Singh <surajrider@gmail.com> * Update NodeGatewayStartedShards to optionally wire in/out ReplicationCheckpoint field Signed-off-by: Suraj Singh <surajrider@gmail.com> * Use default replication checkpoint causing EOF errors on empty checkpoint * Add indexSettings to GatewayAllocator to allow ReplicationCheckpoint comparator only for segrep enabled indices * Add unit tests for primary term first replica promotion & comparator fix * Fix NPE on empty IndexMetadata * Remove settings from AllocationService and directly inject in GatewayAllocator * Add more unit tests and minor code clean up Signed-off-by: Suraj Singh <surajrider@gmail.com> * Address review comments & integration test Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix comparator on null ReplicationCheckpoint Signed-off-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Suraj Singh <surajrider@gmail.com>

… with higher replication checkpoint (#4041) (#4252) * [Segment Replication] Update PrimaryShardAllocator to prefer replicas having higher replication checkpoint Signed-off-by: Suraj Singh <surajrider@gmail.com> * Use empty replication checkpoint to avoid NPE Signed-off-by: Suraj Singh <surajrider@gmail.com> * Update NodeGatewayStartedShards to optionally wire in/out ReplicationCheckpoint field Signed-off-by: Suraj Singh <surajrider@gmail.com> * Use default replication checkpoint causing EOF errors on empty checkpoint * Add indexSettings to GatewayAllocator to allow ReplicationCheckpoint comparator only for segrep enabled indices * Add unit tests for primary term first replica promotion & comparator fix * Fix NPE on empty IndexMetadata * Remove settings from AllocationService and directly inject in GatewayAllocator * Add more unit tests and minor code clean up Signed-off-by: Suraj Singh <surajrider@gmail.com> * Address review comments & integration test Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix comparator on null ReplicationCheckpoint Signed-off-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Suraj Singh <surajrider@gmail.com>

dreamer-89 requested review from a team and reta as code owners July 29, 2022 00:21

dreamer-89 marked this pull request as draft July 29, 2022 00:21

dreamer-89 force-pushed the segrep_failover_primary_selection branch 8 times, most recently from ebd18e5 to 5b5b36d Compare August 3, 2022 21:49

dreamer-89 mentioned this pull request Aug 3, 2022

[Bug] [Segment Replication] Fix isAheadOf logic for ReplicationCheckpoint comparison #4112

Merged

dreamer-89 force-pushed the segrep_failover_primary_selection branch from 5b5b36d to f0e264c Compare August 4, 2022 00:08

opensearch-project deleted a comment from github-actions bot Aug 4, 2022

mch2 reviewed Aug 8, 2022

View reviewed changes

dreamer-89 mentioned this pull request Aug 8, 2022

Add IndexShard#getLatestReplicationCheckpoint behind segrep enable feature flag #4163

Merged

dreamer-89 added 10 commits August 9, 2022 11:16

[Segment Replication] Update PrimaryShardAllocator to prefer replicas…

a083c2d

… having higher replication checkpoint Signed-off-by: Suraj Singh <surajrider@gmail.com>

Use empty replication checkpoint to avoid NPE

dbc8295

Signed-off-by: Suraj Singh <surajrider@gmail.com>

Update NodeGatewayStartedShards to optionally wire in/out Replication…

2d6096e

…Checkpoint field Signed-off-by: Suraj Singh <surajrider@gmail.com>

Use default replication checkpoint causing EOF errors on empty checkp…

2fc0310

…oint

Add indexSettings to GatewayAllocator to allow ReplicationCheckpoint …

94cbf50

…comparator only for segrep enabled indices

Add unit tests for primary term first replica promotion & comparator fix

4909746

Fix NPE on empty IndexMetadata

907df84

Remove settings from AllocationService and directly inject in Gateway…

a62d02c

…Allocator

Add more unit tests and minor code clean up

ac47ffd

Signed-off-by: Suraj Singh <surajrider@gmail.com>

Address review comments & integration test

de07dac

Signed-off-by: Suraj Singh <surajrider@gmail.com>

dreamer-89 force-pushed the segrep_failover_primary_selection branch from dd79408 to de07dac Compare August 9, 2022 18:24

Fix comparator on null ReplicationCheckpoint

1f07d13

Signed-off-by: Suraj Singh <surajrider@gmail.com>

This was referenced Aug 10, 2022

[Backport 2.x] [Segment Replication] Backport PR's : #3525 #3533 #3540 #3943 #3963 From main branch #4181

Merged

[BUG] [Segment Replication] Flaky test SegmentReplicationTargetServiceTests.testReplicationOnDone #4184

Closed

Bukhtawar approved these changes Aug 16, 2022

View reviewed changes

andrross approved these changes Aug 16, 2022

View reviewed changes

dreamer-89 merged commit d308a29 into opensearch-project:main Aug 18, 2022

kartg mentioned this pull request Aug 30, 2022

Adding @dreamer-89 to Opensearch maintainers. #4342

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Update PrimaryShardAllocator to prefer replicas with higher replication checkpoint #4041

[Segment Replication] Update PrimaryShardAllocator to prefer replicas with higher replication checkpoint #4041

dreamer-89 commented Jul 29, 2022 •

edited

Loading

github-actions bot commented Aug 4, 2022

codecov-commenter commented Aug 4, 2022 •

edited

Loading

mch2 Aug 8, 2022

dreamer-89 commented Aug 9, 2022 •

edited

Loading

dreamer-89 commented Aug 9, 2022 •

edited

Loading

github-actions bot commented Aug 9, 2022

github-actions bot commented Aug 9, 2022

dreamer-89 commented Aug 9, 2022

github-actions bot commented Aug 9, 2022

dreamer-89 commented Aug 16, 2022

Bukhtawar left a comment

[Segment Replication] Update PrimaryShardAllocator to prefer replicas with higher replication checkpoint #4041

[Segment Replication] Update PrimaryShardAllocator to prefer replicas with higher replication checkpoint #4041

Conversation

dreamer-89 commented Jul 29, 2022 • edited Loading

Description

Pending

Issues Resolved

Related

github-actions bot commented Aug 4, 2022

Gradle Check (Jenkins) Run Completed with:

codecov-commenter commented Aug 4, 2022 • edited Loading

Codecov Report

mch2 Aug 8, 2022

Choose a reason for hiding this comment

dreamer-89 commented Aug 9, 2022 • edited Loading

dreamer-89 commented Aug 9, 2022 • edited Loading

github-actions bot commented Aug 9, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Aug 9, 2022

Gradle Check (Jenkins) Run Completed with:

dreamer-89 commented Aug 9, 2022

github-actions bot commented Aug 9, 2022

Gradle Check (Jenkins) Run Completed with:

dreamer-89 commented Aug 16, 2022

Bukhtawar left a comment

Choose a reason for hiding this comment

dreamer-89 commented Jul 29, 2022 •

edited

Loading

codecov-commenter commented Aug 4, 2022 •

edited

Loading

dreamer-89 commented Aug 9, 2022 •

edited

Loading

dreamer-89 commented Aug 9, 2022 •

edited

Loading