-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Update PrimaryShardAllocator to prefer replicas with higher replication checkpoint #4041
[Segment Replication] Update PrimaryShardAllocator to prefer replicas with higher replication checkpoint #4041
Conversation
ebd18e5
to
5b5b36d
Compare
5b5b36d
to
f0e264c
Compare
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #4041 +/- ##
============================================
- Coverage 70.68% 70.61% -0.07%
+ Complexity 57125 57041 -84
============================================
Files 4603 4603
Lines 274535 274563 +28
Branches 40209 40216 +7
============================================
- Hits 194058 193886 -172
- Misses 64196 64355 +159
- Partials 16281 16322 +41
Help us with your feedback. Take ten seconds to tell us how you rate us. |
server/src/main/java/org/opensearch/gateway/PrimaryShardAllocator.java
Outdated
Show resolved
Hide resolved
@@ -399,6 +430,14 @@ public void writeTo(StreamOutput out) throws IOException { | |||
} else { | |||
out.writeBoolean(false); | |||
} | |||
if (out.getVersion().onOrAfter(Version.V_3_0_0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make sure we update this to V_2_3_0 when this is backported?
… having higher replication checkpoint Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
…Checkpoint field Signed-off-by: Suraj Singh <surajrider@gmail.com>
…comparator only for segrep enabled indices
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
dd79408
to
de07dac
Compare
@Bukhtawar @andrross @mch2 : Addressed the review comments, please have a look. |
Trying to write an IT which simulate no active shard copies (to ensure RoutingNodes#failShard doesn't select any random active replica) during primary failure but at the same time shard copies should be part of in-sync allocation ids. This is not a blocker for this PR. https://gist.github.com/dreamer-89/9e6a781e57dbf1008c97034103b6a9a8 |
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
Failure looks unrelated and not reproducible locally. Refiring
|
Gradle Check (Jenkins) Run Completed with:
|
@Bukhtawar @andrross @mch2 : Addressed the review comments, please have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… with higher replication checkpoint (opensearch-project#4041) * [Segment Replication] Update PrimaryShardAllocator to prefer replicas having higher replication checkpoint Signed-off-by: Suraj Singh <surajrider@gmail.com> * Use empty replication checkpoint to avoid NPE Signed-off-by: Suraj Singh <surajrider@gmail.com> * Update NodeGatewayStartedShards to optionally wire in/out ReplicationCheckpoint field Signed-off-by: Suraj Singh <surajrider@gmail.com> * Use default replication checkpoint causing EOF errors on empty checkpoint * Add indexSettings to GatewayAllocator to allow ReplicationCheckpoint comparator only for segrep enabled indices * Add unit tests for primary term first replica promotion & comparator fix * Fix NPE on empty IndexMetadata * Remove settings from AllocationService and directly inject in GatewayAllocator * Add more unit tests and minor code clean up Signed-off-by: Suraj Singh <surajrider@gmail.com> * Address review comments & integration test Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix comparator on null ReplicationCheckpoint Signed-off-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Suraj Singh <surajrider@gmail.com>
… with higher replication checkpoint (#4041) (#4252) * [Segment Replication] Update PrimaryShardAllocator to prefer replicas having higher replication checkpoint Signed-off-by: Suraj Singh <surajrider@gmail.com> * Use empty replication checkpoint to avoid NPE Signed-off-by: Suraj Singh <surajrider@gmail.com> * Update NodeGatewayStartedShards to optionally wire in/out ReplicationCheckpoint field Signed-off-by: Suraj Singh <surajrider@gmail.com> * Use default replication checkpoint causing EOF errors on empty checkpoint * Add indexSettings to GatewayAllocator to allow ReplicationCheckpoint comparator only for segrep enabled indices * Add unit tests for primary term first replica promotion & comparator fix * Fix NPE on empty IndexMetadata * Remove settings from AllocationService and directly inject in GatewayAllocator * Add more unit tests and minor code clean up Signed-off-by: Suraj Singh <surajrider@gmail.com> * Address review comments & integration test Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix comparator on null ReplicationCheckpoint Signed-off-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh surajrider@gmail.com
Description
Update the PrimaryShardAllocator (PSA) to include highest ReplicationCheckpoint (RC) info for evaluating shard to be promoted as primary.
Today, PSA orders shard copies from nodes holding active copy of unassigned primary shard and orders using below comparators and selects the first the first shard (satisfying existing allocation deciders).
As part of this change, we are adding another comparator (runs after above two) which orders shard copies based on highest RC (
a > b or a == b && c > d
here a, b are primary terms & c,d are segment info version fields of RC. Please check #4112 for details). This is to ensure in case of failover master always chooses the shard copy which is furthest ahead (after satisfying first 2 comparators) and keeps the translog operations to rerun at minimum.Please note, there are two places where shard promotion logic is used. This PR handles the PrimaryShardAllocator. The other one is RoutingNodes.failShard which will be taken up in follow up PR. Please check #3988 for more details.
Pending
Integration tests which needs #3989 changes (performs InternalEngine <-> NRTEngineReplication) in main.
Issues Resolved
#3988
Related
#2212
#4112
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.