-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndicesClusterStateService should replace an init. replica with an init. primary with the same aId #32374
IndicesClusterStateService should replace an init. replica with an init. primary with the same aId #32374
Conversation
…it. primary with the same aId In rare cases it is possible that a nodes gets an instruction to replace a replica shard that's in POST_RECOVERY with a new initializing primary with the same allocation id. This can happen by batching cluster states that include the starting of the replica, with closing of the indices, opening it up again and allocating the primary shard to the node in question. The node should then clean it's initializing replica and replace it with a new initializing primary. Closes elastic#32308
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've suggested an (imo simpler) alternative.
@@ -420,6 +420,12 @@ private void removeShards(final ClusterState state) { | |||
// state may result in a new shard being initialized while having the same allocation id as the currently started shard. | |||
logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting); | |||
indexService.removeShard(shardId.id(), "removing shard (stale copy)"); | |||
} else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're over-special-casing this, while also possibly missing some more murky cases. May I suggest the following alternative instead, which handles both this case as well as the previous (odd) one:
diff --git a/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java b/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
index e6a86d47f55..42a88383937 100644
--- a/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
+++ b/server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
@@ -219,10 +219,10 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple
removeUnallocatedIndices(event); // also removes shards of removed indices
- failMissingShards(state);
-
removeShards(state); // removes any local shards that doesn't match what the master expects
+ failMissingShards(state);
+
updateIndices(event); // can also fail shards, but these are then guaranteed to be in failedShardsCache
createIndices(state);
@@ -414,17 +414,12 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple
logger.debug("{} removing shard (stale allocation id, stale {}, new {})", shardId,
currentRoutingEntry, newShardRouting);
indexService.removeShard(shardId.id(), "removing shard (stale copy)");
- } else if (newShardRouting.initializing() && currentRoutingEntry.active()) {
+ } else if (newShardRouting.primary() &&
+ state.metaData().index(shardId.getIndex()).primaryTerm(shardId.id()) != shard.getPrimaryTerm()) {
// this can happen if the node was isolated/gc-ed, rejoins the cluster and a new shard with the same allocation id
// is assigned to it. Batch cluster state processing or if shard fetching completes before the node gets a new cluster
// state may result in a new shard being initialized while having the same allocation id as the currently started shard.
- logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting);
- indexService.removeShard(shardId.id(), "removing shard (stale copy)");
- } else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) {
- assert currentRoutingEntry.initializing() : currentRoutingEntry; // see above if clause
- // this can happen when cluster state batching batches activation of the shard, closing an index, reopening it
- // and assigning an initializing primary to this node
- logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting);
+ logger.debug("{} removing shard (stale primary term)", shardId);
indexService.removeShard(shardId.id(), "removing shard (stale copy)");
}
}
@@ -728,6 +723,11 @@ public class IndicesClusterStateService extends AbstractLifecycleComponent imple
*/
ShardRouting routingEntry();
+ /**
+ * Returns the current primary term of this shard
+ */
+ long getPrimaryTerm();
+
/**
* Returns the latest internal shard state.
*/
diff --git a/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java b/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java
index 5e9ba653d40..bfcca1946a6 100644
--- a/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java
+++ b/server/src/test/java/org/elasticsearch/indices/cluster/AbstractIndicesClusterStateServiceTestCase.java
@@ -383,6 +383,11 @@ public abstract class AbstractIndicesClusterStateServiceTestCase extends ESTestC
return shardRouting;
}
+ @Override
+ public long getPrimaryTerm() {
+ return term;
+ }
+
@Override
public IndexShardState state() {
return null;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion, we agreed this is not quite right. We decided to go with the initial condition proposed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous suggestion was too good to be true. The condition you've proposed is the best I can think of right now until we distinguish between allocation ids and shard copy ids. I also think it's ok to keep the test, it's relatively short and succinctly documents an edge case.
@@ -420,6 +420,12 @@ private void removeShards(final ClusterState state) { | |||
// state may result in a new shard being initialized while having the same allocation id as the currently started shard. | |||
logger.debug("{} removing shard (not active, current {}, new {})", shardId, currentRoutingEntry, newShardRouting); | |||
indexService.removeShard(shardId.id(), "removing shard (stale copy)"); | |||
} else if (newShardRouting.primary() && currentRoutingEntry.primary() == false && newShardRouting.initializing()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion, we agreed this is not quite right. We decided to go with the initial condition proposed here.
@@ -74,6 +74,10 @@ public void injectRandomFailures() { | |||
enableRandomFailures = randomBoolean(); | |||
} | |||
|
|||
protected void disalbeRandomFailures() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling
Thx @ywelsch . I tend towards pushing this to 6.3 too. Do you agree? |
…tate_init_replcia_primary
yes |
thx @ywelsch |
…it. primary with the same aId (#32374) In rare cases it is possible that a nodes gets an instruction to replace a replica shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id. This can happen by batching cluster states that include the starting of the replica, with closing of the indices, opening it up again and allocating the primary shard to the node in question. The node should then clean it's initializing replica and replace it with a new initializing primary. I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep. Closes #32308
…it. primary with the same aId (#32374) In rare cases it is possible that a nodes gets an instruction to replace a replica shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id. This can happen by batching cluster states that include the starting of the replica, with closing of the indices, opening it up again and allocating the primary shard to the node in question. The node should then clean it's initializing replica and replace it with a new initializing primary. I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep. Closes #32308
…it. primary with the same aId (#32374) In rare cases it is possible that a nodes gets an instruction to replace a replica shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id. This can happen by batching cluster states that include the starting of the replica, with closing of the indices, opening it up again and allocating the primary shard to the node in question. The node should then clean it's initializing replica and replace it with a new initializing primary. I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep. Closes #32308
* elastic/ccr: (57 commits) ShardFollowNodeTask should fetch operation once (elastic#32455) Do not expose hard-deleted docs in Lucene history (elastic#32333) Tests: Fix convert error tests to use fixed value (elastic#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374) REST high-level client: parse back _ignored meta field (elastic#32362) [CI] Mute DocumentSubsetReaderTests testSearch Reject follow request if following setting not enabled on follower (elastic#32448) TEST: testDocStats should always use forceMerge (elastic#32450) TEST: avoid merge in testSegmentMemoryTrackedInBreaker TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (elastic#32441) Remove reference to non-existent store type (elastic#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (elastic#32417) [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints Validate source of an index in LuceneChangesSnapshot (elastic#32288) [TEST] Mute failing testConvertLongHexError bump lucene version after backport Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390) ...
* 6.x: Fix scriptdocvalues tests with dates Correct minor typo in explain.asciidoc for HLRC Fix painless whitelist and warnings from backporting #31441 Build: Add elastic maven to repos used by BuildPlugin (#32549) Scripting: Conditionally use java time api in scripting (#31441) [ML] Improve error when no available field exists for rule scope (#32550) [ML] Improve error for functions with limited rule condition support (#32548) [ML] Remove multiple_bucket_spans [ML] Fix thread leak when waiting for job flush (#32196) (#32541) Painless: Clean Up PainlessField (#32525) Add @AwaitsFix for #32554 Remove broken @link in Javadoc Add AwaitsFix to failing test - see #32546 SQL: Added support for string manipulating functions with more than one parameter (#32356) [DOCS] Reloadable Secure Settings (#31713) Fix compilation error introduced by #32339 [Rollup] Remove builders from TermsGroupConfig (#32507) Use hostname instead of IP with SPNEGO test (#32514) Switch x-pack rolling restart to new style Requests (#32339) [DOCS] Small fixes in rule configuration page (#32516) Painless: Clean up PainlessMethod (#32476) SQL: Add test for handling of partial results (#32474) Docs: Add missing migration doc for logging change Build: Remove shadowing from benchmarks (#32475) Docs: Add all JDKs to CONTRIBUTING.md Logging: Make node name consistent in logger (#31588) High-level client: fix clusterAlias parsing in SearchHit (#32465) REST high-level client: parse back _ignored meta field (#32362) backport fix of reduceRandom fix (#32508) Add licensing enforcement for FIPS mode (#32437) INGEST: Clean up Java8 Stream Usage (#32059) (#32485) Improve the error message when an index is incompatible with field aliases. (#32482) Mute testFilterCacheStats Scripting: Fix painless compiler loader to know about context classes (#32385) [ML][DOCS] Fix typo applied_to => applies_to Mute SSLTrustRestrictionsTests on JDK 11 Changed ReindexRequest to use Writeable.Reader (#32401) Increase max chunk size to 256Mb for repo-azure (#32101) Mute KerberosAuthenticationIT fix no=>not typo (#32463) HLRC: Add delete watch action (#32337) Fix calculation of orientation of polygons (#27967) [Kerberos] Add missing javadocs (#32469) Fix missing JavaDoc for @throws in several places in KerberosTicketValidator. Make get all app privs requires "*" permission (#32460) Ensure KeyStoreWrapper decryption exceptions are handled (#32472) update rollover to leverage write-alias semantics (#32216) [Kerberos] Remove Kerberos bootstrap checks (#32451) Switch security to new style Requests (#32290) Switch security spi example to new style Requests (#32341) Painless: Add PainlessConstructor (#32447) Update Fuzzy Query docs to clarify default behavior re max_expansions (#30819) Remove > from Javadoc (fatal with Java 11) Tests: Fix convert error tests to use fixed value (#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (#32374) auto-interval date histogram - 6.x backport (#32107) [CI] Mute DocumentSubsetReaderTests testSearch [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints TEST: testDocStats should always use forceMerge (#32450) TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (#32441) Remove reference to non-existent store type (#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (#32417) Wrong discovery.type for azure in breaking changes (#32432) Mute ConvertProcessorTests failing tests TESTS: Move netty leak detection to paranoid level (#32354) (#32425) Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (#32390) [Kerberos] Avoid vagrant update on precommit (#32416) TEST: Avoid triggering merges in FlushIT [DOCS] Fixes formatting of scope object in job resource Switch x-pack/plugin to new style Requests (#32327) Release requests in cors handle (#32410) Remove BouncyCastle dependency from runtime (#32402) Copy missing segment attributes in getSegmentInfo (#32396) Rest HL client: Add put license action (#32214) Docs: Correcting a typo in tophits (#32359) Build: Stop double generating buildSrc pom (#32408) Switch x-pack full restart to new style Requests (#32294) Painless: Clean Up PainlessClass Variables (#32380) [ML] Consistent pattern for strict/lenient parser names (#32399) Add Restore Snapshot High Level REST API Update update-settings.asciidoc (#31378) Introduce index store plugins (#32375) Rank-Eval: Reduce scope of an unchecked supression Make sure _forcemerge respects `max_num_segments`. (#32291)
In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in
POST_RECOVERY
with a new initializing primary with the same allocation id.This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.
I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.
Closes #32308