-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CloseWhileRelocatingShardsIT #37348
Conversation
Pinging @elastic/es-distributed |
} | ||
// Closing is not always acknowledged when shards are relocating: this is the case when the target shard is initializing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to improve this, can you add it as an item to the meta issue?
I wonder if we can fix this by acquiring an operation permit for each batch of operations that we send as part of phase 2 during peer recovery, and then also check whether there's a read-only block under the permit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to improve this, can you add it as an item to the meta issue?
Done.
I wonder if we can fix this by acquiring an operation permit for each batch of operations that we send as part of phase 2 during peer recovery, and then also check whether there's a read-only block under the permit.
I'm not sure to see how it would fix the issue: acquiring an operation permit for a batch of operations does not ensure that all operations have been recovered at the time the verify shard before close action is executed. Or are you thinking of failing the recovery because of the block detected under the permit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I thought about failing the recovery in case where the block suddenly appears during the recovery. This has some other adverse consequences though. Needs more thought
((MockTransportService) internalCluster().getInstance(TransportService.class, targetNode)) | ||
.addSendBehavior(internalCluster().getInstance(TransportService.class, sourceNode.getName()), | ||
(connection, requestId, action, request, options) -> { | ||
if (PeerRecoverySourceService.Actions.START_RECOVERY.equals(action)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just avoid relocating primaries for now until we have a fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is the same for primary or replica (as we don't fail replicas that cannot be verified by the verify shard before close action) so I don't see why we should only relocate replicas in this test.
I think we could keep the current test and relocate primary and replicas even if some close operations are not acknowledged, and if we fix this we could simply change the test to ensure that all closes are acknowledged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Thanks @ywelsch |
* master: (28 commits) Introduce retention lease serialization (elastic#37447) Update Delete Watch to allow unknown fields (elastic#37435) Make finalize step of recovery source non-blocking (elastic#37388) Update the default for include_type_name to false. (elastic#37285) Security: remove SSL settings fallback (elastic#36846) Adding mapping for hostname field (elastic#37288) Relax assertSameDocIdsOnShards assertion Reduce recovery time with compress or secure transport (elastic#36981) Implement ccr file restore (elastic#37130) Fix Eclipse specific compilation issue (elastic#37419) Performance fix. Reduce deprecation calls for the same bulk request (elastic#37415) [ML] Use String rep of Version in map for serialisation (elastic#37416) Cleanup Deadcode in Rest Tests (elastic#37418) Mute IndexShardRetentionLeaseTests.testCommit elastic#37420 unmuted test Remove unused index store in directory service Improve CloseWhileRelocatingShardsIT (elastic#37348) Fix ClusterBlock serialization and Close Index API logic after backport to 6.x (elastic#37360) Update the scroll example in the docs (elastic#37394) Update analysis.asciidoc (elastic#37404) ...
The test
CloseWhileRelocatingShardsIT
creates one or more indices and tries to close them while their shards are relocating. In some not-so-rare cases the test fails because no index was successfully closed. This is the case when shards were still initializing or catching up missing operations when the test tried to close the index.This pull request improves the test so that it always create an empty index. The closing of this index should always succeed as there is no operations to recover. It also creates a second index with documents but no active indexing, and finally create 1 or more indices with ongoing documents indexing. The test also detects that the relocations have started before executing the close operation, and the shard relocations are now started using a single Reroute request instead of N requests.
Closes #37274