Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CloseWhileRelocatingShardsIT #37348

Merged
merged 1 commit into from
Jan 14, 2019

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Jan 11, 2019

The test CloseWhileRelocatingShardsIT creates one or more indices and tries to close them while their shards are relocating. In some not-so-rare cases the test fails because no index was successfully closed. This is the case when shards were still initializing or catching up missing operations when the test tried to close the index.

This pull request improves the test so that it always create an empty index. The closing of this index should always succeed as there is no operations to recover. It also creates a second index with documents but no active indexing, and finally create 1 or more indices with ongoing documents indexing. The test also detects that the relocations have started before executing the close operation, and the shard relocations are now started using a single Reroute request instead of N requests.

Closes #37274

@tlrx tlrx added >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v6.7.0 labels Jan 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

}
// Closing is not always acknowledged when shards are relocating: this is the case when the target shard is initializing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to improve this, can you add it as an item to the meta issue?

I wonder if we can fix this by acquiring an operation permit for each batch of operations that we send as part of phase 2 during peer recovery, and then also check whether there's a read-only block under the permit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to improve this, can you add it as an item to the meta issue?

Done.

I wonder if we can fix this by acquiring an operation permit for each batch of operations that we send as part of phase 2 during peer recovery, and then also check whether there's a read-only block under the permit.

I'm not sure to see how it would fix the issue: acquiring an operation permit for a batch of operations does not ensure that all operations have been recovered at the time the verify shard before close action is executed. Or are you thinking of failing the recovery because of the block detected under the permit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I thought about failing the recovery in case where the block suddenly appears during the recovery. This has some other adverse consequences though. Needs more thought

((MockTransportService) internalCluster().getInstance(TransportService.class, targetNode))
.addSendBehavior(internalCluster().getInstance(TransportService.class, sourceNode.getName()),
(connection, requestId, action, request, options) -> {
if (PeerRecoverySourceService.Actions.START_RECOVERY.equals(action)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just avoid relocating primaries for now until we have a fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is the same for primary or replica (as we don't fail replicas that cannot be verified by the verify shard before close action) so I don't see why we should only relocate replicas in this test.

I think we could keep the current test and relocate primary and replicas even if some close operations are not acknowledged, and if we fix this we could simply change the test to ensure that all closes are acknowledged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@tlrx tlrx mentioned this pull request Jan 11, 2019
50 tasks
@tlrx tlrx merged commit 07dc8c7 into elastic:master Jan 14, 2019
tlrx added a commit that referenced this pull request Jan 14, 2019
@tlrx
Copy link
Member Author

tlrx commented Jan 14, 2019

Thanks @ywelsch

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 15, 2019
* master: (28 commits)
  Introduce retention lease serialization (elastic#37447)
  Update Delete Watch to allow unknown fields (elastic#37435)
  Make finalize step of recovery source non-blocking (elastic#37388)
  Update the default for include_type_name to false. (elastic#37285)
  Security: remove SSL settings fallback (elastic#36846)
  Adding mapping for hostname field (elastic#37288)
  Relax assertSameDocIdsOnShards assertion
  Reduce recovery time with compress or secure transport (elastic#36981)
  Implement ccr file restore (elastic#37130)
  Fix Eclipse specific compilation issue (elastic#37419)
  Performance fix. Reduce deprecation calls for the same bulk request (elastic#37415)
  [ML] Use String rep of Version in map for serialisation (elastic#37416)
  Cleanup Deadcode in Rest Tests (elastic#37418)
  Mute IndexShardRetentionLeaseTests.testCommit elastic#37420
  unmuted test
  Remove unused index store in directory service
  Improve CloseWhileRelocatingShardsIT (elastic#37348)
  Fix ClusterBlock serialization and Close Index API logic after backport to 6.x (elastic#37360)
  Update the scroll example in the docs (elastic#37394)
  Update analysis.asciidoc (elastic#37404)
  ...
@tlrx tlrx deleted the improve-CloseWhileRelocatingIT branch February 5, 2019 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards
4 participants