Improve CloseWhileRelocatingShardsIT #37348

tlrx · 2019-01-11T09:25:28Z

The test CloseWhileRelocatingShardsIT creates one or more indices and tries to close them while their shards are relocating. In some not-so-rare cases the test fails because no index was successfully closed. This is the case when shards were still initializing or catching up missing operations when the test tried to close the index.

This pull request improves the test so that it always create an empty index. The closing of this index should always succeed as there is no operations to recover. It also creates a second index with documents but no active indexing, and finally create 1 or more indices with ongoing documents indexing. The test also detects that the relocations have started before executing the close operation, and the shard relocations are now started using a single Reroute request instead of N requests.

Closes #37274

elasticmachine · 2019-01-11T09:25:30Z

Pinging @elastic/es-distributed

ywelsch · 2019-01-11T11:19:50Z

server/src/test/java/org/elasticsearch/indices/state/CloseWhileRelocatingShardsIT.java

                    }
+                    // Closing is not always acknowledged when shards are relocating: this is the case when the target shard is initializing


we need to improve this, can you add it as an item to the meta issue?

I wonder if we can fix this by acquiring an operation permit for each batch of operations that we send as part of phase 2 during peer recovery, and then also check whether there's a read-only block under the permit.

we need to improve this, can you add it as an item to the meta issue?

Done.

I wonder if we can fix this by acquiring an operation permit for each batch of operations that we send as part of phase 2 during peer recovery, and then also check whether there's a read-only block under the permit.

I'm not sure to see how it would fix the issue: acquiring an operation permit for a batch of operations does not ensure that all operations have been recovered at the time the verify shard before close action is executed. Or are you thinking of failing the recovery because of the block detected under the permit?

yes, I thought about failing the recovery in case where the block suddenly appears during the recovery. This has some other adverse consequences though. Needs more thought

ywelsch · 2019-01-11T11:25:56Z

server/src/test/java/org/elasticsearch/indices/state/CloseWhileRelocatingShardsIT.java

+                ((MockTransportService) internalCluster().getInstance(TransportService.class, targetNode))
+                    .addSendBehavior(internalCluster().getInstance(TransportService.class, sourceNode.getName()),
+                        (connection, requestId, action, request, options) -> {
+                            if (PeerRecoverySourceService.Actions.START_RECOVERY.equals(action)) {


can we just avoid relocating primaries for now until we have a fix?

The behavior is the same for primary or replica (as we don't fail replicas that cannot be verified by the verify shard before close action) so I don't see why we should only relocate replicas in this test.

I think we could keep the current test and relocate primary and replicas even if some close operations are not acknowledged, and if we fix this we could simply change the test to ensure that all closes are acknowledged.

tlrx · 2019-01-14T12:31:18Z

Thanks @ywelsch

* master: (28 commits) Introduce retention lease serialization (elastic#37447) Update Delete Watch to allow unknown fields (elastic#37435) Make finalize step of recovery source non-blocking (elastic#37388) Update the default for include_type_name to false. (elastic#37285) Security: remove SSL settings fallback (elastic#36846) Adding mapping for hostname field (elastic#37288) Relax assertSameDocIdsOnShards assertion Reduce recovery time with compress or secure transport (elastic#36981) Implement ccr file restore (elastic#37130) Fix Eclipse specific compilation issue (elastic#37419) Performance fix. Reduce deprecation calls for the same bulk request (elastic#37415) [ML] Use String rep of Version in map for serialisation (elastic#37416) Cleanup Deadcode in Rest Tests (elastic#37418) Mute IndexShardRetentionLeaseTests.testCommit elastic#37420 unmuted test Remove unused index store in directory service Improve CloseWhileRelocatingShardsIT (elastic#37348) Fix ClusterBlock serialization and Close Index API logic after backport to 6.x (elastic#37360) Update the scroll example in the docs (elastic#37394) Update analysis.asciidoc (elastic#37404) ...

Improve CloseWhileRelocatingShardsIT

80b419d

tlrx added >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v6.7.0 labels Jan 11, 2019

tlrx requested a review from ywelsch January 11, 2019 09:25

tlrx mentioned this pull request Jan 11, 2019

[CI] CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards #37274

Closed

ywelsch reviewed Jan 11, 2019

View reviewed changes

tlrx mentioned this pull request Jan 11, 2019

Replicate closed indices #33888

Closed

50 tasks

ywelsch approved these changes Jan 11, 2019

View reviewed changes

tlrx merged commit 07dc8c7 into elastic:master Jan 14, 2019

tlrx added a commit that referenced this pull request Jan 14, 2019

Improve CloseWhileRelocatingShardsIT (#37348)

4e8a44a

tlrx deleted the improve-CloseWhileRelocatingIT branch February 5, 2019 09:26

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CloseWhileRelocatingShardsIT #37348

Improve CloseWhileRelocatingShardsIT #37348

tlrx commented Jan 11, 2019

elasticmachine commented Jan 11, 2019

ywelsch Jan 11, 2019

tlrx Jan 11, 2019

ywelsch Jan 11, 2019

ywelsch Jan 11, 2019

tlrx Jan 11, 2019

ywelsch Jan 11, 2019

tlrx commented Jan 14, 2019

		}
		// Closing is not always acknowledged when shards are relocating: this is the case when the target shard is initializing

Improve CloseWhileRelocatingShardsIT #37348

Improve CloseWhileRelocatingShardsIT #37348

Conversation

tlrx commented Jan 11, 2019

elasticmachine commented Jan 11, 2019

ywelsch Jan 11, 2019

Choose a reason for hiding this comment

tlrx Jan 11, 2019

Choose a reason for hiding this comment

ywelsch Jan 11, 2019

Choose a reason for hiding this comment

ywelsch Jan 11, 2019

Choose a reason for hiding this comment

tlrx Jan 11, 2019

Choose a reason for hiding this comment

ywelsch Jan 11, 2019

Choose a reason for hiding this comment

tlrx commented Jan 14, 2019