ShardBulkAction ignore primary response on primary #38901

henningandersen · 2019-02-14T14:12:06Z

Previously, if a version conflict occurred and a previous primary
response was present, the original primary response would be used both
for sending to replica and back to client. This was an attempt to fix
issues with conflicts after relocations where a bulk request would
experience a closed shard half way through and thus have to retry on the
new primary.

With sequence numbers, this leads to an issue, since if a primary is
demoted (network partitions), it will send along the original response
in the request. In case of a conflict on the new primary, the old
response is sent to the replica. That data could be stale, leading to
inconsistency between primary and replica.

Relocations now do an explicit hand-off from old to new primary and
ensures that no operations are active while doing this. Above is thus no
longer necessary. This change removes the special handling of conflicts
and ignores primary responses when executing shard bulk requests on the
primary.

In a follow up PR, we should consider removing the mutation of the request and
thus not send along the old primary response to the new primary.

Previously, if a version conflict occurred and a previous primary response was present, the original primary response would be used both for sending to replica and back to client. This was an attempt to fix issues with conflicts after relocations where a bulk request would experience a closed shard half way through and thus have to retry on the new primary. With sequence numbers, this leads to an issue, since if a primary is demoted (network partitions), it will send along the original response in the request. In case of a conflict on the new primary, the old response is sent to the replica. That data could be stale, leading to inconsistency between primary and replica. Relocations now do an explicit hand-off from old to new primary and ensures that no operations are active while doing this. Above is thus no longer necessary. This change removes the special handling of conflicts and ignores primary responses when executing shard bulk requests on the primary.

elasticmachine · 2019-02-14T14:12:08Z

Pinging @elastic/es-distributed

bleskes

LGTM

ywelsch · 2019-02-14T14:23:41Z

server/src/test/java/org/elasticsearch/action/bulk/TransportShardBulkActionTests.java

@@ -809,6 +828,14 @@ public void testRetries() throws Exception {
        assertThat(response.getSeqNo(), equalTo(13L));
    }

+    private void randomSetIgnoredPrimaryResponse(BulkItemRequest primaryRequest) {


nit: s/randomly/random/

ywelsch · 2019-02-14T14:24:35Z

server/src/test/java/org/elasticsearch/discovery/ClusterDisruptionIT.java

+
+
+    // once this has proven to work out fine in all cases, we can revert this to randomly picking the conflict mode.
+    public void testAckedIndexCreateOnly() throws Exception {


nit: testAckedIndexingWithCreateOpType

ywelsch · 2019-02-14T14:24:51Z

server/src/test/java/org/elasticsearch/discovery/ClusterDisruptionIT.java

+        testAckedIndexing(ConflictMode.create);
+    }
+
+    public void testAckedIndexExternalVersioning() throws Exception {


nit: testAckedIndexingWithExternalVersioning

ywelsch · 2019-02-14T14:25:53Z

server/src/test/java/org/elasticsearch/discovery/ClusterDisruptionIT.java

@@ -111,7 +140,9 @@ public void testAckedIndexing() throws Exception {
        final AtomicReference<CountDownLatch> countDownLatchRef = new AtomicReference<>();
        final List<Exception> exceptedExceptions = new CopyOnWriteArrayList<>();

-        logger.info("starting indexers");
+//        final ConflictMode conflictMode = ConflictMode.randomMode();


I think it's ok to chose this randomly instead of having three separate tests. Especially as this test typically takes a bit of time to run.

I wanted my PR builds to have all 3 variants running. Will change to randomMode before merging to master.

ywelsch · 2019-02-14T14:26:53Z

server/src/test/java/org/elasticsearch/discovery/ClusterDisruptionIT.java

+                                    .setTimeout(timeout);
+
+                                if (conflictMode == ConflictMode.external) {
+                                    indexRequestBuilder.setVersion(10).setVersionType(VersionType.EXTERNAL);


randomly chose a version, e.g. randomIntBetween(1, 10)?

Better naming of test methods and use a random external version.

Collapse 3 tests into one and pick the mode randomly instead.

henningandersen · 2019-02-15T07:52:43Z

@elasticmachine run elasticsearch-ci/1

Previously, if a version conflict occurred and a previous primary response was present, the original primary response would be used both for sending to replica and back to client. This was made in the past as an attempt to fix issues with conflicts after relocations where a bulk request would experience a closed shard half way through and thus have to retry on the new primary. It could then fail on its own update. With sequence numbers, this leads to an issue, since if a primary is demoted (network partitions), it will send along the original response in the request. In case of a conflict on the new primary, the old response is sent to the replica. That data could be stale, leading to inconsistency between primary and replica. Relocations now do an explicit hand-off from old to new primary and ensures that no operations are active while doing this. Above is thus no longer necessary. This change removes the special handling of conflicts and ignores primary responses when executing shard bulk requests on the primary.

* elastic/master: Avoid double term construction in DfsPhase (elastic#38716) Fix typo in DateRange docs (yyy → yyyy) (elastic#38883) Introduced class reuses follow parameter code between ShardFollowTasks (elastic#38910) Ensure random timestamps are within search boundary (elastic#38753) [CI] Muting method testFollowIndex in IndexFollowingIT Update Lucene snapshot repo for 7.0.0-beta1 (elastic#38946) SQL: Doc on syntax (identifiers in particular) (elastic#38662) Upgrade to Gradle 5.2.1 (elastic#38880) Tie break search shard iterator comparisons on cluster alias (elastic#38853) Also mmap cfs files for hybridfs (elastic#38940) Build: Fix issue with test status logging (elastic#38799) Adapt FullClusterRestartIT on master (elastic#38856) Fix testAutoFollowing test to use createLeaderIndex() helper method. Migrate muted auto follow rolling upgrade test and unmute this test (elastic#38900) ShardBulkAction ignore primary response on primary (elastic#38901) Recover peers from translog, ignoring soft deletes (elastic#38904) Fix NPE on Stale Index in IndicesService (elastic#38891) Smarter CCR concurrent file chunk fetching (elastic#38841) Fix intermittent failure in ApiKeyIntegTests (elastic#38627) re-enable SmokeTestWatcherWithSecurityIT (elastic#38814)

henningandersen added >bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v7.0.0 v6.7.0 v8.0.0 v7.2.0 v7.0.0-beta1 labels Feb 14, 2019

ywelsch removed the v7.0.0-beta1 label Feb 14, 2019

henningandersen requested review from ywelsch and bleskes February 14, 2019 14:23

bleskes approved these changes Feb 14, 2019

View reviewed changes

ywelsch approved these changes Feb 14, 2019

View reviewed changes

henningandersen added 2 commits February 14, 2019 15:43

ShardBulkAction ignore primary response on primary

bbddad0

Better naming of test methods and use a random external version.

ShardBulkAction ignore primary response on primary

d8d5039

Collapse 3 tests into one and pick the mode randomly instead.

henningandersen merged commit dacb0df into elastic:master Feb 15, 2019

henningandersen added the backport pending label Feb 15, 2019

henningandersen removed the backport pending label Feb 15, 2019

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ShardBulkAction ignore primary response on primary #38901

ShardBulkAction ignore primary response on primary #38901

henningandersen commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

bleskes left a comment

ywelsch Feb 14, 2019

ywelsch Feb 14, 2019

ywelsch Feb 14, 2019

ywelsch Feb 14, 2019

henningandersen Feb 14, 2019

ywelsch Feb 14, 2019

henningandersen commented Feb 15, 2019



		// once this has proven to work out fine in all cases, we can revert this to randomly picking the conflict mode.
		public void testAckedIndexCreateOnly() throws Exception {

ShardBulkAction ignore primary response on primary #38901

ShardBulkAction ignore primary response on primary #38901

Conversation

henningandersen commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

bleskes left a comment

Choose a reason for hiding this comment

ywelsch Feb 14, 2019

Choose a reason for hiding this comment

ywelsch Feb 14, 2019

Choose a reason for hiding this comment

ywelsch Feb 14, 2019

Choose a reason for hiding this comment

ywelsch Feb 14, 2019

Choose a reason for hiding this comment

henningandersen Feb 14, 2019

Choose a reason for hiding this comment

ywelsch Feb 14, 2019

Choose a reason for hiding this comment

henningandersen commented Feb 15, 2019