Peer recovery should not indefinitely retry on mapping error #41099

dnhatn · 2019-04-11T02:03:08Z

A stuck peer recovery in #40913 reveals that we indefinitely retry (on new cluster states) if replaying translog operations hits a MapperException. We should not wait and retry if the mapping on the target is as recent as the mapping version that the primary used to index the replaying operations.

Relates #40913

elasticmachine · 2019-04-11T02:03:12Z

Pinging @elastic/es-distributed

s1monw

LGTM

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTranslogOperationsRequest.java

henningandersen

LGTM.

I added a few comments to consider.

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTranslogOperationsRequest.java

henningandersen · 2019-04-11T07:55:42Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTranslogOperationsRequest.java

+    private final long maxSeenAutoIdTimestampOnPrimary;
+    private final long maxSeqNoOfUpdatesOrDeletesOnPrimary;
+    private final RetentionLeases retentionLeases;
+    private final long mappingVersion;


This version is an upper bound on the mappingVersion necessary to handle the translog operations. I think I would like to rename it to mappingVersionUpperBound to reflect that.

I renamed it to mappingVersionOnPrimary and documented it. It's not an upper bound of mapping version since the replica can have a higher mapping version than that parameter when we index those translog operations. Let me know if you still feel mappingVersionUpperBound is more appropriate.

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetHandler.java

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

dnhatn · 2019-04-19T15:08:44Z

@s1monw @henningandersen I have pushed 95072ae to bypass the no failure assertion for the newly added test. We need to keep this assertion since it caught bug before. I am requesting your reviews again to make sure we agree on this change. Thank you!

henningandersen

Thanks @nhat, a revisit of this PR did make me wonder about the assertion, please have a look at my comments.

henningandersen · 2019-04-23T06:30:08Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

        assertThat(recoveryStates.get(0).getTranslog().recoveredOperations(), greaterThan(0));
    }
+
+    public void testDoNotInfinitelyWaitForMapping() {


This test triggers the assertion failure and then tests the behavior without the assertion. I wonder if it was more appropriate to either:

fail harder on the assertion failure when assertions are not enabled, i.e., not throw a MappingException by wrapping it) and then only handle MappingException in PeerTargetRecoveryService.

allow MappingExceptions that come out of translog indexing (which should normally not happen?) since that can apparently happen.

check that we have the right mapping version up front and fail hard on any mapping failures during translog indexing.

I am not too fond of disabling assertions to test behaviour without them. I think assertions should be thought of as set in stone and unbreakable so that we do not have to worry about what happens when assertions are broken. On the other hand, I recognize that this does directly address the specific situation encountered. Let me know your thoughts on this.

henningandersen · 2019-04-23T06:36:09Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+                if (action.equals(PeerRecoverySourceService.Actions.START_RECOVERY)) {
+                    if (recoveryBlocked.tryAcquire()) {
+                        PluginsService pluginService = internalCluster().getInstance(PluginsService.class, node.getName());
+                        for (TestAnalysisPlugin plugin : pluginService.filterPlugins(TestAnalysisPlugin.class)) {


Would it be possible to add a test that hits the regular mapping update required result type, checked for here: https://github.com/elastic/elasticsearch/pull/41099/files#diff-f9ecc51fd8c3001dd782c93d9a040546L350 ?

Yes, I will add this test in a follow-up.

This reverts commit 95072ae.

dnhatn · 2019-04-23T14:00:09Z

Thanks so much @henningandersen for your suggestions. I pushed 1a31c4b for your second approach. Can you have another look?

henningandersen

LGTM.

Thanks @dnhatn

dnhatn · 2019-04-24T01:42:42Z

@s1monw and @henningandersen Thanks for reviewing.

A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on new cluster states if indexing translog operations hits a mapper exception. We should not wait and retry if the mapping on the target is as recent as the mapping that the primary used to index the replaying operations. Relates elastic#40913

Relates #41099

…ble-map-v1 * elastic/master: Adjust bwc version (elastic#41099) Fix multi-node parsing in voting config exclusions REST API (elastic#41588) Add missing skip: arbitrary_key (elastic#41492) [ML] cleanup + adding description field to transforms (elastic#41554)

A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on new cluster states if indexing translog operations hits a mapper exception. We should not wait and retry if the mapping on the target is as recent as the mapping that the primary used to index the replaying operations. Relates elastic#40913

Relates elastic#41099

A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on new cluster states if indexing translog operations hits a mapper exception. We should not wait and retry if the mapping on the target is as recent as the mapping that the primary used to index the replaying operations. Relates elastic#40913

Relates elastic#41099

Peer recovery should not indefinitely wait for mapping

007ffd5

dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.2.0 labels Apr 11, 2019

dnhatn requested review from henningandersen, jasontedor and s1monw April 11, 2019 02:03

dnhatn added 2 commits April 10, 2019 22:06

stylecheck

662bbf0

move the assertion back to recovery target

9c4823b

s1monw approved these changes Apr 11, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTranslogOperationsRequest.java Outdated Show resolved Hide resolved

henningandersen approved these changes Apr 11, 2019

View reviewed changes

dnhatn added 3 commits April 11, 2019 09:38

rename and javadocs

00eb6d9

throw uoe in readFrom

5f87530

bypass assert-no-failure in the newly added test

95072ae

dnhatn requested review from henningandersen and s1monw April 19, 2019 15:08

Merge branch 'master' into recovery-retry-mapping

836f870

henningandersen reviewed Apr 23, 2019

View reviewed changes

dnhatn added 3 commits April 23, 2019 08:52

Merge branch 'master' into recovery-retry-mapping

9e85445

Revert "bypass assert-no-failure in the newly added test"

fd73a8d

This reverts commit 95072ae.

Allow MapperException to escape when indexing translog

1a31c4b

dnhatn requested a review from henningandersen April 23, 2019 14:00

henningandersen approved these changes Apr 23, 2019

View reviewed changes

dnhatn merged commit 24e3145 into elastic:master Apr 24, 2019

dnhatn deleted the recovery-retry-mapping branch April 24, 2019 01:44

dnhatn added the backport pending label Apr 24, 2019

dnhatn removed the backport pending label Apr 27, 2019

dnhatn added a commit that referenced this pull request Apr 27, 2019

Adjust bwc version (#41099)

b4c6643

Relates #41099

akhil10x5 pushed a commit to akhil10x5/elasticsearch that referenced this pull request May 2, 2019

Adjust bwc version (elastic#41099)

c7d90c5

Relates elastic#41099

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019

Adjust bwc version (elastic#41099)

2bacf91

Relates elastic#41099

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Peer recovery should not indefinitely retry on mapping error #41099

Peer recovery should not indefinitely retry on mapping error #41099

Uh oh!

Conversation

dnhatn commented Apr 11, 2019

Uh oh!

elasticmachine commented Apr 11, 2019

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henningandersen Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

dnhatn Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dnhatn commented Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 23, 2019

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 23, 2019

Choose a reason for hiding this comment

Uh oh!

dnhatn Apr 23, 2019

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 23, 2019

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Apr 23, 2019

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Apr 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dnhatn Apr 11, 2019 •

edited

Loading

dnhatn commented Apr 19, 2019 •

edited

Loading