Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Apr 11, 2019

A stuck peer recovery in #40913 reveals that we indefinitely retry (on new cluster states) if replaying translog operations hits a MapperException. We should not wait and retry if the mapping on the target is as recent as the mapping version that the primary used to index the replaying operations.

Relates #40913

@dnhatn dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.2.0 labels Apr 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I added a few comments to consider.

private final long maxSeenAutoIdTimestampOnPrimary;
private final long maxSeqNoOfUpdatesOrDeletesOnPrimary;
private final RetentionLeases retentionLeases;
private final long mappingVersion;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version is an upper bound on the mappingVersion necessary to handle the translog operations. I think I would like to rename it to mappingVersionUpperBound to reflect that.

Copy link
Member Author

@dnhatn dnhatn Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to mappingVersionOnPrimary and documented it. It's not an upper bound of mapping version since the replica can have a higher mapping version than that parameter when we index those translog operations. Let me know if you still feel mappingVersionUpperBound is more appropriate.

@dnhatn
Copy link
Member Author

dnhatn commented Apr 19, 2019

@s1monw @henningandersen I have pushed 95072ae to bypass the no failure assertion for the newly added test. We need to keep this assertion since it caught bug before. I am requesting your reviews again to make sure we agree on this change. Thank you!

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nhat, a revisit of this PR did make me wonder about the assertion, please have a look at my comments.

assertThat(recoveryStates.get(0).getTranslog().recoveredOperations(), greaterThan(0));
}

public void testDoNotInfinitelyWaitForMapping() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test triggers the assertion failure and then tests the behavior without the assertion. I wonder if it was more appropriate to either:

  1. fail harder on the assertion failure when assertions are not enabled, i.e., not throw a MappingException by wrapping it) and then only handle MappingException in PeerTargetRecoveryService.
  2. allow MappingExceptions that come out of translog indexing (which should normally not happen?) since that can apparently happen.
  3. check that we have the right mapping version up front and fail hard on any mapping failures during translog indexing.

I am not too fond of disabling assertions to test behaviour without them. I think assertions should be thought of as set in stone and unbreakable so that we do not have to worry about what happens when assertions are broken. On the other hand, I recognize that this does directly address the specific situation encountered. Let me know your thoughts on this.

if (action.equals(PeerRecoverySourceService.Actions.START_RECOVERY)) {
if (recoveryBlocked.tryAcquire()) {
PluginsService pluginService = internalCluster().getInstance(PluginsService.class, node.getName());
for (TestAnalysisPlugin plugin : pluginService.filterPlugins(TestAnalysisPlugin.class)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test that hits the regular mapping update required result type, checked for here: https://github.com/elastic/elasticsearch/pull/41099/files#diff-f9ecc51fd8c3001dd782c93d9a040546L350 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add this test in a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dnhatn
Copy link
Member Author

dnhatn commented Apr 23, 2019

Thanks so much @henningandersen for your suggestions. I pushed 1a31c4b for your second approach. Can you have another look?

@dnhatn dnhatn requested a review from henningandersen April 23, 2019 14:00
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thanks @dnhatn

@dnhatn
Copy link
Member Author

dnhatn commented Apr 24, 2019

@s1monw and @henningandersen Thanks for reviewing.

@dnhatn dnhatn merged commit 24e3145 into elastic:master Apr 24, 2019
@dnhatn dnhatn deleted the recovery-retry-mapping branch April 24, 2019 01:44
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Apr 27, 2019
A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on
new cluster states if indexing translog operations hits a mapper
exception. We should not wait and retry if the mapping on the target is
as recent as the mapping that the primary used to index the replaying
operations.

Relates elastic#40913
dnhatn added a commit that referenced this pull request Apr 27, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 27, 2019
…ble-map-v1

* elastic/master:
  Adjust bwc version (elastic#41099)
  Fix multi-node parsing in voting config exclusions REST API (elastic#41588)
  Add missing skip: arbitrary_key (elastic#41492)
  [ML] cleanup + adding description field to transforms (elastic#41554)
akhil10x5 pushed a commit to akhil10x5/elasticsearch that referenced this pull request May 2, 2019
A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on
new cluster states if indexing translog operations hits a mapper
exception. We should not wait and retry if the mapping on the target is
as recent as the mapping that the primary used to index the replaying
operations.

Relates elastic#40913
akhil10x5 pushed a commit to akhil10x5/elasticsearch that referenced this pull request May 2, 2019
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on
new cluster states if indexing translog operations hits a mapper
exception. We should not wait and retry if the mapping on the target is
as recent as the mapping that the primary used to index the replaying
operations.

Relates elastic#40913
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v7.2.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants