-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Peer recovery should not indefinitely retry on mapping error #41099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-distributed |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTranslogOperationsRequest.java
Outdated
Show resolved
Hide resolved
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I added a few comments to consider.
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTranslogOperationsRequest.java
Show resolved
Hide resolved
| private final long maxSeenAutoIdTimestampOnPrimary; | ||
| private final long maxSeqNoOfUpdatesOrDeletesOnPrimary; | ||
| private final RetentionLeases retentionLeases; | ||
| private final long mappingVersion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version is an upper bound on the mappingVersion necessary to handle the translog operations. I think I would like to rename it to mappingVersionUpperBound to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to mappingVersionOnPrimary and documented it. It's not an upper bound of mapping version since the replica can have a higher mapping version than that parameter when we index those translog operations. Let me know if you still feel mappingVersionUpperBound is more appropriate.
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Outdated
Show resolved
Hide resolved
|
@s1monw @henningandersen I have pushed 95072ae to bypass the no failure assertion for the newly added test. We need to keep this assertion since it caught bug before. I am requesting your reviews again to make sure we agree on this change. Thank you! |
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nhat, a revisit of this PR did make me wonder about the assertion, please have a look at my comments.
| assertThat(recoveryStates.get(0).getTranslog().recoveredOperations(), greaterThan(0)); | ||
| } | ||
|
|
||
| public void testDoNotInfinitelyWaitForMapping() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test triggers the assertion failure and then tests the behavior without the assertion. I wonder if it was more appropriate to either:
- fail harder on the assertion failure when assertions are not enabled, i.e., not throw a
MappingExceptionby wrapping it) and then only handleMappingExceptioninPeerTargetRecoveryService. - allow MappingExceptions that come out of translog indexing (which should normally not happen?) since that can apparently happen.
- check that we have the right mapping version up front and fail hard on any mapping failures during translog indexing.
I am not too fond of disabling assertions to test behaviour without them. I think assertions should be thought of as set in stone and unbreakable so that we do not have to worry about what happens when assertions are broken. On the other hand, I recognize that this does directly address the specific situation encountered. Let me know your thoughts on this.
| if (action.equals(PeerRecoverySourceService.Actions.START_RECOVERY)) { | ||
| if (recoveryBlocked.tryAcquire()) { | ||
| PluginsService pluginService = internalCluster().getInstance(PluginsService.class, node.getName()); | ||
| for (TestAnalysisPlugin plugin : pluginService.filterPlugins(TestAnalysisPlugin.class)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a test that hits the regular mapping update required result type, checked for here: https://github.com/elastic/elasticsearch/pull/41099/files#diff-f9ecc51fd8c3001dd782c93d9a040546L350 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add this test in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
Thanks so much @henningandersen for your suggestions. I pushed 1a31c4b for your second approach. Can you have another look? |
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks @dnhatn
|
@s1monw and @henningandersen Thanks for reviewing. |
A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on new cluster states if indexing translog operations hits a mapper exception. We should not wait and retry if the mapping on the target is as recent as the mapping that the primary used to index the replaying operations. Relates elastic#40913
…ble-map-v1 * elastic/master: Adjust bwc version (elastic#41099) Fix multi-node parsing in voting config exclusions REST API (elastic#41588) Add missing skip: arbitrary_key (elastic#41492) [ML] cleanup + adding description field to transforms (elastic#41554)
A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on new cluster states if indexing translog operations hits a mapper exception. We should not wait and retry if the mapping on the target is as recent as the mapping that the primary used to index the replaying operations. Relates elastic#40913
A stuck peer recovery in elastic#40913 reveals that we indefinitely retry on new cluster states if indexing translog operations hits a mapper exception. We should not wait and retry if the mapping on the target is as recent as the mapping that the primary used to index the replaying operations. Relates elastic#40913
A stuck peer recovery in #40913 reveals that we indefinitely retry (on new cluster states) if replaying translog operations hits a MapperException. We should not wait and retry if the mapping on the target is as recent as the mapping version that the primary used to index the replaying operations.
Relates #40913