-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Concurrent Repo Modification to Fix Test #48433
Handle Concurrent Repo Modification to Fix Test #48433
Conversation
Just like elastic#48329 (and using the changes) in that PR we can run into a concurrent repo modification that we will throw on and must retry until consistent handling of this situation is implemented. Closes elastic#47384
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
@@ -215,6 +215,10 @@ public void testRetentionWhileSnapshotInProgress() throws Exception { | |||
SnapshotsStatusResponse s = | |||
client().admin().cluster().prepareSnapshotStatus(REPO).setSnapshots(completedSnapshotName).get(); | |||
assertNull("expected no snapshot but one was returned", s.getSnapshots().get(0)); | |||
} catch (RepositoryException e) { | |||
// Concurrent status calls and write operations may lead to failures in determining the current repository generation | |||
// TODO: Remove this hack once tracking the current repository generation has been made consistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue we could link to here so that we have something we can reference to see if it's safe to remove the hack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet I'm afraid. We have #38941 but the fix to the corruption issue may not fully resolve this situation yet.
Not even sure we have to solve this one with any kind of priority outside of tests:
What's basically happening here is that the snapshot status API breaks for a tiny window during snapshot delete and create (in practice that window will be a little more than the latency of one API request so it's really really hard to actually run into it and you'll probably run into other IO issues more often than this ... but as it turns out SLM tests are the only tests running these APIs in hot loops and are shaking these kinds of issues out). Maybe this is even an ok long term solution? (I'll gather some feedback on that tomorrow and will create an issue accordingly :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this today during snapshot resiliency sync and I'll code up a fix for this shortly and will remove the todos :)
Is that really the issue you want to close? |
@original-brownbear This test was muted in #48441, when you do the backports for this fix, can you unmute the test in other branches if necessary? (I'll take care of unmuting in master). |
Was this ever backported? |
Yea my bad for not linking things properly this change was pulled into 7.6 by Gordon in 5021410 |
Just like #48329 (and using the changes) in that PR
we can run into a concurrent repo modification that we
will throw on and must retry until consistent handling of
this situation is implemented.
Closes #47834