Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Concurrent Repo Modification to Fix Test #48433

Merged
merged 1 commit into from
Oct 24, 2019

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Oct 23, 2019

Just like #48329 (and using the changes) in that PR
we can run into a concurrent repo modification that we
will throw on and must retry until consistent handling of
this situation is implemented.

Closes #47834

Just like elastic#48329 (and using the changes) in that PR
we can run into a concurrent repo modification that we
will throw on and must retry until consistent handling of
this situation is implemented.

Closes elastic#47384
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.6.0 labels Oct 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@@ -215,6 +215,10 @@ public void testRetentionWhileSnapshotInProgress() throws Exception {
SnapshotsStatusResponse s =
client().admin().cluster().prepareSnapshotStatus(REPO).setSnapshots(completedSnapshotName).get();
assertNull("expected no snapshot but one was returned", s.getSnapshots().get(0));
} catch (RepositoryException e) {
// Concurrent status calls and write operations may lead to failures in determining the current repository generation
// TODO: Remove this hack once tracking the current repository generation has been made consistent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue we could link to here so that we have something we can reference to see if it's safe to remove the hack?

Copy link
Member Author

@original-brownbear original-brownbear Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet I'm afraid. We have #38941 but the fix to the corruption issue may not fully resolve this situation yet.

Not even sure we have to solve this one with any kind of priority outside of tests:

What's basically happening here is that the snapshot status API breaks for a tiny window during snapshot delete and create (in practice that window will be a little more than the latency of one API request so it's really really hard to actually run into it and you'll probably run into other IO issues more often than this ... but as it turns out SLM tests are the only tests running these APIs in hot loops and are shaking these kinds of issues out). Maybe this is even an ok long term solution? (I'll gather some feedback on that tomorrow and will create an issue accordingly :))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this today during snapshot resiliency sync and I'll code up a fix for this shortly and will remove the todos :)

@tlrx
Copy link
Member

tlrx commented Oct 24, 2019

Closes #47384

Is that really the issue you want to close?

@original-brownbear
Copy link
Member Author

@tlrx thanks for taking a look and spotting that! Obviously mixed up the last two numbers there and want to close #47834 instead :)

@gwbrown
Copy link
Contributor

gwbrown commented Oct 24, 2019

@original-brownbear This test was muted in #48441, when you do the backports for this fix, can you unmute the test in other branches if necessary? (I'll take care of unmuting in master).

@ywelsch
Copy link
Contributor

ywelsch commented Feb 27, 2020

Was this ever backported?

@original-brownbear
Copy link
Member Author

Yea my bad for not linking things properly this change was pulled into 7.6 by Gordon in 5021410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >test Issues or PRs that are addressing/adding tests v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress failing
6 participants