-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress failure on master #46508
Comments
Pinging @elastic/es-core-features |
This commit adds a wait/check for all running snapshots to be cleared before taking another snapshot. The previous snapshot was successful but had not yet been cleared from the cluster state, so the second snapshot failed due to a `ConcurrentSnapshotException`. Resolves elastic#46508
This commit adds a wait/check for all running snapshots to be cleared before taking another snapshot. The previous snapshot was successful but had not yet been cleared from the cluster state, so the second snapshot failed due to a `ConcurrentSnapshotException`. Resolves #46508
This commit adds a wait/check for all running snapshots to be cleared before taking another snapshot. The previous snapshot was successful but had not yet been cleared from the cluster state, so the second snapshot failed due to a `ConcurrentSnapshotException`. Resolves #46508
This is no longer failing with the previous exception, but has been failing with the following errors:
And
However, these seem not to have stacktraces attached. I'm re-opening this and I'll continue investigation (I have been unable to reproduce either of these locally) |
There's a stack traces available via build stats and the Gradle build scans:
and
Neither of the stack traces looks particularly useful. Interestingly, |
Just in case it helps, this line reproduces it for me locally 100% of the time on
|
One more failure in 7.x: https://gradle-enterprise.elastic.co/s/7yo2x4zsaetea/console-log?task=:x-pack:plugin:ilm:test Doesn't reproduce for me locally with:
|
I got some more failures with the increased logging level, so I've muted this again until I can look into and fix this. |
This commit adds a check to the `SnapshotHistoryStore.putAsync` method that checks to see whether SLM is enabled prior to indexing the history document. Resolves failures from elastic#46508 that were caused by a retention job being kicked off, the test finishing and deleting all indices (including the `.slm-history-*` index), and then retention completing and indexing another history document. This caused the `.slm-history-*` index to be re-created and the test to fail due to the shard lock check in `InternalTestCluster.assertAfterTest` that checks all shards being unlocked.
This commit adds the `/_slm/_execute_retention` API endpoint. This endpoint kicks off SLM retention and then returns immediately. This in particular allows us to run retention without scheduling it (for entirely manual invocation) or perform a one-off cleanup. This commit also includes HLRC for the new API, and fixes an issue in SLMSnapshotBlockingIntegTests where retention invoked prior to the test completing could resurrect an index the internal test cluster cleanup had already deleted. Resolves elastic#46508 Relates to elastic#43663
* Add API to execute SLM retention on-demand This commit adds the `/_slm/_execute_retention` API endpoint. This endpoint kicks off SLM retention and then returns immediately. This in particular allows us to run retention without scheduling it (for entirely manual invocation) or perform a one-off cleanup. This commit also includes HLRC for the new API, and fixes an issue in SLMSnapshotBlockingIntegTests where retention invoked prior to the test completing could resurrect an index the internal test cluster cleanup had already deleted. Resolves #46508 Relates to #43663
* Add API to execute SLM retention on-demand (#47405) This is a backport of #47405 This commit adds the `/_slm/_execute_retention` API endpoint. This endpoint kicks off SLM retention and then returns immediately. This in particular allows us to run retention without scheduling it (for entirely manual invocation) or perform a one-off cleanup. This commit also includes HLRC for the new API, and fixes an issue in SLMSnapshotBlockingIntegTests where retention invoked prior to the test completing could resurrect an index the internal test cluster cleanup had already deleted. Resolves #46508 Relates to #43663
There's still a possible failure here but it's a core snapshot bug/inconsistency. Reopening this and closing via a PR shortly. |
If we fail to read the global metadata in a snapshot we would throw `SnapshotMissingException` but wouldn't do so for the index metadata. This is breaking SLM tests at a low rate because they use `SnapshotMissingException` thrown from snapshot status APIs to wait for a snapshot being gone. Also, we should be consistent here in general and not leak the `NoSuchFileException` to the transport layer for index meta. Closes elastic#46508
If we fail to read the global metadata in a snapshot we would throw `SnapshotMissingException` but wouldn't do so for the index metadata. This is breaking SLM tests at a low rate because they use `SnapshotMissingException` thrown from snapshot status APIs to wait for a snapshot being gone. Also, we should be consistent here in general and not leak the `NoSuchFileException` to the transport layer for index meta. Closes #46508
If we fail to read the global metadata in a snapshot we would throw `SnapshotMissingException` but wouldn't do so for the index metadata. This is breaking SLM tests at a low rate because they use `SnapshotMissingException` thrown from snapshot status APIs to wait for a snapshot being gone. Also, we should be consistent here in general and not leak the `NoSuchFileException` to the transport layer for index meta. Closes elastic#46508
If we fail to read the global metadata in a snapshot we would throw `SnapshotMissingException` but wouldn't do so for the index metadata. This is breaking SLM tests at a low rate because they use `SnapshotMissingException` thrown from snapshot status APIs to wait for a snapshot being gone. Also, we should be consistent here in general and not leak the `NoSuchFileException` to the transport layer for index meta. Closes #46508
From https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/583/console & https://gradle-enterprise.elastic.co/s/6x67ha6426acy/console-log
Likely from this exception when trying to kick off the second snapshot:
My hunch is that the first snapshot has a "SUCCESS" status, but is still present in the cluster state. We should ensure it's no longer present in the cluster state before issuing the second execute policy request.
The text was updated successfully, but these errors were encountered: