Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed #48122

Closed
droberts195 opened this issue Oct 16, 2019 · 6 comments · Fixed by #48329
Closed

[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed #48122

droberts195 opened this issue Oct 16, 2019 · 6 comments · Fixed by #48329
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed in https://gradle-enterprise.elastic.co/s/shm65wzp2dh54

The error is:

org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests > testBasicPartialRetention FAILED
org.elasticsearch.snapshots.SnapshotMissingException: [my-repo:snap-hfs4h9aptqkjnchyhkltla] is missing

This did not reproduce locally using:

./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testBasicPartialRetention" \
  -Dtests.seed=EAF550063D3EF24D \
  -Dtests.security.manager=true \
  -Dtests.locale=nn \
  -Dtests.timezone=America/Kentucky/Louisville \
  -Dcompiler.java=12 \
  -Druntime.java=11
@droberts195 droberts195 added >test-failure Triaged test failures from CI :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@droberts195
Copy link
Contributor Author

It's a different test and a different branch (7.x) but SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress also failed in https://gradle-enterprise.elastic.co/s/pmnulxslshgs4 with the same error message:

[2019-10-16T13:28:40,809][INFO ][o.e.c.r.a.AllocationService] [node_tm1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test][0]]]).
[2019-10-16T13:28:40,811][INFO ][o.e.x.s.SnapshotLifecycleTask] [node_tm1] snapshot lifecycle policy [slm-policy] issuing create snapshot [snap-x-u632e9qjmkrbe08kz_cq]
[2019-10-16T13:28:40,815][INFO ][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] --> kicked off snapshot snap-x-u632e9qjmkrbe08kz_cq
[2019-10-16T13:28:40,823][ERROR][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] expected a snapshot but it was missing
org.elasticsearch.snapshots.SnapshotMissingException: [repo-id:snap-x-u632e9qjmkrbe08kz_cq] is missing
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.buildResponse(TransportSnapshotsStatusAction.java:214) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:100) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:61) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:98) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:169) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_221]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_221]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]

The repro command for that one is:

./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress" \
  -Dtests.seed=278E4966E79E2EA5 \
  -Dtests.security.manager=true \
  -Dtests.locale=fi-FI \
  -Dtests.timezone=MET \
  -Dcompiler.java=12 \
  -Druntime.java=8

This doesn't reproduce locally on the 7.x branch.

@gwbrown
Copy link
Contributor

gwbrown commented Oct 16, 2019

I don't think these are the same issue - the latter does have the same exception in the logs, but that isn't the cause of the failure. In the testRetentionWhileSnapshotInProgress that message is logged each time an assertBusy fails, but the test eventually progresses from that once the snapshot exists, then fails on a different assertion. I'll open a new issue for this shortly.

In the testBasicPartialRetention failure, the failure is due to a snapshot being missing that we have previously verified exists, and, per the logs, has not been deleted, so I am very confused.

@gwbrown
Copy link
Contributor

gwbrown commented Oct 16, 2019

I've opened #48159 for the testRetentionWhileSnapshotInProgress failures so that this issue can stay focused on the testBasicPartialRetention failure.

@romseygeek
Copy link
Contributor

@original-brownbear original-brownbear self-assigned this Oct 22, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 22, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
@original-brownbear
Copy link
Member

Phew :) -> tracked this down in #48329

original-brownbear added a commit that referenced this issue Oct 23, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes #48122
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
original-brownbear added a commit that referenced this issue Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes #48122
original-brownbear added a commit that referenced this issue Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes #48122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants