[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed #48122

droberts195 · 2019-10-16T10:38:37Z

SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed in https://gradle-enterprise.elastic.co/s/shm65wzp2dh54

The error is:

org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests > testBasicPartialRetention FAILED
org.elasticsearch.snapshots.SnapshotMissingException: [my-repo:snap-hfs4h9aptqkjnchyhkltla] is missing

This did not reproduce locally using:

./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testBasicPartialRetention" \
  -Dtests.seed=EAF550063D3EF24D \
  -Dtests.security.manager=true \
  -Dtests.locale=nn \
  -Dtests.timezone=America/Kentucky/Louisville \
  -Dcompiler.java=12 \
  -Druntime.java=11

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-16T10:38:38Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

droberts195 · 2019-10-16T12:21:39Z

It's a different test and a different branch (7.x) but SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress also failed in https://gradle-enterprise.elastic.co/s/pmnulxslshgs4 with the same error message:

[2019-10-16T13:28:40,809][INFO ][o.e.c.r.a.AllocationService] [node_tm1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test][0]]]).
[2019-10-16T13:28:40,811][INFO ][o.e.x.s.SnapshotLifecycleTask] [node_tm1] snapshot lifecycle policy [slm-policy] issuing create snapshot [snap-x-u632e9qjmkrbe08kz_cq]
[2019-10-16T13:28:40,815][INFO ][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] --> kicked off snapshot snap-x-u632e9qjmkrbe08kz_cq
[2019-10-16T13:28:40,823][ERROR][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] expected a snapshot but it was missing
org.elasticsearch.snapshots.SnapshotMissingException: [repo-id:snap-x-u632e9qjmkrbe08kz_cq] is missing
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.buildResponse(TransportSnapshotsStatusAction.java:214) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:100) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:61) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:98) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:169) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_221]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_221]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]

The repro command for that one is:

./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress" \
  -Dtests.seed=278E4966E79E2EA5 \
  -Dtests.security.manager=true \
  -Dtests.locale=fi-FI \
  -Dtests.timezone=MET \
  -Dcompiler.java=12 \
  -Druntime.java=8

This doesn't reproduce locally on the 7.x branch.

gwbrown · 2019-10-16T19:02:20Z

I don't think these are the same issue - the latter does have the same exception in the logs, but that isn't the cause of the failure. In the testRetentionWhileSnapshotInProgress that message is logged each time an assertBusy fails, but the test eventually progresses from that once the snapshot exists, then fails on a different assertion. I'll open a new issue for this shortly.

In the testBasicPartialRetention failure, the failure is due to a snapshot being missing that we have previously verified exists, and, per the logs, has not been deleted, so I am very confused.

gwbrown · 2019-10-16T19:24:10Z

I've opened #48159 for the testRetentionWhileSnapshotInProgress failures so that this issue can stay focused on the testBasicPartialRetention failure.

romseygeek · 2019-10-18T13:04:13Z

This failed again on 7.5: https://gradle-enterprise.elastic.co/s/lsfer4ioyhnag/console-log?task=:x-pack:plugin:ilm:test

The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test elastic#48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes elastic#48122

original-brownbear · 2019-10-22T06:35:59Z

Phew :) -> tracked this down in #48329

The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122

The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test elastic#48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes elastic#48122

The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122

droberts195 added >test-failure Triaged test failures from CI :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 16, 2019

original-brownbear self-assigned this Oct 22, 2019

original-brownbear mentioned this issue Oct 22, 2019

Cleanup Concurrent RepositoryData Loading #48329

Merged

original-brownbear closed this as completed in #48329 Oct 23, 2019

original-brownbear mentioned this issue Nov 2, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48834

Merged

original-brownbear mentioned this issue Nov 2, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48835

Merged

original-brownbear mentioned this issue Nov 2, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48836

Closed

original-brownbear mentioned this issue Nov 2, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48837

Closed

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed #48122

[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed #48122

droberts195 commented Oct 16, 2019

elasticmachine commented Oct 16, 2019

droberts195 commented Oct 16, 2019

gwbrown commented Oct 16, 2019 •

edited

Loading

gwbrown commented Oct 16, 2019

romseygeek commented Oct 18, 2019

original-brownbear commented Oct 22, 2019

[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed #48122

[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed #48122

Comments

droberts195 commented Oct 16, 2019

elasticmachine commented Oct 16, 2019

droberts195 commented Oct 16, 2019

gwbrown commented Oct 16, 2019 • edited Loading

gwbrown commented Oct 16, 2019

romseygeek commented Oct 18, 2019

original-brownbear commented Oct 22, 2019

gwbrown commented Oct 16, 2019 •

edited

Loading