Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] AbstractSnapshotRepoTestKitRestTestCase testRepositoryAnalysis failures #72229

Closed
davidkyle opened this issue Apr 26, 2021 · 4 comments
Closed
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@davidkyle
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/gervfulv6lrku
https://gradle-enterprise.elastic.co/s/o4qzst6pglg36
https://gradle-enterprise.elastic.co/s/u56mg6brfudq6
https://gradle-enterprise.elastic.co/s/m2kldzfhmictm

Repro line:

 ./gradlew ':x-pack:plugin:snapshot-repo-test-kit:qa:s3:integTest' --tests "org.elasticsearch.repositories.blobstore.testkit.S3SnapshotRepoTestKitIT.testRepositoryAnalysis" \
  -Dtests.seed=987C4B3C71034593 \
  -Dtests.locale=fr-CA \
  -Dtests.timezone=Atlantic/Faeroe \
  -Druntime.java=11
./gradlew ':x-pack:plugin:snapshot-repo-test-kit:internalClusterTest' --tests "org.elasticsearch.repositories.blobstore.testkit.RepositoryAnalysisSuccessIT.testRepositoryAnalysis" \
  -Dtests.seed=4C5D2CBA58FFB44A \
  -Dtests.locale=ms-MY \
  -Dtests.timezone=MST \
  -Druntime.java=8

Reproduces locally?:
NO

Applicable branches:
master, 7.x, 7.13

Failure history:
Regularly although I cannot say the cause is the same for all of these
https://build-stats.elastic.co/goto/5c313f0165b5e7db32c26bec167ef280

Failure excerpt:
The actual tests and errors are vary:

https://gradle-enterprise.elastic.co/s/gervfulv6lrku/tests/:x-pack:plugin:snapshot-repo-test-kit:internalClusterTest/org.elasticsearch.repositories.blobstore.testkit.RepositoryAnalysisSuccessIT/testRepositoryAnalysis?expanded-stacktrace=WyIwLTEiXQ

org.elasticsearch.repositories.RepositoryVerificationException: [test-repo] analysis failed, you may need to manually remove [temp-analysis-9_zJzWeWQ4a320bWiQRD8w] |  

Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_s0][127.0.0.1:49320][cluster:admin/repository/analyze/blob] request_id [25] timed out after [5007ms]

https://gradle-enterprise.elastic.co/s/o4qzst6pglg36/tests/:x-pack:plugin:snapshot-repo-test-kit:qa:s3:integTest/org.elasticsearch.repositories.blobstore.testkit.S3SnapshotRepoTestKitIT/testRepositoryAnalysis?top-execution=1


org.elasticsearch.client.ResponseException: method [PUT], host [http://127.0.0.1:43133], URI [_snapshot/repository?verify=true], status line [HTTP/1.1 503 Service Unavailable] |  

  | {"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put_repository [repository]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put_repository [repository]) within 30s"},"status":503}

https://gradle-enterprise.elastic.co/s/u56mg6brfudq6/tests/:x-pack:plugin:snapshot-repo-test-kit:qa:s3:integTest/org.elasticsearch.repositories.blobstore.testkit.S3SnapshotRepoTestKitIT/testRepositoryAnalysis?top-execution=1


org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:36341], URI [/_snapshot/repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&concurrency=4], status line [HTTP/1.1 500 Internal Server Error] |  
-- | --
  | {"error":{"root_cause":[{"type":"uncategorized_execution_exception","reason":"Failed execution"}],"type":"repository_verification_exception","reason":"[repository] analysis failed, you may need to manually remove [temp-analysis-DRAYv4byTmeQ9WrF8qKuzQ]","caused_by":{"type":"repository_verification_exception","reason":"[repository] failure processing [blob analysis [repository:temp-analysis-DRAYv4byTmeQ9WrF8qKuzQ/test-blob-7-6Sn4wUgZQXuaBegKSwPCjA, length=262144, seed=-3175868635373160859, readEarly=false, writeAndOverwrite=false]]","caused_by":{"type":"uncategorized_execution_exception","reason":"Failed execution","caused_by":{"type":"execution_exception","reason":"java.io.IOException: Unable to upload object [base_path/temp-analysis-DRAYv4byTmeQ9WrF8qKuzQ/test-blob-7-6Sn4wUgZQXuaBegKSwPCjA] using a single upload","caused_by":{"type":"i_o_exception","reason":"Unable to upload object [base_path/temp-analysis-DRAYv4byTmeQ9WrF8qKuzQ/test-blob-7-6Sn4wUgZQXuaBegKSwPCjA] using a single upload","caused_by":{"type":"sdk_client_exception","reason":"Unable to verify integrity of data upload. Client calculated content hash (contentMD5: 4J5vz4CfLgJcHZYO3qGtsw== in base 64) didn't match hash (etag: 93131e5ff888533a740cc7c0913276eb in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@75d4ac01, bucketName: bucket, key: base_path/temp-analysis-DRAYv4byTmeQ9WrF8qKuzQ/test-blob-7-6Sn4wUgZQXuaBegKSwPCjA)"}}}}}},"status":500} |  

@davidkyle davidkyle added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Apr 26, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 26, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner self-assigned this Apr 27, 2021
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 27, 2021
In elastic#72229 a test run was observed to exceed this 5-second timeout. This
commit increases it to 20 seconds.
@DaveCTurner
Copy link
Contributor

DaveCTurner commented Apr 27, 2021

Yes there's actually quite a number of different failures here. Going through them one-by-one:

https://gradle-enterprise.elastic.co/s/gervfulv6lrku

      1> [2021-04-26T04:17:14,386][INFO ][o.e.r.b.t.RepositoryAnalyzeAction] [testRepositoryAnalysis] running analysis of repository [test-repo] using path [temp-analysis-9_zJzWeWQ4a320bWiQRD8w]	
	  1> [2021-04-26T04:17:19,399][WARN ][o.e.r.b.t.RepositoryAnalyzeAction] [node_s4] analysis of repository [test-repo] failed before cleanup phase, attempting best-effort cleanup but you may need to manually remove [temp-analysis-9_zJzWeWQ4a320bWiQRD8w]	
	  1> [2021-04-26T04:17:19,405][INFO ][o.e.r.b.t.RepositoryAnalysisSuccessIT] [testRepositoryAnalysis] --> skipped repo consistency checks because [repository is not used for snapshots]	
	  1> [2021-04-26T04:17:19,406][INFO ][o.e.r.b.t.RepositoryAnalysisSuccessIT] [testRepositoryAnalysis] [RepositoryAnalysisSuccessIT#testRepositoryAnalysis]: cleaning up after test	
	  1> [2021-04-26T04:17:19,501][INFO ][o.e.c.m.MetadataIndexTemplateService] [node_s1] removing template [random_index_template]	
	  1> [2021-04-26T04:17:19,584][INFO ][o.e.r.RepositoriesService] [node_s1] deleted repositories [[test-repo]]	
	  1> [2021-04-26T04:17:19,636][WARN ][o.e.t.TransportService   ] [node_s4] Received response for a request that has timed out, sent [5.2s/5207ms] ago, timed out [200ms/200ms] ago, action [cluster:admin/repository/analyze/blob], node [{node_s0}{nC6fsqDaTWCCfWW-vwz03w}{j9tgqpJTSuqh8dCh7HqZcg}{127.0.0.1}{127.0.0.1:49320}{imr}{xpack.installed=true}], id [25]	
	  1> [2021-04-26T04:17:19,640][INFO ][o.e.r.b.t.RepositoryAnalysisSuccessIT] [testRepositoryAnalysis] [RepositoryAnalysisSuccessIT#testRepositoryAnalysis]: cleaned up after test	
	  1> [2021-04-26T04:17:19,641][INFO ][o.e.r.b.t.RepositoryAnalysisSuccessIT] [testRepositoryAnalysis] after test	
	  2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:snapshot-repo-test-kit:internalClusterTest' --tests "org.elasticsearch.repositories.blobstore.testkit.RepositoryAnalysisSuccessIT.testRepositoryAnalysis" -Dtests.seed=4C5D2CBA58FFB44A -Dtests.locale=ms-MY -Dtests.timezone=MST -Druntime.java=8	
	  2> RepositoryVerificationException[[test-repo] analysis failed, you may need to manually remove [temp-analysis-9_zJzWeWQ4a320bWiQRD8w]]; nested: ReceiveTimeoutTransportException[[node_s0][127.0.0.1:49320][cluster:admin/repository/analyze/blob] request_id [25] timed out after [5007ms]];	
	        at __randomizedtesting.SeedInfo.seed([4C5D2CBA58FFB44A:B5608DD4866487A3]:0)	
		
	        Caused by:	
	        ReceiveTimeoutTransportException[[node_s0][127.0.0.1:49320][cluster:admin/repository/analyze/blob] request_id [25] timed out after [5007ms]]	
	            at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1185)	
	            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)	
	            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)	
	            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)	
	            at java.lang.Thread.run(Thread.java:748)	

I'm surprised that we exceeded the 5-second timeout here but maybe it's possible if this CI worker was running quite slowly. It completed shortly after the timeout, but I don't see an obvious deadlock that could otherwise explain this, so I opened #72314 to just give this a more generous timeout.


https://gradle-enterprise.elastic.co/s/m2kldzfhmictm
https://gradle-enterprise.elastic.co/s/o4qzst6pglg36

  1> [2021-04-26T00:13:31,036][INFO ][o.e.r.b.t.S3SnapshotRepoTestKitIT] [testRepositoryAnalysis] creating repository [repository] of type [s3]	
	  1> [2021-04-26T00:14:48,556][INFO ][o.e.r.b.t.S3SnapshotRepoTestKitIT] [testRepositoryAnalysis] after test	
	  2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:snapshot-repo-test-kit:qa:s3:integTest' --tests "org.elasticsearch.repositories.blobstore.testkit.S3SnapshotRepoTestKitIT.testRepositoryAnalysis" -Dtests.seed=2A6F99B8C383414F -Dtests.locale=da-DK -Dtests.timezone=America/Regina -Druntime.java=15	
	  2> org.elasticsearch.client.ResponseException: method [PUT], host [http://127.0.0.1:43133], URI [_snapshot/repository?verify=true], status line [HTTP/1.1 503 Service Unavailable]	
	    {"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put_repository [repository]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put_repository [repository]) within 30s"},"status":503}	
	        at __randomizedtesting.SeedInfo.seed([2A6F99B8C383414F:D35238D61D1872A6]:0)

These two were just processing cluster state updates excruciatingly slowly, even before the test started running.:

[2021-04-26T06:13:23,614][INFO ][o.e.n.Node               ] [integTest-0] started
[2021-04-26T06:13:23,756][INFO ][o.e.c.s.MasterService    ] [integTest-0] elected-as-master ([1] nodes joined)[{integTest-0}{cVjFb4J7QxCd567sAErqRQ}{LndVCvfnTaOZMeSZGmozzQ}{127.0.0.1}{127.0.0.1:37333}{cdfhilmrstw} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 1, version: 1, delta: master node
 changed {previous [], current [{integTest-0}{cVjFb4J7QxCd567sAErqRQ}{LndVCvfnTaOZMeSZGmozzQ}{127.0.0.1}{127.0.0.1:37333}{cdfhilmrstw}]}
[2021-04-26T06:13:23,810][INFO ][o.e.c.c.CoordinationState] [integTest-0] cluster UUID set to [uETKElK5Q8Od6WM8cKDEdg]
[2021-04-26T06:13:26,258][INFO ][o.e.c.s.ClusterApplierService] [integTest-0] master node changed {previous [], current [{integTest-0}{cVjFb4J7QxCd567sAErqRQ}{LndVCvfnTaOZMeSZGmozzQ}{127.0.0.1}{127.0.0.1:37333}{cdfhilmrstw}]}, term: 1, version: 1, reason: Publication{term=1, version=1}
[2021-04-26T06:13:27,905][INFO ][o.e.g.GatewayService     ] [integTest-0] recovered [0] indices into cluster_state
[2021-04-26T06:13:28,244][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [.ml-anomalies-] for index patterns [.ml-anomalies-*]
[2021-04-26T06:13:31,079][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [.ml-notifications-000001] for index patterns [.ml-notifications-000001]
[2021-04-26T06:13:34,819][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [.ml-state] for index patterns [.ml-state*]
[2021-04-26T06:13:38,311][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [.ml-stats] for index patterns [.ml-stats-*]
[2021-04-26T06:13:41,703][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [logs-settings]
[2021-04-26T06:13:45,009][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [synthetics-mappings]
[2021-04-26T06:13:48,208][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [logs-mappings]
[2021-04-26T06:13:49,871][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [metrics-mappings]
[2021-04-26T06:13:51,669][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [metrics-settings]
[2021-04-26T06:13:55,051][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [synthetics-settings]
[2021-04-26T06:13:57,883][ERROR][o.e.x.m.e.l.LocalExporter] [integTest-0] failed to set monitoring template [.monitoring-alerts-7]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template [.monitoring-alerts-7], cause [api]) within 30s
        at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:132) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:131) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
...

We require the repo registration to complete in <30s but it didn't because of other cluster state updates still not having completed. Not sure what to do about this, there's definitely something wrong with how we're running these clusters if they take ≥3s for each cluster state update. TBD.


https://gradle-enterprise.elastic.co/s/u56mg6brfudq6

This one fails because the MD5 checksum we compute on upload doesn't match the checksum the server computes. I opened #72358 to track this separately.

DaveCTurner added a commit that referenced this issue Apr 27, 2021
In #72229 a test run was observed to exceed this 5-second timeout. This
commit increases it to 20 seconds.
DaveCTurner added a commit that referenced this issue Apr 27, 2021
In #72229 a test run was observed to exceed this 5-second timeout. This
commit increases it to 20 seconds.
DaveCTurner added a commit that referenced this issue Apr 27, 2021
In #72229 a test run was observed to exceed this 5-second timeout. This
commit increases it to 20 seconds.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 27, 2021
Today we do not specify the `?seed=` parameter when running the
repository analyzer in REST tests, so we cannot reproduce the set
of operations that led to a failure. This commit introduces a
deterministic value for this parameter.

Relates elastic#72229 which seems to indicate some kind of bug in how certain
checksums are calculated in the test fixtures.
@DaveCTurner
Copy link
Contributor

I opened #72358 specifically for the checksum failures because they're definitely unrelated to the other failures.

@DaveCTurner
Copy link
Contributor

I opened #72404 to discuss the cases where cluster state updates were just desperately slow, as I think this needs help from the delivery folks.

Since these failures are all either addressed or tracked elsewhere, I am closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants