[CI] 400 Bad request creating snapshot in GoogleCloudStorageThirdPartyTests #49429

markharwood · 2019-11-21T11:13:15Z

org.elasticsearch.repositories.gcs.GoogleCloudStorageThirdPartyTests.testCreateSnapshot failed with a "400 bad request".

Stacktrace:

com.google.cloud.storage.StorageException: 400 Bad Request
	at __randomizedtesting.SeedInfo.seed([4A7CE93B6DE34759:B41A7C885B50B540]:0)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:227)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.create(HttpStorageRpc.java:308)
	at com.google.cloud.storage.StorageImpl$3.call(StorageImpl.java:192)
	at com.google.cloud.storage.StorageImpl$3.call(StorageImpl.java:189)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
	at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at com.google.cloud.storage.StorageImpl.internalCreate(StorageImpl.java:188)
	at com.google.cloud.storage.StorageImpl.create(StorageImpl.java:150)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.lambda$writeBlobMultipart$8(GoogleCloudStorageBlobStore.java:308)
	at org.elasticsearch.repositories.gcs.SocketAccess.lambda$doPrivilegedVoidIOException$0(SocketAccess.java:54)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedVoidIOException(SocketAccess.java:53)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlobMultipart(GoogleCloudStorageBlobStore.java:307)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlob(GoogleCloudStorageBlobStore.java:221)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlob(GoogleCloudStorageBlobContainer.java:67)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlobAtomic(GoogleCloudStorageBlobContainer.java:72)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.startVerification(BlobStoreRepository.java:907)
	at org.elasticsearch.repositories.RepositoriesService$3.doRun(RepositoriesService.java:246)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:688)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:528)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:448)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.create(HttpStorageRpc.java:305)

Repro line (did not reproduce for me):

REPRODUCE WITH: ./gradlew ':plugins:repository-gcs:qa:google-cloud-storage:thirdPartyTest' --tests "org.elasticsearch.repositories.gcs.GoogleCloudStorageThirdPartyTests.testCreateSnapshot" -Dtests.seed=4A7CE93B6DE34759 -Dtests.security.manager=false -Dtests.locale=en-GB -Dtests.timezone=Pacific/Samoa -Dcompiler.java=12

Not muting for now given I can't reproduce.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-11-21T11:13:17Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2019-11-22T10:58:34Z

This is running against a GCS mock as well (not a real third party test run). Likely this has similar originals #49400 and all the associated test failures.

This commit ensures that even for requests that are known to be empty body we at least attempt to read one bytes from the request body input stream. This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent` that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`) set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered. This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks weren't fully drained. Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO). I would suggest merging this and closing related issue after verifying that this fixes things on CI. The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently. Relates #49401, #49429

This commit ensures that even for requests that are known to be empty body we at least attempt to read one bytes from the request body input stream. This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent` that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`) set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered. This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks weren't fully drained. Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO). I would suggest merging this and closing related issue after verifying that this fixes things on CI. The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently. Relates elastic#49401, elastic#49429

This commit ensures that even for requests that are known to be empty body we at least attempt to read one bytes from the request body input stream. This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent` that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`) set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered. This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks weren't fully drained. Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO). I would suggest merging this and closing related issue after verifying that this fixes things on CI. The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently. Relates #49401, #49429

tlrx · 2019-11-26T08:53:26Z

Thanks to @original-brownbear, #49518 should fix this issues. I'm closing for now and we'll reopen or reference if this issue happens again.

benwtrent · 2019-12-10T18:22:40Z

New build failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-10&&immutable/443/console

13:08:09 org.elasticsearch.repositories.gcs.GoogleCloudStorageThirdPartyTests > testCreateSnapshot FAILED
13:08:09     com.google.cloud.storage.StorageException: 400 Bad Request
13:08:09 
13:08:09         Caused by:
13:08:09         com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request

Reproduce line:

./gradlew ':plugins:repository-gcs:qa:google-cloud-storage:thirdPartyTest' --tests "org.elasticsearch.repositories.gcs.GoogleCloudStorageThirdPartyTests.testCreateSnapshot" -Dtests.seed=D18AB3DA9ABE738D -Dtests.security.manager=false -Dtests.locale=en-SG -Dtests.timezone=Etc/GMT-3 -Dcompiler.java=12

Two things: 1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of returning a `400` and failing the tests that way here since we can't reproduce these 400s locally. 2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests. Relates elastic#49429

* Better Logging GCS Blobstore Mock Two things: 1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of returning a `400` and failing the tests that way here since we can't reproduce these 400s locally. 2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests. Relates #49429

* Better Logging GCS Blobstore Mock Two things: 1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of returning a `400` and failing the tests that way here since we can't reproduce these 400s locally. 2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests. Relates elastic#49429

* Better Logging GCS Blobstore Mock Two things: 1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of returning a `400` and failing the tests that way here since we can't reproduce these 400s locally. 2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests. Relates #49429

ywelsch · 2020-01-06T12:48:53Z

@original-brownbear is this still an issue?

original-brownbear · 2020-01-06T13:03:40Z

@ywelsch this particular exception is impossible now => let's close this one. There's still some test failures in these tests though, but I'll open a new issue or PR for those.

original-brownbear · 2020-01-06T13:05:04Z

Nevermind ... my bad, this is still a possible issue sorry for the noise. Leaving this open.

We were incorrectly handling blobs starting in `\r\n` which broke tests randomly when blobs started on these. Relates elastic#49429

original-brownbear · 2020-01-06T15:54:55Z

Closing this after all. The 400 that was thrown here is removed from our code. I found another spot of broken multi-part request parsing that might apply here and have caused this past 400 that is now an assertion error (fix for that incoming in #50666), but the exact error seen here isn't a thing any longer.

* Fix GCS Mock Broken Handling of some Blobs We were incorrectly handling blobs starting in `\r\n` which broke tests randomly when blobs started on these. Relates #49429

* Fix GCS Mock Broken Handling of some Blobs We were incorrectly handling blobs starting in `\r\n` which broke tests randomly when blobs started on these. Relates elastic#49429

* Fix GCS Mock Broken Handling of some Blobs We were incorrectly handling blobs starting in `\r\n` which broke tests randomly when blobs started on these. Relates #49429

* Better Logging GCS Blobstore Mock Two things: 1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of returning a `400` and failing the tests that way here since we can't reproduce these 400s locally. 2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests. Relates elastic#49429

* Fix GCS Mock Broken Handling of some Blobs We were incorrectly handling blobs starting in `\r\n` which broke tests randomly when blobs started on these. Relates elastic#49429

markharwood added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Nov 21, 2019

original-brownbear self-assigned this Nov 21, 2019

original-brownbear closed this as completed Nov 22, 2019

original-brownbear reopened this Nov 22, 2019

original-brownbear mentioned this issue Nov 25, 2019

Improve Stability of Mock APIs #49518

Merged

original-brownbear mentioned this issue Nov 25, 2019

Improve Stability of Mock APIs (#49518) #49524

Merged

original-brownbear mentioned this issue Nov 25, 2019

Improve Stability of Mock APIs (#49518) #49525

Merged

tlrx closed this as completed Nov 26, 2019

benwtrent reopened this Dec 10, 2019

original-brownbear mentioned this issue Dec 11, 2019

Better Logging GCS Blobstore Mock #50102

Merged

original-brownbear mentioned this issue Dec 12, 2019

Better Logging GCS Blobstore Mock (#50102) #50124

Merged

original-brownbear closed this as completed Jan 6, 2020

original-brownbear reopened this Jan 6, 2020

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 6, 2020

Fix GCS Mock Broken Handling of some Blobs

cb2eb6a

We were incorrectly handling blobs starting in `\r\n` which broke tests randomly when blobs started on these. Relates elastic#49429

original-brownbear mentioned this issue Jan 6, 2020

Fix GCS Mock Broken Handling of some Blobs #50666

Merged

original-brownbear closed this as completed Jan 6, 2020

original-brownbear mentioned this issue Jan 6, 2020

Fix GCS Mock Broken Handling of some Blobs (#50666) #50671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] 400 Bad request creating snapshot in GoogleCloudStorageThirdPartyTests #49429

[CI] 400 Bad request creating snapshot in GoogleCloudStorageThirdPartyTests #49429

markharwood commented Nov 21, 2019

elasticmachine commented Nov 21, 2019

original-brownbear commented Nov 22, 2019

tlrx commented Nov 26, 2019

benwtrent commented Dec 10, 2019

ywelsch commented Jan 6, 2020

original-brownbear commented Jan 6, 2020

original-brownbear commented Jan 6, 2020 •

edited

Loading

original-brownbear commented Jan 6, 2020

[CI] 400 Bad request creating snapshot in GoogleCloudStorageThirdPartyTests #49429

[CI] 400 Bad request creating snapshot in GoogleCloudStorageThirdPartyTests #49429

Comments

markharwood commented Nov 21, 2019

elasticmachine commented Nov 21, 2019

original-brownbear commented Nov 22, 2019

tlrx commented Nov 26, 2019

benwtrent commented Dec 10, 2019

ywelsch commented Jan 6, 2020

original-brownbear commented Jan 6, 2020

original-brownbear commented Jan 6, 2020 • edited Loading

original-brownbear commented Jan 6, 2020

original-brownbear commented Jan 6, 2020 •

edited

Loading