Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] 400 Bad request creating snapshot in GoogleCloudStorageThirdPartyTests #49429

Closed
markharwood opened this issue Nov 21, 2019 · 8 comments
Closed
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@markharwood
Copy link
Contributor

org.elasticsearch.repositories.gcs.GoogleCloudStorageThirdPartyTests.testCreateSnapshot failed with a "400 bad request".

Stacktrace:

com.google.cloud.storage.StorageException: 400 Bad Request
	at __randomizedtesting.SeedInfo.seed([4A7CE93B6DE34759:B41A7C885B50B540]:0)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:227)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.create(HttpStorageRpc.java:308)
	at com.google.cloud.storage.StorageImpl$3.call(StorageImpl.java:192)
	at com.google.cloud.storage.StorageImpl$3.call(StorageImpl.java:189)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
	at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at com.google.cloud.storage.StorageImpl.internalCreate(StorageImpl.java:188)
	at com.google.cloud.storage.StorageImpl.create(StorageImpl.java:150)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.lambda$writeBlobMultipart$8(GoogleCloudStorageBlobStore.java:308)
	at org.elasticsearch.repositories.gcs.SocketAccess.lambda$doPrivilegedVoidIOException$0(SocketAccess.java:54)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedVoidIOException(SocketAccess.java:53)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlobMultipart(GoogleCloudStorageBlobStore.java:307)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlob(GoogleCloudStorageBlobStore.java:221)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlob(GoogleCloudStorageBlobContainer.java:67)
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlobAtomic(GoogleCloudStorageBlobContainer.java:72)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.startVerification(BlobStoreRepository.java:907)
	at org.elasticsearch.repositories.RepositoriesService$3.doRun(RepositoriesService.java:246)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:688)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:528)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:448)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.create(HttpStorageRpc.java:305)

Repro line (did not reproduce for me):

REPRODUCE WITH: ./gradlew ':plugins:repository-gcs:qa:google-cloud-storage:thirdPartyTest' --tests "org.elasticsearch.repositories.gcs.GoogleCloudStorageThirdPartyTests.testCreateSnapshot" -Dtests.seed=4A7CE93B6DE34759 -Dtests.security.manager=false -Dtests.locale=en-GB -Dtests.timezone=Pacific/Samoa -Dcompiler.java=12

Not muting for now given I can't reproduce.

@markharwood markharwood added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Nov 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear self-assigned this Nov 21, 2019
@original-brownbear
Copy link
Member

This is running against a GCS mock as well (not a real third party test run). Likely this has similar originals #49400 and all the associated test failures.

original-brownbear added a commit that referenced this issue Nov 25, 2019
This commit ensures that even for requests that are known to be empty body
we at least attempt to read one bytes from the request body input stream.
This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent`
that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`)
set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered.
This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks
weren't fully drained.

Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's
drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO).

I would suggest merging this and closing related issue after verifying that this fixes things on CI.

The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests
and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing
of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently.

Relates #49401, #49429
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 25, 2019
This commit ensures that even for requests that are known to be empty body
we at least attempt to read one bytes from the request body input stream.
This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent`
that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`)
set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered.
This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks
weren't fully drained.

Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's
drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO).

I would suggest merging this and closing related issue after verifying that this fixes things on CI.

The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests
and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing
of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently.

Relates elastic#49401, elastic#49429
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 25, 2019
This commit ensures that even for requests that are known to be empty body
we at least attempt to read one bytes from the request body input stream.
This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent`
that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`)
set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered.
This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks
weren't fully drained.

Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's
drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO).

I would suggest merging this and closing related issue after verifying that this fixes things on CI.

The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests
and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing
of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently.

Relates elastic#49401, elastic#49429
original-brownbear added a commit that referenced this issue Nov 25, 2019
This commit ensures that even for requests that are known to be empty body
we at least attempt to read one bytes from the request body input stream.
This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent`
that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`)
set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered.
This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks
weren't fully drained.

Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's
drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO).

I would suggest merging this and closing related issue after verifying that this fixes things on CI.

The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests
and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing
of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently.

Relates #49401, #49429
original-brownbear added a commit that referenced this issue Nov 25, 2019
This commit ensures that even for requests that are known to be empty body
we at least attempt to read one bytes from the request body input stream.
This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent`
that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`)
set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered.
This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks
weren't fully drained.

Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's
drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO).

I would suggest merging this and closing related issue after verifying that this fixes things on CI.

The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests
and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing
of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently.

Relates #49401, #49429
@tlrx
Copy link
Member

tlrx commented Nov 26, 2019

Thanks to @original-brownbear, #49518 should fix this issues. I'm closing for now and we'll reopen or reference if this issue happens again.

@tlrx tlrx closed this as completed Nov 26, 2019
@benwtrent
Copy link
Member

New build failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-10&&immutable/443/console

13:08:09 org.elasticsearch.repositories.gcs.GoogleCloudStorageThirdPartyTests > testCreateSnapshot FAILED
13:08:09     com.google.cloud.storage.StorageException: 400 Bad Request
13:08:09 
13:08:09         Caused by:
13:08:09         com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request

Reproduce line:

./gradlew ':plugins:repository-gcs:qa:google-cloud-storage:thirdPartyTest' --tests "org.elasticsearch.repositories.gcs.GoogleCloudStorageThirdPartyTests.testCreateSnapshot" -Dtests.seed=D18AB3DA9ABE738D -Dtests.security.manager=false -Dtests.locale=en-SG -Dtests.timezone=Etc/GMT-3 -Dcompiler.java=12

@benwtrent benwtrent reopened this Dec 10, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 11, 2019
Two things:
1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of
returning a `400` and failing the tests that way here since we can't reproduce these 400s locally.
2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests.

Relates elastic#49429
original-brownbear added a commit that referenced this issue Dec 12, 2019
* Better Logging GCS Blobstore Mock

Two things:
1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of
returning a `400` and failing the tests that way here since we can't reproduce these 400s locally.
2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests.

Relates #49429
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 12, 2019
* Better Logging GCS Blobstore Mock

Two things:
1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of
returning a `400` and failing the tests that way here since we can't reproduce these 400s locally.
2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests.

Relates elastic#49429
original-brownbear added a commit that referenced this issue Dec 12, 2019
* Better Logging GCS Blobstore Mock

Two things:
1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of
returning a `400` and failing the tests that way here since we can't reproduce these 400s locally.
2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests.

Relates #49429
@ywelsch
Copy link
Contributor

ywelsch commented Jan 6, 2020

@original-brownbear is this still an issue?

@original-brownbear
Copy link
Member

@ywelsch this particular exception is impossible now => let's close this one. There's still some test failures in these tests though, but I'll open a new issue or PR for those.

@original-brownbear
Copy link
Member

original-brownbear commented Jan 6, 2020

Nevermind ... my bad, this is still a possible issue sorry for the noise. Leaving this open.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 6, 2020
We were incorrectly handling blobs starting in `\r\n` which broke
tests randomly when blobs started on these.

Relates elastic#49429
@original-brownbear
Copy link
Member

Closing this after all. The 400 that was thrown here is removed from our code. I found another spot of broken multi-part request parsing that might apply here and have caused this past 400 that is now an assertion error (fix for that incoming in #50666), but the exact error seen here isn't a thing any longer.

original-brownbear added a commit that referenced this issue Jan 6, 2020
* Fix GCS Mock Broken Handling of some Blobs

We were incorrectly handling blobs starting in `\r\n` which broke
tests randomly when blobs started on these.

Relates #49429
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 6, 2020
* Fix GCS Mock Broken Handling of some Blobs

We were incorrectly handling blobs starting in `\r\n` which broke
tests randomly when blobs started on these.

Relates elastic#49429
original-brownbear added a commit that referenced this issue Jan 6, 2020
* Fix GCS Mock Broken Handling of some Blobs

We were incorrectly handling blobs starting in `\r\n` which broke
tests randomly when blobs started on these.

Relates #49429
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
* Better Logging GCS Blobstore Mock

Two things:
1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of
returning a `400` and failing the tests that way here since we can't reproduce these 400s locally.
2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests.

Relates elastic#49429
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
* Fix GCS Mock Broken Handling of some Blobs

We were incorrectly handling blobs starting in `\r\n` which broke
tests randomly when blobs started on these.

Relates elastic#49429
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants