Reenable azure repository tests and lower down max error per request #48283

tlrx · 2019-10-21T09:34:31Z

This pull request reenables some repository integration tests, decreases the maximum number of failures per request in Azure tests and sets an ExecutorService for the HttpServer used in tests.

I wasn't able to reproduce to failures reported in #47948 and #47380 but I suspect that the highest max. errors per request of 3 is picked up and the fact that requests are processed using the same thread that started the HTTP server might be the cause of the failures. I also think that with more parallelization added in snapshot/restore using an executor service in tests makes sense.

Closes #47948
Closes #47380

elasticmachine · 2019-10-21T09:34:33Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

tlrx · 2019-10-21T10:46:53Z

@elasticmachine run elasticsearch-ci/1

(Test is green but I'd like to have more runs)

tlrx · 2019-10-21T11:16:37Z

@elasticmachine run elasticsearch-ci/1

(Test is green but I'd like to have more runs)

tlrx · 2019-10-21T12:33:13Z

@elasticmachine run elasticsearch-ci/1

(Test is green but I'd like to have more runs)

original-brownbear · 2019-10-21T13:59:20Z

@tlrx I'm not so sure about this one. Can slowness due to executing the handlers on the HTTP server's callbacks actually lead to EOF exceptions? (to me it seems no, but I haven't worked much with the HttpServer before this) Isn't it more likely we're failing to consume the full input here in some corner case and corrupt the Http/1.1 stream that way?

original-brownbear

See my question, I'm a little unsure about the fix, but if the failures are all due to the 3x retry I think we're looking at the right thing probably.

tlrx · 2019-10-21T15:14:02Z

Well I did not claim it's a fix :)

I tried to reproduce the Azure failures by having the tests run over the weekend but I did not manage to reproduce. Also, by looking at the build stats I've found some GCS failures with read timeouts and assertion errors on server side.

I'd expect any failure related to read timeouts or errors/retries to be reproductible, but actually it is not because of the randomBoolean() (added to make the tests faster) that randomizes things too much and let requests fail or succeed depending on the order they are executed.

Can slowness due to executing the handlers on the HTTP server's callbacks actually lead to EOF exceptions?

Honestly I'm not sure. But the internal HTTP servers are still single threaded and we are adding more parallelization to snapshot/restore so I think we should change that so that it better reflects how storage services are working and it avoid to have one operation to be dependent on how another unrelated operation performs.

but if the failures are all due to the 3x retry I think we're looking at the right thing probably.

That was my initial guess but the fact that the failures do not reproduce easily and that Azure retries test are testing the max. number of retries let me think that it's maybe a factor but maybe not the root cause of the failures. Since Azure executes more requests (HEAD+GET) maybe they are just failing more often than the others.

At this point I agree that this PR is maybe changing too many things at once.

We could maybe do instead:

remove the randomBoolean() on the server side
use an executor service on the server side
reenable azure tests and see how it performs (also monitor S3/GCS)

Depending of the result, revisit the decrease of 3 to 2 max. errors per request. What do you think?

original-brownbear · 2019-10-21T15:28:39Z

remove the randomBoolean() on the server side

I like this a lot :) Making things more deterministic sounds like the right move.

use an executor service on the server side

I don't like this tbh. I can't see how lowering the latency of the IO thread by forking cheap operations will help us here. Also, even if it does help, then it's a strange spot because it basically means that our tests are broken and just accidentally work on fast boxes?

On this topic:

I think we should change that so that it better reflects how storage services are working and it avoid to have one operation to be dependent on how another unrelated operation performs.

I'm not sure this is a valid point. Even though not using a thread-pool for handling requests does make all the request handling sequential it does not however introduce any artificial ordering as far as I can see. In the same way I argued above, I'd say here: If we fail because of sequential handling of requests, then something must either be seriously wrong with the SDK or with our code mustn't it? :)

reenable azure tests and see how it performs (also monitor S3/GCS)

Jup let's do suggestion 1 (remove the random boolean) and reenable ? :)

tlrx · 2019-10-21T15:55:25Z

Thanks Armin, I tend to follow your networking expertise :) Still, I'm not sure all operations are that cheap on CI boxes (I'm thinking of few megabytes uploads or multiple uploads/deletions at once that "hold on" connections but do nothing with them).

Jup let's do suggestion 1 (remove the random boolean) and reenable ? :)

Sure, I'll adapt this PR.

This reverts commit e8e9610.

original-brownbear · 2019-10-21T16:54:16Z

Jenkins run elasticsearch-ci/bwc

original-brownbear

LGTM thanks Tanguy!

tlrx · 2019-10-21T17:23:53Z

@elasticmachine run elasticsearch-ci/bwc

tlrx · 2019-10-22T07:20:48Z

@elasticmachine update branch

tlrx · 2019-10-22T09:40:52Z

Thanks @original-brownbear.

* elastic/master: [Docs] Fix opType options in IndexRequest API example. (elastic#48290) Simplify Shard Snapshot Upload Code (elastic#48155) Mute ClassificationIT tests (elastic#48338) Reenable azure repository tests and remove some randomization in http servers (elastic#48283) Use an env var for the classpath of jar hell task (elastic#48240) Refactor FIPS BootstrapChecks to simple checks (elastic#47499) Add "format" to "range" queries resulted from optimizing a logical AND (elastic#48073) [DOCS][Transform] document limitation regarding rolling upgrade with 7.2, 7.3 (elastic#48118) Fail with a better error when if there are no ingest nodes (elastic#48272) Fix executing enrich policies stats (elastic#48132) Use MultiFileTransfer in CCR remote recovery (elastic#44514) Make BytesReference an interface (elastic#48171) Also validate source index at put enrich policy time. (elastic#48254) Add 'javadoc' task to lifecycle check tasks (elastic#48214) Remove option to enable direct buffer pooling (elastic#47956) [DOCS] Add 'Selecting gateway and seed nodes' section to CCS docs (elastic#48297) Add Enrich Origin (elastic#48098) fix incorrect comparison (elastic#48208)

… servers (#48283) Relates #47948 Relates #47380

tlrx added 2 commits October 21, 2019 10:59

Reenable azure repository tests and lower down max error per request

a42ded9

Use Executor service

e8e9610

tlrx added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.5.0 v7.6.0 v7.4.2 labels Oct 21, 2019

tlrx requested a review from original-brownbear October 21, 2019 09:34

original-brownbear reviewed Oct 21, 2019

View reviewed changes

tlrx added 3 commits October 21, 2019 17:56

Revert "Use Executor service"

d587675

This reverts commit e8e9610.

Restore randomIntBetween(2, 3)

af9718b

Remove randomBoolean()

26a490a

tlrx requested a review from original-brownbear October 21, 2019 15:58

original-brownbear approved these changes Oct 21, 2019

View reviewed changes

Merge branch 'master' into reenable-azure-tests

6a2be12

tlrx merged commit a6b8d0d into elastic:master Oct 22, 2019

tlrx deleted the reenable-azure-tests branch October 22, 2019 09:40

tlrx added the backport pending label Oct 22, 2019

tlrx added a commit that referenced this pull request Oct 23, 2019

Reenable azure repository tests and remove some randomization in http…

4790ee4

… servers (#48283) Relates #47948 Relates #47380

tlrx added a commit that referenced this pull request Oct 23, 2019

Reenable azure repository tests and remove some randomization in http…

2d10bdd

… servers (#48283) Relates #47948 Relates #47380

tlrx removed backport pending v7.4.2 labels Oct 23, 2019

DaveCTurner mentioned this pull request Oct 25, 2019

CI Failures: AzureBlobStoreRepositoryTests.testIndicesDeletedFromRepository #47948

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reenable azure repository tests and lower down max error per request #48283

Reenable azure repository tests and lower down max error per request #48283

tlrx commented Oct 21, 2019

elasticmachine commented Oct 21, 2019

tlrx commented Oct 21, 2019 •

edited

Loading

tlrx commented Oct 21, 2019

tlrx commented Oct 21, 2019

original-brownbear commented Oct 21, 2019

original-brownbear left a comment

tlrx commented Oct 21, 2019

original-brownbear commented Oct 21, 2019

tlrx commented Oct 21, 2019

original-brownbear commented Oct 21, 2019

original-brownbear left a comment

tlrx commented Oct 21, 2019

tlrx commented Oct 22, 2019

tlrx commented Oct 22, 2019

Reenable azure repository tests and lower down max error per request #48283

Reenable azure repository tests and lower down max error per request #48283

Conversation

tlrx commented Oct 21, 2019

elasticmachine commented Oct 21, 2019

tlrx commented Oct 21, 2019 • edited Loading

tlrx commented Oct 21, 2019

tlrx commented Oct 21, 2019

original-brownbear commented Oct 21, 2019

original-brownbear left a comment

Choose a reason for hiding this comment

tlrx commented Oct 21, 2019

original-brownbear commented Oct 21, 2019

tlrx commented Oct 21, 2019

original-brownbear commented Oct 21, 2019

original-brownbear left a comment

Choose a reason for hiding this comment

tlrx commented Oct 21, 2019

tlrx commented Oct 22, 2019

tlrx commented Oct 22, 2019

tlrx commented Oct 21, 2019 •

edited

Loading