Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable azure repository tests and lower down max error per request #48283

Merged
merged 6 commits into from
Oct 22, 2019

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Oct 21, 2019

This pull request reenables some repository integration tests, decreases the maximum number of failures per request in Azure tests and sets an ExecutorService for the HttpServer used in tests.

I wasn't able to reproduce to failures reported in #47948 and #47380 but I suspect that the highest max. errors per request of 3 is picked up and the fact that requests are processed using the same thread that started the HTTP server might be the cause of the failures. I also think that with more parallelization added in snapshot/restore using an executor service in tests makes sense.

Closes #47948
Closes #47380

@tlrx tlrx added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.5.0 v7.6.0 v7.4.2 labels Oct 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@tlrx
Copy link
Member Author

tlrx commented Oct 21, 2019

@elasticmachine run elasticsearch-ci/1

(Test is green but I'd like to have more runs)

@tlrx
Copy link
Member Author

tlrx commented Oct 21, 2019

@elasticmachine run elasticsearch-ci/1

(Test is green but I'd like to have more runs)

@tlrx
Copy link
Member Author

tlrx commented Oct 21, 2019

@elasticmachine run elasticsearch-ci/1

(Test is green but I'd like to have more runs)

@original-brownbear
Copy link
Member

@tlrx I'm not so sure about this one. Can slowness due to executing the handlers on the HTTP server's callbacks actually lead to EOF exceptions? (to me it seems no, but I haven't worked much with the HttpServer before this) Isn't it more likely we're failing to consume the full input here in some corner case and corrupt the Http/1.1 stream that way?

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my question, I'm a little unsure about the fix, but if the failures are all due to the 3x retry I think we're looking at the right thing probably.

@tlrx
Copy link
Member Author

tlrx commented Oct 21, 2019

Well I did not claim it's a fix :)

I tried to reproduce the Azure failures by having the tests run over the weekend but I did not manage to reproduce. Also, by looking at the build stats I've found some GCS failures with read timeouts and assertion errors on server side.

I'd expect any failure related to read timeouts or errors/retries to be reproductible, but actually it is not because of the randomBoolean() (added to make the tests faster) that randomizes things too much and let requests fail or succeed depending on the order they are executed.

Can slowness due to executing the handlers on the HTTP server's callbacks actually lead to EOF exceptions?

Honestly I'm not sure. But the internal HTTP servers are still single threaded and we are adding more parallelization to snapshot/restore so I think we should change that so that it better reflects how storage services are working and it avoid to have one operation to be dependent on how another unrelated operation performs.

but if the failures are all due to the 3x retry I think we're looking at the right thing probably.

That was my initial guess but the fact that the failures do not reproduce easily and that Azure retries test are testing the max. number of retries let me think that it's maybe a factor but maybe not the root cause of the failures. Since Azure executes more requests (HEAD+GET) maybe they are just failing more often than the others.

At this point I agree that this PR is maybe changing too many things at once.

We could maybe do instead:

  • remove the randomBoolean() on the server side
  • use an executor service on the server side
  • reenable azure tests and see how it performs (also monitor S3/GCS)

Depending of the result, revisit the decrease of 3 to 2 max. errors per request. What do you think?

@original-brownbear
Copy link
Member

remove the randomBoolean() on the server side

I like this a lot :) Making things more deterministic sounds like the right move.

use an executor service on the server side

I don't like this tbh. I can't see how lowering the latency of the IO thread by forking cheap operations will help us here. Also, even if it does help, then it's a strange spot because it basically means that our tests are broken and just accidentally work on fast boxes?

On this topic:

I think we should change that so that it better reflects how storage services are working and it avoid to have one operation to be dependent on how another unrelated operation performs.

I'm not sure this is a valid point. Even though not using a thread-pool for handling requests does make all the request handling sequential it does not however introduce any artificial ordering as far as I can see. In the same way I argued above, I'd say here: If we fail because of sequential handling of requests, then something must either be seriously wrong with the SDK or with our code mustn't it? :)

reenable azure tests and see how it performs (also monitor S3/GCS)

Jup let's do suggestion 1 (remove the random boolean) and reenable ? :)

@tlrx
Copy link
Member Author

tlrx commented Oct 21, 2019

Thanks Armin, I tend to follow your networking expertise :) Still, I'm not sure all operations are that cheap on CI boxes (I'm thinking of few megabytes uploads or multiple uploads/deletions at once that "hold on" connections but do nothing with them).

Jup let's do suggestion 1 (remove the random boolean) and reenable ? :)

Sure, I'll adapt this PR.

@original-brownbear
Copy link
Member

Jenkins run elasticsearch-ci/bwc

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks Tanguy!

@tlrx
Copy link
Member Author

tlrx commented Oct 21, 2019

@elasticmachine run elasticsearch-ci/bwc

@tlrx
Copy link
Member Author

tlrx commented Oct 22, 2019

@elasticmachine update branch

@tlrx tlrx merged commit a6b8d0d into elastic:master Oct 22, 2019
@tlrx tlrx deleted the reenable-azure-tests branch October 22, 2019 09:40
@tlrx
Copy link
Member Author

tlrx commented Oct 22, 2019

Thanks @original-brownbear.

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Oct 22, 2019
* elastic/master:
  [Docs] Fix opType options in IndexRequest API example. (elastic#48290)
  Simplify Shard Snapshot Upload Code (elastic#48155)
  Mute ClassificationIT tests (elastic#48338)
  Reenable azure repository tests and remove some randomization in http servers  (elastic#48283)
  Use an env var for the classpath of jar hell task (elastic#48240)
  Refactor FIPS BootstrapChecks to simple checks (elastic#47499)
  Add "format" to "range" queries resulted from optimizing a logical AND (elastic#48073)
  [DOCS][Transform] document limitation regarding rolling upgrade with 7.2, 7.3 (elastic#48118)
  Fail with a better error when if there are no ingest nodes (elastic#48272)
  Fix executing enrich policies stats (elastic#48132)
  Use MultiFileTransfer in CCR remote recovery (elastic#44514)
  Make BytesReference an interface (elastic#48171)
  Also validate source index at put enrich policy time. (elastic#48254)
  Add 'javadoc' task to lifecycle check tasks (elastic#48214)
  Remove option to enable direct buffer pooling (elastic#47956)
  [DOCS] Add 'Selecting gateway and seed nodes' section to CCS docs (elastic#48297)
  Add Enrich Origin (elastic#48098)
  fix incorrect comparison (elastic#48208)
tlrx added a commit that referenced this pull request Oct 23, 2019
tlrx added a commit that referenced this pull request Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test Issues or PRs that are addressing/adding tests v7.5.0 v7.6.0 v8.0.0-alpha1
Projects
None yet
4 participants