Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When many tests are affected, CI stability jobs will time out #7660

Closed
foolip opened this issue Oct 10, 2017 · 34 comments
Closed

When many tests are affected, CI stability jobs will time out #7660

foolip opened this issue Oct 10, 2017 · 34 comments

Comments

@foolip
Copy link
Member

foolip commented Oct 10, 2017

In #7654 I changed 778 files, and as a result all of the stability jobs failed, with messages like "No output has been received in the last 10m0s".

This is similar to #7073.

@jgraham
Copy link
Contributor

jgraham commented Oct 11, 2017

There are two issues here. One is that we should have a periodic keepalive message when the tests are still running but not producing output. The other is that if we get no output for 10 minutes we're pretty much guaranteed to hit the hard timout (since that implies a minimum runtime of 100 minutes which is about twice the threshold).

@foolip
Copy link
Member Author

foolip commented Feb 25, 2018

Changed the title a bit because in #9641 not that many files were changed, but many included the changed file, and that's what matters. Also, not all browsers necessarily fail.

@gsnedders
Copy link
Member

Does it make sense to run with --no-restart-on-unexpected in CI?

@foolip
Copy link
Member Author

foolip commented Feb 26, 2018

@jgraham, what kinds of things would you expect to blow up if we do that? And would that still restart between each of the 10 runs?

@jgraham
Copy link
Contributor

jgraham commented Feb 26, 2018

Nothing should blow up; the default assumes that if you have one failing test then you don't want it to infect subsequent tests if it happens to set some bad state. In this case what you might see is that one unstable test would be more likely to make subsequent tests on the run look unstable. It won't affect restarting between runs at all.

An alternate apporach would be to set the expectation data after the first run and use that to avoid restarting on subequent runs unless something actually is unstable. At the moment the way we update the metadata is really slow, so that will itself have a performance impact.

@foolip
Copy link
Member Author

foolip commented Mar 14, 2018

@jgraham, as you can see, this is quite a common occurrence, requiring regular manual intervention, and our current setup isn't robust enough to build #7475 on top of. Are you open to revisiting the idea of running fewer iterations if we know we're going to exceed the timeout?

@jgraham
Copy link
Contributor

jgraham commented Mar 14, 2018

So I'm not opposed to that, but it is a limitation of the current system that we should look to address in a better way.

But this issue seems to be about the "10m without output" problem which is different to the entire run timing out and requires a different solution.

@foolip
Copy link
Member Author

foolip commented Mar 14, 2018

I think most of the PRs I've linked to here are were actual timeouts close to 50 minutes, at least https://travis-ci.org/w3c/web-platform-tests/jobs/352958109 ended with just "The job exceeded the maximum time limit for jobs, and has been terminated".

Some of the timeouts have probably been to verbose logging (now fixed) and maybe I've misattributed some other things, but plain hangs that result in no output for 10 minutes isn't most of the problem, I think.

As for a better way, ultimately I think there are cases where it's just not a good use of resources to even try to run all affected tests 10 times, like if testharness.js has changed or if many tests are affected in some trivial way like in #9718.

No matter how much Travis capacity we have or if we are able to reuse the wpt.fyi running infra, I think it makes sense to put a cap on resource usage per PR and just do the most useful things that can be done within those constraints. It should be possible to run all tests once so that we can actually make testharness.js changes with confidence, and maybe twice to do with/without changes, but I wouldn't want to allocate more resources than that per PR.

I think the next step is to just buy more Travis capacity which will increase the timeout to 120 minutes. @plehegar is looking into that.

@foolip
Copy link
Member Author

foolip commented Jan 16, 2020

@stephenmcgruer can you retriage this for priority? #7660 (comment) is the only idea I have for fixing this.

@stephenmcgruer
Copy link
Contributor

stephenmcgruer commented May 8, 2020

I wrote a 'brief' document that summarizes the (I believe) generally agreed upon solution: https://docs.google.com/document/d/1dAlCSHUQldtgWDDTrGJR-ksm19FZZ3k8ppqc5-kSwIk/edit# . It's world-readable, comment-able by @chromiums, and I'm happy to give comment access to anyone else who wants it (just can't make it world comment-able due to spam).

@stephenmcgruer
Copy link
Contributor

Update: unfortunately there was disagreement on the proposed approach for avoiding CI timeout and I haven't had the time/energy to drive it to a successful conclusion. I still hope to address this, hopefully sometime in Q3 once the Taskcluster Checks migration is comfortably landed.

@stephenmcgruer stephenmcgruer self-assigned this Jul 12, 2020
@gsnedders
Copy link
Member

FWIW, my view is still the ideal solution is to do something in the decision task that chooses to shard based on number of files changed (I don't think we have tests affected in the decision task?)

@Hexcles
Copy link
Member

Hexcles commented Aug 11, 2020

Well, even if we do that, we still have widely used resource files that affect way too many tests (such that we can't afford to run 10x).

@stephenmcgruer
Copy link
Contributor

So we're in a better place to do this now, since we switched to repeat-only stability checks for the CI. We can check for an impending timeout between each iteration of the tests run (using previous runs to estimate running time), and bail if we're over our time limit.

Now just needs me (or someone) to put the time in to create a PR ;)

@DanielRyanSmith
Copy link
Contributor

With the new changes added that were implemented based on @stephenmcgruer's proposal, is it safe to close this issue now?

@jgraham
Copy link
Contributor

jgraham commented Feb 25, 2022

I think it's still possible for this to occur in the case we can't run two iterations of a selection of tests inside the task timeout. But hopefully that's rarer. I think it makes sense to track any remaining failures as part of a new issue.

@jgraham jgraham closed this as completed Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants