When many tests are affected, CI stability jobs will time out #7660

foolip · 2017-10-10T21:43:40Z

In #7654 I changed 778 files, and as a result all of the stability jobs failed, with messages like "No output has been received in the last 10m0s".

This is similar to #7073.

jgraham · 2017-10-11T07:11:58Z

There are two issues here. One is that we should have a periodic keepalive message when the tests are still running but not producing output. The other is that if we get no output for 10 minutes we're pretty much guaranteed to hit the hard timout (since that implies a minimum runtime of 100 minutes which is about twice the threshold).

foolip · 2018-02-25T18:21:02Z

Changed the title a bit because in #9641 not that many files were changed, but many included the changed file, and that's what matters. Also, not all browsers necessarily fail.

gsnedders · 2018-02-26T14:26:53Z

Does it make sense to run with --no-restart-on-unexpected in CI?

foolip · 2018-02-26T14:38:32Z

@jgraham, what kinds of things would you expect to blow up if we do that? And would that still restart between each of the 10 runs?

jgraham · 2018-02-26T14:51:31Z

Nothing should blow up; the default assumes that if you have one failing test then you don't want it to infect subsequent tests if it happens to set some bad state. In this case what you might see is that one unstable test would be more likely to make subsequent tests on the run look unstable. It won't affect restarting between runs at all.

An alternate apporach would be to set the expectation data after the first run and use that to avoid restarting on subequent runs unless something actually is unstable. At the moment the way we update the metadata is really slow, so that will itself have a performance impact.

foolip · 2018-03-14T13:37:32Z

@jgraham, as you can see, this is quite a common occurrence, requiring regular manual intervention, and our current setup isn't robust enough to build #7475 on top of. Are you open to revisiting the idea of running fewer iterations if we know we're going to exceed the timeout?

jgraham · 2018-03-14T14:04:58Z

So I'm not opposed to that, but it is a limitation of the current system that we should look to address in a better way.

But this issue seems to be about the "10m without output" problem which is different to the entire run timing out and requires a different solution.

foolip · 2018-03-14T14:48:58Z

I think most of the PRs I've linked to here are were actual timeouts close to 50 minutes, at least https://travis-ci.org/w3c/web-platform-tests/jobs/352958109 ended with just "The job exceeded the maximum time limit for jobs, and has been terminated".

Some of the timeouts have probably been to verbose logging (now fixed) and maybe I've misattributed some other things, but plain hangs that result in no output for 10 minutes isn't most of the problem, I think.

As for a better way, ultimately I think there are cases where it's just not a good use of resources to even try to run all affected tests 10 times, like if testharness.js has changed or if many tests are affected in some trivial way like in #9718.

No matter how much Travis capacity we have or if we are able to reuse the wpt.fyi running infra, I think it makes sense to put a cap on resource usage per PR and just do the most useful things that can be done within those constraints. It should be possible to run all tests once so that we can actually make testharness.js changes with confidence, and maybe twice to do with/without changes, but I wouldn't want to allocate more resources than that per PR.

I think the next step is to just buy more Travis capacity which will increase the timeout to 120 minutes. @plehegar is looking into that.

foolip · 2020-01-16T09:55:05Z

@stephenmcgruer can you retriage this for priority? #7660 (comment) is the only idea I have for fixing this.

stephenmcgruer · 2020-05-08T18:56:53Z

I wrote a 'brief' document that summarizes the (I believe) generally agreed upon solution: https://docs.google.com/document/d/1dAlCSHUQldtgWDDTrGJR-ksm19FZZ3k8ppqc5-kSwIk/edit# . It's world-readable, comment-able by @chromiums, and I'm happy to give comment access to anyone else who wants it (just can't make it world comment-able due to spam).

stephenmcgruer · 2020-07-12T15:33:51Z

Update: unfortunately there was disagreement on the proposed approach for avoiding CI timeout and I haven't had the time/energy to drive it to a successful conclusion. I still hope to address this, hopefully sometime in Q3 once the Taskcluster Checks migration is comfortably landed.

gsnedders · 2020-08-11T17:43:57Z

FWIW, my view is still the ideal solution is to do something in the decision task that chooses to shard based on number of files changed (I don't think we have tests affected in the decision task?)

Hexcles · 2020-08-11T17:45:50Z

Well, even if we do that, we still have widely used resource files that affect way too many tests (such that we can't afford to run 10x).

stephenmcgruer · 2020-10-12T12:00:04Z

So we're in a better place to do this now, since we switched to repeat-only stability checks for the CI. We can check for an impending timeout between each iteration of the tests run (using previous runs to estimate running time), and bail if we're over our time limit.

Now just needs me (or someone) to put the time in to create a PR ;)

DanielRyanSmith · 2022-02-07T18:16:41Z

With the new changes added that were implemented based on @stephenmcgruer's proposal, is it safe to close this issue now?

jgraham · 2022-02-25T15:38:38Z

I think it's still possible for this to occur in the case we can't run two iterations of a selection of tests inside the task timeout. But hopefully that's rarer. I think it makes sense to track any remaining failures as part of a new issue.

foolip added infra priority:roadmap labels Oct 10, 2017

foolip mentioned this issue Oct 10, 2017

Replace tabs with spaces in html/editing/dnd/ #7654

Merged

This was referenced Oct 12, 2017

Travis jobs for slow tests will time out, blocking merge #7073

Closed

Remove executable bit from files that definitely aren't #7756

Merged

Test every commit of web-platform-tests within 1 hour web-platform-tests/results-collection#164

Closed

foolip added priority:backlog and removed priority:roadmap labels Nov 17, 2017

This was referenced Nov 20, 2017

service worker: Upstream sandboxed iframe test to WPT. #8118

Merged

Move /css/css-block/ to css/css-display/run-in/ #8437

Merged

foolip mentioned this issue Jan 6, 2018

Move XMLHttpRequest/ to xhr/ #8937

Merged

foolip mentioned this issue Feb 25, 2018

Reland "Add WPT tests for feature policy" #9238

Merged

foolip changed the title ~~When changing many files, Travis jobs for all browsers will fail~~ When many tests are affected, Travis jobs will time out Feb 25, 2018

foolip mentioned this issue Feb 25, 2018

[css-typed-om] Rejig CSSResourceValue hierarchy. #9641

Merged

foolip mentioned this issue Mar 14, 2018

Add WPT tests for feature policy frame policy #8922

Merged

foolip mentioned this issue Mar 17, 2018

bluetooth: FakeBluetoothChooser stub impl. #9463

Merged

foolip added the priority:roadmap label Apr 3, 2018

foolip mentioned this issue Jan 16, 2020

[WebCryptoAPI] Stability job unable to run tests to completion #20052

Open

foolip removed the priority:roadmap label Jan 16, 2020

Hexcles added the priority:roadmap label Mar 9, 2020

Hexcles unassigned jugglinmike Mar 9, 2020

This was referenced Apr 28, 2020

[wpt] enhanceents to check-layout-th.js #23184

Merged

Add a test for ReplaceTrack that verifies video track content. #22779

Merged

Fix idlharness.js' handling of Float(32/64)Array types #23323

Merged

foolip mentioned this issue Jun 2, 2020

[webdriver] normalize and fix links to the WebDriver spec #23911

Merged

KyleJu mentioned this issue Jun 11, 2020

[Security][Coop] Browsing context switch reporting WPT #23670

Merged

stephenmcgruer self-assigned this Jul 12, 2020

stephenmcgruer added priority:backlog and removed priority:roadmap labels Jan 5, 2021

stephenmcgruer removed their assignment Jan 5, 2021

This was referenced May 6, 2021

Taskcluster wpt-firefox-nightly-stability timeout when touching interfaces/html.idl #16175

Closed

Pass server config to WebDriver via a file instead of an env variable. #28834

Merged

[resource timing] Improve document.domain tests (reland) #28965

Merged

jpchase mentioned this issue May 18, 2021

[ResourceTiming] Update TAO-crossorigin-port.sub.html with new style #28936

Merged

foolip mentioned this issue Aug 10, 2021

Fix some shadow root tests #29924

Merged

jgraham mentioned this issue Nov 15, 2021

WebDriver community stability jobs are frequently timing out #31499

Open

past mentioned this issue Dec 4, 2021

Unify template to create canvas and offscreencanvas tests #31822

Merged

DanielRyanSmith mentioned this issue Dec 29, 2021

7660 Avoid CI stability checks timing out #32202

Merged

jgraham closed this as completed Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When many tests are affected, CI stability jobs will time out #7660

When many tests are affected, CI stability jobs will time out #7660

foolip commented Oct 10, 2017

jgraham commented Oct 11, 2017

foolip commented Feb 25, 2018

gsnedders commented Feb 26, 2018

foolip commented Feb 26, 2018

jgraham commented Feb 26, 2018

foolip commented Mar 14, 2018

jgraham commented Mar 14, 2018

foolip commented Mar 14, 2018

foolip commented Jan 16, 2020

stephenmcgruer commented May 8, 2020 •

edited

Loading

stephenmcgruer commented Jul 12, 2020

gsnedders commented Aug 11, 2020

Hexcles commented Aug 11, 2020

stephenmcgruer commented Oct 12, 2020

DanielRyanSmith commented Feb 7, 2022

jgraham commented Feb 25, 2022

When many tests are affected, CI stability jobs will time out #7660

When many tests are affected, CI stability jobs will time out #7660

Comments

foolip commented Oct 10, 2017

jgraham commented Oct 11, 2017

foolip commented Feb 25, 2018

gsnedders commented Feb 26, 2018

foolip commented Feb 26, 2018

jgraham commented Feb 26, 2018

foolip commented Mar 14, 2018

jgraham commented Mar 14, 2018

foolip commented Mar 14, 2018

foolip commented Jan 16, 2020

stephenmcgruer commented May 8, 2020 • edited Loading

stephenmcgruer commented Jul 12, 2020

gsnedders commented Aug 11, 2020

Hexcles commented Aug 11, 2020

stephenmcgruer commented Oct 12, 2020

DanielRyanSmith commented Feb 7, 2022

jgraham commented Feb 25, 2022

stephenmcgruer commented May 8, 2020 •

edited

Loading