Discussion #1

foolip · 2017-02-21T10:38:15Z

Filing an issue to have the discussion in public.

Which was the test that this flakiness was seen on? In IRC I see http://s92097608.onlinehome.us/tmp/17.02.17-flaky-reftest.html, is it http://w3c-test.org/html/rendering/replaced-elements/attributes-for-embedded-content-and-images/img-dim.html that you minified?

Looks like https://github.com/w3c/wptrunner/blob/master/wptrunner/executors/reftest-wait_webdriver.js is the "wait" script and that this test doesn't use the "reftest-wait" mechanism.

I don't know WebDriver, but are the sequence of commands https://github.com/w3c/wptrunner/blob/38435bc6714ae83bbf759b04395fe13f08388396/wptrunner/executors/executorselenium.py#L253-L264 intended to wait until after the load event? (The load event on the window object is delayed until images are loaded.)

How/where is webdriver.get(url) implemented?

Even given the note in https://github.com/jugglinmike/chrome-screenshot-race/blob/master/LICENSE my guess is that there isn't anything that really guarantees the timing?

@jgraham

The text was updated successfully, but these errors were encountered:

foolip · 2017-02-21T10:46:57Z

@samuong for ChromeDriver. Is it correct to assume that it's perfectly possible and valid to use ChromeDriver to execute async scripts before the page has finished loading? https://w3c.github.io/webdriver/webdriver-spec.html has things like "pageLoadStrategy", but this is all unfamiliar to me so I don't know where to start :)

jgraham · 2017-02-21T11:12:48Z

By default it is supposed to wait for the page to finish loading. However we also check the readyState attribute (the reftest-wait script is run for each page to detect if it has the class=reftest-wait attribute on <body>). So if the screenshot is happening before the load event something is very broken somewhere. Isn't it more likely that Chrome is delaying paint until after load? Or is that not a thing that happens?

foolip · 2017-02-21T11:29:04Z

Oh, I didn't pay attention to what readyState "complete" actually means. @jugglinmike, can you confirm that the load event is in fact fired, perhaps add some bit of red that's removed in the load event? Also, can you check if the img.naturalWidth has the right value?

If it looks to scripts like everything has loaded normally and the problem still reproduces on the same runs where scripts verify this, then a repaint problem of some kind like @jgraham suggests seems likely.

jugglinmike · 2017-02-21T16:52:10Z

Sure thing, @foolip. I've extended the test script to capture more data. Here's
a screenshot of the results:

As you can see, the ready event has indeed fired and the <img> element's
naturalHeight and naturalWidth report the size of the fully-rendered image.
This is all true prior to capturing the screenshot for both the "fully
rendered" case and the "not rendered" case.

samuong · 2017-02-21T17:58:31Z

ChromeDriver will wait for pages to load before executing both the screenshot and the execute async script commands. It then hands it off to DevTools (Page.captureScreenshot) to take the screenshot.

It's interesting that the naturalHeight/Width of the image is out of sync with what's being captured. I'll ask around within DevTools to see who can look into Page.captureScreenshot.

jugglinmike · 2017-02-21T18:10:20Z

From my local testing, this "appears to be avoidable" by forcing a repaint in
the onload event handler, as in:

--- a/index.html
+++ b/index.html
@@ -1,14 +1,15 @@
 <!DOCTYPE html>
 <html>
 <head>
   <script>
 window.ONLOAD_FIRED = null;
 onload = function() {
+  document.body.style.backgroundColor = 'hotpink';
   window.ONLOAD_FIRED = 'the `onload` event fired';
 };
   </script>
 </head>
 <body style="outline: solid teal 2px;">
   <img src="blue.png" style="border-width:50px; border-style:solid;"/>
 </body>
 </html>

With that change in place, the image is rendered consistently through thousands
of trials. However, even if "appears to be avoidable" were an acceptable
resolution, that pattern isn't generally usable for Web Platform Tests--it's
unreasonable to expect all reference tests to include the boilerplate logic
just for the sake of Chromium.

A related-but-more-tenable approach would be to force that repaint in the
WebDriver wait script itself:

--- a/main.py
+++ b/main.py
@@ -87,40 +87,41 @@ def take_reading(session_id):
         script = '''
             var callback = arguments[0];
             function done(strategy) {
+              document.body.style.backgroundColor = 'hotpink';
               callback({
                 callbackStrategy: strategy,
                 readyState: document.readyState,
                 scrollHeight: document.documentElement.scrollHeight,
                 onload_fired: window.ONLOAD_FIRED,
                 natural_dimensions: [
                   document.getElementsByTagName('img')[0].naturalWidth,
                   document.getElementsByTagName('img')[0].naturalHeight
                 ]
               });
             }
             if (document.readyState === 'complete') {
               done('synchronous');
             } else {
               onload = done.bind(null, 'asynchronous');
             }
         '''
         result = request('post', 'session/%s/execute_async' % session_id, dict(script=script, args=[]))

This does not solve the problem. It actually makes it more pronounced,
somehow, raising further questions about the paint/repaint cycle:

What we're seeing here is that the repaint is not taking effect
synchronously... Or, if it is, the frame buffer is not updating before the
screen shot is captured. But that may not be accurate at all; I'm really
stretching my knowledge of the rendering stack at this point.

samuong · 2017-02-21T18:21:53Z

Chrome should already force a redraw before it captures a screenshot: https://cs.chromium.org/chromium/src/content/browser/renderer_host/render_widget_host_impl.cc?l=1454. I think there's something else going wrong here...

jugglinmike · 2017-02-21T23:22:38Z

@foolip Sorry, I forgot to answer your original question. This is related to
that test, but I found a whole set of similar tests to behave inconsistently in
Chrome:

/html/rendering/replaced-elements/attributes-for-embedded-content-and-images/img-dim.html
/html/rendering/replaced-elements/attributes-for-embedded-content-and-images/img_border_percent.xhtml
/html/rendering/replaced-elements/attributes-for-embedded-content-and-images/object_border_perc.xhtml
/html/rendering/non-replaced-elements/the-page/body_text_00ffff.xhtml
/html/rendering/non-replaced-elements/phrasing-content-0/font-element-text-decoration-color/001-a.html
/html/rendering/non-replaced-elements/phrasing-content-0/font-element-text-decoration-color/001-q.html
/html/rendering/replaced-elements/images/space.html
/html/rendering/replaced-elements/attributes-for-embedded-content-and-images/object_border_pixel.xhtml

There are other inconsistent reference tests that may be related. These do not
concern rendering directly, but they do rely on image rendering:

/html/semantics/embedded-content/the-img-element/document-base-url.html
/html/semantics/embedded-content/the-video-element/video_dynamic_poster_absolute.htm
/html/semantics/forms/the-input-element/image01.html

foolip · 2017-02-23T06:35:56Z

The first experiment #1 (comment) demonstrates clearly that the load event is in fact being fired before the screenshot is captured.

However, it is perhaps not a given that the layout, even if invalidated, is actually recomputed before the screenshot is taken. While a script calling img.getBoundingClientRect() or similar would force the layout, perhaps in the absence of such a call the layout might happen on the animation frame after the load event? @eaenet, would you happen to know the sequence of events from the size of an image becoming known to the layout being updated and affecting the paint?

foolip · 2017-02-23T06:37:25Z

Chrome should already force a redraw before it captures a screenshot: https://cs.chromium.org/chromium/src/content/browser/renderer_host/render_widget_host_impl.cc?l=1454. I think there's something else going wrong here...

RenderViewImpl::OnForceRedraw(), which handles that message, calls RenderWidget::ScheduleCompositeWithForcedRedraw(). Are you sure there isn't something asynchronous about that, such that the the semantics "paint real soon now" and not "paint now"?

jugglinmike · 2017-02-23T22:33:49Z

At @RByers's suggestion, I modified the "wait" script to defer for the next
animation frame using requestAnimationFrame:

--- a/main.py
+++ b/main.py
@@ -116,7 +116,11 @@ try:
               });
             }
             if (document.readyState === 'complete') {
-              done('synchronous');
+              requestAnimationFrame(function() {
+                done('synchronous');
+              });
             } else {
               onload = done.bind(null, 'asynchronous');
}

This modification did not address the instability: I observed the race
condition after 10,723 trials. Rick anticipated this, though, and explained how
the semantics of the requestAnimationFrame function meant that we would
actually have to apply it twice to accurately describe the condition we are
interested in here:

--- a/main.py
+++ b/main.py
@@ -116,7 +116,11 @@ try:
               });
             }
             if (document.readyState === 'complete') {
-              done('synchronous');
+              requestAnimationFrame(function() {
+                requestAnimationFrame(function() {
+                  done('synchronous');
+                });
+              });
             } else {
               onload = done.bind(null, 'asynchronous');
}

This ran for almost three hours without exhibiting the race condition.

While this is encouraging, it is by no means conclusive. There is no indication
that the delay introduced here is in any way related to the delay in the
browser's rendering behavior. We are at the limits of what guesswork can tell
us, and we really need a more authoritative voice to conclusively describe an
deferral technique based on the Chromium source code.

RByers · 2017-02-23T22:47:42Z

Yeah, we'll get a Chrome rendering expert to investigate and get back to you. But in the interim I do think it's a reasonable workaround, IIRC some blink LayoutTests follow such a double-rAF pattern already, eg. here.

We recommend a different internal API which probably has stronger guarantees than just a double-rAF. Eg. it may be possible for rAF to fire while raster is still pending, so you could possible see "checkerboarding" (which today is just white, not an actual checkerboard pattern). But I suspect the WebDriver screenshot API blocks on raster (at least it should - probably doesn't make to generate a screenshot of something that hasn't finished rastering yet).

foolip · 2017-03-03T07:26:27Z

@chrishtr, can you help us take a look here? I would be interesting to start from the question I posed to @eaenet in #1 (comment) about the order of events here.

In Document::checkLoadEventSoon() I see that there's just a timer that's started to possibly fire the load event. ImageLoader seems to use that (indirectly), and if that timer is started at the same time as the layout is invalidated, then I don't think there's any guarantees about order, the load event can come before the layout. Could the paint that we use for the screenshot happen between the load event and the layout? Or do we never paint anything if the layout is dirty?

chrishtr · 2017-03-07T00:30:44Z

Blink never paints with dirty layout.

I'm trying to repro locally now.

chrishtr · 2017-03-07T00:55:16Z

I am not yet able to reproduce this bug. I tried the steps in https://github.com/jugglinmike/chrome-screenshot-race/blob/master/readme.md, but when I run

python main.py

it errors out with:

Traceback (most recent call last):
File "main.py", line 91, in
session_id = request('post', 'session', dict(desiredCapabilities={}))['sessionId']
File "main.py", line 20, in request
response = getattr(requests, method)('http://localhost:4444/%s' % uri, json=data)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 88, in post
return request('post', url, data=data, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'json'

As far as I can tell I have installed the "requests" module correctly.

I also tried putting https://github.com/jugglinmike/chrome-screenshot-race/blob/master/index.html into the Blink Layout Test machinery, and was unable to make it fail in 1000 trials. (not that layout tests are the same as Chrome Driver, just one data point).

foolip · 2017-03-07T09:31:08Z

@jugglinmike, just to ensure that we're not chasing after something that was recently fixed, can you try to repro also in Chrome Canary, and if successful note the exact version in case someone wants to try building and testing from the same source?

jugglinmike · 2017-03-07T17:41:25Z

Thanks for taking a look, @chrishtr!

As far as I can tell I have installed the "requests" module correctly.

I'm not sure what could be going wrong here, either. I've just removed and
re-installed that module and the script still functions on my end. To be safe,
I've updated the installation instructions to use the PIP tool and provided a
requirements.txt file specifying the exact version of requests that I am
using. You probably don't have to use PIP, but could you verify that the
version of the library matches?

@jugglinmike, just to ensure that we're not chasing after something that was
recently fixed, can you try to repro also in Chrome Canary, and if successful
note the exact version in case someone wants to try building and testing from
the same source?

That's a good idea, Philip. I'm unable to run Chrome Canary (no suitable
Windows machine), but I've verified using Chrome's latest "dev" release,
(58.0.3026.3) which appears to have been published last
Wednesday.
I've updated the project's readme.md file to include this information along
with additional details about my system architecture and version of
ChromeDriver (2.27.440175, which appears to be the latest at
present).

chrishtr · 2017-03-07T19:59:04Z

I can now get it to run with the updated instructions. Should I expect the test to automatically fail with
a wrong screenshot? So far I haven't seen that. How long should it take?

jugglinmike · 2017-03-07T20:23:06Z

The script should capture an initial screenshot and then begin reloading the
page in quick succession, capturing another screenshot each time, until an
abberation is detected. It should then write a report to a file named
results.html and exit.

Here are the results of my last three executions (taken today):

Captured 439 screen shots over 96.2317750454 seconds.
Captured 251 screen shots over 56.5082240105 seconds.
Captured 315 screen shots over 68.7743799686 seconds.

This problem may be related to available system resources. In case that
information helps:

Processor		: 4x AMD Athlon(tm) II X4 640 Processor
Memory		: 8171MB

chrishtr · 2017-03-08T01:12:13Z

Ok now I can get it to fail. Next I'm having trouble setting a custom Chrome binary, can you help with
that? I have found python code to set up a new driver, but your code is accessing via HTTP not the
python library it appears.

jugglinmike · 2017-03-08T16:13:45Z

Sure thing. I just pushed an update that introduces a new command-line flag,
--chrome-binary. The invocation should now look something like this:

$ python main.py --chrome-binary /custom/path/to/chrome

You can also use the new --help flag to verify.

jugglinmike · 2017-03-08T16:47:49Z

I've just verified that this behavior persists in a more recent build of Chrome: 58.0.3029.6, released yesterday.

jugglinmike · 2017-03-17T17:22:15Z

Hi @chrishtr, is the new --chrome-binary flag working for you?

chrishtr · 2017-03-17T20:33:10Z

Hi,

Yes it worked. Sorry for the delay. I still haven't found time to debug further. This week I got caught
up in some difficult other work. Hope to test on Monday.

jugglinmike · 2017-03-20T14:39:30Z

Got it. Just making sure it's still on your radar!

chrishtr · 2017-03-23T22:45:53Z

Apologies, I still haven't found time to debug this in any depth this week.
I hope next week.

jugglinmike · 2017-03-24T23:54:00Z

Got it. Thanks for keeping us posted

chrishtr · 2017-04-04T23:08:33Z

FYI a similar/same bug got reported via a Google-internal channel, and a bisect to find
the cause of that issue is underway...it may be that Chrome 56 regressed something. If the
bisect fails, I'll try to bisect this issue directly next week, since I was able to reproduce
it locally.

jugglinmike · 2017-04-04T23:23:27Z

Excellent! I'm excited to learn more :)

chrishtr · 2017-04-05T18:15:58Z

For a workaround in the meantime, the double-rAF method mentioned in a comment on Feb 23 sounds like it would work (and I agree it should help). Does it?

RByers · 2017-04-05T18:18:56Z

Yeah given that there are a bunch of questions about CSS test instability potentially impacted by this issue, I suggest we temporarily land the double-rAF workaround while we continue to investigate. @jugglinmike @gsnedders WDYT?

gsnedders · 2017-04-05T18:48:04Z

@jgraham @jugglinmike we should try and land that in wptrunner ASAP

jugglinmike · 2017-04-05T19:15:45Z

For a workaround in the meantime, the double-rAF method mentioned in a
comment on Feb 23 sounds like it would work (and I agree it should help).
Does it?

Yes, see my comment from that day. As I mentioned, there, though, without some concrete justification, that technique is no better than (for instance) setTimeout with a value of 100 milliseconds.

Yeah given that there are a bunch of questions about CSS test instability
potentially impacted by this issue, I suggest we temporarily land the
double-rAF workaround while we continue to investigate. @jugglinmike
@gsnedders WDYT?

@jgraham @jugglinmike we should try and land that in wptrunner ASAP

I'm sensing some urgency here. I'm concerned that implementing an
unsubstantiated workaround will not guarantee stability but that it will
delay a verifiable solution still further. I doubt this is going to make me any
friends, but an alternative would be to mark WPT's Chrome job as an "allowed
failure" until this can be resolved.

gsnedders · 2017-04-05T19:25:39Z

@jugglinmike It's not unsubstantiated insofar as it forces a paint and guarantees a paint has occurred, though it is a hypothesis that it is failing to paint the image… I don't think this will get deprioritised as a result of the workaround landing (as it likely effects far more than just this), hence I don't think that's a reason not to land this.

Ideally, if we did make it an allowed failure, we'd do so for reftests only (which probably means in check_stability.py rather than at a Travis level).

See <jugglinmike/chrome-screenshot-race#1>. This appears to avoid it from happening, and guarantees we've painted after the load event and (potentially) reftest-wait being removed.

jugglinmike · 2017-04-05T20:47:03Z

@chrishtr @RByers Can the internal bug mentioned above be made public? @jgraham has pointed out that it would make a much better reference point for documenting this workaround.

chrishtr · 2017-04-05T20:57:43Z

It cannot unfortunately, but I have filed this bug for you to use as a reference point instead:

https://bugs.chromium.org/p/chromium/issues/detail?id=708757

jugglinmike · 2017-04-05T21:00:54Z

Thank you, @chrishtr!

I discussed this with @gsnedders and @jgraham on IRC, and we're going to move forward with the requestAnimationFrame workaround for the short-term.

See <jugglinmike/chrome-screenshot-race#1>. This appears to avoid it from happening, and guarantees we've painted after the load event and (potentially) reftest-wait being removed. r=jgraham

gsnedders · 2017-04-05T21:12:43Z

Note that this shouldn't de-prioritise the upstream issue, at least until we know that the workaround (i.e., the double requestAnimationFrame) will guarantee we won't hit the underlying bug. Of course, it's likely to affect more than just us, so it probably still matters for web developers generally.

Svennisen · 2017-07-19T21:41:44Z

@jugglinmike @gsnedders @chrishtr
I think I am encountering this exact issue when updating the DOM and trying to take a capture after.

Using a mutation observer and the requestAnimationFrame fix works when I run my app under normal load. However when putting very high stress on CPU and GPU the capture will still run before the render has finished. Using chrome 58 in electron. We tested this because we have customers running on low end computers where this is an issue when their computers run virus scans.

Did you try your test under heavy CPU and GPU load?
I used this CPU stress program to put load on the CPU (windows).
Im fairly certain your double requestAnimationFrame solution for readyState complete will also fail when doing this.

Any progress on the internal bug mentioned related to this?

chrishtr · 2017-07-24T17:16:27Z

Please comment on crbug.com/708757. I don't think there has been any progress recently
on it.

Svennisen · 2017-07-24T17:34:51Z

@chrishtr I just did. Good news! They are saying it is fixed and will be released in v60. Don't have access to that issue though so can't look into it further.

"I thought this issue was fixed, maybe. https://bugs.chromium.org/p/chromium/issues/detailid=382170
That would be in M-60 and will not be merged back. So going to Stable soon."

So I am just waiting and hoping for the best.

gsnedders mentioned this issue Apr 5, 2017

[css-ui-3] Add tests for text-overflow web-platform-tests/wpt#5285

Merged

This was referenced Apr 5, 2017

[css-text-3] White space pre wrap end of line web-platform-tests/wpt#5283

Merged

Remove "interact" flags for tests that can be run automatically web-platform-tests/wpt#5370

Merged

gsnedders mentioned this issue Apr 5, 2017

Add workaround for Chrome not painting images w3c/wptrunner#248

Merged

frivoal mentioned this issue Apr 6, 2017

Can't merge PR due to Chrome failing stability check and Firefox just timing out web-platform-tests/wpt#5395

Closed

jgraham mentioned this issue Apr 17, 2017

Need to wait for a paint before taking a screenshot w3c/webdriver#895

Open

jugglinmike mentioned this issue Sep 12, 2023

Proposing a get endpoint for Take Screenshot that with fullpage option w3c/webdriver#1536

Open

Discussion #1

Discussion #1

Comments

foolip commented Feb 21, 2017

foolip commented Feb 21, 2017

jgraham commented Feb 21, 2017

foolip commented Feb 21, 2017

jugglinmike commented Feb 21, 2017

samuong commented Feb 21, 2017

jugglinmike commented Feb 21, 2017

samuong commented Feb 21, 2017

jugglinmike commented Feb 21, 2017

foolip commented Feb 23, 2017

foolip commented Feb 23, 2017

jugglinmike commented Feb 23, 2017

RByers commented Feb 23, 2017 • edited Loading

foolip commented Mar 3, 2017

chrishtr commented Mar 7, 2017

chrishtr commented Mar 7, 2017

foolip commented Mar 7, 2017

jugglinmike commented Mar 7, 2017

chrishtr commented Mar 7, 2017

jugglinmike commented Mar 7, 2017

chrishtr commented Mar 8, 2017

jugglinmike commented Mar 8, 2017

jugglinmike commented Mar 8, 2017

jugglinmike commented Mar 17, 2017

chrishtr commented Mar 17, 2017

jugglinmike commented Mar 20, 2017

chrishtr commented Mar 23, 2017

jugglinmike commented Mar 24, 2017

chrishtr commented Apr 4, 2017

jugglinmike commented Apr 4, 2017

chrishtr commented Apr 5, 2017

RByers commented Apr 5, 2017

gsnedders commented Apr 5, 2017

jugglinmike commented Apr 5, 2017

gsnedders commented Apr 5, 2017

jugglinmike commented Apr 5, 2017

chrishtr commented Apr 5, 2017

jugglinmike commented Apr 5, 2017

gsnedders commented Apr 5, 2017

Svennisen commented Jul 19, 2017

chrishtr commented Jul 24, 2017

Svennisen commented Jul 24, 2017

RByers commented Feb 23, 2017 •

edited

Loading