Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion #1

Open
foolip opened this issue Feb 21, 2017 · 41 comments
Open

Discussion #1

foolip opened this issue Feb 21, 2017 · 41 comments

Comments

@foolip
Copy link

foolip commented Feb 21, 2017

Filing an issue to have the discussion in public.

Which was the test that this flakiness was seen on? In IRC I see http://s92097608.onlinehome.us/tmp/17.02.17-flaky-reftest.html, is it http://w3c-test.org/html/rendering/replaced-elements/attributes-for-embedded-content-and-images/img-dim.html that you minified?

Looks like https://github.com/w3c/wptrunner/blob/master/wptrunner/executors/reftest-wait_webdriver.js is the "wait" script and that this test doesn't use the "reftest-wait" mechanism.

I don't know WebDriver, but are the sequence of commands https://github.com/w3c/wptrunner/blob/38435bc6714ae83bbf759b04395fe13f08388396/wptrunner/executors/executorselenium.py#L253-L264 intended to wait until after the load event? (The load event on the window object is delayed until images are loaded.)

How/where is webdriver.get(url) implemented?

Even given the note in https://github.com/jugglinmike/chrome-screenshot-race/blob/master/LICENSE my guess is that there isn't anything that really guarantees the timing?

@jgraham

@foolip
Copy link
Author

foolip commented Feb 21, 2017

@samuong for ChromeDriver. Is it correct to assume that it's perfectly possible and valid to use ChromeDriver to execute async scripts before the page has finished loading? https://w3c.github.io/webdriver/webdriver-spec.html has things like "pageLoadStrategy", but this is all unfamiliar to me so I don't know where to start :)

@jgraham
Copy link

jgraham commented Feb 21, 2017

By default it is supposed to wait for the page to finish loading. However we also check the readyState attribute (the reftest-wait script is run for each page to detect if it has the class=reftest-wait attribute on <body>). So if the screenshot is happening before the load event something is very broken somewhere. Isn't it more likely that Chrome is delaying paint until after load? Or is that not a thing that happens?

@foolip
Copy link
Author

foolip commented Feb 21, 2017

Oh, I didn't pay attention to what readyState "complete" actually means. @jugglinmike, can you confirm that the load event is in fact fired, perhaps add some bit of red that's removed in the load event? Also, can you check if the img.naturalWidth has the right value?

If it looks to scripts like everything has loaded normally and the problem still reproduces on the same runs where scripts verify this, then a repaint problem of some kind like @jgraham suggests seems likely.

@jugglinmike
Copy link
Owner

Sure thing, @foolip. I've extended the test script to capture more data. Here's
a screenshot of the results:

2017-02-21-rendering-info

As you can see, the ready event has indeed fired and the <img> element's
naturalHeight and naturalWidth report the size of the fully-rendered image.
This is all true prior to capturing the screenshot for both the "fully
rendered" case and the "not rendered" case.

@samuong
Copy link

samuong commented Feb 21, 2017

ChromeDriver will wait for pages to load before executing both the screenshot and the execute async script commands. It then hands it off to DevTools (Page.captureScreenshot) to take the screenshot.

It's interesting that the naturalHeight/Width of the image is out of sync with what's being captured. I'll ask around within DevTools to see who can look into Page.captureScreenshot.

@jugglinmike
Copy link
Owner

From my local testing, this "appears to be avoidable" by forcing a repaint in
the onload event handler, as in:

--- a/index.html
+++ b/index.html
@@ -1,14 +1,15 @@
 <!DOCTYPE html>
 <html>
 <head>
   <script>
 window.ONLOAD_FIRED = null;
 onload = function() {
+  document.body.style.backgroundColor = 'hotpink';
   window.ONLOAD_FIRED = 'the `onload` event fired';
 };
   </script>
 </head>
 <body style="outline: solid teal 2px;">
   <img src="blue.png" style="border-width:50px; border-style:solid;"/>
 </body>
 </html>

With that change in place, the image is rendered consistently through thousands
of trials. However, even if "appears to be avoidable" were an acceptable
resolution, that pattern isn't generally usable for Web Platform Tests--it's
unreasonable to expect all reference tests to include the boilerplate logic
just for the sake of Chromium.

A related-but-more-tenable approach would be to force that repaint in the
WebDriver wait script itself:

--- a/main.py
+++ b/main.py
@@ -87,40 +87,41 @@ def take_reading(session_id):
         script = '''
             var callback = arguments[0];
             function done(strategy) {
+              document.body.style.backgroundColor = 'hotpink';
               callback({
                 callbackStrategy: strategy,
                 readyState: document.readyState,
                 scrollHeight: document.documentElement.scrollHeight,
                 onload_fired: window.ONLOAD_FIRED,
                 natural_dimensions: [
                   document.getElementsByTagName('img')[0].naturalWidth,
                   document.getElementsByTagName('img')[0].naturalHeight
                 ]
               });
             }
             if (document.readyState === 'complete') {
               done('synchronous');
             } else {
               onload = done.bind(null, 'asynchronous');
             }
         '''
         result = request('post', 'session/%s/execute_async' % session_id, dict(script=script, args=[]))

This does not solve the problem. It actually makes it more pronounced,
somehow, raising further questions about the paint/repaint cycle:

screenshot from 2017-02-21 13-06-39

What we're seeing here is that the repaint is not taking effect
synchronously... Or, if it is, the frame buffer is not updating before the
screen shot is captured. But that may not be accurate at all; I'm really
stretching my knowledge of the rendering stack at this point.

@samuong
Copy link

samuong commented Feb 21, 2017

Chrome should already force a redraw before it captures a screenshot: https://cs.chromium.org/chromium/src/content/browser/renderer_host/render_widget_host_impl.cc?l=1454. I think there's something else going wrong here...

@jugglinmike
Copy link
Owner

@foolip Sorry, I forgot to answer your original question. This is related to
that test, but I found a whole set of similar tests to behave inconsistently in
Chrome:

  • /html/rendering/replaced-elements/attributes-for-embedded-content-and-images/img-dim.html
  • /html/rendering/replaced-elements/attributes-for-embedded-content-and-images/img_border_percent.xhtml
  • /html/rendering/replaced-elements/attributes-for-embedded-content-and-images/object_border_perc.xhtml
  • /html/rendering/non-replaced-elements/the-page/body_text_00ffff.xhtml
  • /html/rendering/non-replaced-elements/phrasing-content-0/font-element-text-decoration-color/001-a.html
  • /html/rendering/non-replaced-elements/phrasing-content-0/font-element-text-decoration-color/001-q.html
  • /html/rendering/replaced-elements/images/space.html
  • /html/rendering/replaced-elements/attributes-for-embedded-content-and-images/object_border_pixel.xhtml

There are other inconsistent reference tests that may be related. These do not
concern rendering directly, but they do rely on image rendering:

  • /html/semantics/embedded-content/the-img-element/document-base-url.html
  • /html/semantics/embedded-content/the-video-element/video_dynamic_poster_absolute.htm
  • /html/semantics/forms/the-input-element/image01.html

@foolip
Copy link
Author

foolip commented Feb 23, 2017

The first experiment #1 (comment) demonstrates clearly that the load event is in fact being fired before the screenshot is captured.

However, it is perhaps not a given that the layout, even if invalidated, is actually recomputed before the screenshot is taken. While a script calling img.getBoundingClientRect() or similar would force the layout, perhaps in the absence of such a call the layout might happen on the animation frame after the load event? @eaenet, would you happen to know the sequence of events from the size of an image becoming known to the layout being updated and affecting the paint?

@foolip
Copy link
Author

foolip commented Feb 23, 2017

Chrome should already force a redraw before it captures a screenshot: https://cs.chromium.org/chromium/src/content/browser/renderer_host/render_widget_host_impl.cc?l=1454. I think there's something else going wrong here...

RenderViewImpl::OnForceRedraw(), which handles that message, calls RenderWidget::ScheduleCompositeWithForcedRedraw(). Are you sure there isn't something asynchronous about that, such that the the semantics "paint real soon now" and not "paint now"?

@jugglinmike
Copy link
Owner

At @RByers's suggestion, I modified the "wait" script to defer for the next
animation frame using requestAnimationFrame:

--- a/main.py
+++ b/main.py
@@ -116,7 +116,11 @@ try:
               });
             }
             if (document.readyState === 'complete') {
-              done('synchronous');
+              requestAnimationFrame(function() {
+                done('synchronous');
+              });
             } else {
               onload = done.bind(null, 'asynchronous');
}

This modification did not address the instability: I observed the race
condition after 10,723 trials. Rick anticipated this, though, and explained how
the semantics of the requestAnimationFrame function meant that we would
actually have to apply it twice to accurately describe the condition we are
interested in here:

--- a/main.py
+++ b/main.py
@@ -116,7 +116,11 @@ try:
               });
             }
             if (document.readyState === 'complete') {
-              done('synchronous');
+              requestAnimationFrame(function() {
+                requestAnimationFrame(function() {
+                  done('synchronous');
+                });
+              });
             } else {
               onload = done.bind(null, 'asynchronous');
}

This ran for almost three hours without exhibiting the race condition.

While this is encouraging, it is by no means conclusive. There is no indication
that the delay introduced here is in any way related to the delay in the
browser's rendering behavior. We are at the limits of what guesswork can tell
us, and we really need a more authoritative voice to conclusively describe an
deferral technique based on the Chromium source code.

@RByers
Copy link

RByers commented Feb 23, 2017

Yeah, we'll get a Chrome rendering expert to investigate and get back to you. But in the interim I do think it's a reasonable workaround, IIRC some blink LayoutTests follow such a double-rAF pattern already, eg. here.

We recommend a different internal API which probably has stronger guarantees than just a double-rAF. Eg. it may be possible for rAF to fire while raster is still pending, so you could possible see "checkerboarding" (which today is just white, not an actual checkerboard pattern). But I suspect the WebDriver screenshot API blocks on raster (at least it should - probably doesn't make to generate a screenshot of something that hasn't finished rastering yet).

@foolip
Copy link
Author

foolip commented Mar 3, 2017

@chrishtr, can you help us take a look here? I would be interesting to start from the question I posed to @eaenet in #1 (comment) about the order of events here.

In Document::checkLoadEventSoon() I see that there's just a timer that's started to possibly fire the load event. ImageLoader seems to use that (indirectly), and if that timer is started at the same time as the layout is invalidated, then I don't think there's any guarantees about order, the load event can come before the layout. Could the paint that we use for the screenshot happen between the load event and the layout? Or do we never paint anything if the layout is dirty?

@chrishtr
Copy link

chrishtr commented Mar 7, 2017

Blink never paints with dirty layout.

I'm trying to repro locally now.

@chrishtr
Copy link

chrishtr commented Mar 7, 2017

I am not yet able to reproduce this bug. I tried the steps in https://github.com/jugglinmike/chrome-screenshot-race/blob/master/readme.md, but when I run

python main.py

it errors out with:

Traceback (most recent call last):
File "main.py", line 91, in
session_id = request('post', 'session', dict(desiredCapabilities={}))['sessionId']
File "main.py", line 20, in request
response = getattr(requests, method)('http://localhost:4444/%s' % uri, json=data)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 88, in post
return request('post', url, data=data, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'json'

As far as I can tell I have installed the "requests" module correctly.

I also tried putting https://github.com/jugglinmike/chrome-screenshot-race/blob/master/index.html into the Blink Layout Test machinery, and was unable to make it fail in 1000 trials. (not that layout tests are the same as Chrome Driver, just one data point).

@foolip
Copy link
Author

foolip commented Mar 7, 2017

@jugglinmike, just to ensure that we're not chasing after something that was recently fixed, can you try to repro also in Chrome Canary, and if successful note the exact version in case someone wants to try building and testing from the same source?

@jugglinmike
Copy link
Owner

Thanks for taking a look, @chrishtr!

As far as I can tell I have installed the "requests" module correctly.

I'm not sure what could be going wrong here, either. I've just removed and
re-installed that module and the script still functions on my end. To be safe,
I've updated the installation instructions to use the PIP tool and provided a
requirements.txt file specifying the exact version of requests that I am
using. You probably don't have to use PIP, but could you verify that the
version of the library matches?

@jugglinmike, just to ensure that we're not chasing after something that was
recently fixed, can you try to repro also in Chrome Canary, and if successful
note the exact version in case someone wants to try building and testing from
the same source?

That's a good idea, Philip. I'm unable to run Chrome Canary (no suitable
Windows machine), but I've verified using Chrome's latest "dev" release,
(58.0.3026.3) which appears to have been published last
Wednesday
.
I've updated the project's readme.md file to include this information along
with additional details about my system architecture and version of
ChromeDriver (2.27.440175, which appears to be the latest at
present
).

@chrishtr
Copy link

chrishtr commented Mar 7, 2017

I can now get it to run with the updated instructions. Should I expect the test to automatically fail with
a wrong screenshot? So far I haven't seen that. How long should it take?

@jugglinmike
Copy link
Owner

The script should capture an initial screenshot and then begin reloading the
page in quick succession, capturing another screenshot each time, until an
abberation is detected. It should then write a report to a file named
results.html and exit.

Here are the results of my last three executions (taken today):

  • Captured 439 screen shots over 96.2317750454 seconds.

  • Captured 251 screen shots over 56.5082240105 seconds.

  • Captured 315 screen shots over 68.7743799686 seconds.

This problem may be related to available system resources. In case that
information helps:

Processor		: 4x AMD Athlon(tm) II X4 640 Processor
Memory		: 8171MB

@chrishtr
Copy link

chrishtr commented Mar 8, 2017

Ok now I can get it to fail. Next I'm having trouble setting a custom Chrome binary, can you help with
that? I have found python code to set up a new driver, but your code is accessing via HTTP not the
python library it appears.

@jugglinmike
Copy link
Owner

Sure thing. I just pushed an update that introduces a new command-line flag,
--chrome-binary. The invocation should now look something like this:

$ python main.py --chrome-binary /custom/path/to/chrome

You can also use the new --help flag to verify.

@jugglinmike
Copy link
Owner

I've just verified that this behavior persists in a more recent build of Chrome: 58.0.3029.6, released yesterday.

@jugglinmike
Copy link
Owner

Hi @chrishtr, is the new --chrome-binary flag working for you?

@chrishtr
Copy link

Hi,

Yes it worked. Sorry for the delay. I still haven't found time to debug further. This week I got caught
up in some difficult other work. Hope to test on Monday.

@jugglinmike
Copy link
Owner

Got it. Just making sure it's still on your radar!

@chrishtr
Copy link

Apologies, I still haven't found time to debug this in any depth this week.
I hope next week.

@jugglinmike
Copy link
Owner

Got it. Thanks for keeping us posted

@chrishtr
Copy link

chrishtr commented Apr 4, 2017

FYI a similar/same bug got reported via a Google-internal channel, and a bisect to find
the cause of that issue is underway...it may be that Chrome 56 regressed something. If the
bisect fails, I'll try to bisect this issue directly next week, since I was able to reproduce
it locally.

@jugglinmike
Copy link
Owner

Excellent! I'm excited to learn more :)

@chrishtr
Copy link

chrishtr commented Apr 5, 2017

For a workaround in the meantime, the double-rAF method mentioned in a comment on Feb 23 sounds like it would work (and I agree it should help). Does it?

@RByers
Copy link

RByers commented Apr 5, 2017

Yeah given that there are a bunch of questions about CSS test instability potentially impacted by this issue, I suggest we temporarily land the double-rAF workaround while we continue to investigate. @jugglinmike @gsnedders WDYT?

@gsnedders
Copy link

@jgraham @jugglinmike we should try and land that in wptrunner ASAP

@jugglinmike
Copy link
Owner

For a workaround in the meantime, the double-rAF method mentioned in a
comment on Feb 23 sounds like it would work (and I agree it should help).
Does it?

Yes, see my comment from that day. As I mentioned, there, though, without some concrete justification, that technique is no better than (for instance) setTimeout with a value of 100 milliseconds.

Yeah given that there are a bunch of questions about CSS test instability
potentially impacted by this issue, I suggest we temporarily land the
double-rAF workaround while we continue to investigate. @jugglinmike
@gsnedders WDYT?

@jgraham @jugglinmike we should try and land that in wptrunner ASAP

I'm sensing some urgency here. I'm concerned that implementing an
unsubstantiated workaround will not guarantee stability but that it will
delay a verifiable solution still further. I doubt this is going to make me any
friends, but an alternative would be to mark WPT's Chrome job as an "allowed
failure" until this can be resolved.

@gsnedders
Copy link

@jugglinmike It's not unsubstantiated insofar as it forces a paint and guarantees a paint has occurred, though it is a hypothesis that it is failing to paint the image… I don't think this will get deprioritised as a result of the workaround landing (as it likely effects far more than just this), hence I don't think that's a reason not to land this.

Ideally, if we did make it an allowed failure, we'd do so for reftests only (which probably means in check_stability.py rather than at a Travis level).

gsnedders added a commit to gsnedders/wptrunner that referenced this issue Apr 5, 2017
See <jugglinmike/chrome-screenshot-race#1>.

This appears to avoid it from happening, and guarantees we've painted
after the load event and (potentially) reftest-wait being removed.
@jugglinmike
Copy link
Owner

@chrishtr @RByers Can the internal bug mentioned above be made public? @jgraham has pointed out that it would make a much better reference point for documenting this workaround.

@chrishtr
Copy link

chrishtr commented Apr 5, 2017

It cannot unfortunately, but I have filed this bug for you to use as a reference point instead:

https://bugs.chromium.org/p/chromium/issues/detail?id=708757

@jugglinmike
Copy link
Owner

Thank you, @chrishtr!

I discussed this with @gsnedders and @jgraham on IRC, and we're going to move forward with the requestAnimationFrame workaround for the short-term.

gsnedders added a commit to w3c/wptrunner that referenced this issue Apr 5, 2017
See <jugglinmike/chrome-screenshot-race#1>.

This appears to avoid it from happening, and guarantees we've painted
after the load event and (potentially) reftest-wait being removed.

r=jgraham
@gsnedders
Copy link

Note that this shouldn't de-prioritise the upstream issue, at least until we know that the workaround (i.e., the double requestAnimationFrame) will guarantee we won't hit the underlying bug. Of course, it's likely to affect more than just us, so it probably still matters for web developers generally.

@Svennisen
Copy link

@jugglinmike @gsnedders @chrishtr
I think I am encountering this exact issue when updating the DOM and trying to take a capture after.

Using a mutation observer and the requestAnimationFrame fix works when I run my app under normal load. However when putting very high stress on CPU and GPU the capture will still run before the render has finished. Using chrome 58 in electron. We tested this because we have customers running on low end computers where this is an issue when their computers run virus scans.

Did you try your test under heavy CPU and GPU load?
I used this CPU stress program to put load on the CPU (windows).
Im fairly certain your double requestAnimationFrame solution for readyState complete will also fail when doing this.

Any progress on the internal bug mentioned related to this?

@chrishtr
Copy link

Please comment on crbug.com/708757. I don't think there has been any progress recently
on it.

@Svennisen
Copy link

@chrishtr I just did. Good news! They are saying it is fixed and will be released in v60. Don't have access to that issue though so can't look into it further.

"I thought this issue was fixed, maybe. https://bugs.chromium.org/p/chromium/issues/detailid=382170
That would be in M-60 and will not be merged back. So going to Stable soon."

So I am just waiting and hoping for the best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants