Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Chrome CI job occassionally does not run any tests #5406

Closed
gsnedders opened this issue Apr 6, 2017 · 18 comments
Closed

Google Chrome CI job occassionally does not run any tests #5406

gsnedders opened this issue Apr 6, 2017 · 18 comments

Comments

@gsnedders
Copy link
Member

See:

https://travis-ci.org/w3c/web-platform-tests/builds/218464208

cc/ @jgraham @jugglinmike @bobholt

@gsnedders
Copy link
Member Author

@gsnedders
Copy link
Member Author

Potentially has overlap with #5408.

@jugglinmike jugglinmike changed the title Stability check running different sets of tests on Chrome/Firefox Google Chrome CI job occassionally does not run any tests Apr 6, 2017
@jugglinmike
Copy link
Contributor

The problem here is actually specific to the Chrome job, so I'm updating the
title to reflect that.

In both reported cases, ChromeDriver failed to run any tests at all. The full
log (hosted on TraviCI but not included in the automated GitHub.com comment)
includes the following output:

PROCESS | 12034 | Starting ChromeDriver 2.29.461571 (8a88bbe0775e2a23afda0ceaf2ef7ee74e822cc5) on port 4454
PROCESS | 12034 | Only local connections are allowed.
Failed to connect to navigate initial page
Init failed 1
u'log' (u'debug', {'message': 'Hanging up on Selenium session'})
u'runner_teardown' ()
PROCESS | 12165 | Starting ChromeDriver 2.29.461571 (8a88bbe0775e2a23afda0ceaf2ef7ee74e822cc5) on port 4454
PROCESS | 12165 | Only local connections are allowed.
Failed to connect to navigate initial page
Init failed 2

...repeated 3 more times (for a total of 5 "Init failed" reports), followed by
a JSON-encoded version of the following error message:

Connecting to Selenium failed:
Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/wptrunner/executors/executorselenium.py", line 59, in setup
    desired_capabilities=self.capabilities)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 98, in __init__
    self.start_session(desired_capabilities, browser_profile)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 185, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 247, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 464, in execute
    return self._request(command_info[0], url, body=data)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 525, in _request
    resp = opener.open(request, timeout=self._timeout)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 1201, in do_open
    r = h.getresponse(buffering=True)
  File "/opt/python/2.7.12/lib/python2.7/httplib.py", line 1136, in getresponse
    response.begin()
  File "/opt/python/2.7.12/lib/python2.7/httplib.py", line 453, in begin
    version, status, reason = self._read_status()
  File "/opt/python/2.7.12/lib/python2.7/httplib.py", line 417, in _read_status
    raise BadStatusLine(line)
BadStatusLine: ''

@bobholt
Copy link
Contributor

bobholt commented Apr 6, 2017

The relevant issue here are #5336 and #5341. References to this in 5395 and 5343 are mis-attributed, as both of those PRs ran tests to completion with results.

The issue that distinguiishes this from #5407 is that here, Chromedriver tries to start 5 times each on various ports starting at 4444 and continuing to 4463.

@bobholt
Copy link
Contributor

bobholt commented Apr 11, 2017

Going back through PRs, all PRs since 11:15am EDT on April 7 have completed Chrome tests that match the Firefox tests. I am also no longer able to recreate the timeouts on my fork.

Unless this PR success is due to re-triggering Chrome jobs, I propose that this be closed. This is unsatisfying to me personally, because I was not able to track down a root cause for the timeouts, but as I no longer recreate the behavior, I don't hold out much hope of figuring it out.

@jugglinmike
Copy link
Contributor

Looks like we're still having trouble:

This is pretty elusive, though: of forty builds that occurred over the last 5 days, it's only occurred once.

This may be a regression in the "Chromedriver" binary. I say this because the "check stability" script was authored to fetch the most recently-published version, and version 2.29 of ChromeDriver was released on April 4, which roughly corresponds to when we started to experience this instability. Nothing in that change log seems particularly relevant, but you never know.

I would like to follow up with the ChromeDriver development group, but we have very little information to share at the moment. As a preliminary step, I am attempting to capture ChromeDriver debugging output at the moment of failure. This output is highly verbose, so we can't enable logging generally (doing so would cause log truncation and potentially obscure output that is relevant to test contributors). Instead, I've opened a dedicated pull request to collect the data. I plan on manually re-triggering that build until the timeout occurs.

I'll report back here when I've got some data.

@gsnedders
Copy link
Member Author

Does 2.28 work with the current dev channel of Chrome? Should we try rolling back to 2.28 and seeing if we get stability back?

@jugglinmike
Copy link
Contributor

I discussed that possibility with @bobholt. Given the infrequency of the
timeout (again, one in forty builds), it would be some time before we could
determine if that was effective, even informally.

@jugglinmike
Copy link
Contributor

Alright, we now have some relevant debugging information. gh-5544 references a
passing and failing build, each with logs extended to include ChomeDriver's
debugging output

I'll skip the analysis here since it's not immediately relevant to WPT.
Suffice it to say, the communication between ChromeDriver and Chrome
(specifically the DevTools component) is definitely failing here.

The good news is that this is a known problem: there are already two separate
bugs in the ChromeDriver issue tracker documenting the behavior:

The bad news is that they have not received very much traction from the
maintainers (which is a shame because that first report is incredibly
thorough). Note that this is starting to look more like an issue with Chromium
rather than ChromeDriver, so maybe a new bug report (this one filed against
Chromium) is in order.

@foolip is there anyone on either team that you can talk to about increasing
the priority?

@foolip
Copy link
Member

foolip commented Apr 13, 2017

@pavelfeldman or @RByers, do you know who's responsible for triaging ChromeDriver bugs?

@RByers
Copy link
Contributor

RByers commented Apr 15, 2017

@pavelfeldman or @RByers, do you know who's responsible for triaging ChromeDriver bugs?

Unfortunately our main ChromeDriver owner (Sam) has recently left Google. I'm trying to find a web platform team to help own it. @NavidZ may be able to help.

@jgraham
Copy link
Contributor

jgraham commented Apr 20, 2017

Based on the bugs that @jugglinmike linked #5626 may offer a bandaid over the problem (although I'm not yet sure).

@RByers: Assuming it does, it is more beneficial to land the bandaid fix, or leave Chrome running but not blocking PRs? I know that @jugglinmike would like to avoid the bandaid since we don't really understand it, and it may delay a real fix. However I don't want to leave you in a situation where you are importing tests that are unstable in a way that we could have caught.

@jugglinmike
Copy link
Contributor

(reposting from IRC) if we implement the workaround, we may never see a fix from Chromium. The bug will continue to trip up application developers. I think ignoring Chromium failures is the better solution because it avoids the workflow interruptions in WPT, places some pressure on the Chromium team, and (importantly) is something we actually understand.

@RByers
Copy link
Contributor

RByers commented Apr 22, 2017

For the record here, we agreed to land the temporary work-around but I'm also pushing to get either a workaround landed in Chrome and/or a proper fix to glib shipped at high priority. This is a problem for anyone trying to automate Chrome - so is important regardless of whether WPT has a workaround.

@jugglinmike
Copy link
Contributor

Thank you, Rick! 🌈

@RByers
Copy link
Contributor

RByers commented May 12, 2017

FYI the work-around has now landed in chromium and is included in the latest Chrome dev-channel build (60.0.3095.5). I suggest we try removing the workaround in WPT.

@gsnedders
Copy link
Member Author

#6438 ran tests only on Firefox. Is the issue with Safari/Edge related to this?

@RByers
Copy link
Contributor

RByers commented Aug 16, 2017

This issue is specific to the problem of Chrome hanging on startup, causing timeout errors during webdriver connection. We believe that issue is now fixed, so closing (but there appear to still be other issues for why tests may not run).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants