Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: windows ci timeouts related to child_process #1383

Closed
silverwind opened this issue Apr 9, 2015 · 31 comments
Closed

test: windows ci timeouts related to child_process #1383

silverwind opened this issue Apr 9, 2015 · 31 comments
Labels
test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.

Comments

@silverwind
Copy link
Contributor

With ARM timeouts fixed, this looks to be the last persistent test failure on the CI. If we can solve this, the builds should start to get blue, and all that's left are some intermittent failures on OS X and ARM.

https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/473/nodes=iojs-win2012r2/

cc: @rvagg @piscisaureus

=== release test-child-process-double-pipe ===
Path: parallel/test-child-process-double-pipe
grep stdin write 7
grep stdin write 6
grep stdin write 5
grep stdin write 7
echo exit
Command: c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\Release\iojs.exe c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\parallel\test-child-process-double-pipe.js
--- TIMEOUT ---


=== release test-child-process-exit-code ===
Path: parallel/test-child-process-exit-code
Command: c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\Release\iojs.exe c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\parallel\test-child-process-exit-code.js
--- TIMEOUT ---


=== release test-child-process-spawnsync ===
Path: parallel/test-child-process-spawnsync
sleep started
Command: c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\Release\iojs.exe c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\parallel\test-child-process-spawnsync.js
--- TIMEOUT ---


=== release test-child-process-spawnsync-input ===
Path: parallel/test-child-process-spawnsync-input
Command: c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\Release\iojs.exe c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\parallel\test-child-process-spawnsync-input.js
--- TIMEOUT ---


=== release test-http-curl-chunk-problem ===
Path: parallel/test-http-curl-chunk-problem
dd command:  "c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\Release\iojs.exe" "c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\fixtures\create-file.js" "c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\tmp.3\big" 10485760
Server running at http://localhost:8080
making curl request
Command: c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\Release\iojs.exe c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\parallel\test-http-curl-chunk-problem.js
--- TIMEOUT ---


=== release test-process-kill-null ===
Path: parallel/test-process-kill-null
Command: c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\Release\iojs.exe c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\parallel\test-process-kill-null.js
--- TIMEOUT ---


=== release test-pipe-head ===
Path: sequential/test-pipe-head
Command: c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\Release\iojs.exe c:\workspace\iojs+any-pr+multi\nodes\iojs-win2012r2\test\sequential\test-pipe-head.js
--- TIMEOUT ---
@silverwind silverwind added windows Issues and PRs related to the Windows platform. test Issues and PRs related to the tests. labels Apr 9, 2015
@Fishrock123
Copy link
Contributor

And ya beat me to it.

Some analysis on when the errors started on the CI will be coming soon.

@Fishrock123
Copy link
Contributor

Note: this picks out only the builds that were done against v1.x.

The last consistently stable v1.x build was 209

Since then, there have been 5 stable v1.xbuilds: 363 364 367 368 369

Here are some builds after 209 where errors started cropping up. The errors seem slightly different than the ones currently on v1.x: 210 212 213 217 (aborted) 218 (rebuild of 217, git failed) 219 (rebuild of 217) 220 (rebuild of 217)

The commits from 209 into 220 are:

  • [2b47fd2eb6] - stream_base: .writev() has limited support (Fedor Indutny) #1008
  • [c9151547bc] - Working on v1.4.3 (Rod Vagg)
  • [b6b9628df7] - 2015-02-28 io.js v1.4.2 Release (Rod Vagg)
  • [1ab7e80838] - tls: proxy handle.reading back to parent handle (Fedor Indutny) #1004
  • [7887e119ed] - install: new performance counters provider guid (Russell Dempsey) #1001
  • [755461219d] - tls: fix typo handle._reading => handle.reading (Fedor Indutny) #994
  • [675cffb33e] - http: don't confuse automatic headers for others (Christian Tellnes) #828
  • [25da0742ee] - build: improve vcbuild.bat (Bert Belder) #998
  • [b8310cbd3e] - build: reduce tarball size by 8-10% (Johan Bergström) #961

@Fishrock123
Copy link
Contributor

Hmm, of note:

Did not have a CI run against the PR.

@silverwind
Copy link
Contributor Author

Hmm, nothing suspicious in that commit log, except maybe the vcbuild thing. Maybe an CI issue?

@Fishrock123
Copy link
Contributor

@silverwind
Copy link
Contributor Author

Maybe related to #1005? A lot of the same timeouts there too.

@Fishrock123
Copy link
Contributor

Everything in #1005 should have been fixed... and, iirc, was fixed..

@Fishrock123
Copy link
Contributor

Ok, I think I was wrong.

The current timeouts appear to have been introduced in 370, and first manifested in 389

373 contains @indutny's last fix for #1005, which should have been green.. That "narrows" down the commits to:

  • [36f017afaf] - js2c: fix module id generation on windows (Ben Noordhuis) #1281
  • [2903410aa8] - src: don't lazy-load timer globals (Ben Noordhuis) #1280
  • [5e609e9324] - _Revert_ "doc: clarify real name requirement" (Jeremiah Senkpiel) #1276
  • [08acf1352c] - win,node-gyp: make delay-load hook optional (Bert Belder) #1266
  • [776b73b243] - timers: cleanup interval handling (Jeremiah Senkpiel) #1272
  • [caf0b36de3] - timers: assure setTimeout callback only runs once (Roman Reiss) #1231
  • [2ccc8f3970] - tls_wrap: fix this incredibly stupid leak (Fedor Indutny) #1244
  • [e74b5d278c] - tls_wrap: fix BIO leak on SSL error (Fedor Indutny) #1244
  • [2db758c562] - iojs: introduce internal modules (Vladimir Kurchatkin) #848
  • [45814216ee] - doc: fix format docs discrepancy (Brendan Ashworth) #1255
  • [8a945814dd] - string_decoder: optimize write() (Brian White) #1209

(probably in the commits below, see the below comment)

  • [3d46fefe0c] - win,node-gyp: allow node.exe/iojs.exe to be renamed (Bert Belder) #1251
  • [c9207f7fc2] - fs: fix corruption in writeFile and writeFileSync (Olov Lassus) #1063
  • [4e9bf93e9c] - doc: clarify real name requirement (Roman Reiss) #1250
  • [e84dd5f651] - doc: document repl on-demand module loading (Roman Reiss) #1249
  • [1832743e18] - lib: add missing new for errors lib/*.js (Mayhem) #1246
  • [7dd5e824be] - assert: simplify logic of testing buffer equality (Alex Yursha) #1171
  • [269e46be37] - deps: make node-gyp work with io.js (cjihrig) #990
  • [b542fb94a4] - deps: upgrade npm to 2.7.3 (Forrest L Norvell) #1219
  • [2e5b87a147] - src: remove unnecessary environment lookups (Ben Noordhuis) #1238
  • [7e88a9322c] - src: make accessors immune to context confusion (Ben Noordhuis) #1238
  • [20c4498e76] - Working on v1.6.3 (Rod Vagg)
  • [33c57354aa] - 2015-03-23 io.js v1.6.2 Release (Rod Vagg)
  • [849319a260] - util: Check input to util.inherits (Connor Peet) #1240

@Fishrock123
Copy link
Contributor

That being said, due to 373 I suspect it was caused before:

@Fishrock123
Copy link
Contributor

https://jenkins-iojs.nodesource.com/view/iojs/job/iojs+any-pr+multi/476/

Narrowed it down to

  • [c9207f7fc2] - fs: fix corruption in writeFile and writeFileSync (Olov Lassus) #1063
  • [4e9bf93e9c] - doc: clarify real name requirement (Roman Reiss) #1250
  • [e84dd5f651] - doc: document repl on-demand module loading (Roman Reiss) #1249
  • [1832743e18] - lib: add missing new for errors lib/*.js (Mayhem) #1246
  • [7dd5e824be] - assert: simplify logic of testing buffer equality (Alex Yursha) #1171
  • [269e46be37] - deps: make node-gyp work with io.js (cjihrig) #990
  • [b542fb94a4] - deps: upgrade npm to 2.7.3 (Forrest L Norvell) #1219
  • [2e5b87a147] - src: remove unnecessary environment lookups (Ben Noordhuis) #1238
  • [7e88a9322c] - src: make accessors immune to context confusion (Ben Noordhuis) #1238
  • [20c4498e76] - Working on v1.6.3 (Rod Vagg)
  • [33c57354aa] - 2015-03-23 io.js v1.6.2 Release (Rod Vagg)
  • [849319a260] - util: Check input to util.inherits (Connor Peet) #1240

@Fishrock123
Copy link
Contributor

https://jenkins-iojs.nodesource.com/view/iojs/job/iojs+any-pr+multi/477/

1.6.3 fails.

I Suspect the culprit is:

(No CI was run on that PR either)

@silverwind
Copy link
Contributor Author

All tests seem to timeout on spawn or spawnSync.

@silverwind
Copy link
Contributor Author

Hmm, not sure about that, there are three uses of inherits in child_process.js, which all look valid, but of course the issue can happen elsewhere.

util.inherits(SocketListSend, EventEmitter);
util.inherits(SocketListReceive, EventEmitter);
util.inherits(ChildProcess, EventEmitter);

@Fishrock123
Copy link
Contributor

It's the windows machine itself.

Both CI's are at the same ref. The older one passes, the current one fails. Same commits.

https://jenkins-iojs.nodesource.com/view/iojs/job/iojs+any-pr+multi/478/

https://jenkins-iojs.nodesource.com/view/iojs/job/iojs+any-pr+multi/368/

@silverwind
Copy link
Contributor Author

Ah, that explains why the commits were misleading. Great detective work 👍

@silverwind
Copy link
Contributor Author

Maybe it's just a intermittent failure that happens to happen very often. We need tests from a second machine to verify.

@Fishrock123
Copy link
Contributor

It has happened every single run since 368 though.

@rvagg
Copy link
Member

rvagg commented Apr 11, 2015

The problem here is that these are only reproducible within Jenkins. I can't even get them to happen if I run the same code on the same machines from a cmd window! So it's apparently something to do with the way Jenkins is being run or the way Jenkins is spawning child processes. Perhaps to do with stdio, perhaps to do with the way it's forking, perhaps ......

@Fishrock123
Copy link
Contributor

Jenkins :trollface:

So it looks like win2008r2 has decided to also begin timing out on similar tests.

@jbergstroem
Copy link
Member

Just wanted to reference #19 here. Seems like we've had similar child process issues prior to the commits triaged above. Since both windows bots has been seeing similar issues lately, we should probably update the issue topic to just "windows" 😢

@silverwind silverwind changed the title test: timeouts on win2012 CI test: windows ci timeouts related to child_process Apr 16, 2015
@jbergstroem
Copy link
Member

So, I just tried running jenkins straight out of powershell instead of wrapping through windows services (that in turn uses nssm.exe). Results here: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/529/nodes=win2008r2/console. No more timeouts. The test that failed is irrelevant to timeouts (we should fix it though :). I'll debug further.

Edit: Here's a run using nssm (and not proxied through windows services): https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/nodes=win2008r2/530/console (hint: full pass)

Edit2: If you look at https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/531/ windows2012 and 2008 doesn't have any timeouts. I restarted all by avoiding starting/stopping through windows services and only using nssm.

I'm far from a windows expert, but there is something going on with how and what services launches what.

@rvagg
Copy link
Member

rvagg commented Apr 17, 2015

Give it time. We've had passes under the current config before but they end up at failure over time. I.e. don't celebrate yet.

@jbergstroem
Copy link
Member

Not celebrating at all. It pretty much feels like I have no clue what I'm doing :)

@silverwind
Copy link
Contributor Author

Seems the recent switch to running through cmd instead of a service did fix these at least for the CI. Now we need some testing done if the issues surface on a non-CI box when ran through a service.

@jbergstroem
Copy link
Member

Lets give it a week so we can confirm that we still don't see them. As mentioned above, I managed to get a pretty stable environment by launching through nssm.

@silverwind
Copy link
Contributor Author

I think this is new on Windows CI:

https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/777/nodes=win2008r2/tapTestReport/test.tap-134/

Error: Not enough storage is available to process this command.

@Fishrock123
Copy link
Contributor

Sounds like a build machine needs to be cleaned. cc @rvagg

@jbergstroem
Copy link
Member

Currebtly on the run, but last I checked it had lots of disk space. I think it might be related to something else. Will investigate when I'm back tomorrow.

@rvagg
Copy link
Member

rvagg commented Jun 8, 2015

I've had a look, enough space on all of the windows machines but I've done a clean-out and restart on them all anyway, want to try again @silverwind?

@jbergstroem
Copy link
Member

I recall starting to see this when libuv 1.6.0 landed (when we had all other strange fs-max errors)

@silverwind
Copy link
Contributor Author

Haven't seen this recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

No branches or pull requests

4 participants