Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test/async-hooks/test-callback-error #13527

Closed
Trott opened this issue Jun 7, 2017 · 66 comments
Closed

Investigate flaky test/async-hooks/test-callback-error #13527

Trott opened this issue Jun 7, 2017 · 66 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. macos Issues and PRs related to the macOS platform / OSX. post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. test Issues and PRs related to the tests.

Comments

@Trott
Copy link
Member

Trott commented Jun 7, 2017

  • Version: v9.0.0-pre
  • Platform: osx1010
  • Subsystem: test, async_hooks, child_process

https://ci.nodejs.org/job/node-test-commit-osx/10336/nodes=osx1010/console

not ok 167 async-hooks/test-callback-error
  ---
  duration_ms: 60.132
  severity: fail
  stack: |-
    timeout
@Trott Trott added async_hooks Issues and PRs related to the async hooks subsystem. child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. labels Jun 7, 2017
@AndreasMadsen
Copy link
Member

I think we should make the test more robust by only running the spawnSync part if process.argv[2] is not set. The way it is currently written, If there is a bug it will cause a recursion resulting in a timeout.

@Trott
Copy link
Member Author

Trott commented Jun 8, 2017

Happened again today:

https://ci.nodejs.org/job/node-test-commit-osx/10360/nodes=osx1010/console

not ok 167 async-hooks/test-callback-error
  ---
  duration_ms: 60.68
  severity: fail
  stack: |-
    timeout

@AndreasMadsen
Copy link
Member

/cc @nodejs/async_hooks

@Trott
Copy link
Member Author

Trott commented Jun 8, 2017

It might be good to split the three test cases in that file into three separate tests/test files so that we can see which one is the one that's timing out. (I'm pretty sure it's test_callback_abort.)

@refack
Copy link
Contributor

refack commented Jun 8, 2017

could add console.log('staring X'/'stopping X')
Cooking now.

@refack
Copy link
Contributor

refack commented Jun 9, 2017

So finally I have data, seems like it is test_callback_abort. Maybe the time to create a coredump? anyway it fluctuates from 19s to over 60s. @AndreasMadsen what can we do? skip this case on macOS?

+ python tools/test.py -p tap --mode=release --timeout 35 async-hooks/test-callback-error
TAP version 13
1..1
not ok 1 async-hooks/test-callback-error
  ---
  duration_ms: 31.297
  severity: fail
  stack: |-
    2017-06-09T04:03:20.423Z init
    2017-06-09T04:03:20.457Z required
    2017-06-09T04:03:20.467Z required ./init-hooks
    2017-06-09T04:03:20.467Z staring test_init_callback
    2017-06-09T04:03:20.547Z finished test_init_callback
    2017-06-09T04:03:20.548Z staring test_callback
    2017-06-09T04:03:20.626Z finished test_callback
    2017-06-09T04:03:20.626Z staring test_callback_abort
    2017-06-09T04:03:51.210Z finished test_callback_abort
    assert.js:586
    assert.ifError = function ifError(err) { if (err) throw err; };
                                                      ^
    
    Error: spawnSync /Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/out/Release/node ETIMEDOUT
        at exports._errnoException (util.js:1012:11)
        at spawnSync (child_process.js:546:20)
        at Object.<anonymous> (/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/test/async-hooks/test-callback-error.js:56:12)
        at Module._compile (module.js:569:30)
        at Object.Module._extensions..js (module.js:580:10)
        at Module.load (module.js:503:32)
        at tryModuleLoad (module.js:466:12)
        at Function.Module._load (module.js:458:3)
        at Function.Module.runMain (module.js:605:10)
        at startup (bootstrap_node.js:158:16)
  ...
+ NOK=1
+ echo '1   OK: 0   NOT OK: 1   TOTAL: 100'
1   OK: 0   NOT OK: 1   TOTAL: 100
+ for i in '`seq $RUN_TIMES`'
+ python tools/test.py -p tap --mode=release --timeout 35 async-hooks/test-callback-error
TAP version 13
1..1
ok 1 async-hooks/test-callback-error
  ---
  duration_ms: 19.5
  ...
+ OK=1
+ echo '2   OK: 1   NOT OK: 1   TOTAL: 100'
2   OK: 1   NOT OK: 1   TOTAL: 100
+ for i in '`seq $RUN_TIMES`'
+ python tools/test.py -p tap --mode=release --timeout 35 async-hooks/test-callback-error
TAP version 13
1..1
ok 1 async-hooks/test-callback-error
  ---
  duration_ms: 18.279
  ...
+ OK=2
+ echo '3   OK: 2   NOT OK: 1   TOTAL: 100'
3   OK: 2   NOT OK: 1   TOTAL: 100
+ for i in '`seq $RUN_TIMES`'
+ python tools/test.py -p tap --mode=release --timeout 35 async-hooks/test-callback-error
TAP version 13
1..1
ok 1 async-hooks/test-callback-error
  ---
  duration_ms: 17.398
  ...
+ OK=3
+ echo '4   OK: 3   NOT OK: 1   TOTAL: 100'
4   OK: 3   NOT OK: 1   TOTAL: 100

@Trott
Copy link
Member Author

Trott commented Jun 9, 2017

/cc @DavidCai1993

@AndreasMadsen
Copy link
Member

Skip this case on macOS?

We don't have a good reason to expect spawnSync to timeout on MacOS. Also, we do we get an ETIMEDOUT now, if it is not a recursion then I would expect the same error as before.

@DavidCai1111
Copy link
Member

-1 on skip this case on macOS at this time, too. Seems that the problem is somehow related to process.abort() ...?

@refack
Copy link
Contributor

refack commented Jun 9, 2017

We don't have a good reason to expect spawnSync to timeout on MacOS. Also, we do we get an ETIMEDOUT now, if it is not a recursion then I would expect the same error as before.

I added a timeout on spawnSync

const c3 = spawnSync(process.execPath, a, { timeout: 30000 });
console.log((new Date()).toISOString() + ' finished test_callback_abort');
assert.ifError(c3.error);

and ifError so we'd see if and why, the call failed.
I could add:

if (common.isOSX && c3.error && c3.error.code === 'ETIMEDOUT') return common.skip()

Or simply return (no skip) so we acknowledge this might fail, but essentially that's ok.

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Jun 9, 2017

I ran this test 1000 times on my Mac, all tests finished within 200-400 ms.

Thanks for explaining the ETIMEOUT. But I still think, we don't have a good reason to expect spawnSync to timeout on MacOS.

does duration_ms: 31.297 mean 31 seconds? That is confusing.

@AndreasMadsen
Copy link
Member

@Trott why does the abort tests not run on the CI? At least that is what the test/README.md says.

@cjihrig
Copy link
Contributor

cjihrig commented Jun 9, 2017

I think the idea was to avoid possibly creating a bunch of core files.

@AndreasMadsen
Copy link
Member

I see, then maybe we should move the test to that directory.

@Trott
Copy link
Member Author

Trott commented Jun 9, 2017

does duration_ms: 31.297 mean 31 seconds? That is confusing.

Yes, that's what it means, and yes it's confusing, but duration_ms is TAP format, not something we determined. So we may be kinda sorta stuck with it (unless/until TAP changes and the Jenkins plugin changes too).

@Trott
Copy link
Member Author

Trott commented Jun 9, 2017

I see, then maybe we should move the test to that directory.

I'd agree with that. Curious if we ever run the abort tests, like maybe before a release or something? @nodejs/release

@Trott
Copy link
Member Author

Trott commented Jun 9, 2017

Oh, by the way, another one on CI today: https://ci.nodejs.org/job/node-test-commit-osx/10391/nodes=osx1010/console

not ok 175 async-hooks/test-callback-error
  ---
  duration_ms: 60.25
  severity: fail
  stack: |-
    timeout

@Trott
Copy link
Member Author

Trott commented Jun 9, 2017

I'd agree with that.

Or at least moving the abort-on-uncaught-exception thing into its own test? I don't know what I think anymore.

@Trott
Copy link
Member Author

Trott commented Jun 9, 2017

Looks like what we have here may be a bug in Node.js master branch on macOS and not a problem with the test.

With this one-line test file called test.js:

throw new Error('foo');

If I run it with current master:

$ ./node --abort-on-uncaught-exception test.js
Uncaught Error: foo

FROM
Object.<anonymous> (/Users/trott/io.js/test.js:1:1)
Module._compile (module.js:1:1)
Object.Module._extensions..js (module.js:1:1)
Module.load (module.js:1:1)
tryModuleLoad (module.js:1:1)
Function.Module._load (module.js:1:1)
Function.Module.runMain (module.js:1:1)
startup (bootstrap_node.js:1:1)
bootstrap_node.js:1:1
Received signal 4 <unknown> 000100c84632

==== C stack trace ===============================

 [0x000100c83442]
 [0x7fff8200652a]
 [0x000000000000]
 [0x000100954b4b]
 [0x0db5628843fd]
 [0x0db562987c7a]
 [0x0db56293be7c]
[end of stack trace]
Illegal instruction: 4
$

There is a looooong pause between printing [end of stack trace] and Illegal instruction: 4.

If I use Node.js 8.1.0, I get similar output, but there's no long pause.

Can someone try to replicate that just to confirm it's not some oddity in my own setup?

@Trott Trott added process Issues and PRs related to the process subsystem. and removed process Issues and PRs related to the process subsystem. labels Jun 9, 2017
@aqrln
Copy link
Contributor

aqrln commented Jun 9, 2017

@Trott I cannot reproduce it on macOS with neither the master branch, nor 8.1.0. It just aborts immediately.

@Trott
Copy link
Member Author

Trott commented Jun 9, 2017

@aqrln Do you get the Illegal instruction: 4 as the last line of your output? (If I run as root, I don't get that line and it exits immediately.)

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Jun 9, 2017

I can't reproduce on master, this is what I'm getting:

Uncaught Error: foo

FROM
Object.<anonymous> (/Users/Andreas/Sites/node/node-master/test.js:1:1)
Module._compile (module.js:1:1)
Object.Module._extensions..js (module.js:1:1)
Module.load (module.js:1:1)
tryModuleLoad (module.js:1:1)
Function.Module._load (module.js:1:1)
Function.Module.runMain (module.js:1:1)
startup (bootstrap_node.js:1:1)
bootstrap_node.js:1:1
Received signal 4 <unknown> 000100c3eff2

==== C stack trace ===============================

 [0x000100c3de24]
 [0x7fff8d423bba]
 [0x7fff5fbfdea0]
 [0x0001009205cb]
 [0x0608333043fd]
 [0x060833407c7a]
 [0x0608333bbe7c]
[end of stack trace]
[1]    44955 illegal hardware instruction  ./node --abort-on-uncaught-exception test.js

@aqrln
Copy link
Contributor

aqrln commented Jun 9, 2017

@Trott

Uncaught Error: foo

FROM
Object.<anonymous> (/Users/alex/projects/aqrln/node/test.js:1:1)
Module._compile (module.js:1:1)
Object.Module._extensions..js (module.js:1:1)
Module.load (module.js:1:1)
tryModuleLoad (module.js:1:1)
Function.Module._load (module.js:1:1)
Function.Module.runMain (module.js:1:1)
startup (bootstrap_node.js:1:1)
bootstrap_node.js:1:1
Received signal 4 <unknown> 000100c3e022

==== C stack trace ===============================

 [0x000100c3ce54]
 [0x7fffaedc8b3a]
 [0x7fff5fbfd960]
 [0x00010092014b]
 [0x23ac401043fd]
 [0x23ac40207c7a]
 [0x23ac401bbe7c]
[end of stack trace]
[1]    32357 illegal hardware instruction  ./node --abort-on-uncaught-exception test.js

@Trott
Copy link
Member Author

Trott commented Jun 21, 2017

Interesting that the stack traces are so very different....

@Trott
Copy link
Member Author

Trott commented Jun 21, 2017

I did a git checkout upstream/v8.x-staging && git checkout a021720 && ./configure && make -j4 to create 8.1.2, but it still takes a long time to exit:

$ ./node -v && time ./node --abort-on-uncaught-exception -e 'throw new Error("foo")'
v8.1.2
Uncaught Error: foo

FROM
[eval]:1:1
ContextifyScript.Script.runInThisContext (vm.js:44:33)
Object.runInThisContext (vm.js:116:38)
Object.<anonymous> ([eval]-wrapper:6:22)
Module._compile (module.js:569:30)
evalScript (bootstrap_node.js:432:27)
startup (bootstrap_node.js:139:9)
bootstrap_node.js:575:3
Illegal instruction: 4

real	0m12.163s
user	0m0.035s
sys	0m0.011s

@Trott
Copy link
Member Author

Trott commented Jun 21, 2017

Now the stack traces match at least, but the time difference is certainly troubling.

@Trott
Copy link
Member Author

Trott commented Jun 21, 2017

When run as root, there is no delay and there is no "Illegal instruction: 4".

@Trott
Copy link
Member Author

Trott commented Jun 21, 2017

I think the issue is in V8 (or perhaps more precisely, an interaction of macOS with V8 code--not sure if the bug is in one place, the other, or really neither). If I don't pass --abort-on-uncaught-exception on to V8, I don't get the ~12 second delay on exit. @nodejs/v8

@bmeurer
Copy link
Member

bmeurer commented Jun 21, 2017

cc @gsathya @ajklein

@gsathya
Copy link
Member

gsathya commented Jun 21, 2017

If I'm reading the history correctly, this seems like a bug with --abort-on-uncaught-exception and not async-hooks. Correct?

Re-routing to @schuay @fhinkel

@Trott
Copy link
Member Author

Trott commented Jun 21, 2017

If I'm reading the history correctly, this seems like a bug with --abort-on-uncaught-exception and not async-hooks. Correct?

Yes. Although I now have the confounding additional information that the issue manifests itself with only some users on my laptop. If I switch to other users, the issue goes away. It's entirely possible the problem is entirely my setup, but I suspect it's not quite that simple because we're seeing this in CI as well.

@Trott
Copy link
Member Author

Trott commented Jun 21, 2017

OK, it's almost definitely user setup at this point. When I run the program, while it's hanging, a process named ReportCrash goes way up to 99% CPU. Main thing at this point is probably to:

  • figure out how to disable it or give it the permissions it needs or whatever, then...
  • ..fix the issue on CI, then...
  • ...document the issue somewhere so Collaborators like me and @benjamingr who run into the issue know what to do :-D

@Trott
Copy link
Member Author

Trott commented Jun 21, 2017

Although...if I run as root, things work just fine, no delay, etc. So maybe the real issue is to figure that out rather than disabling crash reporting... ¯\(ツ)

Again, I wouldn't be going on and on about problems with my own personal setup if it weren't for the fact that we're seeing it in CI too, and we need to fix it there....

addaleax pushed a commit that referenced this issue Jun 24, 2017
PR-URL: #13559
Fixes: #13527
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
@Trott
Copy link
Member Author

Trott commented Jun 26, 2017

I wonder what the node::Chdir is about in the stack trace...

@Trott
Copy link
Member Author

Trott commented Jun 26, 2017

Logging on to the CI machine, I don't get the long delay with simple cases like that described in
#13527 (comment). But I do get a long delay if I do what the test does from the command line:

test-nodesource-osx1010-x64-1:osx1010 iojs$ time ./node --abort-on-uncaught-exception test/async-hooks/test-callback-error.js test_callback_abort
Error: test_callback_abort
    at ActivityCollector.initHooks.oninit.common.mustCall (/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/test/async-hooks/test-callback-error.js:30:45)
    at ActivityCollector.oninit (/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/test/common/index.js:517:15)
    at ActivityCollector._init (/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/test/async-hooks/init-hooks.js:170:10)
    at init (async_hooks.js:467:43)
    at Object.emitInitS [as emitInit] (async_hooks.js:337:3)
    at Object.<anonymous> (/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/test/async-hooks/test-callback-error.js:33:17)
    at Module._compile (module.js:569:30)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
 1: node::Abort() [/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/./node]
 2: node::Chdir(v8::FunctionCallbackInfo<v8::Value> const&) [/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/./node]
 3: v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) [/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/./node]
 4: v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) [/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/./node]
 5: v8::internal::Builtin_Impl_HandleApiCall(v8::internal::BuiltinArguments, v8::internal::Isolate*) [/Users/iojs/build/workspace/node-stress-single-test/nodes/osx1010/./node]
 6: 0x3a7608e043fd
 7: 0x3a7608ef7c49
Received signal 6

==== C stack trace ===============================

 [0x000100c4402b]
 [0x7fff8d42bf1a]
 [0x000000000002]
 [0x7fff8a9d79a3]
 [0x000100ac4e0d]
 [0x000100ac8a2e]
 [0x000100216c92]
 [0x0001002925d0]
 [0x000100291b59]
 [0x3a7608e043fd]
 [0x3a7608ef7c49]
[end of stack trace]

Abort trap: 6

real	2m2.042s
user	0m0.072s
sys	0m0.051s
test-nodesource-osx1010-x64-1:osx1010 iojs$ 

And if I run top -o cpu in another terminal, it shows ReportCrash at the top, unsurprisingly.

I don't have root or sudo access on the machine so I can't see what happens if I run as root and I can't experiment with using launchctl to try to disable ReportCrash. (Running launchctl list without root doesn't show ReportCrash.) @nodejs/build

@gibfahn
Copy link
Member

gibfahn commented Jun 26, 2017

Assuming you're using test-requireio-osx1010-x64-1, then nodejs/build#753 (comment) is the discussion around getting root.

@rvagg could you put the test key in the rvagg user? Presumably we'd also need the password to run sudo commands with it, IDK how we handle that normally.

@refack
Copy link
Contributor

refack commented Jun 26, 2017

...
real	2m2.042s
...

😵

@Trott
Copy link
Member Author

Trott commented Jun 26, 2017

If the plan is to move to MacStadium and not use requireio for Mac testing, then this may all be a moot point, at least in terms of CI. How soon can we move to MacStadium?

@Trott
Copy link
Member Author

Trott commented Jun 27, 2017

Good news! @rvagg reports that disabling ReportCrash with launchctl reduces the time from over two minutes to just 220ms when running time ./node --abort-on-uncaught-exception test/async-hooks/test-callback-error.js test_callback_abort. So this should be enough to turn CI green for OS X. All that's needed is to document this in the ansible scripts that build out the OS X hosts. That's a PR for the build repo.

@Trott
Copy link
Member Author

Trott commented Jun 28, 2017

This has been fixed since @rvagg ran these two commands on the host:

launchctl unload -w /System/Library/LaunchAgents/com.apple.ReportCrash.plist
sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.ReportCrash.Root.plist

Note that they need to be run from a GUI login or else they error out.

There's an issue open to get this into the provisioning for OS X machines. But for now, the problem is resolved, so I'm going to close this.

@Trott Trott closed this as completed Jun 28, 2017
addaleax pushed a commit that referenced this issue Jun 29, 2017
PR-URL: #13559
Fixes: #13527
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
@refack refack added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jun 30, 2017
addaleax pushed a commit that referenced this issue Jul 11, 2017
PR-URL: #13559
Fixes: #13527
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
addaleax pushed a commit that referenced this issue Jul 18, 2017
PR-URL: #13559
Fixes: #13527
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. macos Issues and PRs related to the macOS platform / OSX. post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests