Test hangs on OSX 10.9.5 test-domain-no-error-handler-abort-on-uncaught.js #9979

chexmixbbq · 2016-12-01T17:07:24Z

Version: v8.0.0-pre
Platform: Darwin PA013136.local 13.4.0 Darwin Kernel Version 13.4.0: Mon Jan 11 18:17:34 PST 2016; root:xnu-2422.115.15~1/RELEASE_X86_64 x86_64
Subsystem: test

This test hangs on:

=== release test-domain-no-error-handler-abort-on-uncaught ===
Path: parallel/test-domain-no-error-handler-abort-on-uncaught
Command: out/Release/node /Users/youngj/JsProjects/node/test/parallel/test-domain-no-error-handler-abort-on-uncaught.js

Trott · 2016-12-02T04:29:59Z

Thanks for reporting the issue. I am unable to replicate this on Darwin Kernel Version 16.1.0 and I have not seen it on the test machines. I'm not sure what could be causing it. Hopefully someone else will have more insight. Please do comment here if you manage to get it working.

joyeecheung · 2016-12-02T11:56:02Z

I've run into this a few times today too, and I can reproduce this one when running out/Release/node test/parallel/test-domain-no-error-handler-abort-on-uncaught.js on my mac(Darwin Kernel Version 15.6.0).

I got "Illegal instruction: 4" when trying master, v4.4.7 and v6.9.1 to run individual test cases(passing child 1 or other available textIndex) earlier, but currently I can run the individual cases with expected results, although the test script spawning multiple processes for each test case still hangs.

joyeecheung · 2016-12-02T12:29:00Z

Turns out I can finish the test, it just takes a long time(long enough to fail the test because of timeout when running with the python script).

real	0m16.144s
user	0m0.981s
sys	0m0.330s

another one

real	0m26.977s
user	0m0.960s
sys	0m0.339s

and another one

real	0m57.213s
user	0m0.988s
sys	0m0.359s

And I notice that there are two test/fixtures/child-process-persistent.js still running after the test finishes, that's odd.

Trott · 2016-12-02T16:07:52Z

Looks like this test launches 13 child processes at once. Can one or both of you see what happens if you only have the first six items in the array and remove the last seven? And then again but with the last seven in the array and the first six removed? If both of those still exhibit problems, how about if there's just one item in the array?

Trott · 2016-12-02T16:11:29Z

Just for clarity: The array I'm talking about is tests defined starting on line 14. Each item in it is a function that spans several lines. There are 13 functions in the array.

joyeecheung · 2016-12-02T16:58:26Z

I have tried some combinations, it looks like whenever I hit the else branch(the one launching child processes from the array), the test will run for a long time. I have tried all the individual cases with child arguments(thus hitting the if branch), and they are all fine. Only spawning child processes will make it run for a long time, and all the combinations of test cases I've tried do that(first half, last half, first one, second one, ...).

BTW aren't there 10 functions in that array? That's what I get from

 if (process.argv[2] === 'child') {
   const testIndex = +process.argv[3];
+  console.log('running test ' + testIndex);
   tests[testIndex]();
 } else {
-
   tests.forEach(function(test, testIndex) {
     var testCmd = '';
     if (!common.isWindows) {
       // Do not create core files, as it can take a lot of disk space on
@@ -156,6 +175,7 @@ if (process.argv[2] === 'child') {
     testCmd += ' ' + 'child';
     testCmd += ' ' + testIndex;

+    console.log('spawn child ' + testIndex);
     var child = child_process.exec(testCmd);

     child.on('exit', function onExit(exitCode, signal) {
@@ -164,5 +184,9 @@ if (process.argv[2] === 'child') {
                      ' and signal ' + signal;
       assert(common.nodeProcessAborted(exitCode, signal), errMsg);
     });
+
+    child.stdout.on('data', function(data) {
+      console.log(data);
+    });
   });
 }

I've tried adding some logs in the tests and see what they would do. All test cases can run to the point before they throw an error out.

joyeecheung · 2016-12-02T17:19:25Z

Update:

I tried to log in the onexit callback. Most of the times half of the test cases hit that line and log immediately, and after dozens of seconds, another half hit there and log and then quit. It's pretty random which ones will be in the first batch, which ones the second, and it's not ordered. Is there a race or anything?

@@ -156,13 +175,19 @@ if (process.argv[2] === 'child') {
     testCmd += ' ' + 'child';
     testCmd += ' ' + testIndex;

     var child = child_process.exec(testCmd);

     child.on('exit', function onExit(exitCode, signal) {
+      console.log(exitCode, signal, testIndex);
       const errMsg = 'Test at index ' + testIndex + ' should have aborted ' +
                      'but instead exited with exit code ' + exitCode +
                      ' and signal ' + signal;
       assert(common.nodeProcessAborted(exitCode, signal), errMsg);
     });
+
+    child.stdout.on('data', function(data) {
+      console.log(data);
+    });
   });
 }

Trott · 2016-12-02T18:32:07Z

BTW aren't there 10 functions in that array?

When I do console.log(tests.length), I get 13. And when I add logging in the child exit handler, I get 13 numbers (0 through 12). If you only get 10, that's curious....

Trott · 2016-12-02T18:38:19Z

Given how many processes this spawns, and given how we've seen timers be less reliable on heavily-loaded machines, and how some of these tests cases rely on timers, I wonder if the solution might be to move those tests to their own files. In a way, this is a test that is bypassing the way test.py regulates how many parallel jobs should run at once.

Other possible solution might be to rewrite this to run the tests one at a time and have each child process kick off the next one. But it will probably be easier code to understand to just have the tests in separate files.

@nodejs/testing

joyeecheung · 2016-12-02T20:10:05Z

There could be use cases like this test(spawning multiple processes to max CPU utilization, using --abort-on-uncaught-exception for coredumps in production), but given that domain is pending deprecation anyway, maybe it's OK to just let it go.

If there is no objection I'll be happy to pick up the splitting task.

When I do console.log(tests.length), I get 13. And when I add logging in the child exit handler, I get 13 numbers (0 through 12). If you only get 10, that's curious....

I get 10 when I console.log(tests.length) and can confirm this number counting them with my eyes...that's weird

Trott · 2016-12-02T20:39:38Z

@joyeecheung I figured it out. You are correctly looking at test-domain-no-error-handler-abort-on-uncaught.js and I am mistakenly looking at test-domain-abort-on-uncaught.js.

Trott · 2016-12-09T06:14:46Z

If there is no objection I'll be happy to pick up the splitting task.

@joyeecheung Yes, please feel free to split this test up into multiple test files to reduce concurrency!

joyeecheung · 2016-12-09T08:11:29Z

@Trott Ok, working on it!

I think I still need to spawn one child process per test file, because they need ulimit -c to avoid filling up the test machines disk with core dumps, but at least that's better than 10 child processes for one file?

Trott · 2016-12-09T22:45:45Z

but at least that's better than 10 child processes for one file?

Sounds about right to me, if I'm understanding correctly.

joyeecheung · 2016-12-10T07:26:41Z

Hmm..turns out there is something off with the --abort-on-uncaught-exception in the first place. If I put the test function in a single file:

'use strict';

const domain = require('domain');

function test() {
  const d = domain.create();

  d.run(function() {
    throw new Error('boom!');
  });
}

test();

and run it with --abort-on-uncaught-exception(after I ulimit -c 0), it won't crash immediately. Instead, on my mac it will hang for another ~12 seconds.

0.05s user
0.02s system
0% cpu
12.091 total

But, if I run this test without --abort-on-uncaught-exception, it crashes immediately.

I think the problem is not about child processes. It's about --abort-on-uncaught-exception. The abortion is triggered like this:

(lldb) v8 bt
 * thread #1: tid = 0x5353f5, 0x0000000100a19fc2 node`v8::base::OS::Abort() + 18 at platform-posix.cc:240, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x0000000100a19fc2 node`v8::base::OS::Abort() + 18 at platform-posix.cc:240 [opt]
    frame #1: 0x00000001006be521 node`v8::internal::Runtime_Throw(int, v8::internal::Object**, v8::internal::Isolate*) [inlined] v8::internal::__RT_impl_Runtime_Throw(v8::internal::Arguments, v8::internal::Isolate*) + 34 at runtime-internal.cc:82 [opt]
    frame #2: 0x00000001006be4ff node`v8::internal::Runtime_Throw(args_length=<unavailable>, args_object=0x00007fff77a383e0, isolate=0x00000001038040c8) + 319 at runtime-internal.cc:79 [opt]
    frame #3: 0x000038d9b56092a7 <exit>
    frame #4: 0x000038d9b574b5e0 (anonymous)(this=0x00000675340ed831:<Object: Domain>) at /Users/joyee/projects/node-test/domain.js:8:17 fn=0x00000675340edc31
    frame #5: 0x000038d9b574ae96 Domain.run(this=0x00000675340ed831:<Object: Domain>, 0x00000675340edc31:<function: (anonymous) at /Users/joyee/projects/node-test/domain.js:8:17>) at domain.js:205:32 fn=0x0000007eaa762c79
    frame #6: 0x000038d9b574a41f test(this=0x00003788c2504381:<undefined>) at /Users/joyee/projects/node-test/domain.js:5:14 fn=0x00000675340e9aa9
    frame #7: 0x000038d9b57483e2 (anonymous)(this=0x00000675340e7fa1:<Object: Object>, 0x00000675340e7fa1:<Object: Object>, 0x00000675340e95d1:<function: require at internal/module.js:17:19>, 0x00000675340e7ea9:<Object: Module>, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">, 0x00000675340e94f9:<String: "/Users/joyee/pro...">) at /Users/joyee/projects/node-test/domain.js:1:10 fn=0x00000675340e9499
    frame #8: 0x000038d9b5748090 Module._compile(this=0x00000675340e7ea9:<Object: Module>, 0x00000675340e8fa9:<String: "'use strict';

c...">, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">) at module.js:510:37 fn=0x0000007eaa752dd1
    frame #9: 0x000038d9b5740f2b Module._extensions..js(this=0x00000675340e0ee1:<Object: Object>, 0x00000675340e7ea9:<Object: Module>, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">) at module.js:577:37 fn=0x0000007eaa752e91
    frame #10: 0x000038d9b573f952 Module.load(this=0x00000675340e7ea9:<Object: Module>, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">) at module.js:478:33 fn=0x0000007eaa752c71
    frame #11: 0x000038d9b573f47d tryModuleLoad(this=0x00003788c2504381:<undefined>, 0x00000675340e7ea9:<Object: Module>, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">) at module.js:443:23 fn=0x00000675340aa701
    frame #12: 0x000038d9b5734d22 Module._load(this=0x00000675340aa509:<function: Module at module.js:37:16>, 0x00000675340a4b21:<String: "/Users/joyee/pro...">, 0x00003788c2504201:<null>, 0x00003788c25043c1:<true>) at module.js:412:24 fn=0x0000007eaa752aa9
    frame #13: 0x000038d9b57347ea Module.runMain(this=0x00003788c2504381:<undefined>) at module.js:602:26 fn=0x0000007eaa7530d1
    frame #14: 0x000038d9b5734670 run(this=0x00003788c2504381:<undefined>, 0x0000007eaa7530d1:<function: Module.runMain at module.js:602:26>) at bootstrap_node.js:375:15 fn=0x0000153f113e23a1
    frame #15: 0x000038d9b56478f2 startup(this=0x00003788c2504381:<undefined>) at bootstrap_node.js:12:19 fn=0x00003788c25e2349
    frame #16: 0x000038d9b5642759 (anonymous)(this=0x00003788c2504201:<null>, 0x00003788c25e22e9:<Object: process>) at bootstrap_node.js:10:10 fn=0x00003788c25e2391
    frame #17: 0x000038d9b563b7c3 <internal>
    frame #18: 0x000038d9b562508f <entry>
    frame #19: 0x0000000100457840 node`v8::internal::(anonymous namespace)::Invoke(isolate=0x0000000103003cb8, is_construct=<unavailable>, target=<unavailable>, receiver=<unavailable>, argc=1, args=<unavailable>, new_target=<unavailable>) + 400 at execution.cc:97 [opt]
    frame #20: 0x000000010045767d node`v8::internal::Execution::Call(isolate=0x0000000103003c00, callable=<unavailable>, receiver=<unavailable>, argc=<unavailable>, argv=0x00007fff5fbff0d8) + 333 at execution.cc:153 [opt]
    frame #21: 0x000000010015c5c9 node`v8::Function::Call(this=<unavailable>, context=<unavailable>, recv=<unavailable>, argc=<unavailable>, argv=<unavailable>) + 473 at api.cc:4535 [opt]
    frame #22: 0x000000010087c69a node`node::LoadEnvironment(env=0x0000000102803000) + 625 at node.cc:3540 [opt]
    frame #23: 0x000000010087e1de node`node::Start(int, char**) + 505 at node.cc:4632 [opt]
    frame #24: 0x000000010087dfe5 node`node::Start(argc=<unavailable>, argv=0x0000000102204a60) + 182 at node.cc:4733 [opt]
    frame #25: 0x0000000100000c34 node`start + 52

The abortion is triggered with V8_IMMEDIATE_CRASH(), ~~which is #define V8_IMMEDIATE_CRASH() ((void(*)())0)() on my mac~~

#if V8_CC_GNU
#define V8_IMMEDIATE_CRASH() __builtin_trap()
#else
#define V8_IMMEDIATE_CRASH() ((void(*)())0)()
#endif

and that is why I get "Illegal instruction: 4" when the process aborted.

Process 14612 stopped
* thread #1: tid = 0x534ec6, 0x0000000100a19fc2 node`v8::base::OS::Abort() + 18 at platform-posix.cc:240, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x0000000100a19fc2 node`v8::base::OS::Abort() + 18 at platform-posix.cc:240 [opt]
   237
   238 	void OS::Abort() {
   239 	  if (g_hard_abort) {
-> 240 	    V8_IMMEDIATE_CRASH();
   241 	  }
   242 	  // Redirect to std abort to signal abnormal program termination.
   243 	  abort();

But if I simply do ((void(*)())0)()or __builtin_trap() in another C++ program(I am not sure under what mode is node compiled on my mac so I tried both), it will just crash with illegal instruction warnings instead of hanging for ~12s before crashing.

I am not very familiar with the domains and how --abort-on-uncaught-exception works at the moment, so I might need a little bit more time to figure this out.

The original test lauches 10 child processes at once and bypass `test.py`'s process regulation. This PR reduces the unmanaged parallelism and is a temporary workaround for nodejs#9979 (not a real fix).

joyeecheung · 2016-12-19T04:42:57Z

After splitting the test, the hanging time doesn't change significantly, but at least now they are under more control of test.py. The hanging time on my mac is still high compared to what I get from a ubuntu docker container.

Lauching this:

tools/test.py --mode=release -J test/parallel/test-domain-no-error-handler-abort-on-uncaught-0.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-1.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-2.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-3.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-4.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-5.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-6.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-7.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-8.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-9.js

In a ubuntu container(16.04):

[00:02|% 100|+  10|-   0]: Done

On the host machine(Darwin Kernel Version 15.6.0)

[00:18|% 100|+  10|-   0]: Done

Without further investigation all I observe from the lldb debugger is that it keeps hitting this line at platform-posix.cc again and again. I assume this is supposed to be hit only once and then the process will get abort right after that?

   237
   238 	void OS::Abort() {
   239 	  if (g_hard_abort) {
-> 240 	    V8_IMMEDIATE_CRASH();
   241 	  }
   242 	  // Redirect to std abort to signal abnormal program termination.
   243 	  abort();

Making a PR for the test splitting anyways...

addaleax · 2016-12-19T05:01:39Z

Without further investigation all I observe from the lldb debugger is that it keeps hitting this line at platform-posix.cc again and again. I assume this is supposed to be hit only once and then the process will get abort right after that?

Uh, this may be a long shot, but is there a signal handler attached that does not terminate the process (e.g. one that was attached by libuv/from JS land)? That would lead to a “hanging” loop in that place, since every time the signal handler returns the process attempts to execute the crashing instruction again?

joyeecheung · 2016-12-19T05:12:09Z

Uh, this may be a long shot, but is there a signal handler attached that does not terminate the process (e.g. one that was attached by libuv/from JS land)? That would lead to a “hanging” loop in that place, since every time the signal handler returns the process attempts to execute the crashing instruction again?

I am not sure will there be any signal handlers attached if I run these tests with out/Realease/node --abort-on-uncaught-exception $file, but running this simple demo:

#if defined(__clang__)
#if defined(__GNUC__)  // Clang in gcc mode.
# define V8_CC_GNU 1
#endif
#elif defined(__GNUC__)
# define V8_CC_GNU 1
#endif

#if V8_CC_GNU
#define V8_IMMEDIATE_CRASH() __builtin_trap()
#else
#define V8_IMMEDIATE_CRASH() ((void(*)())0)()
#endif

// the lines above are copied from the V8 source
int main(int argc, char *argv[]) {
  V8_IMMEDIATE_CRASH();
  return 0;
}

crashes immediately on my machine. Maybe there are some compile flags that I should try?

The original test lauches 10 child processes at once and bypass `test.py`'s process regulation. This PR reduces the unmanaged parallelism and is a temporary workaround for nodejs#9979 (not a real fix). PR-URL: nodejs#10329 Reviewed-By: Anna Henningsen <anna@addaleax.net>

The original test lauches 10 child processes at once and bypass `test.py`'s process regulation. This PR reduces the unmanaged parallelism and is a temporary workaround for #9979 (not a real fix). PR-URL: #10329 Reviewed-By: Anna Henningsen <anna@addaleax.net>

Trott · 2017-07-15T09:33:12Z

@joyeecheung Does this still hang for you?

If so, what is the output of launchctl list | grep ReportCrash and sudo launchctl list | grep ReportCrash?

joyeecheung · 2017-07-16T03:32:35Z

@Trott Can not reproduce anymore, although I have updated my OS since then. I think we can reopen this issue when it shows up again.

mscdex added domain Issues and PRs related to the domain subsystem. macos Issues and PRs related to the macOS platform / OSX. test Issues and PRs related to the tests. labels Dec 1, 2016

joyeecheung mentioned this issue Dec 19, 2016

test: reduce unmanaged parallelism in test #10329

Closed

3 tasks

joyeecheung closed this as completed Jul 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test hangs on OSX 10.9.5 test-domain-no-error-handler-abort-on-uncaught.js #9979

Test hangs on OSX 10.9.5 test-domain-no-error-handler-abort-on-uncaught.js #9979

chexmixbbq commented Dec 1, 2016 •

edited

Loading

Trott commented Dec 2, 2016

joyeecheung commented Dec 2, 2016 •

edited

Loading

joyeecheung commented Dec 2, 2016 •

edited

Loading

Trott commented Dec 2, 2016

Trott commented Dec 2, 2016

joyeecheung commented Dec 2, 2016 •

edited

Loading

joyeecheung commented Dec 2, 2016 •

edited

Loading

Trott commented Dec 2, 2016

Trott commented Dec 2, 2016

joyeecheung commented Dec 2, 2016 •

edited

Loading

Trott commented Dec 2, 2016

Trott commented Dec 9, 2016 •

edited

Loading

joyeecheung commented Dec 9, 2016

Trott commented Dec 9, 2016

joyeecheung commented Dec 10, 2016 •

edited

Loading

joyeecheung commented Dec 19, 2016 •

edited

Loading

addaleax commented Dec 19, 2016

joyeecheung commented Dec 19, 2016 •

edited

Loading

Trott commented Jul 15, 2017

joyeecheung commented Jul 16, 2017 •

edited

Loading

Test hangs on OSX 10.9.5 test-domain-no-error-handler-abort-on-uncaught.js #9979

Test hangs on OSX 10.9.5 test-domain-no-error-handler-abort-on-uncaught.js #9979

Comments

chexmixbbq commented Dec 1, 2016 • edited Loading

Trott commented Dec 2, 2016

joyeecheung commented Dec 2, 2016 • edited Loading

joyeecheung commented Dec 2, 2016 • edited Loading

Trott commented Dec 2, 2016

Trott commented Dec 2, 2016

joyeecheung commented Dec 2, 2016 • edited Loading

joyeecheung commented Dec 2, 2016 • edited Loading

Trott commented Dec 2, 2016

Trott commented Dec 2, 2016

joyeecheung commented Dec 2, 2016 • edited Loading

Trott commented Dec 2, 2016

Trott commented Dec 9, 2016 • edited Loading

joyeecheung commented Dec 9, 2016

Trott commented Dec 9, 2016

joyeecheung commented Dec 10, 2016 • edited Loading

joyeecheung commented Dec 19, 2016 • edited Loading

addaleax commented Dec 19, 2016

joyeecheung commented Dec 19, 2016 • edited Loading

Trott commented Jul 15, 2017

joyeecheung commented Jul 16, 2017 • edited Loading

chexmixbbq commented Dec 1, 2016 •

edited

Loading

joyeecheung commented Dec 2, 2016 •

edited

Loading

joyeecheung commented Dec 2, 2016 •

edited

Loading

joyeecheung commented Dec 2, 2016 •

edited

Loading

joyeecheung commented Dec 2, 2016 •

edited

Loading

joyeecheung commented Dec 2, 2016 •

edited

Loading

Trott commented Dec 9, 2016 •

edited

Loading

joyeecheung commented Dec 10, 2016 •

edited

Loading

joyeecheung commented Dec 19, 2016 •

edited

Loading

joyeecheung commented Dec 19, 2016 •

edited

Loading

joyeecheung commented Jul 16, 2017 •

edited

Loading