Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test hangs on OSX 10.9.5 test-domain-no-error-handler-abort-on-uncaught.js #9979

Closed
chexmixbbq opened this issue Dec 1, 2016 · 20 comments
Closed
Labels
domain Issues and PRs related to the domain subsystem. macos Issues and PRs related to the macOS platform / OSX. test Issues and PRs related to the tests.

Comments

@chexmixbbq
Copy link

chexmixbbq commented Dec 1, 2016

  • Version: v8.0.0-pre
  • Platform: Darwin PA013136.local 13.4.0 Darwin Kernel Version 13.4.0: Mon Jan 11 18:17:34 PST 2016; root:xnu-2422.115.15~1/RELEASE_X86_64 x86_64
  • Subsystem: test

This test hangs on:

=== release test-domain-no-error-handler-abort-on-uncaught ===
Path: parallel/test-domain-no-error-handler-abort-on-uncaught
Command: out/Release/node /Users/youngj/JsProjects/node/test/parallel/test-domain-no-error-handler-abort-on-uncaught.js

@mscdex mscdex added domain Issues and PRs related to the domain subsystem. macos Issues and PRs related to the macOS platform / OSX. test Issues and PRs related to the tests. labels Dec 1, 2016
@Trott
Copy link
Member

Trott commented Dec 2, 2016

Thanks for reporting the issue. I am unable to replicate this on Darwin Kernel Version 16.1.0 and I have not seen it on the test machines. I'm not sure what could be causing it. Hopefully someone else will have more insight. Please do comment here if you manage to get it working.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 2, 2016

I've run into this a few times today too, and I can reproduce this one when running out/Release/node test/parallel/test-domain-no-error-handler-abort-on-uncaught.js on my mac(Darwin Kernel Version 15.6.0).

I got "Illegal instruction: 4" when trying master, v4.4.7 and v6.9.1 to run individual test cases(passing child 1 or other available textIndex) earlier, but currently I can run the individual cases with expected results, although the test script spawning multiple processes for each test case still hangs.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 2, 2016

Turns out I can finish the test, it just takes a long time(long enough to fail the test because of timeout when running with the python script).

real	0m16.144s
user	0m0.981s
sys	0m0.330s

another one

real	0m26.977s
user	0m0.960s
sys	0m0.339s

and another one

real	0m57.213s
user	0m0.988s
sys	0m0.359s

And I notice that there are two test/fixtures/child-process-persistent.js still running after the test finishes, that's odd.

@Trott
Copy link
Member

Trott commented Dec 2, 2016

Looks like this test launches 13 child processes at once. Can one or both of you see what happens if you only have the first six items in the array and remove the last seven? And then again but with the last seven in the array and the first six removed? If both of those still exhibit problems, how about if there's just one item in the array?

@Trott
Copy link
Member

Trott commented Dec 2, 2016

Just for clarity: The array I'm talking about is tests defined starting on line 14. Each item in it is a function that spans several lines. There are 13 functions in the array.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 2, 2016

I have tried some combinations, it looks like whenever I hit the else branch(the one launching child processes from the array), the test will run for a long time. I have tried all the individual cases with child arguments(thus hitting the if branch), and they are all fine. Only spawning child processes will make it run for a long time, and all the combinations of test cases I've tried do that(first half, last half, first one, second one, ...).

BTW aren't there 10 functions in that array? That's what I get from

 if (process.argv[2] === 'child') {
   const testIndex = +process.argv[3];
+  console.log('running test ' + testIndex);
   tests[testIndex]();
 } else {
-
   tests.forEach(function(test, testIndex) {
     var testCmd = '';
     if (!common.isWindows) {
       // Do not create core files, as it can take a lot of disk space on
@@ -156,6 +175,7 @@ if (process.argv[2] === 'child') {
     testCmd += ' ' + 'child';
     testCmd += ' ' + testIndex;

+    console.log('spawn child ' + testIndex);
     var child = child_process.exec(testCmd);

     child.on('exit', function onExit(exitCode, signal) {
@@ -164,5 +184,9 @@ if (process.argv[2] === 'child') {
                      ' and signal ' + signal;
       assert(common.nodeProcessAborted(exitCode, signal), errMsg);
     });
+
+    child.stdout.on('data', function(data) {
+      console.log(data);
+    });
   });
 }

I've tried adding some logs in the tests and see what they would do. All test cases can run to the point before they throw an error out.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 2, 2016

Update:

I tried to log in the onexit callback. Most of the times half of the test cases hit that line and log immediately, and after dozens of seconds, another half hit there and log and then quit. It's pretty random which ones will be in the first batch, which ones the second, and it's not ordered. Is there a race or anything?

@@ -156,13 +175,19 @@ if (process.argv[2] === 'child') {
     testCmd += ' ' + 'child';
     testCmd += ' ' + testIndex;

     var child = child_process.exec(testCmd);

     child.on('exit', function onExit(exitCode, signal) {
+      console.log(exitCode, signal, testIndex);
       const errMsg = 'Test at index ' + testIndex + ' should have aborted ' +
                      'but instead exited with exit code ' + exitCode +
                      ' and signal ' + signal;
       assert(common.nodeProcessAborted(exitCode, signal), errMsg);
     });
+
+    child.stdout.on('data', function(data) {
+      console.log(data);
+    });
   });
 }

@Trott
Copy link
Member

Trott commented Dec 2, 2016

BTW aren't there 10 functions in that array?

When I do console.log(tests.length), I get 13. And when I add logging in the child exit handler, I get 13 numbers (0 through 12). If you only get 10, that's curious....

@Trott
Copy link
Member

Trott commented Dec 2, 2016

Given how many processes this spawns, and given how we've seen timers be less reliable on heavily-loaded machines, and how some of these tests cases rely on timers, I wonder if the solution might be to move those tests to their own files. In a way, this is a test that is bypassing the way test.py regulates how many parallel jobs should run at once.

Other possible solution might be to rewrite this to run the tests one at a time and have each child process kick off the next one. But it will probably be easier code to understand to just have the tests in separate files.

@nodejs/testing

@joyeecheung
Copy link
Member

joyeecheung commented Dec 2, 2016

There could be use cases like this test(spawning multiple processes to max CPU utilization, using --abort-on-uncaught-exception for coredumps in production), but given that domain is pending deprecation anyway, maybe it's OK to just let it go.

If there is no objection I'll be happy to pick up the splitting task.

When I do console.log(tests.length), I get 13. And when I add logging in the child exit handler, I get 13 numbers (0 through 12). If you only get 10, that's curious....

I get 10 when I console.log(tests.length) and can confirm this number counting them with my eyes...that's weird

@Trott
Copy link
Member

Trott commented Dec 2, 2016

@joyeecheung I figured it out. You are correctly looking at test-domain-no-error-handler-abort-on-uncaught.js and I am mistakenly looking at test-domain-abort-on-uncaught.js.

@Trott
Copy link
Member

Trott commented Dec 9, 2016

If there is no objection I'll be happy to pick up the splitting task.

@joyeecheung Yes, please feel free to split this test up into multiple test files to reduce concurrency!

@joyeecheung
Copy link
Member

@Trott Ok, working on it!

I think I still need to spawn one child process per test file, because they need ulimit -c to avoid filling up the test machines disk with core dumps, but at least that's better than 10 child processes for one file?

@Trott
Copy link
Member

Trott commented Dec 9, 2016

but at least that's better than 10 child processes for one file?

Sounds about right to me, if I'm understanding correctly.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 10, 2016

Hmm..turns out there is something off with the --abort-on-uncaught-exception in the first place. If I put the test function in a single file:

'use strict';

const domain = require('domain');

function test() {
  const d = domain.create();

  d.run(function() {
    throw new Error('boom!');
  });
}

test();

and run it with --abort-on-uncaught-exception(after I ulimit -c 0), it won't crash immediately. Instead, on my mac it will hang for another ~12 seconds.

0.05s user
0.02s system
0% cpu
12.091 total

But, if I run this test without --abort-on-uncaught-exception, it crashes immediately.

I think the problem is not about child processes. It's about --abort-on-uncaught-exception. The abortion is triggered like this:

(lldb) v8 bt
 * thread #1: tid = 0x5353f5, 0x0000000100a19fc2 node`v8::base::OS::Abort() + 18 at platform-posix.cc:240, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x0000000100a19fc2 node`v8::base::OS::Abort() + 18 at platform-posix.cc:240 [opt]
    frame #1: 0x00000001006be521 node`v8::internal::Runtime_Throw(int, v8::internal::Object**, v8::internal::Isolate*) [inlined] v8::internal::__RT_impl_Runtime_Throw(v8::internal::Arguments, v8::internal::Isolate*) + 34 at runtime-internal.cc:82 [opt]
    frame #2: 0x00000001006be4ff node`v8::internal::Runtime_Throw(args_length=<unavailable>, args_object=0x00007fff77a383e0, isolate=0x00000001038040c8) + 319 at runtime-internal.cc:79 [opt]
    frame #3: 0x000038d9b56092a7 <exit>
    frame #4: 0x000038d9b574b5e0 (anonymous)(this=0x00000675340ed831:<Object: Domain>) at /Users/joyee/projects/node-test/domain.js:8:17 fn=0x00000675340edc31
    frame #5: 0x000038d9b574ae96 Domain.run(this=0x00000675340ed831:<Object: Domain>, 0x00000675340edc31:<function: (anonymous) at /Users/joyee/projects/node-test/domain.js:8:17>) at domain.js:205:32 fn=0x0000007eaa762c79
    frame #6: 0x000038d9b574a41f test(this=0x00003788c2504381:<undefined>) at /Users/joyee/projects/node-test/domain.js:5:14 fn=0x00000675340e9aa9
    frame #7: 0x000038d9b57483e2 (anonymous)(this=0x00000675340e7fa1:<Object: Object>, 0x00000675340e7fa1:<Object: Object>, 0x00000675340e95d1:<function: require at internal/module.js:17:19>, 0x00000675340e7ea9:<Object: Module>, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">, 0x00000675340e94f9:<String: "/Users/joyee/pro...">) at /Users/joyee/projects/node-test/domain.js:1:10 fn=0x00000675340e9499
    frame #8: 0x000038d9b5748090 Module._compile(this=0x00000675340e7ea9:<Object: Module>, 0x00000675340e8fa9:<String: "'use strict';

c...">, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">) at module.js:510:37 fn=0x0000007eaa752dd1
    frame #9: 0x000038d9b5740f2b Module._extensions..js(this=0x00000675340e0ee1:<Object: Object>, 0x00000675340e7ea9:<Object: Module>, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">) at module.js:577:37 fn=0x0000007eaa752e91
    frame #10: 0x000038d9b573f952 Module.load(this=0x00000675340e7ea9:<Object: Module>, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">) at module.js:478:33 fn=0x0000007eaa752c71
    frame #11: 0x000038d9b573f47d tryModuleLoad(this=0x00003788c2504381:<undefined>, 0x00000675340e7ea9:<Object: Module>, 0x00000675340e5bf9:<String: "/Users/joyee/pro...">) at module.js:443:23 fn=0x00000675340aa701
    frame #12: 0x000038d9b5734d22 Module._load(this=0x00000675340aa509:<function: Module at module.js:37:16>, 0x00000675340a4b21:<String: "/Users/joyee/pro...">, 0x00003788c2504201:<null>, 0x00003788c25043c1:<true>) at module.js:412:24 fn=0x0000007eaa752aa9
    frame #13: 0x000038d9b57347ea Module.runMain(this=0x00003788c2504381:<undefined>) at module.js:602:26 fn=0x0000007eaa7530d1
    frame #14: 0x000038d9b5734670 run(this=0x00003788c2504381:<undefined>, 0x0000007eaa7530d1:<function: Module.runMain at module.js:602:26>) at bootstrap_node.js:375:15 fn=0x0000153f113e23a1
    frame #15: 0x000038d9b56478f2 startup(this=0x00003788c2504381:<undefined>) at bootstrap_node.js:12:19 fn=0x00003788c25e2349
    frame #16: 0x000038d9b5642759 (anonymous)(this=0x00003788c2504201:<null>, 0x00003788c25e22e9:<Object: process>) at bootstrap_node.js:10:10 fn=0x00003788c25e2391
    frame #17: 0x000038d9b563b7c3 <internal>
    frame #18: 0x000038d9b562508f <entry>
    frame #19: 0x0000000100457840 node`v8::internal::(anonymous namespace)::Invoke(isolate=0x0000000103003cb8, is_construct=<unavailable>, target=<unavailable>, receiver=<unavailable>, argc=1, args=<unavailable>, new_target=<unavailable>) + 400 at execution.cc:97 [opt]
    frame #20: 0x000000010045767d node`v8::internal::Execution::Call(isolate=0x0000000103003c00, callable=<unavailable>, receiver=<unavailable>, argc=<unavailable>, argv=0x00007fff5fbff0d8) + 333 at execution.cc:153 [opt]
    frame #21: 0x000000010015c5c9 node`v8::Function::Call(this=<unavailable>, context=<unavailable>, recv=<unavailable>, argc=<unavailable>, argv=<unavailable>) + 473 at api.cc:4535 [opt]
    frame #22: 0x000000010087c69a node`node::LoadEnvironment(env=0x0000000102803000) + 625 at node.cc:3540 [opt]
    frame #23: 0x000000010087e1de node`node::Start(int, char**) + 505 at node.cc:4632 [opt]
    frame #24: 0x000000010087dfe5 node`node::Start(argc=<unavailable>, argv=0x0000000102204a60) + 182 at node.cc:4733 [opt]
    frame #25: 0x0000000100000c34 node`start + 52

The abortion is triggered with V8_IMMEDIATE_CRASH(), which is #define V8_IMMEDIATE_CRASH() ((void(*)())0)() on my mac

#if V8_CC_GNU
#define V8_IMMEDIATE_CRASH() __builtin_trap()
#else
#define V8_IMMEDIATE_CRASH() ((void(*)())0)()
#endif

and that is why I get "Illegal instruction: 4" when the process aborted.

Process 14612 stopped
* thread #1: tid = 0x534ec6, 0x0000000100a19fc2 node`v8::base::OS::Abort() + 18 at platform-posix.cc:240, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x0000000100a19fc2 node`v8::base::OS::Abort() + 18 at platform-posix.cc:240 [opt]
   237
   238 	void OS::Abort() {
   239 	  if (g_hard_abort) {
-> 240 	    V8_IMMEDIATE_CRASH();
   241 	  }
   242 	  // Redirect to std abort to signal abnormal program termination.
   243 	  abort();

But if I simply do ((void(*)())0)()or __builtin_trap() in another C++ program(I am not sure under what mode is node compiled on my mac so I tried both), it will just crash with illegal instruction warnings instead of hanging for ~12s before crashing.

I am not very familiar with the domains and how --abort-on-uncaught-exception works at the moment, so I might need a little bit more time to figure this out.

joyeecheung added a commit to joyeecheung/node that referenced this issue Dec 19, 2016
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for nodejs#9979 (not a real fix).
@joyeecheung
Copy link
Member

joyeecheung commented Dec 19, 2016

After splitting the test, the hanging time doesn't change significantly, but at least now they are under more control of test.py. The hanging time on my mac is still high compared to what I get from a ubuntu docker container.

Lauching this:

tools/test.py --mode=release -J test/parallel/test-domain-no-error-handler-abort-on-uncaught-0.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-1.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-2.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-3.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-4.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-5.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-6.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-7.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-8.js test/parallel/test-domain-no-error-handler-abort-on-uncaught-9.js

In a ubuntu container(16.04):

[00:02|% 100|+  10|-   0]: Done

On the host machine(Darwin Kernel Version 15.6.0)

[00:18|% 100|+  10|-   0]: Done

Without further investigation all I observe from the lldb debugger is that it keeps hitting this line at platform-posix.cc again and again. I assume this is supposed to be hit only once and then the process will get abort right after that?

   237
   238 	void OS::Abort() {
   239 	  if (g_hard_abort) {
-> 240 	    V8_IMMEDIATE_CRASH();
   241 	  }
   242 	  // Redirect to std abort to signal abnormal program termination.
   243 	  abort();

Making a PR for the test splitting anyways...

@addaleax
Copy link
Member

Without further investigation all I observe from the lldb debugger is that it keeps hitting this line at platform-posix.cc again and again. I assume this is supposed to be hit only once and then the process will get abort right after that?

Uh, this may be a long shot, but is there a signal handler attached that does not terminate the process (e.g. one that was attached by libuv/from JS land)? That would lead to a “hanging” loop in that place, since every time the signal handler returns the process attempts to execute the crashing instruction again?

@joyeecheung
Copy link
Member

joyeecheung commented Dec 19, 2016

Uh, this may be a long shot, but is there a signal handler attached that does not terminate the process (e.g. one that was attached by libuv/from JS land)? That would lead to a “hanging” loop in that place, since every time the signal handler returns the process attempts to execute the crashing instruction again?

I am not sure will there be any signal handlers attached if I run these tests with out/Realease/node --abort-on-uncaught-exception $file, but running this simple demo:

#if defined(__clang__)
#if defined(__GNUC__)  // Clang in gcc mode.
# define V8_CC_GNU 1
#endif
#elif defined(__GNUC__)
# define V8_CC_GNU 1
#endif

#if V8_CC_GNU
#define V8_IMMEDIATE_CRASH() __builtin_trap()
#else
#define V8_IMMEDIATE_CRASH() ((void(*)())0)()
#endif

// the lines above are copied from the V8 source
int main(int argc, char *argv[]) {
  V8_IMMEDIATE_CRASH();
  return 0;
}

crashes immediately on my machine. Maybe there are some compile flags that I should try?

Trott pushed a commit to Trott/io.js that referenced this issue Dec 22, 2016
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for nodejs#9979 (not a real fix).

PR-URL: nodejs#10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 18, 2017
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for nodejs#9979 (not a real fix).

PR-URL: nodejs#10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 19, 2017
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for nodejs#9979 (not a real fix).

PR-URL: nodejs#10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 25, 2017
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for nodejs#9979 (not a real fix).

PR-URL: nodejs#10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 27, 2017
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for nodejs#9979 (not a real fix).

PR-URL: nodejs#10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 30, 2017
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for nodejs#9979 (not a real fix).

PR-URL: nodejs#10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 30, 2017
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for nodejs#9979 (not a real fix).

PR-URL: nodejs#10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
MylesBorins pushed a commit that referenced this issue Mar 8, 2017
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for #9979 (not a real fix).

PR-URL: #10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
MylesBorins pushed a commit that referenced this issue Mar 9, 2017
The original test lauches 10 child processes at once
and bypass `test.py`'s process regulation.
This PR reduces the unmanaged parallelism and is a
temporary workaround for #9979 (not a real fix).

PR-URL: #10329
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@Trott
Copy link
Member

Trott commented Jul 15, 2017

@joyeecheung Does this still hang for you?

If so, what is the output of launchctl list | grep ReportCrash and sudo launchctl list | grep ReportCrash?

@joyeecheung
Copy link
Member

joyeecheung commented Jul 16, 2017

@Trott Can not reproduce anymore, although I have updated my OS since then. I think we can reopen this issue when it shows up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain Issues and PRs related to the domain subsystem. macos Issues and PRs related to the macOS platform / OSX. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

5 participants