promises: add --abort-on-unhandled-rejection #15335

addaleax · 2017-09-11T10:48:32Z

Support creating core dumps of unhandled promise rejections on POSIX systems by keeping fork()s of the process alive until it is known whether they are handled synchronously or not.

This is a kind of crude approach, but with some basic experimentation it looks like it can work well enough.

@nodejs/diagnostics @nodejs/post-mortem

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included (to the degree to which that is possible? we can’t test core dumps themselves)
documentation is changed or added
commit message follows commit guidelines

Affected core subsystem(s)

promises

bnoordhuis

LGTM modulo comments.

bnoordhuis · 2017-09-11T11:37:31Z

doc/api/cli.md

+
+This option only affects `Promise` rejections. It is only available on POSIX
+systems. It is implemented by keeping `fork()`ed copies of the process alive
+until it is known whether the `Promise` is handled within the same eventloop


'event loop'

bnoordhuis · 2017-09-11T11:39:15Z

doc/api/cli.md

+This option only affects `Promise` rejections. It is only available on POSIX
+systems. It is implemented by keeping `fork()`ed copies of the process alive
+until it is known whether the `Promise` is handled within the same eventloop
+iteration or is left unhandled, which may come with performance implications.


With the current wording it's somewhat ambiguous if 'performance implications' applies to forking or unhandled rejections.

bnoordhuis · 2017-09-11T11:39:33Z

doc/node.1

+systems. It is implemented by keeping \fBfork()\fRed copies of the process alive
+until it is known whether the \fIPromise\fR is handled within the same event
+loop iteration or is left unhandled, which may come with performance
+implications.


bnoordhuis · 2017-09-11T11:40:07Z

lib/internal/process/promises.js

@@ -1,11 +1,13 @@
 'use strict';

 const { safeToString } = process.binding('util');
+const abort_regex = /^--abort[_-]on[_-]unhandled[_-]rejection$/;


abortRegex?

bnoordhuis · 2017-09-11T11:41:30Z

lib/internal/process/promises.js

+    if (wantAbort) {
+      // Defer core dump closing until the next tick since another Promise
+      // might become an unhandled rejection by adopting this promise's state.
+      process.nextTick(process._closeCoreDump, promise, false);


I'd assign process._closeCoreDump to a const and delete it from process.

bnoordhuis · 2017-09-11T11:50:48Z

src/node.cc

+  }
+  if (pid == 0) {
+    bool do_abort;
+    int rc = read(pipes[0], &do_abort, sizeof(do_abort));


You need to retry on EINTR:

int rc; do rc = read(...); while (rc == -1 && errno == EINTR);

(And #include <errno.h> at the top.)

As well, I'd make do_abort a char, not a bool. sizeof(char) is always 1, sizeof(bool) need not be.

bnoordhuis · 2017-09-11T11:52:49Z

src/node.cc

+    if (do_abort) {
+      // This may happen when e.g. fork()ing itself failed due to resource
+      // constraints.
+      abort();


ABORT(), here and elsewhere? It prints a stack trace.

I’ve added it here, but the other instances of abort(); are used for an “expected” abort after printing the Promise’s stack trace.

bnoordhuis · 2017-09-11T11:53:24Z

src/node.cc

+    if (info->refcount > 0) return;
+  }
+  env->promise_reject_reason_map()->Delete(promise->Result());
+  if (write(info->fd, &do_abort, sizeof(do_abort)) <= 0) abort();


Retry on EINTR.

bnoordhuis · 2017-09-11T11:53:47Z

src/node.cc

+  if (write(info->fd, &do_abort, sizeof(do_abort)) <= 0) abort();
+  if (close(info->fd) == -1) abort();
+  int status = 0;
+  if (waitpid(info->pid, &status, 0) == -1) abort();


cjihrig

LGTM once @bnoordhuis nits are addressed.

addaleax · 2017-09-11T13:03:14Z

Updated with @bnoordhuis nits addressed

richardlau · 2017-09-11T13:06:57Z

lib/internal/process/promises.js

@@ -1,11 +1,13 @@
 'use strict';

 const { safeToString } = process.binding('util');
+const abortRegex = /^--abort[_-]on[_-]unhandled[_-]rejection$/;


Do we need to support the _ variant? We do for --abort-on-uncaught-exception, but that's because it's a V8 option.

It’s here for parity with --abort_on_uncaught_exception but yes, we can choose to not support it if we like.

It doesn't cost much to keep parity; I would suggest keeping it.

I'm +0 either way but if we do support it then we should probably have a test variant for it.

richardlau · 2017-09-11T13:26:26Z

src/node.cc

@@ -3861,6 +4008,7 @@ static void CheckIfAllowedInEnv(const char* exe, bool is_env,
    "--force-fips",
    "--openssl-config",
    "--icu-data-dir",
+    "--abort_on_unhandled_rejection",


Please also add this to the documented list of options that are allowed in NODE_OPTIONS in doc/api/cli.md.

@richardlau done!

richardlau · 2017-09-11T19:02:01Z

test/abort/test-abort-unhandled-rejection.js

+  Promise.reject(new Error('child error'));
+} else {
+  run('', null);
+  if (!common.isWindows) {


Should we common.skip() on Windows, or perhaps assert that the flag is rejected?

davepacheco · 2017-09-11T19:37:13Z

doc/api/cli.md

+systems. It is implemented by keeping `fork()`ed copies of the process alive
+until it is known whether the `Promise` is handled within the same event loop
+iteration or is left unhandled.
+This makes unhandled rejections a significantly more complex operation,


As scary as this already sounds, I think it remains a massive understatement. Fork, particularly at high frequency, has some pretty significant performance impact:

fork itself can take a bunch of time

after the fork completes, performance is impacted in both parent and child processes as subsequent writes require additional page faults, even if the memory is already resident

on systems that avoid overcommitting swap (including illumos and Linux, when configured to do so), the memory used by the child process is counted against the swap cap. In cloud environments with tuned caps, this can quickly exhaust swap, causing lots of other operations to fail.

If the child ends up writing to the shared pages, then the total memory used by the parent and child goes up, which can result in paging (again, particularly in environments with memory caps configured) that can dramatically reduce performance.

I think it's good that the documentation mentions the use of fork, and I think it would benefit from a firmer message about the possible impact.

Fork, particularly at high frequency, has some pretty significant performance impact:

Yes, I am aware that that makes this option unsuitable depending on your application type. I can try to incorporate some of your points into the doc.

If the child ends up writing to the shared pages, then the total memory used by the parent and child goes up, which can result in paging (again, particularly in environments with memory caps configured) that can dramatically reduce performance.

Why would this happen (for more than one or two pages of the stack)? The child process really isn’t doing much here.

Any pages marked copy-on-write by fork() get cloned when the parent mutates them. If the parent does (for example) a compacting GC, that's a significant number of pages.

Is there any way we can limit GC or at least discourage it while the forked process lives?

It may be obvious but I wonder if reminding people they should ensure ulimit is set to allow core dumps would be helpful.

davepacheco · 2017-09-11T19:39:33Z

src/node.cc

+
+  if (pipe(pipes) == -1) return;
+  pid = fork();
+  if (pid == -1) {


Am I reading this correctly that if we fail to fork (or do any of these operations), we just won't get a core file for that, and the user will never know? I think the documentation should probably reflect that this is a best-effort approach. It would also be great if there were some way for a user to know that a dump has been skipped because of a failure like this. I'm afraid that when combined with some of the problems I mentioned earlier (e.g., swap exhaustion, or exhaustion of some other cap -- like number of processes), users won't get a core file, and they won't know that they won't get one until it's too late.

In this case, the original process itself will abort() once it finds that there is no promise_abort_info struct attached to the promise when it would otherwise signal the child process to abort:

https://github.com/addaleax/node/blob/29fcd36f789a2f9913661ba7f6b964362517c9b6/src/node.cc#L1369-L1373

It’s certainly not ideal, but what’s the alternative?

I think that aborting the original process if we can't be sure is the reasonable approach.

davepacheco · 2017-09-11T19:41:47Z

src/node.cc

+  pid_t pid;
+
+  if (pipe(pipes) == -1) return;
+  pid = fork();


How do we avoid the problem where another thread is holding a lock that will remain held in the child process with no thread to unlock it?

That doesn't matter, does it? The child does nothing but read() and _exit(), it doesn't take out any locks.

fork() only clones the calling thread into the child process, if that answers your question. Plus, what Ben said.

That's true, although it relies on nobody adding anything more sophisticated to the child process in the future.

That also raises something I missed the first time: this only captures stack information about the forking thread. State about thread pool threads or other threads created by libraries or other add-ons will be missing from the core file. That might be another caveat worth documenting.

(edit: Sorry for the redundant note. I hadn't caught up to Julien's comment.)

davepacheco · 2017-09-11T19:47:31Z

I've made a couple of notes on the implementation, but just to summarize: this has come up several times, and major issues were brought up with the approach. I think it's pretty likely that end users won't appreciate the severity of the impact that this can have, or the limitations of the resulting core files, until it's too late.

addaleax · 2017-09-11T22:23:25Z

@davepacheco I know this isn’t ideal, but I’ve spent a lot of time thinking about this issue too. I’ll try to write up a bit more advisory text tomorrow; but generally, since this is an opt-in feature that – as you mention – has been brought up a few times, I think it’s something worth trying out. Maybe we can get V8 to give better predictions of whether an rejection is unhandled, but if you want a core dump at failure time, I’ve come to the conclusion that this is as good as it gets.

Also, regarding the performance issues: In the end, for all suggested solutions that I know of, avoiding uncaught rejections as much as possible is the key to reasonable performance. One nice thing is that with the increasing adoption of async/await, that’s going to become slightly less of an issue over time.

jasnell · 2017-09-12T09:43:21Z

I appreciate the work on this and will do a more thorough review later, but I have to say that I'm not too keen on approaches for this that only work on posix. Will definitely have to give this a lot of thought.

benjamingr · 2017-09-12T12:01:45Z

but I have to say that I'm not too keen on approaches for this that only work on posix

All the postmortem analysis tools that care about core dumps for Node.js don't work on windows at the moment anyway IIRC ( @nodejs/platform-windows - please do correct me if there is news or I missed anything?).

In particular, I'd like to ping @misterdjules to assert that this solves the promises post-mortem issue once and for all.

The approach here is the one we discussed during nodejs/post-mortem#45

benjamingr

This is the best path forward in terms of making sure post mortem analysis tools are compatible with promises.

misterdjules · 2017-09-12T14:07:55Z

Is it correct that core dumps generated by such a mechanism would have information about only one thread, that is the thread where the promise reject callback is called?

If so, maybe that's one more caveat to add to the documentation.

@benjamingr

In particular, I'd like to ping @misterdjules to assert that this solves the promises post-mortem issue once and for all.

With all the potential problems that were mentioned in this GH issue, it doesn't seem to me that this would solve problems with promises and post-mortem debugging "once and for all".

benjamingr · 2017-09-12T14:25:18Z

@misterdjules

Is it correct that core dumps generated by such a mechanism would have information about only one thread, that is the thread where the promise reject callback is called?

Since the whole process is forked I would assume this would contain all information about the process (fork is literally called).

Promise rejections are supposed to be rare, promise rejections that are handled asynchronously are supposed to be very rare so I'd assume the penalty for the average program would be low and I intend to run my own code with this flag when I can.

With all the potential problems that were mentioned in this GH issue, it doesn't seem to me that this would solve problems with promises and post-mortem debugging "once and for all".

Would you mind elaborating? I would like us to set up action items and a plan after which we can all agree that promises are compatible with post-mortem analysis tools.

In case it isn't abundantly clear - a lot of us (me included) are not particularly versed with these tools and take core dumps rarely and we could really use your knowledge on the topic.

With this pull request we don't lose any stack/heap information about the promise which I think is pretty neat and it's what I've been advocating for at nodejs/post-mortem#45

misterdjules · 2017-09-12T15:14:47Z

Is it correct that core dumps generated by such a mechanism would have information about only one thread, that is the thread where the promise reject callback is called?

Since the whole process is forked I would assume this would contain all information about the process (fork is literally called).

From man -s 2 fork on SmartOS:

   Threads
       A call to forkall() or forkallx() replicates in the child process all
       of the threads (see thr_create(3C) and pthread_create(3C)) in the
       parent process. A call to fork1() or forkx() replicates only the
       calling thread in the child process.


       A call to fork() is identical to a call to fork1(); only the calling
       thread is replicated in the child process. This is the POSIX-specified
       behavior for fork().

This is confirmed empirically by running the following (./node is a node binary with the changes in this GH PR applied to it):

[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# ls -lt /var/tmp/cores/
total 0
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# ./node --abort-on-unhandled-rejection -e "Promise.reject(new Error('boom'));"
Unhandled promise rejection
Error: boom
    at [eval]:1:16
    at ContextifyScript.Script.runInThisContext (vm.js:50:33)
    at Object.runInThisContext (vm.js:139:38)
    at Object.<anonymous> ([eval]-wrapper:6:22)
    at Module._compile (module.js:600:30)
    at evalScript (bootstrap_node.js:480:27)
    at startup (bootstrap_node.js:177:9)
    at bootstrap_node.js:626:3
Abort (core dumped)
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# ls -lt /var/tmp/cores/
total 48942
-rw------- 1 root root 59272149 Sep 12 14:59 core.node.28515
-rw------- 1 root root 58714981 Sep 12 14:59 core.node.28518
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# pstack /var/tmp/cores/core.node.28515 
core '/var/tmp/cores/core.node.28515' of 28515: ./node --abort-on-unhandled-rejection -e Promise.reject(new Error('boo
-----------------  lwp# 1 / thread# 1  --------------------
 ffffbf7fff22184a _lwp_kill () + a
 ffffbf7fff1b5960 raise (6) + 20
 ffffbf7fff18faa8 abort () + 98
 0000000000cf682c _ZN4node12_GLOBAL__N_113CloseCoreDumpERKN2v820FunctionCallbackInfoINS1_5ValueEEE () + 22c
 0000000001260f4b _ZN2v88internal25FunctionCallbackArguments4CallEPFvRKNS_20FunctionCallbackInfoINS_5ValueEEEE () + 14b
 00000000012e26d5 _ZN2v88internal12_GLOBAL__N_119HandleApiCallHelperILb0EEENS0_11MaybeHandleINS0_6ObjectEEEPNS0_7IsolateENS0_6HandleINS0_10HeapObjectEEESA_NS8_INS0_20FunctionTemplateInfoEEENS8_IS4_EENS0_16BuiltinArgumentsE () + 155
 00000000012e2f8f _ZN2v88internal21Builtin_HandleApiCallEiPPNS0_6ObjectEPNS0_7IsolateE () + bf
 00000447682846fd ???????? ()
 000004476837aa82 ???????? ()
 000004476833f075 ???????? ()
 000004476837a7e1 ???????? ()
 000004476833f075 ???????? ()
 000004476837a7e1 ???????? ()
 000004476833f075 ???????? ()
 000004476837aa82 ???????? ()
 000004476833f075 ???????? ()
 0000044768379e29 ???????? ()
 000004476833f075 ???????? ()
 0000044768398052 ???????? ()
 000004476833f075 ???????? ()
 000004476837a7e1 ???????? ()
 000004476833f075 ???????? ()
 000004476833e339 ???????? ()
 000004476828412d ???????? ()
 0000000001662a3c _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_ () + 11c
 000000000124418a _ZN2v88Function4CallENS_5LocalINS_7ContextEEENS1_INS_5ValueEEEiPS5_ () + 1fa
 0000000001252171 _ZN2v88Function4CallENS_5LocalINS_5ValueEEEiPS3_ () + 41
 0000000000cfc461 _ZN4node15LoadEnvironmentEPNS_11EnvironmentE () + 231
 0000000000d06dec _ZN4node5StartEP9uv_loop_siPKPKciS5_ () + 8ac
 0000000000cff286 _ZN4node5StartEiPPc () + 156
 0000000000cca36c _start () + 6c
-----------------  lwp# 2 / thread# 2  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2c74e98, 2c74e80, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2c74e98, 2c74e80, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2c74e98, 2c74e80) + 52
 ffffbf7fff21425a cond_wait (2c74e98, 2c74e80) + 2a
 ffffbf7fff214295 pthread_cond_wait (2c74e98, 2c74e80) + 15
 000000000115cd19 uv_cond_wait () + 9
 0000000000d4eedb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e0240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 3 / thread# 3  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2c74e98, 2c74e80, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2c74e98, 2c74e80, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2c74e98, 2c74e80) + 52
 ffffbf7fff21425a cond_wait (2c74e98, 2c74e80) + 2a
 ffffbf7fff214295 pthread_cond_wait (2c74e98, 2c74e80) + 15
 000000000115cd19 uv_cond_wait () + 9
 0000000000d4eedb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e0a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 4 / thread# 4  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2c74e98, 2c74e80, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2c74e98, 2c74e80, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2c74e98, 2c74e80) + 52
 ffffbf7fff21425a cond_wait (2c74e98, 2c74e80) + 2a
 ffffbf7fff214295 pthread_cond_wait (2c74e98, 2c74e80) + 15
 000000000115cd19 uv_cond_wait () + 9
 0000000000d4eedb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e1240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 5 / thread# 5  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2c74e98, 2c74e80, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2c74e98, 2c74e80, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2c74e98, 2c74e80) + 52
 ffffbf7fff21425a cond_wait (2c74e98, 2c74e80) + 2a
 ffffbf7fff214295 pthread_cond_wait (2c74e98, 2c74e80) + 15
 000000000115cd19 uv_cond_wait () + 9
 0000000000d4eedb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e1a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 6 / thread# 6  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff20cb63 sema_wait (2b8dd80) + 13
 ffffbf7fff200a78 sem_wait (2b8dd80) + 38
 000000000115cbb4 uv_sem_wait () + 24
 0000000000d95d3a _ZN4node9inspector12_GLOBAL__N_117StartIoThreadMainEPv () + 1a
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e2240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# pstack /var/tmp/cores/core.node.28518 
core '/var/tmp/cores/core.node.28518' of 28518: ./node --abort-on-unhandled-rejection -e Promise.reject(new Error('boo
 ffffbf7fff22184a _lwp_kill () + a
 ffffbf7fff1b5960 raise (6) + 20
 ffffbf7fff18faa8 abort () + 98
 0000000000cf4988 _ZN4node12_GLOBAL__N_121PromiseRejectCallbackEN2v820PromiseRejectMessageE () + 288
 000000000177fb35 _ZN2v88internal7Isolate19ReportPromiseRejectENS0_6HandleINS0_8JSObjectEEENS2_INS0_6ObjectEEENS_18PromiseRejectEventE () + 45
 000000000197640e _ZN2v88internal35Runtime_PromiseRejectEventFromStackEiPPNS0_6ObjectEPNS0_7IsolateE () + ce
 00000447682846fd ???????? ()
 00000447682fcb73 ???????? ()
 0000044768379f7c ???????? ()
 000004476833f075 ???????? ()
 000004476833e339 ???????? ()
 000004476828412d ???????? ()
 0000000001662a3c _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_ () + 11c
 00000000012345cb _ZN2v86Script3RunENS_5LocalINS_7ContextEEE () + 21b
 00000000012522e2 _ZN2v86Script3RunEv () + 32
 0000000000d26868 _ZN4node12_GLOBAL__N_116ContextifyScript11EvalMachineEPNS_11EnvironmentElbbRKN2v820FunctionCallbackInfoINS4_5ValueEEEPNS4_8TryCatchE () + 388
 0000000000d28ba8 _ZN4node12_GLOBAL__N_116ContextifyScript16RunInThisContextERKN2v820FunctionCallbackInfoINS2_5ValueEEE () + 2d8
 0000000001260f4b _ZN2v88internal25FunctionCallbackArguments4CallEPFvRKNS_20FunctionCallbackInfoINS_5ValueEEEE () + 14b
 00000000012e26d5 _ZN2v88internal12_GLOBAL__N_119HandleApiCallHelperILb0EEENS0_11MaybeHandleINS0_6ObjectEEEPNS0_7IsolateENS0_6HandleINS0_10HeapObjectEEESA_NS8_INS0_20FunctionTemplateInfoEEENS8_IS4_EENS0_16BuiltinArgumentsE () + 155
 00000000012e2f8f _ZN2v88internal21Builtin_HandleApiCallEiPPNS0_6ObjectEPNS0_7IsolateE () + bf
 00000447682846fd ???????? ()
 000004476837a0ea ???????? ()
 000004476833f075 ???????? ()
 0000044768379f7c ???????? ()
 000004476833f075 ???????? ()
 000004476837a0ea ???????? ()
 000004476833f075 ???????? ()
 0000044768379948 ???????? ()
 000004476833f075 ???????? ()
 000004476837a0ea ???????? ()
 000004476833f075 ???????? ()
 0000044768398052 ???????? ()
 000004476833f075 ???????? ()
 000004476837a7e1 ???????? ()
 000004476833f075 ???????? ()
 000004476833e339 ???????? ()
 000004476828412d ???????? ()
 0000000001662a3c _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_ () + 11c
 000000000124418a _ZN2v88Function4CallENS_5LocalINS_7ContextEEENS1_INS_5ValueEEEiPS5_ () + 1fa
 0000000001252171 _ZN2v88Function4CallENS_5LocalINS_5ValueEEEiPS3_ () + 41
 0000000000cfc461 _ZN4node15LoadEnvironmentEPNS_11EnvironmentE () + 231
 0000000000d06dec _ZN4node5StartEP9uv_loop_siPKPKciS5_ () + 8ac
 0000000000cff286 _ZN4node5StartEiPPc () + 156
 0000000000cca36c _start () + 6c
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]#

In the output above, we can see that two core files where generated (I thought only one was supposed to be generated after reading https://github.com/nodejs/node/pull/15335/files#diff-cd53544f44aab2c697bcd7b6a57f23ccR1395), and the one with the promise reject callback on the stack has only one thread, while the other has more than one thread, that is all the threads you'd expected from a node process.

Promise rejections are supposed to be rare, promise rejections that are handled asynchronously are supposed to be very rare so I'd assume the penalty for the average program would be low and I intend to run my own code with this flag when I can.

I'm not sure these are safe assumptions to make. At scale, in pathological cases, or in any other unforeseen circumstance what used to be rare and insignificant can become quite problematic. Moreover, when occurrences of a bug are rare, we want the tools we use to investigate them to be robust, reliable and as complete as possible.

With all the potential problems that were mentioned in this GH issue, it doesn't seem to me that this would solve problems with promises and post-mortem debugging "once and for all".

Would you mind elaborating?

@davepacheco listed many potential problems with this approach with his comments on this PR, and in many other GH issues/PRs.

With this pull request we don't lose any stack/heap information about the promise which I think is pretty neat and it's what I've been advocating for at nodejs/post-mortem#45

It's definitely an interesting experiment and contribution, but as a post-mortem debugging user I don't think I would use it in its current form in production if it landed because of, among other things, the potential problems that @davepacheco and myself presented in this PR.

mhdawson · 2017-09-12T15:43:34Z

@uzun FYI as another approach being taken to capture additional info related to promises.

rnchamberlain · 2017-09-12T15:51:58Z

fork() only preserves the single calling thread, but since that is the main node loop you'll get the thread and JavaScript stack that you need to diagnose the problem, also the entire heap. Missing the other (e.g libuv) native threads is not a show-stopper. I agree we should mention this in the documentation.

mhdawson · 2017-09-12T15:53:40Z

@uzun can you also run the test cases you were building in nodejs/post-mortem#45 with this to see what the results are.

benjamingr · 2017-09-12T17:14:13Z

@davepacheco

I've made a couple of notes on the implementation, but just to summarize: this has come up several times, and major issues were brought up with the approach.

The idea in the thread you linked to there was to start the cluster and treat the first unhandled rejection differently. This is similar (by forking the process and relying on CoW) but I think the performance profile can be very different. I think in the vast majority of bugs this approach can work well.

@misterdjules

From man -s 2 fork on SmartOS:

You are absolutely right, I'm going to send a very angry mail to my old CS professor :D Thanks.

Promise rejections are supposed to be rare, promise rejections that are handled asynchronously are supposed to be very rare so I'd assume the penalty for the average program would be low and I intend to run my own code with this flag when I can.

I'm not sure these are safe assumptions to make. At scale, in pathological cases, or in any other unforeseen circumstance what used to be rare and insignificant can become quite problematic. Moreover, when occurrences of a bug are rare, we want the tools we use to investigate them to be robust, reliable and as complete as possible.

I agree with everything you've written. I think that at this point we have to make painful tradeoffs and Node.js has been reasonably good at making those unsafe assumptions responsibly. I think determinism when debugging is extremely important and I acknowledge the value of postmortem analysis.

I don't think there is a way to solve the fundamental issue with errors thrown asynchronously (would a util.throwSynchronouslyOnRejection(promise) (name pending) help?). It's likely that the majority of Node.js users are using promises in some way. This might not be some big company but I know several companies (including the one I'm working for) that use promises in large scale projects with concurrency in the millions. These people need a good post-mortem story.

Given all that - I would really like an actionable list of items from the post-mortem side of things that should happen. I know very little beyond a consumer perspective about post-mortem analysis but I've been involved with promise work for more than 4 years. Except for @mhdawson the post-mortem working group has been awfully silent about making post-mortem analysis better with promises in the last year.

By "once and for all" I meant achieving the best outcome given these constraints. I realize this might require a lot of work but I think it's worth exploring.

@rnchamberlain

fork() only preserves the single calling thread, but since that is the main node loop you'll get the thread and JavaScript stack that you need to diagnose the problem, also the entire heap. Missing the other (e.g libuv) native threads is not a show-stopper. I agree we should mention this in the documentation.

If this is sufficient in the general case it sounds good to me. I've only used core dumps a handful of times in order to debug things and it was always the JavaScript or Buffers that were problematic and by the point Buffers were problematic they were already in the JavaScript heap.

davepacheco · 2017-09-12T19:27:44Z

Given all that - I would really like an actionable list of items from the post-mortem side of things that should happen. I know very little beyond a consumer perspective about post-mortem analysis but I've been involved with promise work for more than 4 years. Except for @mhdawson the post-mortem working group has been awfully silent about making post-mortem analysis better with promises in the last year.

First of all, I appreciate all the work people have done to try to satisfy both the postmortem debugging use-cases and people's desires to use promises. I know this PR represents a lot of work, and as a very heavy postmortem debugging user I'm grateful that Node is trying so hard to support this use-case.

I don't think it's fair to say that members of the postmortem working group have been silent on this issue. Several of us have weighed in on several different threads now explaining why we believe it's impossible to satisfy both use cases. We're not saying that it's hard, or that people just haven't figured out how to do it yet, but that we believe the requirements are mutually exclusive. I understand people want to explore the space anyway. That's great! But I can't give you "an actionable list of items from the post-mortem side of things that should happen" because I don't think there's anything that can be done to resolve the starting constraints. I think @addaleax is correct in concluding that this is likely as good as it gets. Unfortunately, it's still a non-starter for a lot of postmortem debugging users. I think the best we can hope for is that this approach is helpful for some users, and that the documentation warns them about the limitations and risks of this approach so that they're not caught off guard if they wind up with dozens of extra processes in production or core files that don't have some of the state they expect.

Relatedly, I think it would be useful to know if there are people who use core-file-based postmortem debugging today who are ready to deploy with the proposed --abort-on-unhandled-rejection. It would be especially useful to know if any of those people have already used this in its prototype form to debug real issues. I know I haven't waded into use of promises much myself yet because of these issues, and it sounds like some of the people interested in this proposal haven't used core files much before, so it would be good to know from somebody firmly in both camps that we're really addressing their issue. I apologize for not knowing already if there are end users consuming this work already!

For those new to these discussions, to summarize the problem again most concisely: postmortem debugging with --abort-on-uncaught-exception collects the entire JavaScript and native state of the program (including all threads) in the precise context of the failure (so that stack state is preserved) with zero runtime cost until the failure happens, while promises explicitly specify that the program must continue executing after the point of the failure, meaning that state is necessarily mutated. Forking is the closest thing to having it both ways, but it has all of the drawbacks brought up here -- it's a heavy cost and it doesn't include all of the same state.

misterdjules · 2017-09-12T20:15:57Z

Relatedly, I think it would be useful to know if there are people who use core-file-based postmortem debugging today who are ready to deploy with the proposed --abort-on-unhandled-rejection. It would be especially useful to know if any of those people have already used this in its prototype form to debug real issues. I know I haven't waded into use of promises much myself yet because of these issues, and it sounds like some of the people interested in this proposal haven't used core files much before, so it would be good to know from somebody firmly in both camps that we're really addressing their issue.

It seems @yunong would potentially be a good person to ask.

rnchamberlain · 2017-09-13T16:31:12Z

I tried the PR with a simple testcase, on Ubuntu 16.04. Running vanilla, as expected I get
(node:20138) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): FAILED and the application continues.

With --abort-on-unhandled-rejection I get a core dump, but it is triggered from the assert in node.cc line 1306: CHECK_EQ(promise->State(), Promise::kRejected); so the dump is not of the saved state from the forked process. I'll do some more digging.

addaleax · 2017-09-13T17:10:33Z

@rnchamberlain do you have the example code you used for that?

rnchamberlain · 2017-09-14T09:55:11Z

@addaleax code is here: https://gist.github.com/rnchamberlain/563d575f129867bd23c3c7da001d888b

BridgeAR · 2017-09-22T20:53:13Z

Ping @addaleax

rnchamberlain · 2017-10-03T10:50:12Z

The assert in node.cc line 1306: CHECK_EQ(promise->State(), Promise::kRejected); happens because promise->State() is zero (enum PromiseState { kPending, kFulfilled, kRejected }. Then we abort immediately. The calling code that supplies the promise is this:

void PromiseRejectCallback(PromiseRejectMessage message) {
  Local<Promise> promise = message.GetPromise();
  Isolate* isolate = promise->GetIsolate();
  Local<Value> value = message.GetValue();
  Local<Integer> event = Integer::New(isolate, message.GetEvent());

  Environment* env = Environment::GetCurrent(isolate);
  Local<Function> callback = env->promise_reject_function();

  if (message.GetEvent() == v8::kPromiseRejectWithNoHandler) {
    OpenCoreDump(env, promise);
  }

addaleax · 2017-10-03T13:45:18Z

@rnchamberlain I still can’t reproduce this :( It sounds like a V8 bug though, because promise->State() == Promise::kRejected should be guaranteed by the V8 API when it calls its PromiseRejectCallback, if I understand it correctly? @gsathya am I mistaken about that?

gsathya · 2017-10-03T14:35:19Z

https://cs.chromium.org/chromium/src/v8/src/builtins/builtins-promise-gen.cc?rcl=9c4f80520da791c712944265478d0272293e5d2b&l=980 does call out to PromiseRejectCallback before fulfilling the promise, so it is possible.

It does seem odd that this isn't guaranteed by the API. Filed https://bugs.chromium.org/p/v8/issues/detail?id=6880 to track this. Thanks @addaleax !

uzun · 2017-10-04T18:50:37Z

We reproduced this in macOS 10.13.

/Users/mu/Downloads/node/out/Release/node[810]: ../src/node.cc:1306:void node::(anonymous namespace)::OpenCoreDump(node::Environment , Localv8::Promise): Assertion (promise->State()) == (Promise::kRejected)' failed.
1: node::Abort() [/Users/mu/Downloads/node/./node]
2: node::MakeCallback(v8::Isolate, v8::Localv8::Object, char const*, int, v8::Localv8::Value, node::async_context) [/Users/mu/Downloads/node/./node]
3: node::(anonymous namespace)::PromiseRejectCallback(v8::PromiseRejectMessage) [/Users/mu/Downloads/node/./node]
4: v8::internal::Isolate::ReportPromiseReject(v8::internal::Handlev8::internal::JSObject, v8::internal::Handlev8::internal::Object, v8::PromiseRejectEvent) [/Users/mu/Downloads/node/./node]
5: v8::internal::Runtime_ReportPromiseReject(int, v8::internal::Object**, v8::internal::Isolate) [/Users/mu/Downloads/node/./node]
6: 0x38667f2046fd
7: 0x38667f2853a5
8: 0x38667f284173
9: 0x38667f2be339
Abort trap: 6 (core dumped)

addaleax · 2017-10-04T18:54:41Z

Fwiw it sounds like @gsathya already figured out what the issue in V8 is, so it seems like all we need is an upstream change that we’ll cherry-pick into this PR

Support creating core dumps of unhandled promise rejections on POSIX systems by keeping `fork()`s of the process alive until it is known whether they are handled synchronously or not.

BridgeAR · 2017-11-22T17:38:46Z

Is there any update here? @gsathya Is there a fix in V8 that we can cherry-pick?

addaleax · 2017-11-22T17:41:47Z

@BridgeAR I think for now I’m abandoning this in favor of looking into more creative userland solutions … if anybody else thinks this is a good idea and wants to push it over the line, please feel free to do that

BridgeAR · 2017-11-22T18:31:18Z

@addaleax thanks for the info. I am closing this therefore. If anyone feels it should be reopened, please feel free to do so.

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. process Issues and PRs related to the process subsystem. labels Sep 11, 2017

bnoordhuis approved these changes Sep 11, 2017

View reviewed changes

cjihrig approved these changes Sep 11, 2017

View reviewed changes

addaleax force-pushed the abort-on-unhandled-rejection branch 2 times, most recently from 60cff61 to bebefa8 Compare September 11, 2017 13:02

addaleax force-pushed the abort-on-unhandled-rejection branch from bebefa8 to 0eca52a Compare September 11, 2017 13:04

richardlau reviewed Sep 11, 2017

View reviewed changes

addaleax force-pushed the abort-on-unhandled-rejection branch from 0eca52a to 58ef2eb Compare September 11, 2017 13:08

richardlau reviewed Sep 11, 2017

View reviewed changes

addaleax force-pushed the abort-on-unhandled-rejection branch 2 times, most recently from e1271fb to 29fcd36 Compare September 11, 2017 16:33

richardlau reviewed Sep 11, 2017

View reviewed changes

davepacheco reviewed Sep 11, 2017

View reviewed changes

benjamingr approved these changes Sep 12, 2017

View reviewed changes

benjamingr mentioned this pull request Sep 12, 2017

Investigation of post-mortem debugging with Promises. nodejs/post-mortem#45

Closed

promises: add --abort-on-unhandled-rejection

eff4265

Support creating core dumps of unhandled promise rejections on POSIX systems by keeping `fork()`s of the process alive until it is known whether they are handled synchronously or not.

addaleax force-pushed the abort-on-unhandled-rejection branch from 29fcd36 to eff4265 Compare October 4, 2017 19:02

rnchamberlain mentioned this pull request Oct 11, 2017

Add node-report to node core nodejs/node-report#103

Closed

BridgeAR closed this Nov 22, 2017

promises: add --abort-on-unhandled-rejection #15335

promises: add --abort-on-unhandled-rejection #15335

Conversation

addaleax commented Sep 11, 2017 • edited Loading

Checklist

Affected core subsystem(s)

bnoordhuis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjihrig left a comment

Choose a reason for hiding this comment

addaleax commented Sep 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnoordhuis Sep 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco Sep 12, 2017 • edited Loading

Choose a reason for hiding this comment

davepacheco commented Sep 11, 2017

addaleax commented Sep 11, 2017

jasnell commented Sep 12, 2017

benjamingr commented Sep 12, 2017

benjamingr left a comment

Choose a reason for hiding this comment

misterdjules commented Sep 12, 2017

benjamingr commented Sep 12, 2017

misterdjules commented Sep 12, 2017

mhdawson commented Sep 12, 2017

rnchamberlain commented Sep 12, 2017

mhdawson commented Sep 12, 2017

benjamingr commented Sep 12, 2017

davepacheco commented Sep 12, 2017

misterdjules commented Sep 12, 2017

rnchamberlain commented Sep 13, 2017

addaleax commented Sep 13, 2017

rnchamberlain commented Sep 14, 2017

BridgeAR commented Sep 22, 2017

rnchamberlain commented Oct 3, 2017

addaleax commented Oct 3, 2017

gsathya commented Oct 3, 2017

uzun commented Oct 4, 2017

addaleax commented Oct 4, 2017

BridgeAR commented Nov 22, 2017

addaleax commented Nov 22, 2017

BridgeAR commented Nov 22, 2017

addaleax commented Sep 11, 2017 •

edited

Loading

bnoordhuis Sep 12, 2017 •

edited

Loading

davepacheco Sep 12, 2017 •

edited

Loading