worker: refactor thread life cycle management #26099

gireeshpunathil · 2019-02-14T14:01:00Z

The current mechanism uses two async handles, one owned by the
creator of the worker thread to terminate a running worker,
and another one employed by the worker to interrupt its creator on its
natural termination. The force termination piggybacks on the message-
passing mechanism to inform the worker to quiesce.

Also there are few flags that represent the other thread's state /
request state because certain code path is shared by multiple
control flows, and there are certain code path where the async
handles may not have come to life.

Refactor into a LoopStopper abstraction that exposes routines to
install a handle as well as to save a state.

Refs: #21283

The approach can be re-used for stopping the main Node application thread in-flight.

cc @addaleax @nodejs/workers

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
documentation is changed or added
commit message follows commit guidelines

nodejs-github-bot · 2019-02-14T14:01:02Z

@gireeshpunathil build started: https://ci.nodejs.org/blue/organizations/jenkins/node-test-pull-request-lite-pipeline/detail/node-test-pull-request-lite-pipeline/2602/pipeline

src/node_worker.cc

src/node_worker.h

src/node_worker.cc

gireeshpunathil · 2019-02-15T10:51:14Z

@addaleax - One thing I noticed is that the Environment that is newly created in Worker::Run is never attached to the worker object, instead it has the creator's Environment, but within the run loop the new env_ is in effect. While I don't see any issues with this approach, is this working as designed? What advantage it has over attaching the new one onto it?

src/node_worker.cc

src/node_worker.h

addaleax · 2019-02-15T11:09:34Z

While I don't see any issues with this approach, is this working as designed?

Yes, it’s working as designed.

What advantage it has over attaching the new one onto it?

It’s a question of correctness – the Worker object is an object in the parent thread, so env() should return the parent thread for it. We also don’t want the child thread’s Environment to be used outside of the child thread, so making it local to that and not exposing it otherwise seems to make sense?

src/node_worker.cc

addaleax · 2019-02-15T11:54:11Z

CI: https://ci.nodejs.org/job/node-test-pull-request/20791/

gireeshpunathil · 2019-02-15T12:54:42Z

parallel/test-worker-messageport-transfer-terminate is consistently failing in power linux, and I am able to recreate locally. investigating...

gireeshpunathil · 2019-02-15T12:57:54Z

parallel/test-worker-messageport-transfer-terminate fails in centos and osx too.
parallel/test-worker-debug fails in windows.

addaleax · 2019-02-15T13:30:50Z

@gireeshpunathil Let me know if you need anything investigating those failures :)

gireeshpunathil · 2019-02-15T13:58:26Z

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:67
67	../nptl/pthread_mutex_lock.c: No such file or directory.
[Current thread is 1 (Thread 0x3fffa1d54e60 (LWP 32165))]
(gdb) where
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:67
#1  0x000000001077a3d0 in ?? () at ../deps/uv/src/unix/thread.c:288
#2  0x0000000010669de4 in node::worker::Worker::StopThread(v8::FunctionCallbackInfo<v8::Value> const&)
    ()
#3  0x00000000108dfc4c in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#4  0x00000000108e09e4 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#5  0x0000000011b5bc68 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()

void Worker::Exit(int code) {
  Mutex::ScopedLock lock(mutex_);

  Debug(this, "Worker %llu called Exit(%d)", thread_id_, code);
  if (!thread_stopper_->IsStopped()) { // --------------------------> here
    exit_code_ = code;
    Debug(this, "Received StopEventLoop request");
    thread_stopper_->Stop();
    if (isolate_ != nullptr)
      isolate_->TerminateExecution();
  }
}

Looking at this line (where it crashed), I believe that the main thread is trying to terminate the worker even before it came to life - which means thread_stopper_ is still not created.

Prior to this PR it was not an issue as we were using primitive fields under worker.

Right now I am constructing thread_stopper_ as soon as I enter the thread (::Run) but looks like that itself is too late!

I guess we could create it in the parent itself, and the async_handle can be late attached in the worker thread, of course.

With that change I ran ppc linux 1000 times and see no crashes.

@addaleax - what do you think?

gireeshpunathil · 2019-02-15T13:59:43Z

btw AIX also failed the same test, and I was anticipating it!

gireeshpunathil · 2019-02-15T14:22:38Z

new CI with the changes: https://ci.nodejs.org/job/node-test-pull-request/20794/

addaleax · 2019-02-15T14:33:43Z

@gireeshpunathil Yes, that makes a lot of sense. Do these AsyncRequest fields need to be allocated separately, now that they are both constructed in the constructor anyway? They could be direct members of Worker, right?

gireeshpunathil · 2019-02-15T14:39:09Z

you mean to unwrap the AsyncRequest object and make everything part of the worker::Worker? but then we can't re-use it elsewhere? for example #21283 where we would want this (which is outside the scope of worker_threads)

gireeshpunathil · 2019-02-15T14:48:59Z

btw wrote a new test that recreated the crash in xlinux with the old code and confirmed the theory. When the main thread went to terminate, the worker was still being cloned.

(gdb) where
#0  0x00007f1d0de23c30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x0000000001019239 in uv_mutex_lock (mutex=0x18) at ../deps/uv/src/unix/thread.c:287
#2  0x0000000000d9b0a0 in node::LibuvMutexTraits::mutex_lock (mutex=0x18) at ../src/node_mutex.h:108
#3  0x0000000000d9ce86 in node::MutexBase<node::LibuvMutexTraits>::ScopedLock::ScopedLock (
    this=0x7ffe42fea980, mutex=...) at ../src/node_mutex.h:164
#4  0x0000000000efd41a in node::worker::AsyncRequest::IsStopped (this=0x0) at ../src/node_worker.cc:87
#5  0x0000000000f00105 in node::worker::Worker::Exit (this=0x5060d20, code=1)
    at ../src/node_worker.cc:549
#6  0x0000000000efff41 in node::worker::Worker::StopThread (args=...) at ../src/node_worker.cc:526
#7  0x0000000001185b4d in v8::internal::FunctionCallbackArguments::Call (
    this=this@entry=0x7ffe42feab90, handler=handler@entry=0x18d96d7f5391)
    at ../deps/v8/src/api-arguments-inl.h:140
#8  0x0000000001186802 in v8::internal::(anonymous namespace)::HandleApiCallHelper<false> (
    isolate=isolate@entry=0x4f6da90, function=..., function@entry=..., new_target=..., 
    new_target@entry=..., fun_data=..., receiver=..., receiver@entry=..., args=...)
    at ../deps/v8/src/builtins/builtins-api.cc:109
#9  0x000000000118ac2b in v8::internal::Builtin_Impl_HandleApiCall (args=..., 
    isolate=isolate@entry=0x4f6da90) at ../deps/v8/src/builtins/builtins-api.cc:139
#10 0x000000000118b641 in v8::internal::Builtin_HandleApiCall (args_length=5, 
    args_object=0x7ffe42feadb8, isolate=0x4f6da90) at ../deps/v8/src/builtins/builtins-api.cc:127
#11 0x000000000249ea95 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()
#12 0x00000840f988cb8e in ?? ()
#13 0x000004c1108025a1 in ?? ()
#14 0x000018d96d7f5621 in ?? ()
#15 0x0000000500000000 in ?? ()
#16 0x000004c110802681 in ?? ()
#17 0x00003a17f9dd3be9 in ?? ()
(gdb) f 5
#5  0x0000000000f00105 in node::worker::Worker::Exit (this=0x5060d20, code=1)
    at ../src/node_worker.cc:549
549	  if (!thread_stopper_->IsStopped()) {
(gdb) thr 8
[Switching to thread 8 (Thread 0x7f1d06ffd700 (LWP 42711))]
#0  0x00007f1d0db4bb01 in clone () from /lib64/libc.so.6
(gdb) where
#0  0x00007f1d0db4bb01 in clone () from /lib64/libc.so.6
#1  0x00007f1d0de21d10 in ?? () from /lib64/libpthread.so.0
#2  0x00007f1d06ffd700 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb)

addaleax · 2019-02-15T18:19:29Z

you mean to unwrap the AsyncRequest object and make everything part of the worker::Worker?

Sorry for being unclear – what I meant was to use AsyncRequest thread_stopper_; instead of std::unique_ptr<AsyncRequest> thread_stopper_; (and to perform the necessary replacements of -> with . etc.).

gireeshpunathil · 2019-02-16T05:54:35Z

thanks @addaleax - I followed that suggestion.

Several test were crashing in windows. Here is what I found:

00007FF6A9505BE0  push        rbx  
00007FF6A9505BE2  sub         rsp,20h  
00007FF6A9505BE6  mov         rbx,qword ptr [rcx]  
00007FF6A9505BE9  mov         edx,0E0h  
00007FF6A9505BEE  mov         rax,qword ptr [rbx]  
>> 00007FF6A9505BF1  dec         dword ptr [rax+7D0h]  
00007FF6A9505BF7  mov         rax,qword ptr [rbx+10h]  
00007FF6A9505BFB  mov         qword ptr [rcx],rax  
00007FF6A9505BFE  call        operator delete (07FF6A93C1B4Ah)  
00007FF6A9505C03  mov         edx,18h  
00007FF6A9505C08  mov         rcx,rbx  
00007FF6A9505C0B  add         rsp,20h  
00007FF6A9505C0F  pop         rbx

this points to data->env access in CloseHandle method

  uv_close(reinterpret_cast<uv_handle_t*>(handle), [](uv_handle_t* handle) {
    std::unique_ptr<CloseData> data { static_cast<CloseData*>(handle->data) };
    data->env->handle_cleanup_waiting_--;
    handle->data = data->original_data;
    data->callback(reinterpret_cast<T*>(handle));
  });

It is possible that when we issue the CloseHandle call the Environment was live, but by the time the callback was issued (may be in the next tick it is torn down? And in windows we have seen in the past that freed memory gets filled with garbage.

If I nullify the env_ field in the OnScopeLeave and skip the CloseHandle call based on env_'s existence, the problem surfaces as handle leak, reported by CheckedUvLoopClose.

Looking for suggestions at this point!

gireeshpunathil · 2019-02-16T11:13:06Z

ok, I guess I found the issue - I was issuing thread_stopper_.Uninstall() very late in the cycle, based on my original premise of keeping the pointer as late as possible. Following the existing code, I see it closes the handle in MessagePort::OnClose. Following the same sequence, I am able to solve the issue. Currently running tests in Windows locally.

addaleax · 2019-02-16T14:25:13Z

@gireeshpunathil Yeah, I think that makes sense. :)

src/node_worker.h

addaleax · 2019-02-17T11:00:36Z

src/node_worker.h

@@ -34,14 +55,13 @@ class Worker : public AsyncWrap {
    tracker->TrackFieldWithSize(
        "isolate_data", sizeof(IsolateData), "IsolateData");
    tracker->TrackFieldWithSize("env", sizeof(Environment), "Environment");
-    tracker->TrackField("thread_exit_async", *thread_exit_async_);


We’re not tracking the uv_async_t anymore, right? Maybe we should add something like tracker->TrackInlineField() that allows us to keep track of MemoryRetainers that are direct members of the class…

you mean - the one represented by thread_exit_async_ ? that is replaced withAsyncRequest objects that creates uv_async_t objects, and tracks through the interface method. the async_ field in AsyncRequest is still a pointer, direct member of neither AsyncRequest nor Worker.

Neither *thread_stopper_.async_ nor *on_thread_finished_.async_ are tracked, yes, because we don’t inform the tracker about the existence of the AsyncRequest fields.

Also, side note: I’m just noticing that we have the IsolateData and Environment fields listed here as well, which I’m not sure makes sense given that they are no longer directly allocated by this object…

sorry, I don't understand. *thread_stopper_.async and *on_thread_finished_.async_ are not tracked through the tracker instance or of worker, but those are tracked through the tracker instance of AsyncRequest object (line 98):

void AsyncRequest::MemoryInfo(MemoryTracker* tracker) const { Mutex::ScopedLock lock(mutex_); if (async_ != nullptr) tracker->TrackField("async_request", *async_); }

Isn't it enough? I hope we don't need multiple trackers for the same allocation?

For the IsolateData and Environment: I just removed those from being actively tracked by the worker and pushed in under this PR itself.

@gireeshpunathil The problem is that the memory tracker doesn’t know that it should call AsyncRequest::MemoryInfo. Currently, the way to inform it would be adding tracker->TrackField("thread_stopper_", &thread_stopper_);, but then we would end up tracking the memory for the AsyncRequest itself twice.

@gireeshpunathil Should we change this PR to use TrackInlineField now?

@addaleax - yes. Though I knew this depend on #26161 for a moment I forgot about that!

gireeshpunathil · 2019-02-28T15:57:07Z

CI: https://ci.nodejs.org/job/node-test-pull-request/21037/

addaleax · 2019-02-28T20:25:05Z

Resume CI: https://ci.nodejs.org/job/node-test-pull-request/21057/

The current mechanism of uses two async handles, one owned by the creator of the worker thread to terminate a running worker, and another one employed by the worker to interrupt its creator on its natural termination. The force termination piggybacks on the message- passing mechanism to inform the worker to quiesce. Also there are few flags that represent the other thread's state / request state because certain code path is shared by multiple control flows, and there are certain code path where the async handles may not have come to life. Refactor into a LoopStopper abstraction that exposes routines to install a handle as well as to save a state. Refs: nodejs#21283

gireeshpunathil · 2019-03-01T06:11:40Z

full CI: https://ci.nodejs.org/job/node-test-pull-request/21078/

addaleax · 2019-03-01T09:15:22Z

Landed in d14cba4 :)

The current mechanism of uses two async handles, one owned by the creator of the worker thread to terminate a running worker, and another one employed by the worker to interrupt its creator on its natural termination. The force termination piggybacks on the message- passing mechanism to inform the worker to quiesce. Also there are few flags that represent the other thread's state / request state because certain code path is shared by multiple control flows, and there are certain code path where the async handles may not have come to life. Refactor into an AsyncRequest abstraction that exposes routines to install a handle as well as to save a state. PR-URL: #26099 Refs: #21283 Reviewed-By: Anna Henningsen <anna@addaleax.net>

nodejs-github-bot added the c++ Issues and PRs that require attention from people who are familiar with C++. label Feb 14, 2019

gireeshpunathil requested review from addaleax and bnoordhuis February 14, 2019 14:02

gireeshpunathil added the worker Issues and PRs related to Worker support. label Feb 14, 2019

addaleax reviewed Feb 14, 2019

View reviewed changes

src/node_worker.cc Outdated Show resolved Hide resolved

src/node_worker.h Show resolved Hide resolved

src/node_worker.h Outdated Show resolved Hide resolved

src/node_worker.h Show resolved Hide resolved

addaleax reviewed Feb 14, 2019

View reviewed changes

src/node_worker.cc Outdated Show resolved Hide resolved

addaleax reviewed Feb 15, 2019

View reviewed changes

src/node_worker.cc Outdated Show resolved Hide resolved

src/node_worker.cc Outdated Show resolved Hide resolved

src/node_worker.cc Outdated Show resolved Hide resolved

src/node_worker.h Show resolved Hide resolved

addaleax approved these changes Feb 15, 2019

View reviewed changes

src/node_worker.cc Outdated Show resolved Hide resolved

gireeshpunathil force-pushed the uvcontrol branch 3 times, most recently from 0035d2c to f53a1e4 Compare February 17, 2019 10:55

addaleax reviewed Feb 17, 2019

View reviewed changes

addaleax mentioned this pull request Feb 17, 2019

src: track memory retainer fields #26161

Merged

4 tasks

addaleax mentioned this pull request Feb 23, 2019

worker: make MessagePort uv_async_t inline field #26271

Closed

2 tasks

mhdawson mentioned this pull request Feb 28, 2019

shutdown node in-flight #21283

Merged

4 tasks

addaleax added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 28, 2019

gireeshpunathil added 4 commits March 1, 2019 00:45

fixup: address review comments

8c48265

fixup: don't track irrelevant fields

df06cc4

fixup: leverage TrackInlineField

db9edc1

gireeshpunathil force-pushed the uvcontrol branch from 338d590 to db9edc1 Compare March 1, 2019 06:09

addaleax closed this Mar 1, 2019

BridgeAR mentioned this pull request Mar 4, 2019

v11.11.0 proposal #26322

Merged

gireeshpunathil mentioned this pull request Mar 17, 2019

Investigate flaky pummel/test-heapdump-worker #26712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker: refactor thread life cycle management #26099

worker: refactor thread life cycle management #26099

gireeshpunathil commented Feb 14, 2019 •

edited

Loading

nodejs-github-bot commented Feb 14, 2019

gireeshpunathil commented Feb 15, 2019

addaleax commented Feb 15, 2019

addaleax commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

addaleax commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

addaleax commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

addaleax commented Feb 15, 2019

gireeshpunathil commented Feb 16, 2019

gireeshpunathil commented Feb 16, 2019

addaleax commented Feb 16, 2019

addaleax Feb 17, 2019

gireeshpunathil Feb 17, 2019

addaleax Feb 17, 2019

gireeshpunathil Feb 17, 2019 •

edited

Loading

addaleax Feb 17, 2019

addaleax Feb 28, 2019

gireeshpunathil Mar 1, 2019

gireeshpunathil commented Feb 28, 2019

addaleax commented Feb 28, 2019

gireeshpunathil commented Mar 1, 2019

addaleax commented Mar 1, 2019

worker: refactor thread life cycle management #26099

worker: refactor thread life cycle management #26099

Conversation

gireeshpunathil commented Feb 14, 2019 • edited Loading

Checklist

nodejs-github-bot commented Feb 14, 2019

gireeshpunathil commented Feb 15, 2019

addaleax commented Feb 15, 2019

addaleax commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

addaleax commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

addaleax commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

gireeshpunathil commented Feb 15, 2019

addaleax commented Feb 15, 2019

gireeshpunathil commented Feb 16, 2019

gireeshpunathil commented Feb 16, 2019

addaleax commented Feb 16, 2019

addaleax Feb 17, 2019

Choose a reason for hiding this comment

gireeshpunathil Feb 17, 2019

Choose a reason for hiding this comment

addaleax Feb 17, 2019

Choose a reason for hiding this comment

gireeshpunathil Feb 17, 2019 • edited Loading

Choose a reason for hiding this comment

addaleax Feb 17, 2019

Choose a reason for hiding this comment

addaleax Feb 28, 2019

Choose a reason for hiding this comment

gireeshpunathil Mar 1, 2019

Choose a reason for hiding this comment

gireeshpunathil commented Feb 28, 2019

addaleax commented Feb 28, 2019

gireeshpunathil commented Mar 1, 2019

addaleax commented Mar 1, 2019

gireeshpunathil commented Feb 14, 2019 •

edited

Loading

gireeshpunathil Feb 17, 2019 •

edited

Loading