Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: fix cb scope bugs involved in termination #45596

Merged
merged 1 commit into from
Feb 28, 2023

Conversation

ywave620
Copy link
Contributor

@ywave620 ywave620 commented Nov 23, 2022

An inheritor of #45422 and a fix to #43084. I move the original failing test back to the test/parallel/ to increase the chance that it exercises the defect, at which this PR is targeted.

To see the bug in 43084, apply this patch and run the script

diff --git a/src/env.cc b/src/env.cc
index 2f0720d402..5a4d9b2146 100644
--- a/src/env.cc
+++ b/src/env.cc
@@ -27,6 +27,8 @@
 #include <iostream>
 #include <limits>
 #include <memory>
+#include <thread> 
+#include <chrono>
 
 namespace node {
 
@@ -907,6 +909,7 @@ void Environment::InitializeLibuv() {
 void Environment::ExitEnv() {
   set_can_call_into_js(false);
   set_stopping(true);
+  std::this_thread::sleep_for(std::chrono::milliseconds(2000));
   isolate_->TerminateExecution();
   SetImmediateThreadsafe([](Environment* env) { uv_stop(env->event_loop()); });
 }
// Flags: --expose-internals

'use strict';
const common = require('../common');
const { Worker } = require('worker_threads');

/**
 * This file is a minimal possible reproduction and an
 * explanation of https://github.com/nodejs/node/issues/43084.
 *
 * It can not trigger the bug consistently because
 * that would require a rare thread execution interleaving
 * condition.
 *
 * test-worker-http2-stream-terminate.js has a bigger chance
 * to trigger the bug.
 */

if (!process.env.HAS_STARTED_WORKER) {
  process.env.HAS_STARTED_WORKER = 1;

  const workerData = new Int32Array(new SharedArrayBuffer(4));
  const w = new Worker(__filename, { workerData });
  w.on('exit', common.mustCall());

  // Wait for the worker to enter the _write() below
  // eslint-disable-next-line
  while (Atomics.load(workerData, 0) !== 1) {}
  w.terminate(); // node::Environment::ExitEnv() is called under the hood

  return;
}

// --------- worker thread ---------

const JSStreamSocket = require('internal/js_stream_socket');
const { Duplex } = require('stream');

class DummyDuplex extends Duplex {
  _read() {
  }

  _write(chunk, encoding, callback) {
    /**
     * Now, this worker thread has a stack looks like:
     * _pthread_start
     * ...
     * node::CheckImmediate
     * ...
     * node::MakeCallback
     * ...
     * v8::internal::FunctionCallbackArguments::Call
     * ...
     * node::MakeCallback
     * ...
     */

    const { workerData } = require('worker_threads');
    Atomics.store(workerData, 0, 1);

    // Make some time for the main thread to enter node::Environment::ExitEnv()
    // eslint-disable-next-line
    for (let i = 0; i < 1000000; i++) {}

    // Just before this function returns, if the main thread
    // 1. called node::Environment::ExitEnv()/set_can_call_into_js(false)
    // 2. but did not called node::Environment::ExitEnv()/TerminateExecution() yet
    // 3. and got suspended by the OS
    // Without the fix, this bug https://github.com/nodejs/node/issues/43084 occurs
    // and emitAfter() in processImmediate() in lib/internal/timers.js complains
    // Error: async hook stack has become corrupted (actual: 6, expected: 5)
  }

  _final(callback) {
  }
}

setImmediate(() => {
  const sock = new JSStreamSocket(new DummyDuplex());
  sock.write(Buffer.alloc(10));
  // JSStreamSocket basically delegate r/w to the Duplex paased to its constructor
  // It is a net.Socket
  // lib/net.js write()
  //  lib/internal/stream_base_commons.js writeGeneric()
  //    handleWriteReq()
  //      handle.writeBuffer()
  //        node::StreamBase::WriteBuffer()
  //          node::StreamBase::Write()
  //            node::JSStream::DoWrite()
  //              call to the above DummyDuplex
});

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Nov 23, 2022
@ywave620
Copy link
Contributor Author

ywave620 commented Dec 2, 2022

Any idea? :)

@legendecas
Copy link
Member

@legendecas legendecas added the request-ci Add this label to start a Jenkins CI on a PR. label Dec 6, 2022
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 6, 2022
@nodejs-github-bot
Copy link
Collaborator

@ywave620
Copy link
Contributor Author

@legendecas Could you run the CI for this?

@legendecas legendecas added the request-ci Add this label to start a Jenkins CI on a PR. label Dec 15, 2022
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 15, 2022
@nodejs-github-bot
Copy link
Collaborator

auto perform_stopping_check = [&]() {
if (env_->is_stopping()) {
if (!env_->can_call_into_js()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference here between !env_->can_call_into_js() and env_->is_stopping() is trivial. Environment::can_call_into_js() also checks if Environment::is_stopping_ is set. Maybe @addaleax can chime in here?

If this line is updated, the lambda variable should also be renamed as perform_call_js_check or similar. Also, I'd find the checks if (!env_->can_call_into_js()) return; below can be merged with this lambda call too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, semantically, this should be checking for whether we’re currently stopping the current environment through a termination exception, so is_stopping() would seem to be the correct choice?

@ywave620
Copy link
Contributor Author

Any thoughts on this PR?

@legendecas
Copy link
Member

Would you mind updating the PR according to addaleax's comment above? Thanks!

@ywave620
Copy link
Contributor Author

@legendecas addaleax's suggestion does not fix the bug. Please see above for why.

I suggest that remove set_can_call_into_js(false) from Environment::ExitEnv(), this is ok because can_call_into_js() also check is_stopping():

node/src/env-inl.h

Lines 637 to 639 in c203921

inline bool Environment::can_call_into_js() const {
return can_call_into_js_ && !is_stopping();
}

@legendecas
Copy link
Member

With #45907, env->is_stopping() implies !env->can_call_into_js() on the JavaScript thread. With regard to the comment of #45596 (comment), the current check on if env is stopping is the expected semantic.

Be more aggresive to clean up the async id stack,
and ensure the cleanup when terminating.

Calling SetIdle() when terminating is not harmless.
When node terminates due to an unhandled exception,
v8 preseves the vm state, which is JS and notifies
node through PerIsolateMessageListener(). If node
calls SetIdle() later, v8 complains because it
requires the vm state to either be EXTERNEL or IDLE
when embedder calling SetIdle().
@ywave620
Copy link
Contributor Author

@legendecas Got you. Changed, PTAL

Copy link
Member

@legendecas legendecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@legendecas legendecas added the request-ci Add this label to start a Jenkins CI on a PR. label Feb 20, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Feb 20, 2023
@nodejs-github-bot
Copy link
Collaborator

@ywave620
Copy link
Contributor Author

ywave620 commented Feb 21, 2023

A test crashed. I wonder if there is a way to get a full stacktrace 🤔️
https://ci.nodejs.org/job/node-test-commit-osx-arm/10087/nodes=osx11/testReport/junit/(root)/test/parallel_test_tls_server_verify/

Failing for the past 1 build (Since [#10087](https://ci.nodejs.org/job/node-test-commit-osx-arm/10087/nodes=osx11/) )
[Took 0.85 sec.](https://ci.nodejs.org/job/node-test-commit-osx-arm/10087/nodes=osx11/testReport/junit/(root)/test/parallel_test_tls_server_verify/history)
Error Message
crashed (-10)
Stacktrace
0 0   connecting with agent1
0 1   connecting with agent2
0 2   connecting with agent3
0 3   connecting with nocert
1 0   connecting with agent1
1 1   connecting with agent2
1 2   connecting with agent3
1 3   connecting with nocert
2 0   connecting with agent1
2 1   connecting with agent2
2 2   connecting with agent3
2 3   connecting with nocert
3 0   connecting with agent1
3 1   connecting with agent2
3 2   connecting with agent3
3 3   connecting with nocert
4 0   connecting with agent1
4 1   connecting with agent2
4 2   connecting with agent3
4 3   connecting with nocert
5 0   connecting with agent1
5 1   connecting with agent2
5 2   connecting with agent3
5 3   connecting with agent4
0 Running 'Do not request certs. Everyone is unauthorized.'
0 - unauthed connection: null
0 - unauthed connection: null
0 2   * unauthed
0 1   * unauthed
0 - unauthed connection: null
0 0   * unauthed
0 - unauthed connection: null
0 3   * unauthed
1 Running 'Allow both authed and unauthed connections with CA1'
1 - unauthed connection: DEPTH_ZERO_SELF_SIGNED_CERT
1 - authed connection: agent1
1 - unauthed connection: UNABLE_TO_VERIFY_LEAF_SIGNATURE
1 1   * unauthed
1 0   * authed
1 2   * unauthed
1 - unauthed connection: UNABLE_TO_GET_ISSUER_CERT
1 3   * unauthed
2 Running 'Do not request certs at connection. Do that later'
2 - connected, renegotiating
2 - authed connection: agent1
2 0   * authed
2 - unauthed connection: null
2 1   * unauthed
2 - unauthed connection: null
2 2   * unauthed
2 - unauthed connection: null
2 3   * unauthed
3 Running 'Allow only authed connections with CA1'
3 - authed connection: agent1
3 0   * authed
4 Running 'Allow only authed connections with CA1 and CA2'
4 - authed connection: agent3
4 - authed connection: agent1
4 2   * authed
4 0   * authed
5 Running 'Allow only certs signed by CA2 but not in the CRL'

@nodejs-github-bot
Copy link
Collaborator

@legendecas
Copy link
Member

According to the reliability report nodejs/reliability#503, the failed test doesn't seem to be introduced in this PR. I've resumed the build.

@ywave620
Copy link
Contributor Author

@legendecas Could you merge this?

@legendecas legendecas added the commit-queue Add this label to land a pull request using GitHub Actions. label Feb 28, 2023
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label Feb 28, 2023
@nodejs-github-bot nodejs-github-bot merged commit 7a37829 into nodejs:main Feb 28, 2023
@nodejs-github-bot
Copy link
Collaborator

Landed in 7a37829

targos pushed a commit that referenced this pull request Mar 13, 2023
Be more aggresive to clean up the async id stack,
and ensure the cleanup when terminating.

Calling SetIdle() when terminating is not harmless.
When node terminates due to an unhandled exception,
v8 preseves the vm state, which is JS and notifies
node through PerIsolateMessageListener(). If node
calls SetIdle() later, v8 complains because it
requires the vm state to either be EXTERNEL or IDLE
when embedder calling SetIdle().

PR-URL: #45596
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
targos pushed a commit that referenced this pull request Mar 14, 2023
Be more aggresive to clean up the async id stack,
and ensure the cleanup when terminating.

Calling SetIdle() when terminating is not harmless.
When node terminates due to an unhandled exception,
v8 preseves the vm state, which is JS and notifies
node through PerIsolateMessageListener(). If node
calls SetIdle() later, v8 complains because it
requires the vm state to either be EXTERNEL or IDLE
when embedder calling SetIdle().

PR-URL: #45596
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
danielleadams pushed a commit that referenced this pull request Apr 11, 2023
Be more aggresive to clean up the async id stack,
and ensure the cleanup when terminating.

Calling SetIdle() when terminating is not harmless.
When node terminates due to an unhandled exception,
v8 preseves the vm state, which is JS and notifies
node through PerIsolateMessageListener(). If node
calls SetIdle() later, v8 complains because it
requires the vm state to either be EXTERNEL or IDLE
when embedder calling SetIdle().

PR-URL: #45596
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants