Skip to content

Node.js 20 Upgrade: Segmentation Fault Core Dump During Pipeline Lage Build Step #56236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Umesh-daiict opened this issue Dec 12, 2024 · 16 comments

Comments

@Umesh-daiict
Copy link

Umesh-daiict commented Dec 12, 2024

Version

20.15.0

Platform

linux-gnu

Subsystem

No response

What steps will reproduce the bug?

  1. Use a Linux-based CI agent.
  2. Execute the following pipeline commands:
    yarn lage build bundle stage-deployment --concurrency 32
  3. Observe the segmentation fault during execution.

How often does it reproduce? Is there a required condition?

The issue with the segmentation fault during the yarn lage build bundle step occurs intermittently. Sometimes the build and bundle process passes successfully, while other times it results in a core dump. There is no specific required condition mentioned for the issue to reproduce.

What is the expected behavior? Why is that the expected behavior?

The expected behavior is for the yarn lage build bundle command to successfully build and bundle the project without encountering segmentation faults. This is expected because the same process worked correctly with Node.js 18, and the upgrade to Node.js 20 should not introduce such critical issues.

What do you see instead?

We are encountering segmentation fault (core dump) when running our pipeline after upgrading to Node.js 20. This issue specifically arises during the execution of yarn lage build bundle steps on a Linux-based CI agent. The core dump logs do not provide sufficient insights into the root cause. This behaviour was not observed prior when running with Node.js 18, it's only been seen when we are trying to upgrade from Node 18.15.0 to Node 20.15.0

Additional information

We have tried multiple of above troubleshooting approaches but are still unable to resolve this issue. If anyone has encountered similar segmentation faults with Node.js 20 or has suggestions for further debugging, please share your thoughts.

  1. System Vitals Monitoring:
    ○ Monitored memory usage, agent configurations, and system vitals to identify any anomalies.
    ○ Found no indications of memory pressure or system resource limitations.
    2. Core Dump Analysis:
    ○ Installed Valgrind and a segfault handler to capture and analyze core dump logs.
    ○ Unfortunately, no meaningful insights were captured from the VM or agent machine logs.
    3. Node Version Update:
    ○ Verified and updated all Node.js native modules and dependencies for compatibility with Node.js 20.
    4. Heap Space Configuration:
    ○ Ensured no misconfigurations targeting the "new" space in the heap.
    ○ Adjusted heap settings with the following commands:
    node --max-old-space-size=8192 dist/server.js
    ○ Introduced the --max-semi-space-size parameter to configure the "new" space in the heap.
    5. Node Environment Cleanup:
    ○ Removed older versions of Node.js from the environment to prevent conflicts during pipeline execution.
    Any assistance or guidance would be appreciated. Please tag anyone relevant who can help us out.
@koirodev
Copy link

Were you able to solve this problem?

@lforst
Copy link

lforst commented Feb 12, 2025

We're also seeing this.

@Umesh-daiict
Copy link
Author

Hi @riverego, @avivkeller
If possible, could you please take a look at this issue? To me, it seems similar to your issue (#54692). How were you able to solve it? Could you please help us here?

@sunilsurana
Copy link

Can someone please help with how do we go about investigating the error Segmentation fault (core dumped)

@lforst
Copy link

lforst commented Mar 26, 2025

@sunilsurana we figured this out after some help from a Node maintainer! See https://bsky.app/profile/joyeecheung.bsky.social/post/3lhy7xpe3ok2h and #51555 (comment)

Setting DISABLE_V8_COMPILE_CACHE=1 as env variable will maybe fix your problem.

@sunilsurana
Copy link

Thanks @lforst will try this

@AlekhyaYalla
Copy link

AlekhyaYalla commented Apr 3, 2025

Attaching the stack trace for this issue.

 [*** build] ERROR DETECTED
ERR! started
ERR! hash: 160c7864a0129cc2c44d2ced6dac2cb8269f7584, cache hit? false
ERR! Running yarn run build
ERR! [2:15:22 PM] ■ started 'prebuild'
ERR! [2:15:22 PM] ■ started 'buildCategoryNameToLocalizedListMap'
ERR! [2:15:22 PM] ■ finished 'buildCategoryNameToLocalizedListMap' in 0s
ERR! [2:15:22 PM] ■ finished 'prebuild' in 0.01s
ERR! PID 51004 received SIGSEGV for address: 0x78
ERR! /workspaces/***/***/.store/segfault-handler@1.3.0-15f5af3b2a125f88cad7/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3340)[0x71717b091340]
ERR! /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x71717b17f520]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node(uv_async_send+0x0)[0x18c54c0]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node(_ZN2v88internal4wasm15AsyncCompileJob19StartForegroundTaskEv+0x69)[0x1751e49]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node(_ZN2v88internal4wasm15AsyncCompileJob24CompilationStateCallback4callENS1_16CompilationEventE+0x21e)[0x1752aee]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0x174e988]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0x175ab04]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0x175c0c9]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0x175c3f8]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node(_ZN2v88platform16DefaultJobWorker3RunEv+0x88)[0x1d75538]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0xd43601]
ERR! /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x71717b1d1ac3]
ERR! /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x71717b263850]
ERR! Segmentation fault (core dumped)
ERR! Exiting yarn run build with exit code 1
ERR! failed

Seems like wasm module compilation is failing for different packages in our mono repo. I came to that conclusion as I can see that consistently everytime its failing in this call: _ZN2v88internal4wasm15AsyncCompileJob19StartForegroundTaskEv as per the stack traces.
Also as mentioned in this thread by @joyeecheung We have v8-compile-cache in our repo through other dependencies webpack-cli and eslint. I tried disabling the compilation cache too, yet the issue hasn't resolved yet.

@joyeecheung Any inputs here based on the stack trace? We have nearly ~2000 packages in our monorepo and everytime we try to bundle a huge package which has multiple dependent packages, with cache disabled , we are landing on this issue with concurrency 8. With lower concurrency as low as 2,4, its working fine.
But from concurrency 8, it consistently throwing this error in random package (mostly the packages which involve wasm modules). And 90% of times the crash stack trace is same.

@sunilsurana
Copy link

We are getting this issue as well. It would be great if we can get some help here. Setting v8 compilation cache to false did not help

@AlekhyaYalla
Copy link

@avivkeller Any inputs here please from the stack trace?

@AlekhyaYalla
Copy link

AlekhyaYalla commented Apr 10, 2025

After looking at Node code, I could see that the issue is coming from this function. Based on that, its clear that v8 while compiling wasm modules is the issue.

I have tried below flags to check if anything helps.

--wasm-lazy-compilation
--wasm-tier-up
--no-liftoff 
--no-experimental-wasm-inlining
--no-wasm-loop-unrolling 
--turboshaft-wasm 

The below error is coming, mostly OOM

ERR! [8:53:31 AM] x Error: Command terminated by signal SIGTERM: /home/vscode/.nvm/versions/node/v20.19.0/bin/node --max-old-space-size=23480 --wasm-lazy-compilation --no-wasm-native-module-cache-enabled --turboshaft-wasm --no-liftoff --dns-result-order=ipv4first /workspaces/****/****/apps/*****/node_modules/webpack-cli/bin/cli.js --env suffix=DS --env flavor=debug --env forceDebug=true
ERR!     at ChildProcess.<anonymous> (/workspaces/****/****/.store/just-scripts-utils@1.2.0-ced31b28d3c02b71b138/node_modules/just-scripts-utils/lib/exec.js:98:31)
ERR!     at ChildProcess.emit (node:events:524:28)
ERR!     at ChildProcess.emit (node:domain:552:15)
ERR!     at Process.ChildProcess._handle.onexit (node:internal/child_process:293:12)
ERR!     at Process.callbackTrampoline (node:internal/async_hooks:130:17)

@joyeecheung Anything else do you suggest to try? We are blocked, please do help.

@joyeecheung
Copy link
Member

I am afraid there isn't enough information to provide specific suggestions - if you have a coredump, you can try debugging it using lldb or gdb.

@AlekhyaYalla
Copy link

AlekhyaYalla commented Apr 15, 2025

@joyeecheung With gdb debugging, seems like there is a nullptr getting passed in this line according to the stack trace. Also the issue is coming for linux platform.

#0  **uv_async_send (handle=0x0)** at ../deps/uv/src/unix/async.c:73
#1  0x0000000001751e49 in v8::internal::wasm::AsyncCompileJob::StartForegroundTask() ()
#2  0x0000000001752aee in v8::internal::wasm::AsyncCompileJob::CompilationStateCallback::call(v8::internal::wasm::CompilationEvent)
    ()
#3  0x000000000174e988 in v8::internal::wasm::(anonymous namespace)::CompilationStateImpl::TriggerCallbacks() ()
#4  0x000000000175ab04 in v8::internal::wasm::(anonymous namespace)::CompilationStateImpl::SchedulePublishCompilationResults(std::vector<std::unique_ptr<v8::internal::wasm::WasmCode, std::default_delete<v8::internal::wasm::WasmCode> >, std::allocator<std::unique_ptr<v8::internal::wasm::WasmCode, std::default_delete<v8::internal::wasm::WasmCode> > > >) ()
#5  0x000000000175c0c9 in v8::internal::wasm::(anonymous namespace)::ExecuteCompilationUnits(std::weak_ptr<v8::internal::wasm::NativeModule>, v8::internal::Counters*, v8::JobDelegate*, v8::internal::wasm::(anonymous namespace)::CompilationTier) ()
#6  0x000000000175c3f8 in v8::internal::wasm::(anonymous namespace)::BackgroundCompileJob::Run(v8::JobDelegate*) ()
#7  0x0000000001d75538 in v8::platform::DefaultJobWorker::Run() ()
#8  0x0000000000d43601 in node::(anonymous namespace)::PlatformWorkerThread(void*) ()
#9  0x00007f7c7ed1aac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#10 0x00007f7c7edac850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@codebytere
Copy link
Member

This was bisected to v20.2.0...v20.3.0 and within that 9e68f94. I believe within that deps roll a likely culprit is libuv/libuv#3879 - @vtjnash or @bnoordhuis do you have any ideas what might be the issue here or how it could be worked around?

@bnoordhuis
Copy link
Member

#0 **uv_async_send (handle=0x0)** at ../deps/uv/src/unix/async.c:73

handle=0x0 indicates it's not a libuv issue but node passing in a nullptr.

I'm 95% sure it's flush_tasks_ in src/node_platform.cc that either hasn't been initialized yet or has been freed already. PostTask and PostDelayedTask have if (flush_tasks_ == nullptr) guards but that's unsound when called from different threads (which they are.)

My money is on pre-existing race condition that shows up now because libuv got faster.

@codebytere
Copy link
Member

codebytere commented Apr 16, 2025

Would an appropriate potential solution be something like this? If yes i'm happy to PR that.

diff --git a/src/node_platform.cc b/src/node_platform.cc
index 743ac069ad5..e659d75b718 100644
--- a/src/node_platform.cc
+++ b/src/node_platform.cc
@@ -252,6 +252,7 @@ void PerIsolatePlatformData::PostIdleTaskImpl(std::unique_ptr<v8::IdleTask> task
 
 void PerIsolatePlatformData::PostTaskImpl(std::unique_ptr<Task> task,
                                           const v8::SourceLocation& location) {
+  Mutex::ScopedLock lock(flush_tasks_mutex_);
   if (flush_tasks_ == nullptr) {
     // V8 may post tasks during Isolate disposal. In that case, the only
     // sensible path forward is to discard the task.
@@ -265,6 +266,7 @@ void PerIsolatePlatformData::PostDelayedTaskImpl(
     std::unique_ptr<Task> task,
     double delay_in_seconds,
     const v8::SourceLocation& location) {
+  Mutex::ScopedLock lock(flush_tasks_mutex_);
   if (flush_tasks_ == nullptr) {
     // V8 may post tasks during Isolate disposal. In that case, the only
     // sensible path forward is to discard the task.
@@ -300,6 +302,7 @@ void PerIsolatePlatformData::AddShutdownCallback(void (*callback)(void*),
 }
 
 void PerIsolatePlatformData::Shutdown() {
+  Mutex::ScopedLock lock(flush_tasks_mutex_);
   if (flush_tasks_ == nullptr)
     return;
 
diff --git a/src/node_platform.h b/src/node_platform.h
index 0a99f5b4b5e..b18fc11bc71 100644
--- a/src/node_platform.h
+++ b/src/node_platform.h
@@ -105,6 +105,7 @@ class PerIsolatePlatformData :
 
   v8::Isolate* const isolate_;
   uv_loop_t* const loop_;
+  Mutex flush_tasks_mutex_;
   uv_async_t* flush_tasks_ = nullptr;
   TaskQueue<v8::Task> foreground_tasks_;
   TaskQueue<DelayedTask> foreground_delayed_tasks_;

also cc @joyeecheung

@bnoordhuis
Copy link
Member

TaskQueue (what PostTask and PostDelayedTask push into) also takes out a lock. That makes it a little too easy to end up with ABBA deadlocks.

It'd be better to come up with an abstraction that turns TaskQueue inside out, where you can only get at the inner queue when you lock the mutex first, like how Rust's std::sync::Mutex works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants