Node.js 20 Upgrade: Segmentation Fault Core Dump During Pipeline Lage Build Step #56236

Umesh-daiict · 2024-12-12T08:17:13Z

Version

20.15.0

Platform

linux-gnu

Subsystem

No response

What steps will reproduce the bug?

Use a Linux-based CI agent.
Execute the following pipeline commands:
yarn lage build bundle stage-deployment --concurrency 32
Observe the segmentation fault during execution.

How often does it reproduce? Is there a required condition?

The issue with the segmentation fault during the yarn lage build bundle step occurs intermittently. Sometimes the build and bundle process passes successfully, while other times it results in a core dump. There is no specific required condition mentioned for the issue to reproduce.

What is the expected behavior? Why is that the expected behavior?

The expected behavior is for the yarn lage build bundle command to successfully build and bundle the project without encountering segmentation faults. This is expected because the same process worked correctly with Node.js 18, and the upgrade to Node.js 20 should not introduce such critical issues.

What do you see instead?

We are encountering segmentation fault (core dump) when running our pipeline after upgrading to Node.js 20. This issue specifically arises during the execution of yarn lage build bundle steps on a Linux-based CI agent. The core dump logs do not provide sufficient insights into the root cause. This behaviour was not observed prior when running with Node.js 18, it's only been seen when we are trying to upgrade from Node 18.15.0 to Node 20.15.0

Additional information

We have tried multiple of above troubleshooting approaches but are still unable to resolve this issue. If anyone has encountered similar segmentation faults with Node.js 20 or has suggestions for further debugging, please share your thoughts.

System Vitals Monitoring:
○ Monitored memory usage, agent configurations, and system vitals to identify any anomalies.
○ Found no indications of memory pressure or system resource limitations.
2. Core Dump Analysis:
○ Installed Valgrind and a segfault handler to capture and analyze core dump logs.
○ Unfortunately, no meaningful insights were captured from the VM or agent machine logs.
3. Node Version Update:
○ Verified and updated all Node.js native modules and dependencies for compatibility with Node.js 20.
4. Heap Space Configuration:
○ Ensured no misconfigurations targeting the "new" space in the heap.
○ Adjusted heap settings with the following commands:
node --max-old-space-size=8192 dist/server.js
○ Introduced the --max-semi-space-size parameter to configure the "new" space in the heap.
5. Node Environment Cleanup:
○ Removed older versions of Node.js from the environment to prevent conflicts during pipeline execution.
Any assistance or guidance would be appreciated. Please tag anyone relevant who can help us out.

The text was updated successfully, but these errors were encountered:

koirodev · 2025-01-20T09:56:09Z

Were you able to solve this problem?

lforst · 2025-02-12T09:37:24Z

We're also seeing this.

Umesh-daiict · 2025-03-03T11:16:44Z

Hi @riverego, @avivkeller
If possible, could you please take a look at this issue? To me, it seems similar to your issue (#54692). How were you able to solve it? Could you please help us here?

sunilsurana · 2025-03-26T08:51:24Z

Can someone please help with how do we go about investigating the error Segmentation fault (core dumped)

lforst · 2025-03-26T09:18:42Z

@sunilsurana we figured this out after some help from a Node maintainer! See https://bsky.app/profile/joyeecheung.bsky.social/post/3lhy7xpe3ok2h and #51555 (comment)

Setting DISABLE_V8_COMPILE_CACHE=1 as env variable will maybe fix your problem.

sunilsurana · 2025-03-26T11:39:12Z

Thanks @lforst will try this

AlekhyaYalla · 2025-04-03T10:35:21Z

Attaching the stack trace for this issue.

 [*** build] ERROR DETECTED
ERR! started
ERR! hash: 160c7864a0129cc2c44d2ced6dac2cb8269f7584, cache hit? false
ERR! Running yarn run build
ERR! [2:15:22 PM] ■ started 'prebuild'
ERR! [2:15:22 PM] ■ started 'buildCategoryNameToLocalizedListMap'
ERR! [2:15:22 PM] ■ finished 'buildCategoryNameToLocalizedListMap' in 0s
ERR! [2:15:22 PM] ■ finished 'prebuild' in 0.01s
ERR! PID 51004 received SIGSEGV for address: 0x78
ERR! /workspaces/***/***/.store/segfault-handler@1.3.0-15f5af3b2a125f88cad7/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3340)[0x71717b091340]
ERR! /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x71717b17f520]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node(uv_async_send+0x0)[0x18c54c0]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node(_ZN2v88internal4wasm15AsyncCompileJob19StartForegroundTaskEv+0x69)[0x1751e49]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node(_ZN2v88internal4wasm15AsyncCompileJob24CompilationStateCallback4callENS1_16CompilationEventE+0x21e)[0x1752aee]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0x174e988]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0x175ab04]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0x175c0c9]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0x175c3f8]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node(_ZN2v88platform16DefaultJobWorker3RunEv+0x88)[0x1d75538]
ERR! /home/vscode/.nvm/versions/node/v20.19.0/bin/node[0xd43601]
ERR! /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x71717b1d1ac3]
ERR! /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x71717b263850]
ERR! Segmentation fault (core dumped)
ERR! Exiting yarn run build with exit code 1
ERR! failed

Seems like wasm module compilation is failing for different packages in our mono repo. I came to that conclusion as I can see that consistently everytime its failing in this call: _ZN2v88internal4wasm15AsyncCompileJob19StartForegroundTaskEv as per the stack traces.
Also as mentioned in this thread by @joyeecheung We have v8-compile-cache in our repo through other dependencies webpack-cli and eslint. I tried disabling the compilation cache too, yet the issue hasn't resolved yet.

@joyeecheung Any inputs here based on the stack trace? We have nearly ~2000 packages in our monorepo and everytime we try to bundle a huge package which has multiple dependent packages, with cache disabled , we are landing on this issue with concurrency 8. With lower concurrency as low as 2,4, its working fine.
But from concurrency 8, it consistently throwing this error in random package (mostly the packages which involve wasm modules). And 90% of times the crash stack trace is same.

sunilsurana · 2025-04-05T06:23:57Z

We are getting this issue as well. It would be great if we can get some help here. Setting v8 compilation cache to false did not help

AlekhyaYalla · 2025-04-07T07:31:45Z

@avivkeller Any inputs here please from the stack trace?

AlekhyaYalla · 2025-04-10T10:51:49Z

After looking at Node code, I could see that the issue is coming from this function. Based on that, its clear that v8 while compiling wasm modules is the issue.

I have tried below flags to check if anything helps.

--wasm-lazy-compilation
--wasm-tier-up
--no-liftoff 
--no-experimental-wasm-inlining
--no-wasm-loop-unrolling 
--turboshaft-wasm

The below error is coming, mostly OOM

ERR! [8:53:31 AM] x Error: Command terminated by signal SIGTERM: /home/vscode/.nvm/versions/node/v20.19.0/bin/node --max-old-space-size=23480 --wasm-lazy-compilation --no-wasm-native-module-cache-enabled --turboshaft-wasm --no-liftoff --dns-result-order=ipv4first /workspaces/****/****/apps/*****/node_modules/webpack-cli/bin/cli.js --env suffix=DS --env flavor=debug --env forceDebug=true
ERR!     at ChildProcess.<anonymous> (/workspaces/****/****/.store/just-scripts-utils@1.2.0-ced31b28d3c02b71b138/node_modules/just-scripts-utils/lib/exec.js:98:31)
ERR!     at ChildProcess.emit (node:events:524:28)
ERR!     at ChildProcess.emit (node:domain:552:15)
ERR!     at Process.ChildProcess._handle.onexit (node:internal/child_process:293:12)
ERR!     at Process.callbackTrampoline (node:internal/async_hooks:130:17)

@joyeecheung Anything else do you suggest to try? We are blocked, please do help.

joyeecheung · 2025-04-11T20:58:55Z

I am afraid there isn't enough information to provide specific suggestions - if you have a coredump, you can try debugging it using lldb or gdb.

AlekhyaYalla · 2025-04-15T07:45:46Z

@joyeecheung With gdb debugging, seems like there is a nullptr getting passed in this line according to the stack trace. Also the issue is coming for linux platform.

#0  **uv_async_send (handle=0x0)** at ../deps/uv/src/unix/async.c:73
#1  0x0000000001751e49 in v8::internal::wasm::AsyncCompileJob::StartForegroundTask() ()
#2  0x0000000001752aee in v8::internal::wasm::AsyncCompileJob::CompilationStateCallback::call(v8::internal::wasm::CompilationEvent)
    ()
#3  0x000000000174e988 in v8::internal::wasm::(anonymous namespace)::CompilationStateImpl::TriggerCallbacks() ()
#4  0x000000000175ab04 in v8::internal::wasm::(anonymous namespace)::CompilationStateImpl::SchedulePublishCompilationResults(std::vector<std::unique_ptr<v8::internal::wasm::WasmCode, std::default_delete<v8::internal::wasm::WasmCode> >, std::allocator<std::unique_ptr<v8::internal::wasm::WasmCode, std::default_delete<v8::internal::wasm::WasmCode> > > >) ()
#5  0x000000000175c0c9 in v8::internal::wasm::(anonymous namespace)::ExecuteCompilationUnits(std::weak_ptr<v8::internal::wasm::NativeModule>, v8::internal::Counters*, v8::JobDelegate*, v8::internal::wasm::(anonymous namespace)::CompilationTier) ()
#6  0x000000000175c3f8 in v8::internal::wasm::(anonymous namespace)::BackgroundCompileJob::Run(v8::JobDelegate*) ()
#7  0x0000000001d75538 in v8::platform::DefaultJobWorker::Run() ()
#8  0x0000000000d43601 in node::(anonymous namespace)::PlatformWorkerThread(void*) ()
#9  0x00007f7c7ed1aac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#10 0x00007f7c7edac850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

codebytere · 2025-04-16T08:22:37Z

This was bisected to v20.2.0...v20.3.0 and within that 9e68f94. I believe within that deps roll a likely culprit is libuv/libuv#3879 - @vtjnash or @bnoordhuis do you have any ideas what might be the issue here or how it could be worked around?

bnoordhuis · 2025-04-16T10:10:44Z

#0 **uv_async_send (handle=0x0)** at ../deps/uv/src/unix/async.c:73

handle=0x0 indicates it's not a libuv issue but node passing in a nullptr.

I'm 95% sure it's flush_tasks_ in src/node_platform.cc that either hasn't been initialized yet or has been freed already. PostTask and PostDelayedTask have if (flush_tasks_ == nullptr) guards but that's unsound when called from different threads (which they are.)

My money is on pre-existing race condition that shows up now because libuv got faster.

codebytere · 2025-04-16T11:08:32Z

Would an appropriate potential solution be something like this? If yes i'm happy to PR that.

diff --git a/src/node_platform.cc b/src/node_platform.cc
index 743ac069ad5..e659d75b718 100644
--- a/src/node_platform.cc
+++ b/src/node_platform.cc
@@ -252,6 +252,7 @@ void PerIsolatePlatformData::PostIdleTaskImpl(std::unique_ptr<v8::IdleTask> task
 
 void PerIsolatePlatformData::PostTaskImpl(std::unique_ptr<Task> task,
                                           const v8::SourceLocation& location) {
+  Mutex::ScopedLock lock(flush_tasks_mutex_);
   if (flush_tasks_ == nullptr) {
     // V8 may post tasks during Isolate disposal. In that case, the only
     // sensible path forward is to discard the task.
@@ -265,6 +266,7 @@ void PerIsolatePlatformData::PostDelayedTaskImpl(
     std::unique_ptr<Task> task,
     double delay_in_seconds,
     const v8::SourceLocation& location) {
+  Mutex::ScopedLock lock(flush_tasks_mutex_);
   if (flush_tasks_ == nullptr) {
     // V8 may post tasks during Isolate disposal. In that case, the only
     // sensible path forward is to discard the task.
@@ -300,6 +302,7 @@ void PerIsolatePlatformData::AddShutdownCallback(void (*callback)(void*),
 }
 
 void PerIsolatePlatformData::Shutdown() {
+  Mutex::ScopedLock lock(flush_tasks_mutex_);
   if (flush_tasks_ == nullptr)
     return;
 
diff --git a/src/node_platform.h b/src/node_platform.h
index 0a99f5b4b5e..b18fc11bc71 100644
--- a/src/node_platform.h
+++ b/src/node_platform.h
@@ -105,6 +105,7 @@ class PerIsolatePlatformData :
 
   v8::Isolate* const isolate_;
   uv_loop_t* const loop_;
+  Mutex flush_tasks_mutex_;
   uv_async_t* flush_tasks_ = nullptr;
   TaskQueue<v8::Task> foreground_tasks_;
   TaskQueue<DelayedTask> foreground_delayed_tasks_;

also cc @joyeecheung

bnoordhuis · 2025-04-16T13:02:06Z

TaskQueue (what PostTask and PostDelayedTask push into) also takes out a lock. That makes it a little too easy to end up with ABBA deadlocks.

It'd be better to come up with an abstraction that turns TaskQueue inside out, where you can only get at the inner queue when you lock the mutex first, like how Rust's std::sync::Mutex works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node.js 20 Upgrade: Segmentation Fault Core Dump During Pipeline Lage Build Step #56236

Node.js 20 Upgrade: Segmentation Fault Core Dump During Pipeline Lage Build Step #56236

Umesh-daiict commented Dec 12, 2024 •

edited

Loading

koirodev commented Jan 20, 2025

lforst commented Feb 12, 2025

Umesh-daiict commented Mar 3, 2025

sunilsurana commented Mar 26, 2025

lforst commented Mar 26, 2025

sunilsurana commented Mar 26, 2025

AlekhyaYalla commented Apr 3, 2025 •

edited

Loading

sunilsurana commented Apr 5, 2025

AlekhyaYalla commented Apr 7, 2025

AlekhyaYalla commented Apr 10, 2025 •

edited

Loading

joyeecheung commented Apr 11, 2025

AlekhyaYalla commented Apr 15, 2025 •

edited

Loading

codebytere commented Apr 16, 2025

bnoordhuis commented Apr 16, 2025

codebytere commented Apr 16, 2025 •

edited

Loading

bnoordhuis commented Apr 16, 2025

Node.js 20 Upgrade: Segmentation Fault Core Dump During Pipeline Lage Build Step #56236

Node.js 20 Upgrade: Segmentation Fault Core Dump During Pipeline Lage Build Step #56236

Comments

Umesh-daiict commented Dec 12, 2024 • edited Loading

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

koirodev commented Jan 20, 2025

lforst commented Feb 12, 2025

Umesh-daiict commented Mar 3, 2025

sunilsurana commented Mar 26, 2025

lforst commented Mar 26, 2025

sunilsurana commented Mar 26, 2025

AlekhyaYalla commented Apr 3, 2025 • edited Loading

sunilsurana commented Apr 5, 2025

AlekhyaYalla commented Apr 7, 2025

AlekhyaYalla commented Apr 10, 2025 • edited Loading

joyeecheung commented Apr 11, 2025

AlekhyaYalla commented Apr 15, 2025 • edited Loading

codebytere commented Apr 16, 2025

bnoordhuis commented Apr 16, 2025

codebytere commented Apr 16, 2025 • edited Loading

bnoordhuis commented Apr 16, 2025

Umesh-daiict commented Dec 12, 2024 •

edited

Loading

AlekhyaYalla commented Apr 3, 2025 •

edited

Loading

AlekhyaYalla commented Apr 10, 2025 •

edited

Loading

AlekhyaYalla commented Apr 15, 2025 •

edited

Loading

codebytere commented Apr 16, 2025 •

edited

Loading