-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Node.js 20 Upgrade: Segmentation Fault Core Dump During Pipeline Lage Build Step #56236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Were you able to solve this problem? |
We're also seeing this. |
Hi @riverego, @avivkeller |
Can someone please help with how do we go about investigating the error Segmentation fault (core dumped) |
@sunilsurana we figured this out after some help from a Node maintainer! See https://bsky.app/profile/joyeecheung.bsky.social/post/3lhy7xpe3ok2h and #51555 (comment) Setting |
Thanks @lforst will try this |
Attaching the stack trace for this issue.
Seems like wasm module compilation is failing for different packages in our mono repo. I came to that conclusion as I can see that consistently everytime its failing in this call: @joyeecheung Any inputs here based on the stack trace? We have nearly ~2000 packages in our monorepo and everytime we try to bundle a huge package which has multiple dependent packages, with cache disabled , we are landing on this issue with concurrency 8. With lower concurrency as low as 2,4, its working fine. |
We are getting this issue as well. It would be great if we can get some help here. Setting v8 compilation cache to false did not help |
@avivkeller Any inputs here please from the stack trace? |
After looking at Node code, I could see that the issue is coming from this function. Based on that, its clear that v8 while compiling wasm modules is the issue. I have tried below flags to check if anything helps.
The below error is coming, mostly OOM
@joyeecheung Anything else do you suggest to try? We are blocked, please do help. |
I am afraid there isn't enough information to provide specific suggestions - if you have a coredump, you can try debugging it using lldb or gdb. |
@joyeecheung With gdb debugging, seems like there is a nullptr getting passed in this line according to the stack trace. Also the issue is coming for linux platform.
|
This was bisected to v20.2.0...v20.3.0 and within that 9e68f94. I believe within that deps roll a likely culprit is libuv/libuv#3879 - @vtjnash or @bnoordhuis do you have any ideas what might be the issue here or how it could be worked around? |
handle=0x0 indicates it's not a libuv issue but node passing in a nullptr. I'm 95% sure it's My money is on pre-existing race condition that shows up now because libuv got faster. |
Would an appropriate potential solution be something like this? If yes i'm happy to PR that. diff --git a/src/node_platform.cc b/src/node_platform.cc
index 743ac069ad5..e659d75b718 100644
--- a/src/node_platform.cc
+++ b/src/node_platform.cc
@@ -252,6 +252,7 @@ void PerIsolatePlatformData::PostIdleTaskImpl(std::unique_ptr<v8::IdleTask> task
void PerIsolatePlatformData::PostTaskImpl(std::unique_ptr<Task> task,
const v8::SourceLocation& location) {
+ Mutex::ScopedLock lock(flush_tasks_mutex_);
if (flush_tasks_ == nullptr) {
// V8 may post tasks during Isolate disposal. In that case, the only
// sensible path forward is to discard the task.
@@ -265,6 +266,7 @@ void PerIsolatePlatformData::PostDelayedTaskImpl(
std::unique_ptr<Task> task,
double delay_in_seconds,
const v8::SourceLocation& location) {
+ Mutex::ScopedLock lock(flush_tasks_mutex_);
if (flush_tasks_ == nullptr) {
// V8 may post tasks during Isolate disposal. In that case, the only
// sensible path forward is to discard the task.
@@ -300,6 +302,7 @@ void PerIsolatePlatformData::AddShutdownCallback(void (*callback)(void*),
}
void PerIsolatePlatformData::Shutdown() {
+ Mutex::ScopedLock lock(flush_tasks_mutex_);
if (flush_tasks_ == nullptr)
return;
diff --git a/src/node_platform.h b/src/node_platform.h
index 0a99f5b4b5e..b18fc11bc71 100644
--- a/src/node_platform.h
+++ b/src/node_platform.h
@@ -105,6 +105,7 @@ class PerIsolatePlatformData :
v8::Isolate* const isolate_;
uv_loop_t* const loop_;
+ Mutex flush_tasks_mutex_;
uv_async_t* flush_tasks_ = nullptr;
TaskQueue<v8::Task> foreground_tasks_;
TaskQueue<DelayedTask> foreground_delayed_tasks_; also cc @joyeecheung |
TaskQueue (what PostTask and PostDelayedTask push into) also takes out a lock. That makes it a little too easy to end up with ABBA deadlocks. It'd be better to come up with an abstraction that turns TaskQueue inside out, where you can only get at the inner queue when you lock the mutex first, like how Rust's std::sync::Mutex works. |
Version
20.15.0
Platform
Subsystem
No response
What steps will reproduce the bug?
yarn lage build bundle stage-deployment --concurrency 32
How often does it reproduce? Is there a required condition?
The issue with the segmentation fault during the yarn lage build bundle step occurs intermittently. Sometimes the build and bundle process passes successfully, while other times it results in a core dump. There is no specific required condition mentioned for the issue to reproduce.
What is the expected behavior? Why is that the expected behavior?
The expected behavior is for the yarn lage build bundle command to successfully build and bundle the project without encountering segmentation faults. This is expected because the same process worked correctly with Node.js 18, and the upgrade to Node.js 20 should not introduce such critical issues.
What do you see instead?
We are encountering segmentation fault (core dump) when running our pipeline after upgrading to Node.js 20. This issue specifically arises during the execution of yarn lage build bundle steps on a Linux-based CI agent. The core dump logs do not provide sufficient insights into the root cause. This behaviour was not observed prior when running with Node.js 18, it's only been seen when we are trying to upgrade from Node 18.15.0 to Node 20.15.0
Additional information
We have tried multiple of above troubleshooting approaches but are still unable to resolve this issue. If anyone has encountered similar segmentation faults with Node.js 20 or has suggestions for further debugging, please share your thoughts.
○ Monitored memory usage, agent configurations, and system vitals to identify any anomalies.
○ Found no indications of memory pressure or system resource limitations.
2. Core Dump Analysis:
○ Installed Valgrind and a segfault handler to capture and analyze core dump logs.
○ Unfortunately, no meaningful insights were captured from the VM or agent machine logs.
3. Node Version Update:
○ Verified and updated all Node.js native modules and dependencies for compatibility with Node.js 20.
4. Heap Space Configuration:
○ Ensured no misconfigurations targeting the "new" space in the heap.
○ Adjusted heap settings with the following commands:
node --max-old-space-size=8192 dist/server.js
○ Introduced the --max-semi-space-size parameter to configure the "new" space in the heap.
5. Node Environment Cleanup:
○ Removed older versions of Node.js from the environment to prevent conflicts during pipeline execution.
Any assistance or guidance would be appreciated. Please tag anyone relevant who can help us out.
The text was updated successfully, but these errors were encountered: