-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MacOS: use-after-free during debuginfo registration leads to segfault #44562
Comments
I see this on x86 MacOS on intel macs too. Could this be related to #42836 |
very minor part to the fix: diff --git a/src/signals-mach.c b/src/signals-mach.c
index 130603931c..b8bf25e84e 100644
--- a/src/signals-mach.c
+++ b/src/signals-mach.c
@@ -326,6 +326,7 @@ kern_return_t catch_exception_raise(mach_port_t exception_port,
return KERN_SUCCESS;
}
else {
+ thread0_exit_count++;
jl_exit_thread0(128 + SIGSEGV, NULL, 0);
return KERN_SUCCESS;
} I would need to get access to the M1 to see what the disassembly and registers look like at that crash site there. |
I am trying to catch it again in lldb but it isn't cooperating :) |
@vtjnash do you want me to run other commands? |
Hm, those numbers disagree in lldb. One states the exception happened while writing to |
It's becoming clear to me that this issue has nothing to do with the actual code being run. E.g. here's an example where it triggers on a worker that is running the However, that test suite is just about as simple as possible: https://github.com/JuliaLang/julia/blob/master/stdlib/p7zip_jll/test/runtests.jl#L6 My guess is that this is actually happening during process teardown, as the distributed workers exit after finishing their work. Notice in the log above how many different workers all experience the issue at the same time, as they run out of work items to process. |
I got it to segfault inside lldb but I forgot to attach the child processes :( |
I can trigger this locally by just doing:
It takes a while though. Now trying to catch it in LLDB using Expect:
|
Seems like the segfault cascade is caused by the
EDIT: putting a
So a double free or so in LLVM during process clean-up. Maybe ASAN could help here |
Finally got an ASAN build, and it caught the following:
Not sure it's related yet. Would be so much easier to debug with |
Another one in the debuginfo registration:
Does anybody know why the ASAN reports here are truncated? |
So I'm not sure what's wrong, but it's definitely related to the two use-after-frees reported by ASAN in the debuginfo registration code. Disabling it, I can't reproduce the segfaults: diff --git a/src/debuginfo.cpp b/src/debuginfo.cpp
index 0e246160a3..aabd6c96c0 100644
--- a/src/debuginfo.cpp
+++ b/src/debuginfo.cpp
@@ -123,7 +123,7 @@ static std::string mangle(StringRef Name, const DataLayout &DL)
}
void jl_add_code_in_flight(StringRef name, jl_code_instance_t *codeinst, const DataLayout &DL)
{
- codeinst_in_flight[mangle(name, DL)] = codeinst;
+ //codeinst_in_flight[mangle(name, DL)] = codeinst;
}
@@ -363,20 +363,20 @@ public:
codeinst = codeinst_it->second;
codeinst_in_flight.erase(codeinst_it);
}
- jl_profile_atomic([&]() {
- if (codeinst)
- linfomap[Addr] = std::make_pair(Size, codeinst->def);
- if (first) {
- ObjectInfo tmp = {&Object,
- (size_t)SectionSize,
- (ptrdiff_t)(SectionAddr - SectionLoadAddr),
- *Section,
- nullptr,
- };
- objectmap[SectionLoadAddr] = tmp;
- first = false;
- }
- });
+ // jl_profile_atomic([&]() {
+ // if (codeinst)
+ // linfomap[Addr] = std::make_pair(Size, codeinst->def);
+ // if (first) {
+ // ObjectInfo tmp = {&Object,
+ // (size_t)SectionSize,
+ // (ptrdiff_t)(SectionAddr - SectionLoadAddr),
+ // *Section,
+ // nullptr,
+ // };
+ // objectmap[SectionLoadAddr] = tmp;
+ // first = false;
+ // }
+ // });
}
jl_gc_safe_leave(ptls, gc_state);
} |
Regarding the title change, I see the repeat empty segfaults on my intel mac. I don't believe this is M1 only |
Yeah, we see this on both x86_64 and aarch64 mac on CI pretty often as well. |
When running the
read
test on the m1 mac get non fatal segfaults likeI caught it on lldb on a non-debug buid:
The text was updated successfully, but these errors were encountered: