-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SVD test segfaults on Apple M1 #41440
Comments
Sigh, this is some sort of nasty something. Goes away under LLDB. Also no rr available 😢 . |
This might be related: JuliaStats/Distributions.jl#1344 |
Also @chriselrod found some weird segmentation faults referencing |
Yes, ntuple seems to be a recurring theme in these. |
What I found most fun in JuliaStats/Distributions.jl#1344 is that it isn't deterministic, one more reason to miss rr 🙃 |
For the record, I just ran
on latest Edit: I can reproduce the issue by running the tests with
which I guess is what Keno did. |
Ok, I was running this test and got a kernel panic, I haven't tried to reproduce it yet, the panicked task was WindowServer but I don't know if it was just a coincidence |
could this be related? |
I don't think so, since we we don't use LLVM libomp anywhere. |
Hvcat seems to be a common denominator https://gist.github.com/gbaraldi/56f2d34fe841a182d6f29a0078830f83 |
Is there anything I can do to help get to the bottom of these errors? It's very common and pretty significantly limits the amount of the ecosystem usable on M1. |
Yes, it's the same issue as #42295. Next step is for somebody to file an upstream issue about it. Probably ping Lang Hames and Tim Northover on the Apple LLVM teams there. We may also need an extension to the MachO spec. |
do you know these people? does it make more sense for you to contact or should i basically just send them links to these issues? |
I'd start by filing a bug at https://bugs.llvm.org/, complaining that the large code model is not properly implemented on Darwin Aarch64, causing Orc JIT to crash if the memory allocator does not allocate sections within 4GB of each other (which the default allocator does not). CC the two of them on the issue. Lang is the Orc JIT maintainer, Tim is the Aarch64 backend maintainer. Apple has in general said that they're happy to help with M1 issues, so if the issue is filed there and you get no response I can send it through those channels. If you need more help understanding what the issue is, I can walk you through it. |
okay i think i did this right https://bugs.llvm.org/show_bug.cgi?id=52029 |
Couldn't we just change the code model to Small? What would be the implications of doing so? i.e adding #elif _CPU_AARCH64_ && _OS_DARWIN_
CodeModel::Small;
#else at: Lines 8275 to 8298 in 690517a
I tried adding that and ran the linalg tests and #42295 and didn't get any test failures. |
No, that's effectively what it does now. |
That was true of the old MCJIT, but OrcJIT is supposed to be better able to allocate farcall stubs on-demand if we used the small code model. |
It looks like LLVM sets the CodeModel to Large by default unless I misunderstood static CodeModel::Model
getEffectiveAArch64CodeModel(const Triple &TT, Optional<CodeModel::Model> CM,
bool JIT) {
if (CM) {
if (*CM != CodeModel::Small && *CM != CodeModel::Tiny &&
*CM != CodeModel::Large) {
report_fatal_error(
"Only small, tiny and large code models are allowed on AArch64");
} else if (*CM == CodeModel::Tiny && !TT.isOSBinFormatELF())
report_fatal_error("tiny code model is only supported on ELF");
return *CM;
}
// The default MCJIT memory managers make no guarantees about where they can
// find an executable page; JITed code needs to be able to refer to globals
// no matter how far away they are.
// We should set the CodeModel::Small for Windows ARM64 in JIT mode,
// since with large code model LLVM generating 4 MOV instructions, and
// Windows doesn't support relocating these long branch (4 MOVs).
if (JIT && !TT.isOSWindows())
return CodeModel::Large;
return CodeModel::Small;
} |
For fun, confirming it doesn't help the diff --git a/src/codegen.cpp b/src/codegen.cpp
index 754499d502..5517365f9d 100644
--- a/src/codegen.cpp
+++ b/src/codegen.cpp
@@ -8289,6 +8289,8 @@ extern "C" void jl_init_llvm(void)
// Make sure we are using the large code model on 64bit
// Let LLVM pick a default suitable for jitting on 32bit
CodeModel::Large;
+#elif _CPU_AARCH64_ && _OS_DARWIN_
+ CodeModel::Small;
#else
None;
#endif ] test Distributions
....
Test mvnormal | 5883 5883
signal (11): Segmentation fault: 11
in expression starting at /Users/chriselrod/.julia/packages/Distributions/1WSG5/test/mvlognormal.jl:114
ntuple at ./ntuple.jl:0 LinearAlgebra tests passed. |
@chriselrod Probably silly question — but is |
Ah, yes, simply setting
|
I believe we need to switch to the upcoming llvm::JITLink module (from the old RTDyldMemoryManager code) before that option is feasible. |
Hi all, I wanted to report that this problem also occurs with
|
… Darwin This change is inspired by this comment [0], switching our linking layer to the newer one as recommended by [1]. With these changes, I can pass the `Distributions` test suite on aarch64 darwin, and so it appears it fixes at least one of the segfault issues noted on apple silicon. [0] #41440 (comment) [1] https://llvm.org/docs/JITLink.html#jitlink-and-objectlinkinglayer
I've got a WIP patch that ports Julia to LLVM Git main and ObjectLinkingLayer/CodeModel::Small – while I'm still working on debug info integration, the Distributions.jl tests pass with it on darwin-aarch64. |
I will note that |
(I'm assuming "entire Julia test suite" refers to |
I have not either, but I have also not been able to use Julia on M1 ARM for more than 15 minutes without this crash occurring somewhere. The failure is generally some low level operation in |
@staticfloat try running them multiple times, they still crash for me:
This is non-reproducible, it crashes most of the time, but every now and then tests do pass.
|
My WIP branch is here: https://github.com/dnadlinger/julia/commits/aarch64-darwin This isn't usable for end users yet, as debug info registration (backtraces, …) isn't working yet (just using |
Now with eh frame and debug info registration fixed: dnadlinger@6feb722. See commit message for a few caveats re LLVM patches – the code is also rather janky and uncommented, but passes almost all of the main test suite again. Note that this replaces the |
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work: llvm/llvm-project#52921 ``` diff --git a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp index f2a029d35cd5..4d958b302ff9 100644 --- a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp +++ b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp @@ -705,6 +705,10 @@ void link_MachO_arm64(std::unique_ptr<LinkGraph> G, Config.PrePrunePasses.push_back( CompactUnwindSplitter("__LD,__compact_unwind")); + Config.PrePrunePasses.push_back(EHFrameSplitter("__TEXT,__eh_frame")); + Config.PrePrunePasses.push_back(EHFrameEdgeFixer("__TEXT,__eh_frame", + 8, Delta64, Delta32, NegDelta32)); + // Add an in-place GOT/Stubs pass. Config.PostPrunePasses.push_back( PerGraphGOTAndPLTStubsBuilder_MachO_arm64::asPass); ```
PR now at #43664. |
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work: llvm/llvm-project#52921 ``` diff --git a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp index f2a029d35cd5..4d958b302ff9 100644 --- a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp +++ b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp @@ -705,6 +705,10 @@ void link_MachO_arm64(std::unique_ptr<LinkGraph> G, Config.PrePrunePasses.push_back( CompactUnwindSplitter("__LD,__compact_unwind")); + Config.PrePrunePasses.push_back(EHFrameSplitter("__TEXT,__eh_frame")); + Config.PrePrunePasses.push_back(EHFrameEdgeFixer("__TEXT,__eh_frame", + 8, Delta64, Delta32, NegDelta32)); + // Add an in-place GOT/Stubs pass. Config.PostPrunePasses.push_back( PerGraphGOTAndPLTStubsBuilder_MachO_arm64::asPass); ```
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work: llvm/llvm-project#52921 ``` diff --git a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp index f2a029d35cd5..4d958b302ff9 100644 --- a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp +++ b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp @@ -705,6 +705,10 @@ void link_MachO_arm64(std::unique_ptr<LinkGraph> G, Config.PrePrunePasses.push_back( CompactUnwindSplitter("__LD,__compact_unwind")); + Config.PrePrunePasses.push_back(EHFrameSplitter("__TEXT,__eh_frame")); + Config.PrePrunePasses.push_back(EHFrameEdgeFixer("__TEXT,__eh_frame", + 8, Delta64, Delta32, NegDelta32)); + // Add an in-place GOT/Stubs pass. Config.PostPrunePasses.push_back( PerGraphGOTAndPLTStubsBuilder_MachO_arm64::asPass); ```
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work: llvm/llvm-project#52921 ``` diff --git a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp index f2a029d35cd5..4d958b302ff9 100644 --- a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp +++ b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp @@ -705,6 +705,10 @@ void link_MachO_arm64(std::unique_ptr<LinkGraph> G, Config.PrePrunePasses.push_back( CompactUnwindSplitter("__LD,__compact_unwind")); + Config.PrePrunePasses.push_back(EHFrameSplitter("__TEXT,__eh_frame")); + Config.PrePrunePasses.push_back(EHFrameEdgeFixer("__TEXT,__eh_frame", + 8, Delta64, Delta32, NegDelta32)); + // Add an in-place GOT/Stubs pass. Config.PostPrunePasses.push_back( PerGraphGOTAndPLTStubsBuilder_MachO_arm64::asPass); ```
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work: llvm/llvm-project#52921 ``` diff --git a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp index f2a029d35cd5..4d958b302ff9 100644 --- a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp +++ b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp @@ -705,6 +705,10 @@ void link_MachO_arm64(std::unique_ptr<LinkGraph> G, Config.PrePrunePasses.push_back( CompactUnwindSplitter("__LD,__compact_unwind")); + Config.PrePrunePasses.push_back(EHFrameSplitter("__TEXT,__eh_frame")); + Config.PrePrunePasses.push_back(EHFrameEdgeFixer("__TEXT,__eh_frame", + 8, Delta64, Delta32, NegDelta32)); + // Add an in-place GOT/Stubs pass. Config.PostPrunePasses.push_back( PerGraphGOTAndPLTStubsBuilder_MachO_arm64::asPass); ```
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work: llvm/llvm-project#52921 ``` diff --git a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp index f2a029d35cd5..4d958b302ff9 100644 --- a/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp +++ b/llvm/lib/ExecutionEngine/JITLink/MachO_arm64.cpp @@ -705,6 +705,10 @@ void link_MachO_arm64(std::unique_ptr<LinkGraph> G, Config.PrePrunePasses.push_back( CompactUnwindSplitter("__LD,__compact_unwind")); + Config.PrePrunePasses.push_back(EHFrameSplitter("__TEXT,__eh_frame")); + Config.PrePrunePasses.push_back(EHFrameEdgeFixer("__TEXT,__eh_frame", + 8, Delta64, Delta32, NegDelta32)); + // Add an in-place GOT/Stubs pass. Config.PostPrunePasses.push_back( PerGraphGOTAndPLTStubsBuilder_MachO_arm64::asPass); ```
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work (already applied on JuliaLang/llvm-project@julia-release/13.x): llvm/llvm-project#52921
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work (already applied on JuliaLang/llvm-project@julia-release/13.x): llvm/llvm-project#52921
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work (already applied on JuliaLang/llvm-project@julia-release/13.x): llvm/llvm-project#52921
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work (already applied on JuliaLang/llvm-project@julia-release/13.x): llvm/llvm-project#52921
…ll code model This fixes JuliaLang#41440, JuliaLang#43285 and similar issues, which stem from CodeModel::Large not being correctly implemented on MachO/ARM64. Requires LLVM 13.x or Git main (tested: 1dd5e6fed5db with patches from the JuliaLang/llvm-project julia-release/13.x branch, available at https://github.com/dnadlinger/llvm-project/commits/julia-main). Requires an LLVM patch to pass through __eh_frame unwind information, without which backtraces silently won't work (already applied on JuliaLang/llvm-project@julia-release/13.x): llvm/llvm-project#52921
The text was updated successfully, but these errors were encountered: