-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI implementations intercepting Signals is incompatible with Julia GC safepoint #725
Comments
In a multi-threaded environment Julia uses segmentation faults on special addresses for it's safepoint implementation. If the MPI implementation intercepts signals this will cause spurious aborts. UCX is a library that does this and so for a better experience we tell it not to. Generally Julia will handle signals for the user. |
That's right, @vchuravy and this issue we are documenting here is that this issue is not just with UCX, and affects other MPI implementations, in particular some that are currently in MPI.jl's set of test cases (see "test-intel-linux" in #724 showing that MPI.jl with Intel's MPI will currently crash when GC happens in a multithreaded context) |
If you can figure out how to tell Intel MPI not to intercept signals we can add that as a vendor specific workaround. |
We will do some research on that, thank you. However it seems though a more principled approach would be to tell Julia to use another signal for GC coordination, since it seems that in any situation where Julia is used as a child process, GC+multithreading would trigger a crash. This leads to a kind of a Whac-A-Mole situation where the issue has to be addressed on all possible of parent processes, some of which could potentially be closed source (like the situation here). |
Also it looks like that issue was reported here: https://discourse.julialang.org/t/julia-crashes-inside-threads-with-mpi/52400/5 From a quick look there is no obvious ENV-based workaround for Intel MPI. Add to the list of MPI systems incompatible with GC+multithread: MPICH 4.0 (but MPICH 4.1!). |
Let's be precise here. Julia does not crash, the MPI implementation is misreporting a signal as a crash. The Julia GC safepoint needs to be very low-overhead and is implemented as a load from an address. When GC needs to be triggered Julia set's the safepoint to hot e.g. it maps the page from which the load happens as inaccessible. The OS will provide a signal to the process and Julia inspects the address to ensure that the signal was caused by the safepoint. While there are different alternatives one could implement, this method has the lowest overhead during execution off the program,
I would encourage you to file a ticket with the vendor of the software. |
Can you see which libfabric version the IntelMPI is using? There was a signal handler related bugfix that landed in v1.10.0rc1 (ofiwg/libfabric#5613) |
According to MPI.jl/.github/workflows/UnitTests.yml Line 242 in 6d513bb
this particular failed test is on intelmpi-2019.9.304 |
@simonbyrne the latest is |
Is that the same as oneAPI MPI? We already test that (thanks to @giordano) |
@alexandrebouchard what version of Intel MPI are you using? And what is your libfabric version? |
I am travelling this week, but let me get back to you on this soon! |
Intel PSM also has the same issue as this, and requires the existence of the environment variable Similar to UCX as documented here: |
Also OpenMPI sets the same environment variable for a similar reason: https://docs.open-mpi.org/en/main/news/news-v2.x.html
|
It doesn't look like setting |
Thanks again for your help with #720 - this one is unrelated (except that issue #720 lead us to create more comprehensive unit test revealing this new, probably unrelated segfault).
Summary of this problem: a segfault occurs when GC is triggered in a multithreaded+MPI context.
How to reproduce: I have create a draft PR adding a GC.gc() call in one of MPI.jl's existing multithreaded test: see PR Request #724
The draft PR is based off the most recent commit where all tests passed (Tag 0.20.8). In the output of "test-intel-linux", the salient output is
The change we made is in the file test/test_threads.jl, where we added the following if clause:
We experience similar problems with MPICH 4.0 in our package (https://github.com/Julia-Tempering/Pigeons.jl), but not with MPICH 4.1.
Related discussions
This describes a similar issue in the context of UCX. However this problem does not seem limited to UCX from our investigations so far.
This describes a similar issue in the context of OpenMPI. However it seems that certain versions of MPICH and intel MPI (which is MPICH-derived) might suffer from a similar issue?
In light of these two sources, perhaps other environment variables in the style of
MPI.jl/src/MPI.jl
Line 133 in 6d513bb
Thank you so much for your time.
The text was updated successfully, but these errors were encountered: