-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation faults when combined with @threads with memory allocation #337
Comments
What version of MPI.jl are you using? You can get it via |
Hmm, I can reproduce on our cluster (with openmpi 3.1.4 and 4.0.1). It might have something to do with openmpi/UCX's funny malloc hooks, which have caused problems before. cc'ing @kpamnany and @vchuravy who might have some ideas? mpich doesn't seem to have this problem, so I would suggest using that in the meantime? |
Noticed after I wrote this that the exception part is not critical. The program simply has to spend enough time (or maybe do enough allocating?) inside the using MPI
function main()
MPI.Init()
Threads.@threads for i in 1:2
A = rand(1000,1000)
A1 = inv(A)
end
MPI.Finalize()
end
main() I'll adjust the title accordingly. |
It's something to do with how MPI and Julia malloc interact. The simplest example (which doesn't require this package) that I can reproduce is: # threads.jl
function main()
ccall((:MPI_Init, :libmpi), Nothing, (Ptr{Cint},Ptr{Cint}), C_NULL, C_NULL)
Threads.@threads for i in 1:100
A = rand(1000,1000)
end
end
main() then calling
|
My guess is that it's due to UCX, which has caused similar problems in the past (#298). They've fixed quite a few issues on their main branch, but haven't made a new release for quite a while (I think they've made some breaking API changes, so need to coordinate with openmpi). Hopefully this is one of those that will get fixed. |
I tried this on master build of UCX + openmpi and see the same issue. |
Try calling |
Yes, that seems to give the same result.
( |
It still blows up for me with OpenMPI 3.1.2 when I force just using tcp and self (this is after allocating 2 tasks with 2 cores each and setting
|
A few generic comments (that might not apply here, but nevertheless):
|
Yes, I printed the result: it returns the same value.
Yes, I tried it with a specified path for
They are at runtime: I with one rank and printing between the loop and finalize, I get
|
One way forward would be to see if we can recreate the behavior in C: I guess this would consist of writing a multithreaded C program that internally calls |
Another datapoint: it works correctly with GC disabled:
|
Ah, running it with
#1 points to this "FIXME": |
IIUC correctly, Julia uses signals to communicate between threads (e.g. to implement wait for the world) So the "error" above is normal and is a case where you need to tell GDB to ignore that signal. |
@simonbyrne can you try running with |
Hmm, that seems to work correctly. |
more data:
|
@vchuravy and @kpamnany narrowed this down to UCX intercepting signals, and according to the Julia developer docs:
So basically users need to disable UCX intercepting the SIGSEGV error signal, which can be done by setting the (undocumented) environment variable @twhitehead Can you confirm if this fixes your users' problem? |
Our user reports that this (setting UCX_ERROR_SIGNALS="") doesn't not fix their problem. My testing also seems to indicate that, while it changes the nature of the error reported, it does not cause the sample code to stop crashing.
I think though that, as has been long suggested, it may still be UCX related. While looking at the shared libraries loaded, I noticed my earlier attempt to avoid UCX through MCA parameters was not quite complete as it also gets sucked in from the OSC layer. I'm currently getting no crashes when run as follows
I'll play around more with variants (e.g., try OpenIB in to the BTL layer, etc.) to ensure I'm not missing anything and also verify with our user whether they exported the Thanks everyone for all the work digging into this. |
Interesting: it looks like it is passing on the node you are launching it from ( Does it work if you add set the environment variable in your script, i.e. add
to the top of If that works, we can add that to the |
@simonbyrne that was a very good idea. Unfortunately it doesn't seem to work though (or at least not under OpenMPI 3.1.2 and UCX 1.5.2). That is, I added a printout line at the top to verify it is set (checked that it does throw an exception if the environment lookup fails) print(gethostname(),": UCX_ERROR_SIGNALS=",ENV["UCX_ERROR_SIGNALS"],"\n") but the program still crashes [tyson@gra-login1 ~]$ UCX_ERROR_SIGNALS= JULIA_NUM_THREADS=2 salloc --ntasks 2 --nodes 2 --cpus-per-task 2 -t 3:0:0 --mem-per-cpu 2g -A def-tyson-ab
[tyson@gra2 ~]$ cd julia
[tyson@gra2 julia]$ mpirun julia example.jl
|
Hmm, I see the same thing using the same versions of UCX and OpenMPI, but not with the latest versions, so I think you will need to upgrade. |
Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.
Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.
Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.
Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.
FYI: I ran into the same problem, however, on our cluster (Lise@HLRN) OpenMPI is built without UCX support. Updating from 3.1.5 to 4.1.1 didn't helped either. One potential solution here is to change the PML from cm to ob1 with openib as the BTL, a second (better solution) is to use Intel MPI's implementation instead of OpenMPI. |
One of our users was having problems with their hybrid MPI/threaded Julia code segment faulting on our clusters.
OS: Linux (CentOS 7)
Julia: 1.3.0
OpenMPI: 3.1.2
I simplified their code down the following demo
exceptions sometimes turn into segmentation faults inside of@threads for
loopsHere is some possibly relevant info from
ompi_info
as wellEDIT: Removed exception bit as as noted below it isn't required.
The text was updated successfully, but these errors were encountered: