Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests segfaulting on Ookami #684

Closed
giordano opened this issue Dec 7, 2022 · 34 comments
Closed

Tests segfaulting on Ookami #684

giordano opened this issue Dec 7, 2022 · 34 comments

Comments

@giordano
Copy link
Member

giordano commented Dec 7, 2022

With OpenMPI
[3925494] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/test_allgather.jl:9                                                                                                             
hcoll_create_mpi_type at /opt/mellanox/hcoll/lib/libhcoll.so.1 (unknown line)                                                                                                                                     
ompi_dtype_2_hcoll_dtype at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)                                                                                                    
mca_coll_hcoll_allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)                                                                                                    
MPI_Allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/libmpi.so (unknown line)                                                                                                                               
MPI_Allgather at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/api/generated_api.jl:252 [inlined]                                                                                                        
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:367                                                                                                                            
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:371 [inlined]                                                                                                                  
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:374                                                                                                                            
Allgather at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:398                                                                                                                             
unknown function (ip: 0x40001520335b)                                                                                                                                                                             
                                                                                                                                                                                                                  
[3925498] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/test_allgather.jl:9                                                                                                             
hcoll_create_mpi_type at /opt/mellanox/hcoll/lib/libhcoll.so.1 (unknown line)                                                                                                                                     
ompi_dtype_2_hcoll_dtype at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)                                                                                                    
mca_coll_hcoll_allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)                                                                                                    
MPI_Allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/libmpi.so (unknown line)                                                                                                                               
MPI_Allgather at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/api/generated_api.jl:252 [inlined]                                                                                                        
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:367                                                                                                                            
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:371 [inlined]                                                                                                                  
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:374                                                                                                                            
                                                                                                                                                                                                                  
[3925496] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/test_allgather.jl:9                                                                                                             
Allgather at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:398
unknown function (ip: 0x40001520335b)

[3925493] signal (11.1): Segmentation fault                                                                                                                                                             [148/1271]
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/test_allgather.jl:9
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
hcoll_create_mpi_type at /opt/mellanox/hcoll/lib/libhcoll.so.1 (unknown line)
hcoll_create_mpi_type at /opt/mellanox/hcoll/lib/libhcoll.so.1 (unknown line)
ompi_dtype_2_hcoll_dtype at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)
mca_coll_hcoll_allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)
ompi_dtype_2_hcoll_dtype at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)
mca_coll_hcoll_allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)
MPI_Allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/libmpi.so (unknown line)
MPI_Allgather at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/api/generated_api.jl:252 [inlined]
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:367
MPI_Allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/libmpi.so (unknown line)
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:371 [inlined]
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:374
MPI_Allgather at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/api/generated_api.jl:252 [inlined]
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:367
Allgather at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:398
unknown function (ip: 0x40001520335b)
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:371 [inlined]
Allgather! at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:374
Allgather at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/src/collective.jl:398
unknown function (ip: 0x400015203363)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
do_call at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:126
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
do_call at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:226
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
do_call at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:126
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
do_call at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:226
eval_body at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:478
eval_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:226
eval_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:226
eval_body at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:478
jl_interpret_toplevel_thunk at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:762
eval_body at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:478
eval_body at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:478
jl_interpret_toplevel_thunk at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:762
jl_interpret_toplevel_thunk at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:762                                                                                           [101/1271]
jl_interpret_toplevel_thunk at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:912
jl_interpret_toplevel_thunk at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:971
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:971
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:971
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1522
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1522
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1522
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1522
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
_include at ./loading.jl:1582
_include at ./loading.jl:1582
_include at ./loading.jl:1582
_include at ./loading.jl:1582
include at ./Base.jl:450
include at ./Base.jl:450
include at ./Base.jl:450
include at ./Base.jl:450
jfptr_include_49529 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jfptr_include_49529 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jfptr_include_49529 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706                                                                                                              [54/1271]
jfptr_include_49529 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
exec_options at ./client.jl:307
exec_options at ./client.jl:307
exec_options at ./client.jl:307
exec_options at ./client.jl:307
_start at ./client.jl:522
_start at ./client.jl:522
_start at ./client.jl:522
_start at ./client.jl:522
jfptr__start_36420 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
true_main at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:717
jfptr__start_36420 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
main at /cache/build/default-armageddon-2/julialang/julia-master/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
Allocations: 1287421 (Pool: 1286553; Big: 868); GC: 2
jfptr__start_36420 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
true_main at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:717
main at /cache/build/default-armageddon-2/julialang/julia-master/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
Allocations: 1287421 (Pool: 1286553; Big: 868); GC: 2
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
true_main at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:717
main at /cache/build/default-armageddon-2/julialang/julia-master/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
Allocations: 1287421 (Pool: 1286553; Big: 868); GC: 2
jfptr__start_36420 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
true_main at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:717
main at /cache/build/default-armageddon-2/julialang/julia-master/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
Allocations: 1287421 (Pool: 1286553; Big: 868); GC: 2
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 3925494 on node fj001 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
test_allgather.jl: Error During Test at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/runtests.jl:43
With MVAPICH2
[fj001:mpi_rank_0][MPID_Init] [Performance Suggestion]: Application has requested for multi-thread capability. If allocating memory from different pthreads/OpenMP threads, please consider setting MV2_USE_ALIGNE
D_ALLOC=1 for improved performance.                                                                                                                                                                               
Use MV2_USE_THREAD_WARNING=0 to suppress this error message.                                                                                                                                                      
┌ Warning: MPI thread level requested = MPI.ThreadLevel(2), provided = MPI.ThreadLevel(0)                                                                                                                         
└ @ MPI ~/.julia/packages/MPI/tJjHF/src/environment.jl:96                                                                                                                                                         
┌ Warning: MPI thread level requested = MPI.ThreadLevel(2), provided = MPI.ThreadLevel(0)                                                                                                                         
└ @ MPI ~/.julia/packages/MPI/tJjHF/src/environment.jl:96                                                                                                                                                         
┌ Warning: MPI thread level requested = MPI.ThreadLevel(2), provided = MPI.ThreadLevel(0)                                                                                                                         
└ @ MPI ~/.julia/packages/MPI/tJjHF/src/environment.jl:96                                                                                                                                                         
┌ Warning: MPI thread level requested = MPI.ThreadLevel(2), provided = MPI.ThreadLevel(0)                                                                                                                         
└ @ MPI ~/.julia/packages/MPI/tJjHF/src/environment.jl:96                                                                                                                                                         
                                                                                                                                                                                                                  
[3924385] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/test_allgather.jl:56                                                                                                            
                                                                                                                                                                                                                  
[3924388] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/test_allgather.jl:56                                                                                                            
                                                                                                                                                                                                                  
[3924386] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/test_allgather.jl:56                                                                                                            
                                                                                                                                                                                                                  
[3924387] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/test_allgather.jl:56
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2594242 (Pool: 2592819; Big: 1423); GC: 3
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2594242 (Pool: 2592819; Big: 1423); GC: 3
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2594242 (Pool: 2592819; Big: 1423); GC: 3
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2594242 (Pool: 2592819; Big: 1423); GC: 3
srun: error: fj001: tasks 0-3: Segmentation fault (core dumped)
test_allgather.jl: Error During Test at /lustre/home/mosgiordano/.julia/packages/MPI/tJjHF/test/runtests.jl:43

In MVAPICH2 there may be some threadings issues because loading the package issues the warning

┌ Warning: MPI thread level requested = MPI.ThreadLevel(2), provided = MPI.ThreadLevel(0)                                                                                                                         
└ @ MPI ~/.julia/packages/MPI/tJjHF/src/environment.jl:96                                                                                                                                                         
julia> versioninfo()
Julia Version 1.10.0-DEV.77
Commit 5da8d5f17ad (2022-11-30 11:11 UTC)
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 48 × unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, a64fx)
  Threads: 1 on 48 virtual cores

Information about MPI on Ookami.

@giordano
Copy link
Member Author

giordano commented Dec 7, 2022

The problems with threading in MVAPICH2 are probably a red herring, with

diff --git a/src/environment.jl b/src/environment.jl
index a9d7e40..c889ae3 100644
--- a/src/environment.jl
+++ b/src/environment.jl
@@ -78,7 +78,7 @@ it after calling [`MPI.Finalize`](@ref).
 $(_doc_external("MPI_Init"))
 $(_doc_external("MPI_Init_thread"))
 """
-function Init(;threadlevel=:serialized, finalize_atexit=true, errors_return=true)
+function Init(;threadlevel=:single, finalize_atexit=true, errors_return=true)
     if threadlevel isa Symbol
         threadlevel = ThreadLevel(threadlevel)
     end

to force single-thread initialisation (is there a better way to do that? In the tests the call is always MPI.Init(), no way to control initialisation settings) I still get a segfault:

[2258005] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/dev/MPI/test/test_allgather.jl:56                                                                                                                       
                                                                                                                                                                                                                  
[2258006] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/dev/MPI/test/test_allgather.jl:56                                                                                                                       
                                                                                                                                                                                                                  
[2258007] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/dev/MPI/test/test_allgather.jl:56                                                                                                                       
                                                                                                                                                                                                                  
[2258008] signal (11.1): Segmentation fault                                                                                                                                                                       
in expression starting at /lustre/home/mosgiordano/.julia/dev/MPI/test/test_allgather.jl:56                                                                                                                       
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)                                                                                                                       
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)                                                                                                                       
Allocations: 2593986 (Pool: 2592537; Big: 1449); GC: 3                                                                                                                                                            
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)                                                                                                                       
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)                                                                                                                       
Allocations: 2593986 (Pool: 2592537; Big: 1449); GC: 3
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2593986 (Pool: 2592537; Big: 1449); GC: 3
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2593986 (Pool: 2592537; Big: 1449); GC: 3
srun: error: fj002: tasks 0-3: Segmentation fault (core dumped)
test_allgather.jl: Error During Test at /lustre/home/mosgiordano/.julia/dev/MPI/test/runtests.jl:43

@williamfgc

This comment was marked as off-topic.

@giordano
Copy link
Member Author

giordano commented Dec 7, 2022

MPI_Init() on Ookami works fine with both OpenMPI and MVAPICH2 (apart from the fact threading should be disabled with MVAPICH2), so your problem is likely unrelated, I'd recommend opening a dedicated issue.

For the record, MPICH_jll and OpenMPI_jll work fine on Ookami, all tests pass. I'm trying to make a small reproducer for the failing tests with system libraries.

@giordano
Copy link
Member Author

giordano commented Dec 7, 2022

It's sufficient to do

using MPI
MPI.Init(;threadlevel=:single)
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
A = Array{Char}([rank + 1])
C = MPI.Allgather(A, comm)

to reproduce the segfault in test_allgather.jl with MVAPICH2. The problem is specifically with the Char datatype, not any of the others.

@giordano
Copy link
Member Author

giordano commented Dec 7, 2022

Am I doing anything ostensibly wrong in

#include <mpi.h>
#include <stdlib.h>

int main(void)
{
    MPI_Init(NULL, NULL);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    char A = rank + 97;
    char *C = (char *)malloc(sizeof(char) * size);
    MPI_Allgather(&A, 1, MPI_CHAR, C, size, MPI_CHAR, MPI_COMM_WORLD);
    free(C);
    MPI_Finalize();
    return 0;
}

? With this code, which should be pretty much a C equivalent of the Julia code above, I get

[mosgiordano@fj003 temp-env]$ module purge
[mosgiordano@fj003 temp-env]$ module load slurm gcc/11.1.0 mvapich2/gcc11/2.3.6
[mosgiordano@fj003 temp-env]$ mpicc -o repro repro.c && srun -n 6 ./repro 
[fj003:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[fj003:mpi_rank_2][error_sighandler] Caught error: Segmentation fault (signal 11)
[fj003:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
[fj003:mpi_rank_4][error_sighandler] Caught error: Segmentation fault (signal 11)
[fj003:mpi_rank_5][error_sighandler] Caught error: Segmentation fault (signal 11)
[fj003:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: fj003: tasks 0,2-5: Segmentation fault (core dumped)
srun: error: fj003: task 1: Segmentation fault (core dumped)

Edit: nevermind, I got revcount (the fifth argument to MPI_Allgather) wrong, should have been 1, not size. So I'm back to the fact I don't know what's going on ☹️

@eschnett
Copy link
Contributor

eschnett commented Dec 8, 2022

@giordano If the problem is related to Char: Note that Char != Cchar; it is sizeof(Char) == 4.

@eschnett
Copy link
Contributor

eschnett commented Dec 8, 2022

MPI.jl will create a new MPI datatype for Char. It should choose MPI_UINT32_T. Maybe this non-trivial mechanism is somehow broken? Does it work if you explicitly use MPI_INT as datatype for Char?

@giordano
Copy link
Member Author

giordano commented Dec 8, 2022

I think you got it, this code works:

using MPI
MPI.Init(;threadlevel=:single)
comm = MPI.COMM_WORLD
size = MPI.Comm_size(comm)
rank = MPI.Comm_rank(comm)
A = Array{Char}([rank + 1])
C = Char.(zeros(Int32, size))
MPI.Allgather!(MPI.Buffer(A, 1, MPI.Datatype(Int32)),
               MPI.UBuffer(C, 1, nothing, MPI.Datatype(Int32)), comm)
@show rank, C
$ srun -n 2 julia --project repro.jl 
(rank, C) = (0, ['\x01', '\x02'])
(rank, C) = (1, ['\x01', '\x02'])

Where is the datatype converted to the MPI data in the ccall?

Note that the use of Char in the tests is coming from

const MPIDatatype = Union{Char,
.

@giordano
Copy link
Member Author

giordano commented Dec 8, 2022

This doesn't look good, right:

julia> MPI.get_name(MPI.Datatype(Char))
""

? I get this also with MPICH_jll, so it doesn't seem to be specific to MVAPICH2

@giordano
Copy link
Member Author

giordano commented Dec 8, 2022

For the record, other broken tests with MVAPICH2 include:

  • bcast (I think this is again related to Chars)
  • datatypes (couldn't quickly understand what's the issue, there are some Chars also here, but there may be something else too)
  • error (it hangs after the end of the test, which should have been fixed by use new exitcode atexit hook in Julia 1.9 #680, but there seems to be something else. I don't see the same with OpenMPI, looks specific to MVAPICH2)
  • io (I get many errors like File does not exist or Invalid file name)
  • spawn (MPID_Open_port(70)............: Function not implemented inside MPI_Comm_spawn)

Most OpenMPI failing tests seem to be related to Chars again. There is also a failure in spawn, again, with error wireup.c:1335 Fatal: endpoint reconfiguration not supported yet

@simonbyrne
Copy link
Member

The issue looks to be something with custom datatypes. Would be good to see what is going on

@simonbyrne
Copy link
Member

For Open MPI:

hcoll_create_mpi_type at /opt/mellanox/hcoll/lib/libhcoll.so.1 (unknown line)                                                                                                                                     
ompi_dtype_2_hcoll_dtype at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)                                                                                                    
mca_coll_hcoll_allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/openmpi/mca_coll_hcoll.so (unknown line)                                                                                                    
MPI_Allgather at /lustre/software/openmpi/llvm14/4.1.4/lib/libmpi.so (unknown line)                                                                                                                               

It looks like the Melanox HCOLL library doesn't like custom MPI Datatypes. I think we've had issues with it before.

My suggestions

@simonbyrne
Copy link
Member

The following should be a reproducer of your example:

#include <mpi.h>
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    int n, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &n);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    uint32_t sendarr[1];
    sendarr[0] = rank;

    uint32_t *recvbuf;
    recvbuf = (uint32_t *)malloc(n*sizeof(uint32_t));
    
    MPI_Datatype dup_type;
    MPI_Type_dup(MPI_UINT32_T, &dup_type);

    MPI_Allgather(sendarr, 1, dup_type, recvbuf, 1, dup_type, MPI_COMM_WORLD);

    if (rank == 0) {
      for (int i = 0; i < n; i++) {
        printf("recvbuf[%i] = %"PRIu32"\n", i, recvbuf[i]);
      }
    }
    
    MPI_Finalize();
    return 0;
}

@giordano
Copy link
Member Author

Yes, that's probably it:

[mosgiordano@fj003 openmpi]$ mpiexec -n 1 ./test
recvbuf[0] = 0
[mosgiordano@fj003 openmpi]$ mpiexec -n 2 ./test
[fj003:1748529:0:1748529] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2000002021)
[fj003:1748528:0:1748528] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2000002021)
==== backtrace (tid:1748528) ====
=================================
==== backtrace (tid:1748529) ====
=================================
[fj003:1748529] *** Process received signal ***
[fj003:1748529] Signal: Segmentation fault (11)
[fj003:1748529] Signal code:  (-6)
[fj003:1748529] Failing at address: 0xa267ae8001aae31
[fj003:1748529] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000000707a0]
[fj003:1748529] [ 1] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_create_mpi_type+0x8c4)[0x400003e70694]
[fj003:1748529] [ 2] /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/openmpi/mca_coll_hcoll.so(+0x70a0)[0x400003d570a0]
[fj003:1748529] [ 3] /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_allgather+0x8c)[0x400003d5767c]
[fj003:1748529] [ 4] /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so.40(MPI_Allgather+0x100)[0x4000000e6410]
[fj003:1748529] [ 5] ./test[0x400b78]
[fj003:1748529] [ 6] /lib64/libc.so.6(__libc_start_main+0xe4)[0x400000260de4]
[fj003:1748529] [ 7] ./test[0x400a0c]
[fj003:1748529] *** End of error message ***
[fj003:1748528] *** Process received signal ***
[fj003:1748528] Signal: Segmentation fault (11)
[fj003:1748528] Signal code:  (-6)
[fj003:1748528] Failing at address: 0xa267ae8001aae30
[fj003:1748528] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000000707a0]
[fj003:1748528] [ 1] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_create_mpi_type+0x8c4)[0x400003e70694]
[fj003:1748528] [ 2] /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/openmpi/mca_coll_hcoll.so(+0x70a0)[0x400003d570a0]
[fj003:1748528] [ 3] /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_allgather+0x8c)[0x400003d5767c]
[fj003:1748528] [ 4] /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so.40(MPI_Allgather+0x100)[0x4000000e6410]
[fj003:1748528] [ 5] ./test[0x400b78]
[fj003:1748528] [ 6] /lib64/libc.so.6(__libc_start_main+0xe4)[0x400000260de4]
[fj003:1748528] [ 7] ./test[0x400a0c]
[fj003:1748528] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 1748529 on node fj003 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[mosgiordano@fj003 openmpi]$ OMPI_MCA_coll_hcoll_enable="0" mpirun -np 2 ./test
recvbuf[0] = 0
recvbuf[1] = 1
[mosgiordano@fj003 openmpi]$ OMPI_MCA_coll_hcoll_enable="0" mpirun -np 4 ./test
recvbuf[0] = 0
recvbuf[1] = 1
recvbuf[2] = 2
recvbuf[3] = 3
[mosgiordano@fj003 openmpi]$ OMPI_MCA_coll_hcoll_enable="0" mpirun -np 8 ./test
recvbuf[0] = 0
recvbuf[1] = 1
recvbuf[2] = 2
recvbuf[3] = 3
recvbuf[4] = 4
recvbuf[5] = 5
recvbuf[6] = 6
recvbuf[7] = 7

Exporting OMPI_MCA_coll_hcoll_enable="0" (found at https://docs.dkrz.de/doc/levante/running-jobs/runtime-settings.html) seems to be enough to disable HCOLL. All tests apart from test_spawn (already mentioned above, but see below for full wall of errors) are now passing in OpenMPI with OMPI_MCA_coll_hcoll_enable="0".

`test_spawn` error
[1670880440.887724] [fj003:1752826:0]          wireup.c:1037 UCX  ERROR   old: am_lane 0 wireup_msg_lane 3 cm_lane <none> reachable_mds 0xfb ep_check_map 0x0                                                     
[1670880440.887813] [fj003:1752826:0]          wireup.c:1047 UCX  ERROR   old: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0                                                    
[1670880440.887836] [fj003:1752826:0]          wireup.c:1047 UCX  ERROR   old: lane[1]: 13:xpmem/memory.0 md[7]          -> md[7]/xpmem/sysdev[255] rkey_ptr                                                      
[1670880440.887854] [fj003:1752826:0]          wireup.c:1047 UCX  ERROR   old: lane[2]: 11:cma/memory.0 md[5]            -> md[5]/cma/sysdev[255] rma_bw#0                                                        
[1670880440.887871] [fj003:1752826:0]          wireup.c:1047 UCX  ERROR   old: lane[3]:  7:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#1 wireup                                                  
[1670880440.887885] [fj003:1752826:0]          wireup.c:1037 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> reachable_mds 0xfb ep_check_map 0x0                                                     
[1670880440.887901] [fj003:1752826:0]          wireup.c:1047 UCX  ERROR   new: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0                                                    
[1670880440.887915] [fj003:1752826:0]          wireup.c:1047 UCX  ERROR   new: lane[1]: 13:xpmem/memory.0 md[7]          -> md[7]/xpmem/sysdev[255] rkey_ptr                                                      
[1670880440.887931] [fj003:1752826:0]          wireup.c:1047 UCX  ERROR   new: lane[2]: 12:knem/memory.0 md[6]           -> md[6]/knem/sysdev[255] rma_bw#0                                                       
[1670880440.887945] [fj003:1752826:0]          wireup.c:1047 UCX  ERROR   new: lane[3]:  7:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#1 wireup                                                  
[1670880440.887712] [fj003:1752827:0]          wireup.c:1037 UCX  ERROR   old: am_lane 0 wireup_msg_lane 3 cm_lane <none> reachable_mds 0xfb ep_check_map 0x0                                                     
[1670880440.887801] [fj003:1752827:0]          wireup.c:1047 UCX  ERROR   old: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0                                                    
[1670880440.887825] [fj003:1752827:0]          wireup.c:1047 UCX  ERROR   old: lane[1]: 13:xpmem/memory.0 md[7]          -> md[7]/xpmem/sysdev[255] rkey_ptr                                                      
[1670880440.887843] [fj003:1752827:0]          wireup.c:1047 UCX  ERROR   old: lane[2]: 11:cma/memory.0 md[5]            -> md[5]/cma/sysdev[255] rma_bw#0
[1670880440.887868] [fj003:1752827:0]          wireup.c:1047 UCX  ERROR   old: lane[3]:  7:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#1 wireup
[1670880440.887884] [fj003:1752827:0]          wireup.c:1037 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> reachable_mds 0xfb ep_check_map 0x0
[1670880440.887899] [fj003:1752827:0]          wireup.c:1047 UCX  ERROR   new: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0
[1670880440.887914] [fj003:1752827:0]          wireup.c:1047 UCX  ERROR   new: lane[1]: 13:xpmem/memory.0 md[7]          -> md[7]/xpmem/sysdev[255] rkey_ptr
[1670880440.887928] [fj003:1752827:0]          wireup.c:1047 UCX  ERROR   new: lane[2]: 12:knem/memory.0 md[6]           -> md[6]/knem/sysdev[255] rma_bw#0
[1670880440.887944] [fj003:1752827:0]          wireup.c:1047 UCX  ERROR   new: lane[3]:  7:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#1 wireup
[1670880440.887713] [fj003:1752829:0]          wireup.c:1037 UCX  ERROR   old: am_lane 0 wireup_msg_lane 3 cm_lane <none> reachable_mds 0xfb ep_check_map 0x0
[1670880440.887801] [fj003:1752829:0]          wireup.c:1047 UCX  ERROR   old: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0
[1670880440.887826] [fj003:1752829:0]          wireup.c:1047 UCX  ERROR   old: lane[1]: 13:xpmem/memory.0 md[7]          -> md[7]/xpmem/sysdev[255] rkey_ptr
[1670880440.887844] [fj003:1752829:0]          wireup.c:1047 UCX  ERROR   old: lane[2]: 11:cma/memory.0 md[5]            -> md[5]/cma/sysdev[255] rma_bw#0
[1670880440.887860] [fj003:1752829:0]          wireup.c:1047 UCX  ERROR   old: lane[3]:  7:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#1 wireup
[1670880440.887876] [fj003:1752829:0]          wireup.c:1037 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> reachable_mds 0xfb ep_check_map 0x0
[1670880440.887891] [fj003:1752829:0]          wireup.c:1047 UCX  ERROR   new: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0
[1670880440.887907] [fj003:1752829:0]          wireup.c:1047 UCX  ERROR   new: lane[1]: 13:xpmem/memory.0 md[7]          -> md[7]/xpmem/sysdev[255] rkey_ptr
[1670880440.887921] [fj003:1752829:0]          wireup.c:1047 UCX  ERROR   new: lane[2]: 12:knem/memory.0 md[6]           -> md[6]/knem/sysdev[255] rma_bw#0
[1670880440.887937] [fj003:1752829:0]          wireup.c:1047 UCX  ERROR   new: lane[3]:  7:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#1 wireup
[fj003:1752826:0:1752826]      wireup.c:1335 Fatal: endpoint reconfiguration not supported yet
[fj003:1752827:0:1752827]      wireup.c:1335 Fatal: endpoint reconfiguration not supported yet
[fj003:1752829:0:1752829]      wireup.c:1335 Fatal: endpoint reconfiguration not supported yet
==== backtrace (tid:1752827) ====
=================================
==== backtrace (tid:1752826) ====
=================================

[1752827] signal (6.-6): Aborted
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/test/spawned_worker.jl:4

[1752826] signal (6.-6): Aborted
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/test/spawned_worker.jl:4
==== backtrace (tid:1752829) ====
=================================

[1752829] signal (6.-6): Aborted
in expression starting at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/test/spawned_worker.jl:4
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
ucs_fatal_error_message at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/debug/assert.c:38
ucs_fatal_error_message at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/debug/assert.c:38
ucs_fatal_error_message at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/debug/assert.c:38
ucs_fatal_error_format at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/debug/assert.c:53
ucs_fatal_error_format at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/debug/assert.c:53
ucs_fatal_error_format at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/debug/assert.c:53
ucp_wireup_init_lanes at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:1335
ucp_wireup_init_lanes at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:1335
ucp_wireup_init_lanes at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:1335
ucp_wireup_init_lanes_by_request at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:434 [inlined]
ucp_wireup_process_request at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:561
ucp_wireup_init_lanes_by_request at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:434 [inlined]
ucp_wireup_process_request at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:561
ucp_wireup_init_lanes_by_request at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:434 [inlined]
ucp_wireup_process_request at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:561
ucp_wireup_msg_handler at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:791
ucp_wireup_msg_handler at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:791
ucp_wireup_msg_handler at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/wireup/wireup.c:791
uct_iface_invoke_am at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/base/uct_iface.h:769 [inlined]
uct_ib_iface_invoke_am_desc at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/base/ib_iface.h:365 [inlined]
uct_ud_ep_process_rx at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/base/ud_ep.c:952
uct_iface_invoke_am at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/base/uct_iface.h:769 [inlined]
uct_ib_iface_invoke_am_desc at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/base/ib_iface.h:365 [inlined]
uct_ud_ep_process_rx at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/base/ud_ep.c:952
uct_iface_invoke_am at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/base/uct_iface.h:769 [inlined]
uct_ib_iface_invoke_am_desc at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/base/ib_iface.h:365 [inlined]
uct_ud_ep_process_rx at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/base/ud_ep.c:952
uct_ud_mlx5_iface_poll_rx at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/accel/ud_mlx5.c:510 [inlined]
uct_ud_mlx5_iface_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/accel/ud_mlx5.c:559
uct_ud_mlx5_iface_poll_rx at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/accel/ud_mlx5.c:510 [inlined]
uct_ud_mlx5_iface_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/accel/ud_mlx5.c:559
uct_ud_mlx5_iface_poll_rx at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/accel/ud_mlx5.c:510 [inlined]
uct_ud_mlx5_iface_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/ib/ud/accel/ud_mlx5.c:559
ucs_callbackq_dispatch at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/datastruct/callbackq.h:211 [inlined]
uct_worker_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/api/uct.h:2592 [inlined]
ucp_worker_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/core/ucp_worker.c:2455
ucs_callbackq_dispatch at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/datastruct/callbackq.h:211 [inlined]
uct_worker_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/api/uct.h:2592 [inlined]
ucp_worker_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/core/ucp_worker.c:2455
ucs_callbackq_dispatch at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucs/datastruct/callbackq.h:211 [inlined]
uct_worker_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/uct/api/uct.h:2592 [inlined]
ucp_worker_progress at /lustre/projects/hpc_support_ookami/ucx-1.11.2/src/ucp/core/ucp_worker.c:2455
opal_progress at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libopen-pal.so.40 (unknown line)
ompi_sync_wait_mt at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libopen-pal.so.40 (unknown line)
opal_progress at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libopen-pal.so.40 (unknown line)
ompi_sync_wait_mt at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libopen-pal.so.40 (unknown line)
opal_progress at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libopen-pal.so.40 (unknown line)
ompi_sync_wait_mt at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libopen-pal.so.40 (unknown line)
ompi_comm_nextcid at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_dpm_connect_accept at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_dpm_dyn_init at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_mpi_init at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
PMPI_Init_thread at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_comm_nextcid at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_dpm_connect_accept at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_dpm_dyn_init at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_mpi_init at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_comm_nextcid at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_dpm_connect_accept at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
PMPI_Init_thread at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_dpm_dyn_init at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
ompi_mpi_init at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
PMPI_Init_thread at /lustre/software/openmpi/gcc12.1.0/4.1.4/lib/libmpi.so (unknown line)
MPI_Init_thread at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/api/generated_api.jl:1899 [inlined]
_init_thread at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:174 [inlined]
#Init#30 at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:94
Init at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:81
unknown function (ip: 0x400017027d47)
MPI_Init_thread at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/api/generated_api.jl:1899 [inlined]
_init_thread at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:174 [inlined]
#Init#30 at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:94
MPI_Init_thread at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/api/generated_api.jl:1899 [inlined]
_init_thread at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:174 [inlined]
#Init#30 at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:94
Init at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:81
unknown function (ip: 0x400017027d47)
Init at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/src/environment.jl:81
unknown function (ip: 0x400016ff7d47)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
do_call at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:126
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
do_call at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:126
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
do_call at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:226
eval_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:226
eval_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:226
eval_stmt_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:624
eval_stmt_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:624
eval_stmt_value at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:762
jl_interpret_toplevel_thunk at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:762
jl_interpret_toplevel_thunk at /cache/build/default-armageddon-2/julialang/julia-master/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:971
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:971
jl_toplevel_eval_flex at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-armageddon-2/julialang/julia-master/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1522
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1522
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1522
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
_include at ./loading.jl:1582
_include at ./loading.jl:1582
_include at ./loading.jl:1582
include at ./Base.jl:450
include at ./Base.jl:450
include at ./Base.jl:450
jfptr_include_49529 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jfptr_include_49529 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jfptr_include_49529 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
exec_options at ./client.jl:307
exec_options at ./client.jl:307
exec_options at ./client.jl:307
_start at ./client.jl:522
_start at ./client.jl:522
_start at ./client.jl:522
jfptr__start_36420 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jfptr__start_36420 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jfptr__start_36420 at /lustre/software/julia-5da8d5f17a/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2524 [inlined]
ijl_apply_generic at /cache/build/default-armageddon-2/julialang/julia-master/src/gf.c:2706
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
true_main at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:717
main at /cache/build/default-armageddon-2/julialang/julia-master/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
true_main at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:717
main at /cache/build/default-armageddon-2/julialang/julia-master/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
jl_apply at /cache/build/default-armageddon-2/julialang/julia-master/src/julia.h:1875 [inlined]
true_main at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-armageddon-2/julialang/julia-master/src/jlapi.c:717
main at /cache/build/default-armageddon-2/julialang/julia-master/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
_start at /lustre/software/julia-5da8d5f17a/bin/julia (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 1752827 on node fj003 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpiexec during cleanup)
test_spawn.jl: Error During Test at /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/test/runtests.jl:43
  Got exception outside of a @test
  failed process: Process(`mpiexec -n 1 /lustre/software/julia-5da8d5f17a/bin/julia -Cnative -J/lustre/software/julia-5da8d5f17a/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --color=yes --startup-file=
no /lustre/home/mosgiordano/.julia/packages/MPI/5cAQG/test/test_spawn.jl`, ProcessExited(134)) [134]

New entry for the known OpenMPI issues section? Was it reported upstream already?

@simonbyrne
Copy link
Member

Was it reported upstream already?

No: can you open an issue? They will probably want to know the version of HCOLL you're using.

@vchuravy is there someone at Nvidia we should contact?

@giordano
Copy link
Member Author

No: can you open an issue?

Yes, but where, OpenMPI or HCOLL? Couldn't find where to report bugs in HCOLL.

@simonbyrne
Copy link
Member

I would open it on Open MPI, and let them or @vchuravy contact the appropriate Melanox/Nvidia folks

@giordano
Copy link
Member Author

For the last error with OpenMPI, in test_spawn, it looks like we're hitting https://github.com/openucx/ucx/blob/ef2bbcf6f8a653fb70a7be0673644db5d3ca10c2/src/ucp/wireup/wireup.c#L1316-L1336 (this block is also still on master: https://github.com/openucx/ucx/blob/5f26dd48122b588fd6aad6746fcebd35522e1afe/src/ucp/wireup/wireup.c#L1531-L1551). Probably not much we can do about this? Any ideas how to handle this more gently (e.g. how to skip the test if we can detect it won't work)?

@simonbyrne
Copy link
Member

We should probably have a generic mechanism for skipping tests

@simonbyrne
Copy link
Member

@giordano can you try now with #693 merged?

@t-bltg
Copy link
Collaborator

t-bltg commented Dec 16, 2022

From @giordano 's experiments in #693 (comment) it didn't partially helped.

@giordano
Copy link
Member Author

giordano commented Dec 19, 2022

Ok, this is weird. Now on master the allgather tests still segfault

[1112653] signal (11.1): Segmentation fault                                                                                                                                                                                                 
in expression starting at /lustre/home/mosgiordano/.julia/dev/MPI/test/test_allgather.jl:56                                                                                                                                                 
                                                                                                                                                                                                                                            
[1112656] signal (11.1): Segmentation fault                                                                                                                                                                                                 
in expression starting at /lustre/home/mosgiordano/.julia/dev/MPI/test/test_allgather.jl:56

[1112654] signal (11.1): Segmentation fault
in expression starting at /lustre/home/mosgiordano/.julia/dev/MPI/test/test_allgather.jl:56

[1112655] signal (11.1): Segmentation fault
in expression starting at /lustre/home/mosgiordano/.julia/dev/MPI/test/test_allgather.jl:56
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
srun: error: fj003: tasks 0-3: Segmentation fault (core dumped)

even though a standalone program like #684 (comment) doesn't anymore 😕 However, as mentioned in #693 (comment), this program

using MPI
MPI.Init(;threadlevel=:single)
MPI.Datatype(Char)
MPI.Finalize()

also segfaults:

$ srun -n 2 julia --project test.jl 

[1113525] signal (11.1): Segmentation fault
in expression starting at /lustre/home/mosgiordano/tmp/mvapich/test.jl:4

[1113526] signal (11.1): Segmentation fault
in expression starting at /lustre/home/mosgiordano/tmp/mvapich/test.jl:4
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
MPIR_Call_attr_delete at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
MPIR_Attr_delete_list at /lustre/software/mvapich2/gcc11/2.3.6/lib/libmpi.so (unknown line)
Allocations: 2975 (Pool: 2963; Big: 12); GC: 0
srun: error: fj003: tasks 0-1: Segmentation fault (core dumped)

But this C program

#include <mpi.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    MPI_Datatype dup_type;
    MPI_Type_dup(MPI_UINT32_T, &dup_type);

    MPI_Finalize();
    return 0;
}

runs without problems. Does MPI.Datatype(Char) do anything else?

Edit: yes, MPI.Datatype(Char) does quite a lot more than MPI_Type_dup, I'm trying to follow all the steps, will take a while...

@giordano
Copy link
Member Author

giordano commented Dec 19, 2022

If I inline MPI.Datatype(T) and inner branches, then I don't get a segfault

using MPI
MPI.Init(; threadlevel=:single)
T = Char

# inline MPI.Datatype(T)
get!(MPI.created_datatypes, T) do
    datatype = MPI.Datatype(MPI.API.MPI_DATATYPE_NULL[])
    @assert MPI.Initialized()
    MPI.Types.duplicate!(datatype, MPI.Datatype(UInt32))
end

MPI.Finalize()

This is getting weirder and weirder.... 😢

Edit: ignore this message, it's wrong, see below.

@giordano
Copy link
Member Author

Scratch my previous message, I had inlined a bit too much, removing the actually offending lines.

This is a better reproducer for the segfault:

using MPI
MPI.Init(; threadlevel=:single)

datatype = MPI.Datatype(MPI.API.MPI_DATATYPE_NULL[])
MPI.API.MPI_Type_dup(MPI.Datatype(UInt32), datatype)
MPI.API.MPI_Type_commit(datatype)
MPI.API.MPI_Type_set_attr(datatype, MPI.JULIA_TYPE_PTR_ATTR[], pointer_from_objref(Char))

MPI.Finalize()

This is in principle relatively easy to translate to C, except I don't know what MPI.JULIA_TYPE_PTR_ATTR[] and pointer_from_objref(Char) would need to be outside of Julia, and the MPI_Type_set_attr call is necessary to trigger the segfault in MPI.Finalize(). Any ideas?

@Gnimuc
Copy link
Contributor

Gnimuc commented Dec 19, 2022

I noticed MPI_TYPE_NULL_COPY_FN and MPI_TYPE_NULL_DELETE_FN are defined(initialized) in openmpi.jl and mpt.jl but not in mopish.jl and microsoftmpi.jl.

The segfault occurred in MPIR_Attr_delete_list, I suspect those copy/delete callback functions are not correctly registered.

@giordano
Copy link
Member Author

Those are null pointers also in the headers of MVAPICH2 (see also #688 (comment)):

$ echo '#include <mpi.h>' | mpicc -dM -E - | grep -E 'MPI_TYPE_NULL_(COPY|DELETE)_FN'
#define MPI_TYPE_NULL_DELETE_FN ((MPI_Type_delete_attr_function*)0)
#define MPI_TYPE_NULL_COPY_FN ((MPI_Type_copy_attr_function*)0)

But yes, maybe we aren't registering callback correctly and those null pointers are called? Using debugger here is a bit complicated.

@simonbyrne
Copy link
Member

@giordano I think I figured it out: the issue is that MVAPICH doesn't like it when objects are not cleaned up before MPI_Finalize. From your sample code:

using MPI
MPI.Init(; threadlevel=:single)

datatype = MPI.Datatype(MPI.API.MPI_DATATYPE_NULL[])
MPI.API.MPI_Type_dup(MPI.Datatype(UInt32), datatype)
MPI.API.MPI_Type_commit(datatype)
MPI.API.MPI_Type_set_attr(datatype, MPI.JULIA_TYPE_PTR_ATTR[], pointer_from_objref(Char))

# calling either of these will prevent a segfault:
#   MPI.API.MPI_Type_delete_attr(datatype, MPI.JULIA_TYPE_PTR_ATTR[])
#   MPI.free(datatype)
MPI.Finalize()

@simonbyrne
Copy link
Member

The easiest fix for now is to probably free every entry in MPI.created_datatypes as part of MPI_Finalize

@simonbyrne
Copy link
Member

simonbyrne commented Dec 19, 2022

This should be a C reproducer:

#include <mpi.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    MPI_Datatype dup_type;
    MPI_Type_dup(MPI_UINT32_T, &dup_type);

    MPI_Type_commit(&dup_type);

    int keyval;
    MPI_Type_create_keyval(MPI_TYPE_NULL_COPY_FN,
                           MPI_TYPE_NULL_DELETE_FN,
                           &keyval, NULL);
    
    MPI_Type_set_attr(dup_type, keyval, NULL);
      
    MPI_Finalize();
    return 0;
}

@giordano
Copy link
Member Author

That code does segfault, but the error message is different, it doesn't mention MPIR_Call_attr_delete or MPIR_Attr_delete_list:

$ srun -n 2 ./test
[fj023:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[fj023:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: fj023: tasks 0-1: Segmentation fault (core dumped)

With #696, #684 (comment) still segfaults, always same stacktrace 😢

@simonbyrne
Copy link
Member

#696 won't fix #684 (comment) (since we don't track that type), but it should fix #684 (comment)

@simonbyrne
Copy link
Member

@giordano did you want to send a bug report to mvapich? It looks like you have to do it through the mailing list https://mvapich.cse.ohio-state.edu/help/

@simonbyrne
Copy link
Member

I emailed the list. Closing this for now, re-open if more issues arise.

@giordano
Copy link
Member Author

giordano commented Jan 17, 2023

For the record, segmentation faults are gone when using MVAPICH after applying the patch

--- a/src/mpi/attr/attrutil.c
+++ b/src/mpi/attr/attrutil.c
@@ -266,6 +266,7 @@
 	   corresponding keyval */
 	/* Still to do: capture any error returns but continue to 
 	   process attributes */
+    if (p->keyval) {
 	mpi_errno = MPIR_Call_attr_delete( handle, p );
 
 	/* We must also remove the keyval reference.  If the keyval
@@ -282,6 +283,7 @@
 		MPIU_Handle_obj_free( &MPID_Keyval_mem, p->keyval );
 	    }
 	}
+	}
 	
 	MPIU_Handle_obj_free( &MPID_Attr_mem, p );
 	

suggested in the MVAPICH mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants