-
Notifications
You must be signed in to change notification settings - Fork 893
Cray CXI SHS11.1 and openmpi@main fail with intra-node communication #13148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you need to set some non-default MCA parameters for the OFI BTL. If you use env. variables to set the mca params here they are:
you may also need to set the following PRRTE MCA params
We set these in the spack generated modules files we use at NERSC and on our internal SS11 systems. We have not done extensive testing of Open MPI using the libfabric 2.0.0 or 2.1.0 rc's if you are using one of those. Here's a snippet from the modules.yaml file I like to use:
We're setting the FI_CX_RX_MATCH_MODE to software as it seems to typically give better performance although mileage varies depending on the app. We set the OMPI_MCA_pml to cm as its the quickest way to find out if somethings not working. We are finding though that a number of apps do better using the ob1 pml. |
Hmm...this may be a new problem. Could you see what happens if you try to force the run to use the OB1 PML?
Also, what happens if you use libfabric@1.22.0 ? |
Dear Howard, Thank you so much for your really express answer! I unfortunately still get the exact error, even if I set all the variables... I get this when doing the above requested command; [gpu005][[60577,1],0][btl_ofi_module.c:88:mca_btl_ofi_add_procs] error receiving modex
[gpu005][[60577,1],0][btl_ofi_component.c:244:mca_btl_ofi_exit] BTL OFI will now abort. I do not use libfabric@1.22.0 because if I remember well this either fail build wise when trying to use opensource CXI or it fails performance wise in the end. |
Editing, I tried to fix the libfabric version to 1.22.0, unfortunately I get the exact same error |
Does the test run successfully if we get OFI out of the picture?
|
Indeed it's then successful! But obviously with bad bandwidth |
Ah something I forgot to mention but that is really important. This exact same openmpi was working with SHS2.1.3. We upgraded the system image and unfortunately are now unable to run intra-node ompi jobs correctly anymore. |
could you try running with OB1 and OFI BTL and set |
something else to try - if you build open mpi against the system libfabric does it work? |
The output of FI_LOG_LEVEL=debug This line is definitely suspicious
Notice I am still setting:
|
An other interesting report, using the system libfabric doesn't solve any problem either... Currently Loaded Modules:
1) zstd/1.5.6-jcbw 2) gcc/14.2.0
3) hwloc/2.11.1-GH200-gpu 4) libfabric/1.22.0
5) xpmem/2.9.6-1.1 6) zlib-ng/2.2.3-o7xb
7) openmpi/main-v6sz-GH200-gpu 8) cuda/12.8.0-ne7u
9) osu-micro-benchmarks/7.5-GH200-gpu
[gpu003][[39322,1],0][btl_ofi_module.c:88:mca_btl_ofi_add_procs] error receiving modex
[gpu003][[39322,1],0][btl_ofi_component.c:244:mca_btl_ofi_exit] BTL OFI will now abort.
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
-------------------------------------------------------------------------- And I checked with ldd, no OSS libraries, only systems Edit, it might be only a problem with newest version of ompi, seems I don't have it using 5.0.3! |
ok... I kinda find out a solution. Putting --exclusive to our nodes solve the problem, which is definitely not the best way for us, but it is at least a solution. |
I think we need to poke around with how slurm is configured. You may want to see if there were changes made to the slurm version and configuration as part of the upgrade to shs11.1. i am pointing our slurm adm to this issue to take a look. |
could you run this command
and paste the output into this issue? On one or our XC systems this reports
|
We tried the three following switchparameters on our test system, unfortunately it doesn't help. I will try to recompile Slurm because it was compiled finding libfabric 1.15.2.0, I am not sure this should be a problem but still.
|
FYI: just experimented with an other version of slurm, and ~ same problem. Just to make sure it's not slurm 24.11.3 that's the problem.
|
Thank you very much for all the help and apologies for having taken time. The problem was a setting on the cxi network on the machine which was set wrongly, hopefully we were able to correct this. Has nothing to do with Slurm or OpenMPI or libfabric themselves. |
Would you mind sharing that cxi network setting here? i'm sure its only a matter of time before this resurfaces somewhere else. |
Sure, sorry was just checking how much is ok to say, So, after the upgrade from SHS2.1.3 to SHS11.1 the module cxi_service -d cxi0 enable -s 1
cxi_service -d cxi1 enable -s 1
cxi_service -d cxi2 enable -s 1
cxi_service -d cxi3 enable -s 1 Or directly by setting the proper boot parameters
mpirun --mca mtl ofi --mca opal_common_ofi_provider_include "shm+cxi:linkx" --map-by ppr:1:l3cache --bind-to core --np 2 osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 0.23
2 0.47
4 0.94
8 1.89
16 3.81
32 7.60
64 15.24
128 30.43
256 60.89
512 120.38
1024 241.64
2048 483.08
4096 962.28
8192 1928.12
16384 3841.19
32768 7663.60
65536 15275.76
131072 30226.04
262144 58536.63
524288 80619.33
1048576 98711.76
2097152 111745.70
4194304 120530.83 However, using Linkx between 2 nodes, we still get very poor performance. A new issue would be opened in case this is not solved, but now this is being addressed at the libfabric level. |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Branch main, 10 March 2025
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Was installed via Spack, using OSS libfabric and cxi support. Compiler args:
spack spec:
Please describe the system on which you are running
Details of the problem
Multi nodes jobs do run without any problem. Multi tasks jobs on single node do fail with the following error:
ompi_info returns that the btl ofi component is there, but it still seems to fail
It seems very similar to the issue #12038 but since I am using the branch main, this shoud have been fixed in the meantime...
Thanks a lot for any help in advance!
The text was updated successfully, but these errors were encountered: