Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl smcuda hang in v4.1.5 #12270

Closed
lrbison opened this issue Jan 23, 2024 · 3 comments · Fixed by #12338
Closed

btl smcuda hang in v4.1.5 #12270

lrbison opened this issue Jan 23, 2024 · 3 comments · Fixed by #12338

Comments

@lrbison
Copy link
Contributor

lrbison commented Jan 23, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.x

commit 6c1ecd00767b54c70e524d9d551db1f132c1fca8 (HEAD -> v4.1.x, origin/v4.1.x)
Date:   Thu Jan 18 12:06:31 2024 -0500

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone:

git clone --recurse-submodules -j4 https://github.com/open-mpi/ompi.git  --branch v4.1.x

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

oddly, git submodule status returns 0, but prints nothing of course, no submodules in 4.1.x!

Please describe the system on which you are running

  • Operating system/version: Ubuntu 20.04, Rocky 9, Others
  • Computer hardware: Graviton 3 (*7g)
  • Network type: None

Details of the problem

Started in https://gitlab.com/eessi/support/-/issues/41#note_1738867500

There are some applications that crash or hang when run on c7g.4xlarge. EasyBuild configurations include a patch to build against cuda at all times, however I can reproduce hangs when I do a fresh build without EB patches against debian-provided cuda (no crashes yet)

The symptom I've been able to reproduce is a hang in smcuda btl, so I must configure with cuda support, and we should either exclude ofi at configure or run time. However we are not using CUDA memory, only the smcuda btl.

./configure --with-cuda --prefix=/fsx/tmp/ompi-with-cuda --enable-debug
make -j && make -j install
module load <my-new-build>
mpirun --mca btl ^ofi --mca mtl ^ofi  -n 64 /fsx/lrbison/eessi/mpi-benchmarks/src_c/IMB-MPI1 alltoall -npmin 64

I'm compiling with gcc 12.3.0.

In fully-loaded examples (ie 64 ranks on hpc7g), we can find a hang relatively frequently (10%?) of IMB's allgather or alltoall test. A lightly-loaded node (6/64) may take as many as 300 executions to find a hang.

A backtrace looks like this:

#0  0x0000ffff84320a14 in sm_fifo_read (fifo=0xffff7010d480) at /dev/shm/ompi/opal/mca/btl/smcuda/btl_smcuda.h:315
#1  0x0000ffff84322ce4 in mca_btl_smcuda_component_progress () at btl_smcuda_component.c:1036
#2  0x0000ffff863b5aa4 in opal_progress () at runtime/opal_progress.c:231
#3  0x0000ffff867c3258 in sync_wait_st (sync=0xffffff8f7cf0) at ../opal/threads/wait_sync.h:83
#4  0x0000ffff867c3c50 in ompi_request_default_wait_all (count=5, requests=0x1f7d4708, statuses=0x0) at request/req_wait.c:234
#5  0x0000ffff8688c638 in ompi_coll_base_barrier_intra_basic_linear (comm=0x1f7356d0, module=0x1f7c4d10) at base/coll_base_barrier.c:366
#6  0x0000ffff6fa2dc28 in ompi_coll_tuned_barrier_intra_do_this (comm=0x1f7356d0, module=0x1f7c4d10, algorithm=1, faninout=0, segsize=0)
    at coll_tuned_barrier_decision.c:99
#7  0x0000ffff6fa249f8 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x1f7356d0, module=0x1f7c4d10) at coll_tuned_decision_fixed.c:500
#8  0x0000ffff867e6c54 in PMPI_Barrier (comm=0x1f7356d0) at pbarrier.c:66
#9  0x000000000040e22c in IMB_alltoall ()

I find all ranks in a barrier, and every fifo read comes up with SM_FIFO_FREE, but they are all waiting on some completion. To me this means a message was overwritten/missed/dropped.

I have attempted to reproduce in 5.0.x branch, however smcuda deactivates itself when it cannot initialize an accelerator stream.

@jsquyres
Copy link
Member

This may be related to issue ##12011 and PR #11999 (#12005 is the v4.1.x PR)

@jsquyres
Copy link
Member

Per Jan 30 discussion: this may or may not be a blocker v4.1.7 release. Let's re-evaluate as we get closer to the end of Q1 CY2024 / expected release of v4.1.7.

@lrbison
Copy link
Contributor Author

lrbison commented Feb 13, 2024

I am making this investigation my top priority this week.

lrbison added a commit to lrbison/ompi that referenced this issue Feb 14, 2024
This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
lrbison added a commit to lrbison/ompi that referenced this issue Feb 14, 2024
This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
lrbison added a commit to lrbison/ompi that referenced this issue Feb 14, 2024
This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
lrbison added a commit to lrbison/ompi that referenced this issue Feb 15, 2024
This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
(cherry picked from commit 71f378d)
lrbison added a commit to lrbison/ompi that referenced this issue Feb 15, 2024
This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
(cherry picked from commit 71f378d)
jiaxiyan pushed a commit to jiaxiyan/ompi that referenced this issue Mar 1, 2024
This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants