Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX: Hang combining exclusive/shared window lock #6549

Closed
devreal opened this issue Mar 30, 2019 · 8 comments
Closed

UCX: Hang combining exclusive/shared window lock #6549

devreal opened this issue Mar 30, 2019 · 8 comments

Comments

@devreal
Copy link
Contributor

devreal commented Mar 30, 2019

Running Open MPI 4.0.1 in combination with Open UCX 1.5 I am seeing my application hang while one process attempts to release an exclusive lock while the target attempts to acquire a shared lock. The code below can be used to reproduce the issue (tested on our IB cluster):

#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv)
{
  MPI_Win win;
  int elem_per_unit = 1;
  int *baseptr;
  int rank, size;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  MPI_Win_allocate(
    elem_per_unit*sizeof(int), 1, MPI_INFO_NULL,
    MPI_COMM_WORLD, &baseptr, &win);

  if (size == 2) {
    // get exclusive lock
    if (rank != 0) {
      int val;
      printf("[%d] Acquiring exclusive lock\n", rank);
      MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
      MPI_Put(&val, 1, MPI_INT, 0, 0, 1, MPI_INT, win);
      MPI_Win_flush(0, win);
    }

    MPI_Barrier(MPI_COMM_WORLD);

    // release exclusive lock
    if (rank != 0) {
      printf("[%d] Releasing exclusive lock\n", rank);
      // Rank 1 hangs here
      MPI_Win_unlock(0, win);
    }
  }

  // Rank 0 hangs here
  printf("[%d] Acquiring shared lock\n", rank);
  MPI_Win_lock_all(0, win);

  MPI_Win_unlock_all(win);
  MPI_Win_free(&win);
  MPI_Finalize();

  return 0;
}

Build with:

$ mpicc mpi_shared_excl_lock.c -o mpi_shared_excl_lock

Run with:

$ mpirun -n 2 -N 1 ./mpi_shared_excl_lock
[1] Acquiring exclusive lock
[1] Releasing exclusive lock
[0] Acquiring shared lock

Interestingly, leaving out the barrier between acquiring and releasing the lock lets the example run successfully. Also, things run fine when using Open IB instead of UCX.

@yosefe
Copy link
Contributor

yosefe commented Mar 31, 2019

@devreal what is OpenMPI configure command and git revision?

@devreal
Copy link
Contributor Author

devreal commented Apr 1, 2019

I used Open MPI version 4.0.1 downloaded from the website and configured using:

./configure CC=icc CXX=icpc FTN=ifort --with-ucx=/path/to/ucx-1.5.0/ --without-verbs --enable-debug

It's the Intel compiler in version 19.0.1. I could try changing to the GNU compiler but I'm not sure that that makes a difference, let me know if I should give it a shot.

@gpaulsen
Copy link
Member

@jladd-mlnx Any progress?

@jladd-mlnx jladd-mlnx assigned janjust and jladd-mlnx and unassigned yosefe Jul 9, 2019
@jladd-mlnx
Copy link
Member

@janjust please take it.

@janjust
Copy link
Contributor

janjust commented Jul 19, 2019

@devreal I have a hard time reproducing this issue, what is the ucx and ompi commit?

@devreal
Copy link
Contributor Author

devreal commented Jul 20, 2019

@janjust That was done using UCX 1.5.0 and Open MPI 4.0.1, both built from release branches. I will give the 4.0.x branch a try with the latest UCX release and report back.

@devreal
Copy link
Contributor Author

devreal commented Jul 22, 2019

@janjust I cannot reproduce this on latest 4.0.x (git 368da00) with UCX 1.6.x (git 0309365) so I'm closing this issue. Thanks for checking though :)

@devreal devreal closed this as completed Jul 22, 2019
@janjust
Copy link
Contributor

janjust commented Jul 22, 2019

@devreal , thanks - I believe a lot of your issues are sw atomic related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants