Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi4py: Remaining spawn/accept/connect issues #12307

Open
dalcinl opened this issue Feb 4, 2024 · 25 comments
Open

mpi4py: Remaining spawn/accept/connect issues #12307

dalcinl opened this issue Feb 4, 2024 · 25 comments
Assignees

Comments

@dalcinl
Copy link
Contributor

dalcinl commented Feb 4, 2024

There are remaining issues related to spawn when running the mpi4py testsuite. I'm able to reproduce them locally.

First, you need to switch to branch testing/ompi-dpm, otherwise some of the reproducers below will be skip as know failures.

cd mpi4py # git repo clone
git fetch && git checkout testing/ompi-dpm

I'm configuring ompi@main the following way:

options=(
    --prefix=/home/devel/mpi/openmpi/dev
    --without-ofi
    --without-ucx
    --without-psm2
    --without-cuda
    --without-rocm
    --with-pmix=internal
    --with-prrte=internal
    --with-libevent=internal
    --with-hwloc=internal
    --enable-debug
    --enable-mem-debug
    --disable-man-pages
    --disable-sphinx
)
./configure "${options[@]}"

I've enabled oversubscription via both Open MPI and PRTE config files.

$ cat ~/.openmpi/mca-params.conf 
rmaps_default_mapping_policy = :oversubscribe
$ cat ~/.prte/mca-params.conf 
rmaps_default_mapping_policy = :oversubscribe

Afterwards, try the following:

  1. I cannot run in singleton mode:
$ python test/test_spawn.py -v
[kw61149:525865] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.kw61149.1000/jf.0/3608084480/shared_mem_cuda_pool.kw61149 could be created.
[kw61149:525865] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 
[0@kw61149] Python 3.12.1 (/usr/bin/python)
[0@kw61149] numpy 1.26.3 (/home/dalcinl/.local/lib/python3.12/site-packages/numpy)
[0@kw61149] MPI 3.1 (Open MPI 5.1.0)
[0@kw61149] mpi4py 4.0.0.dev0 (/home/dalcinl/Devel/mpi4py/src/mpi4py)
testArgsBad (__main__.TestSpawnMultipleSelf.testArgsBad) ... ok
testArgsOnlyAtRoot (__main__.TestSpawnMultipleSelf.testArgsOnlyAtRoot) ... ok
testCommSpawn (__main__.TestSpawnMultipleSelf.testCommSpawn) ... ok
testCommSpawnDefaults1 (__main__.TestSpawnMultipleSelf.testCommSpawnDefaults1) ... prte: ../../../../../ompi/3rd-party/openpmix/src/class/pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
ERROR
testCommSpawnDefaults2 (__main__.TestSpawnMultipleSelf.testCommSpawnDefaults2) ... ERROR
...
  1. The following test fails when using a large number of MPI processes, let say 10, you may need more:
mpiexec -n 10 python test/test_spawn.py -v

Sometimes I get a segfault, sometimes a deadlock, and a few times the run may run to completion.

The following narrowing of tests may help figure out the problem:

mpiexec -n 10 python test/test_spawn.py -v -k testArgsOnlyAtRoot

It may run OK many times, but eventually I get a failure and the following output:

testArgsOnlyAtRoot (__main__.TestSpawnSingleSelfMany.testArgsOnlyAtRoot) ... [kw61149:00000] *** An error occurred in Socket closed3) The following test deadlocks when running in 4 or more MPI processes:

This other narrowed down test also have issues, but it does not always fail:

mpiexec -n 10 python test/test_spawn.py -v -k testNoArgs
[kw61149:1826801] *** Process received signal ***
[kw61149:1826801] Signal: Segmentation fault (11)
[kw61149:1826801] Signal code: Address not mapped (1)
[kw61149:1826801] Failing at address: 0x180
[kw61149:1826801] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7fea10eaa9a0]
[kw61149:1826801] [ 1] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(+0x386b4a)[0x7fea02786b4a]
[kw61149:1826801] [ 2] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x123)[0x7fea02788d32]
[kw61149:1826801] [ 3] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(+0xc7661)[0x7fea02384661]
[kw61149:1826801] [ 4] /home/devel/mpi/openmpi/dev/lib/libevent_core-2.1.so.7(+0x1c645)[0x7fea02ea6645]
[kw61149:1826801] [ 5] /home/devel/mpi/openmpi/dev/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7fea02ea6ccf]
[kw61149:1826801] [ 6] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(+0x23ef1)[0x7fea022e0ef1]
[kw61149:1826801] [ 7] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(opal_progress+0xa7)[0x7fea022e0faa]
[kw61149:1826801] [ 8] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1cd)[0x7fea0232ca1a]
[kw61149:1826801] [ 9] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(+0x6c4bf)[0x7fea0246c4bf]
[kw61149:1826801] [10] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_comm_nextcid+0x7a)[0x7fea0246e0cf]
[kw61149:1826801] [11] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_dpm_connect_accept+0x3cd6)[0x7fea0247ffca]
[kw61149:1826801] [12] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_dpm_dyn_init+0xc6)[0x7fea02489df8]
[kw61149:1826801] [13] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_mpi_init+0x750)[0x7fea024abd47]
[kw61149:1826801] [14] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(PMPI_Init_thread+0xdc)[0x7fea02513c4a]
  1. The following test deadlocks when running in 4 or more MPI processes:
mpiexec -n 4 python test/test_dynproc.py -v

It may run occasionally, but most of the times it deadlocks.

[kw61149:00000] *** reported by process [3119841281,6]
[kw61149:00000] *** on a NULL communicator
[kw61149:00000] *** Unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] ***    and MPI will try to terminate your MPI job as well)

cc @hppritcha

@janjust
Copy link
Contributor

janjust commented Feb 13, 2024

@dalcinl Are you using Open MPI v5.0.2?

@janjust janjust assigned janjust and hppritcha and unassigned janjust Feb 13, 2024
@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 13, 2024

@janjust Last time I tried it was ompi@master. At this point I'm losing track of all the cumulated issues with their respective branches.

@hppritcha
Copy link
Member

@dalcinl thanks for putting this together. Just to check you are not trying with these failures to run oversubscribed, correct?

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 13, 2024

Just to check you are not trying with these failures to run oversubscribed

Oh, hold on... Yes, I may eventually run oversubscribed if the tests spawn too many processes. But I'm setting the proper MCA parameters to allow for that. Am I missing something? Also, see above, I'm reporting failures even in singleton init mode, and in that case I don't think I'm not oversubscribing the machine. Also note that I'm reporting that the deadlocks are not always reproducible, so any potential issue with oversubscription does not seems to be perfectly reproducible.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 13, 2024

I can repeat my local tests tomorrow with current main and then report the outcome.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 14, 2024

Folks, I've updated the description. All my local tests are with ompi@main.

@janjust My CI also failed with deadlock using ompi@v5.0.x, see here.

@hppritcha
Copy link
Member

I think i have a fix in prrte for the singleton problem:

python test/test_spawn.py -v

odd that you don't see the assert that I observed.

I''m having problems reproducing this one:

mpiexec -n 10 python test/test_spawn.py -v -k testArgsOnlyAtRoot

could you do a run with

mpiexec --display allocation

so i can try better to reproduce?

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 19, 2024

@hppritcha This is what I get from mpiexec --display allocation ...

======================   ALLOCATED NODES   ======================
    kw61149: slots=1 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
	aliases: kw61149
=================================================================

======================   ALLOCATED NODES   ======================
    kw61149: slots=64 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: kw61149
=================================================================

======================   ALLOCATED NODES   ======================
    kw61149: slots=64 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: kw61149
=================================================================

@rhc54
Copy link
Contributor

rhc54 commented Feb 19, 2024

I could possibly help, but I can do nothing without at least some description of what these test do. Just naming a test in somebody's test suite and saying "it doesn't work" isn't very helpful for those of us not dedicated to using that test suite.

Ultimately, I don't really care - if I can help, then I will. But if you'd rather not provide the info, then that's okay too - Howard can struggle on his own.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 19, 2024

@rhc54 I'll submit a trivial reproducer here as soon as I can. The issue is not particular of my testsuite, any spawn example in singleton init mode with a relocated ompi install tree should suffice (issue: setting OPAL_PREFIX is not enough, PATH has to be updated as well for spawn to succeed).

@rhc54
Copy link
Contributor

rhc54 commented Feb 19, 2024

I'm not concerned about that one - @hppritcha indicates he is already addressing it. I'm talking about the other ones you cite here.

@hppritcha FWIW: I'm reworking dpm.c to use PMIx_Group instead of the flakier "publish/lookup" handshake. No idea how that will impact these issues - as I have no idea what these issues are 😄

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 19, 2024

Sorry, I mixed up issues, I was talking about #12349. Regarding spawn testsuites, what mine does that probably no other one does is to issue spawn/spawn_multiple calls in rapid succession from both COMM_SELF and COMM_WORLD, asking for a lot of short-lived child processes, maybe oversubscribing heavily the machine, including testing things like spawn arguments only relevant at the root process. The failures smell to me as race conditions. Maybe the key to the issue is the flaky "publish/lookup" handshake you mentioned above. Your update may very well fix things for good.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 20, 2024

@rhc54 @hppritcha Here you have a C reproducer, as simple as it can get.

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  int maxnp = argc >= 2 ? atoi(argv[1]) : 1;

  int provided;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  MPI_Comm comm;
  MPI_Comm_get_parent(&comm);

  if (MPI_COMM_NULL == comm) {

    for (int i=0; i<100; i++) {
      if (0 == rank) printf("%d\n",i);
      MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, maxnp,
                     MPI_INFO_NULL, 0,
                     MPI_COMM_SELF, &comm,
                     MPI_ERRCODES_IGNORE);
      MPI_Barrier(comm);
      MPI_Comm_disconnect(&comm);
    }

  } else {

    MPI_Barrier(comm);
    MPI_Comm_disconnect(&comm);

  }

  MPI_Finalize();
  return 0;
}

Build and run like below. Other test failing cases can be generated by changing the -np arg to mpiexec and the cmdline arg to the program. You can also change SELF -> WORLD in the C code above.

$ mpicc spawn.c

$ mpiexec -n 10 ./a.out 1
0
1
2
...
11
[kw61149:1737636] *** Process received signal ***
[kw61149:1737636] Signal: Segmentation fault (11)
[kw61149:1737636] Signal code: Address not mapped (1)
[kw61149:1737636] Failing at address: 0x180
[kw61149:1737636] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7f053345c9a0]
[kw61149:1737636] [ 1] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x386b4a)[0x7f0533986b4a]
[kw61149:1737636] [ 2] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x123)[0x7f0533988d32]
[kw61149:1737636] [ 3] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(+0xc76c5)[0x7f05333a26c5]
[kw61149:1737636] [ 4] /home/devel/mpi/openmpi/main/lib/libevent_core-2.1.so.7(+0x1c645)[0x7f0533c5b645]
[kw61149:1737636] [ 5] /home/devel/mpi/openmpi/main/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7f0533c5bccf]
[kw61149:1737636] [ 6] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(+0x23ef1)[0x7f05332feef1]
[kw61149:1737636] [ 7] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(opal_progress+0xa7)[0x7f05332fefaa]
[kw61149:1737636] [ 8] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1cd)[0x7f053334aa1a]
[kw61149:1737636] [ 9] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x6c4bf)[0x7f053366c4bf]
[kw61149:1737636] [10] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_comm_nextcid+0x7a)[0x7f053366e0cf]
[kw61149:1737636] [11] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_dpm_connect_accept+0x3cd6)[0x7f053367ffca]
[kw61149:1737636] [12] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_dpm_dyn_init+0xc6)[0x7f0533689df8]
[kw61149:1737636] [13] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_mpi_init+0x750)[0x7f05336abd47]
[kw61149:1737636] [14] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(PMPI_Init_thread+0xb6)[0x7f0533713c24]
[kw61149:1737636] [15] ./a.out[0x4011f6]
[kw61149:1737636] [16] /lib64/libc.so.6(+0x2814a)[0x7f053344614a]
[kw61149:1737636] [17] /lib64/libc.so.6(__libc_start_main+0x8b)[0x7f053344620b]
[kw61149:1737636] [18] ./a.out[0x4010e5]
[kw61149:1737636] *** End of error message ***
[kw61149:00000] *** An error occurred in Socket closed
[kw61149:00000] *** reported by process [1000800257,5]
[kw61149:00000] *** on a NULL communicator
[kw61149:00000] *** Unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
[kw61149:1737659] OPAL ERROR: Server not available in file ../../ompi/ompi/dpm/dpm.c at line 406
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
LAUNCHER JOB OBJECT NOT FOUND

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2024

I can somewhat reproduce this - it either hangs at the very end, or it segfaults somewhere before the end. Always happens in that "next_cid" code. Afraid that code is deep voodoo and I have no idea what it is doing, or why.

My rewrite gets rid of all that stuff, but it will take me awhile to complete it. I've also been asked to leave the old code in parallel, so I'll add an MCA param to select between the two methods.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 20, 2024

I'll add an MCA param to select between the two methods.

I hope your new method will become the default... The old code is evidently broken.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 20, 2024

@rhc54 After your diagnosis, what would you suggest for the mpi4py test suite? Should I just skip all these spawn tests as know failures, at least until we all have your new implementation available?

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2024

You might as well skip them - they will just keep failing for now, and I doubt anyone will take the time to try and work thru that code in OMPI to figure out the problem.

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2024

FWIW: in fairness, the existing code seems to work fine when not heavily stressed as we haven't heard complaints from the field. Not saying there's anything wrong with your test - it is technically correct and we therefore should pass it. Just noting that the test is higher-stress than we see in practice.

hppritcha added a commit to hppritcha/prrte that referenced this issue Feb 22, 2024
some recent changes broke singleton support - twice in a month.

First, remove problematic PMIX_RELEASE of jdata when its not ready to be removed.

For some reason this showed up in singleton mode with debug enabled.
Various asserts would fail when this PMIX_RELEASE was invoked.
This was due to the fact that the jdata had been put on a list of jdata's
so the opal_list destructor was having a fit trying to release a jdata
which was still in a list.

It turns out this jdata is being released in the code starting at
line 95 of prte_finalize.c.   I assume with debug not enabled that the
jdata is released twice, rather than failing in the assert in prted_comm.c

Some work to add in session id's for tracking allocations also broke
singleton support.

This patch restores the singletone functionality.

Related to issue open-mpi/ompi#12307

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/prrte that referenced this issue Feb 22, 2024
some recent changes broke singleton support - twice in a month.

First, remove problematic PMIX_RELEASE of jdata when its not ready to be removed.

For some reason this showed up in singleton mode with debug enabled.
Various asserts would fail when this PMIX_RELEASE was invoked.
This was due to the fact that the jdata had been put on a list of jdata's
so the opal_list destructor was having a fit trying to release a jdata
which was still in a list.

It turns out this jdata is being released in the code starting at
line 95 of prte_finalize.c.   I assume with debug not enabled that the
jdata is released twice, rather than failing in the assert in prted_comm.c

Some work to add in session id's for tracking allocations also broke
singleton support.

This patch restores the singletone functionality.

Related to issue open-mpi/ompi#12307

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
rhc54 pushed a commit to openpmix/prrte that referenced this issue Feb 23, 2024
some recent changes broke singleton support - twice in a month.

First, remove problematic PMIX_RELEASE of jdata when its not ready to be removed.

For some reason this showed up in singleton mode with debug enabled.
Various asserts would fail when this PMIX_RELEASE was invoked.
This was due to the fact that the jdata had been put on a list of jdata's
so the opal_list destructor was having a fit trying to release a jdata
which was still in a list.

It turns out this jdata is being released in the code starting at
line 95 of prte_finalize.c.   I assume with debug not enabled that the
jdata is released twice, rather than failing in the assert in prted_comm.c

Some work to add in session id's for tracking allocations also broke
singleton support.

This patch restores the singletone functionality.

Related to issue open-mpi/ompi#12307

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@bosilca
Copy link
Member

bosilca commented Feb 23, 2024

We skipped all the connect/accept/spawn tests for years, that's why we are in this mess.

@rhc54
Copy link
Contributor

rhc54 commented Feb 23, 2024

Let's be fair here - we actually do run connect/accept/spawn tests, but they are low-stress versions.. The only reason this one fails is because it is a high-stress test with a tight loop over comm-spawn. The current code works just fine for people who actually use it, which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn).

Modifying it to support high-stress operations isn't trivial, but probably doable. I concur with other comments, though, that this isn't a high priority.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 23, 2024

which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn).

I have had quite a bit of emails over the years asking about spawn-related issues from Python folks using mpi4py. That's the reason I stress-test implementations within my test suite.

Modifying it to support high-stress operations isn't trivial,

I'm not even sure how high-stress is precisely defined. How could I modify my tests to be lower-stress? I've already backeted my spawn calls with barriers, but that's clearly not enough. Should I use sleep() or something like that? Should I serialize all spawn calls from COMM_SELF? I'm really afraid that if I stop testing spawn functionality and don't keep an eye on it, at some point it will become simply unusable.

@bosilca
Copy link
Member

bosilca commented Feb 23, 2024

This storyline (nobody uses this feature) is getting old. I had similar contact as @dalcinl, people tried to use but it was broken so they found another way.

I have no idea what low-stress and high-stress testing could be. It works or it doesn't.

@rhc54
Copy link
Contributor

rhc54 commented Feb 23, 2024

Sigh - seems a rather pointless debate, doesn't it? Fact is, nobody in the OMPI community has historically been inclined to spend time worrying about it, so the debate is rather moot.

Kudos to @hppritcha for trying to take it on, or at least portions of it (e.g., the singleton comm-spawn case).

@bosilca
Copy link
Member

bosilca commented Feb 23, 2024

Looking at the last 2 years of updates in the DPM related code, many of us (you/ICM/LANL/Amazon/UTK) tried to do so. Smaller steps but it got to a point where it kind-of-work. The only thing left is to have a solution that make it works everywhere, because this is a critical feature outside the HPC market.

@rhc54
Copy link
Contributor

rhc54 commented Feb 23, 2024

Agreed - my "low stress" was simply a single invocation of "comm_spawn" by a process in a job. "High-stress" is when a process in a job sits there and calls "comm_spawn" in a loop. Fills the system with lots of coupled jobs, requires that the system (both MPI and RTE) be able to fully cleanup/recover between jobs, etc.

We have historically been satisfied with making the "low stress" operation work. Occasionally, I'd take a crack at the "loop spawn" test and at least keep it from hanging, but it was always a struggle and didn't last for very long. And my "loop spawn" test was very simple, just looping over spawn and disconnecting. Never actually had the conjoined jobs do anything.

I agree that it is an important feature outside HPC, and it obviously should work. Perhaps my new approach will succeed where we previously failed - have to wait and see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants