mpi4py: Remaining spawn/accept/connect issues #12307

dalcinl · 2024-02-04T07:54:52Z

There are remaining issues related to spawn when running the mpi4py testsuite. I'm able to reproduce them locally.

First, you need to switch to branch testing/ompi-dpm, otherwise some of the reproducers below will be skip as know failures.

cd mpi4py # git repo clone
git fetch && git checkout testing/ompi-dpm

I'm configuring ompi@main the following way:

options=(
    --prefix=/home/devel/mpi/openmpi/dev
    --without-ofi
    --without-ucx
    --without-psm2
    --without-cuda
    --without-rocm
    --with-pmix=internal
    --with-prrte=internal
    --with-libevent=internal
    --with-hwloc=internal
    --enable-debug
    --enable-mem-debug
    --disable-man-pages
    --disable-sphinx
)
./configure "${options[@]}"

I've enabled oversubscription via both Open MPI and PRTE config files.

$ cat ~/.openmpi/mca-params.conf 
rmaps_default_mapping_policy = :oversubscribe
$ cat ~/.prte/mca-params.conf 
rmaps_default_mapping_policy = :oversubscribe

Afterwards, try the following:

I cannot run in singleton mode:

$ python test/test_spawn.py -v
[kw61149:525865] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.kw61149.1000/jf.0/3608084480/shared_mem_cuda_pool.kw61149 could be created.
[kw61149:525865] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 
[0@kw61149] Python 3.12.1 (/usr/bin/python)
[0@kw61149] numpy 1.26.3 (/home/dalcinl/.local/lib/python3.12/site-packages/numpy)
[0@kw61149] MPI 3.1 (Open MPI 5.1.0)
[0@kw61149] mpi4py 4.0.0.dev0 (/home/dalcinl/Devel/mpi4py/src/mpi4py)
testArgsBad (__main__.TestSpawnMultipleSelf.testArgsBad) ... ok
testArgsOnlyAtRoot (__main__.TestSpawnMultipleSelf.testArgsOnlyAtRoot) ... ok
testCommSpawn (__main__.TestSpawnMultipleSelf.testCommSpawn) ... ok
testCommSpawnDefaults1 (__main__.TestSpawnMultipleSelf.testCommSpawnDefaults1) ... prte: ../../../../../ompi/3rd-party/openpmix/src/class/pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
ERROR
testCommSpawnDefaults2 (__main__.TestSpawnMultipleSelf.testCommSpawnDefaults2) ... ERROR
...

The following test fails when using a large number of MPI processes, let say 10, you may need more:

mpiexec -n 10 python test/test_spawn.py -v

Sometimes I get a segfault, sometimes a deadlock, and a few times the run may run to completion.

The following narrowing of tests may help figure out the problem:

mpiexec -n 10 python test/test_spawn.py -v -k testArgsOnlyAtRoot

It may run OK many times, but eventually I get a failure and the following output:

testArgsOnlyAtRoot (__main__.TestSpawnSingleSelfMany.testArgsOnlyAtRoot) ... [kw61149:00000] *** An error occurred in Socket closed3) The following test deadlocks when running in 4 or more MPI processes:

This other narrowed down test also have issues, but it does not always fail:

mpiexec -n 10 python test/test_spawn.py -v -k testNoArgs

[kw61149:1826801] *** Process received signal ***
[kw61149:1826801] Signal: Segmentation fault (11)
[kw61149:1826801] Signal code: Address not mapped (1)
[kw61149:1826801] Failing at address: 0x180
[kw61149:1826801] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7fea10eaa9a0]
[kw61149:1826801] [ 1] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(+0x386b4a)[0x7fea02786b4a]
[kw61149:1826801] [ 2] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x123)[0x7fea02788d32]
[kw61149:1826801] [ 3] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(+0xc7661)[0x7fea02384661]
[kw61149:1826801] [ 4] /home/devel/mpi/openmpi/dev/lib/libevent_core-2.1.so.7(+0x1c645)[0x7fea02ea6645]
[kw61149:1826801] [ 5] /home/devel/mpi/openmpi/dev/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7fea02ea6ccf]
[kw61149:1826801] [ 6] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(+0x23ef1)[0x7fea022e0ef1]
[kw61149:1826801] [ 7] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(opal_progress+0xa7)[0x7fea022e0faa]
[kw61149:1826801] [ 8] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1cd)[0x7fea0232ca1a]
[kw61149:1826801] [ 9] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(+0x6c4bf)[0x7fea0246c4bf]
[kw61149:1826801] [10] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_comm_nextcid+0x7a)[0x7fea0246e0cf]
[kw61149:1826801] [11] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_dpm_connect_accept+0x3cd6)[0x7fea0247ffca]
[kw61149:1826801] [12] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_dpm_dyn_init+0xc6)[0x7fea02489df8]
[kw61149:1826801] [13] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_mpi_init+0x750)[0x7fea024abd47]
[kw61149:1826801] [14] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(PMPI_Init_thread+0xdc)[0x7fea02513c4a]

The following test deadlocks when running in 4 or more MPI processes:

mpiexec -n 4 python test/test_dynproc.py -v

It may run occasionally, but most of the times it deadlocks.

[kw61149:00000] *** reported by process [3119841281,6]
[kw61149:00000] *** on a NULL communicator
[kw61149:00000] *** Unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] ***    and MPI will try to terminate your MPI job as well)

cc @hppritcha

The text was updated successfully, but these errors were encountered:

janjust · 2024-02-13T16:27:34Z

@dalcinl Are you using Open MPI v5.0.2?

dalcinl · 2024-02-13T17:09:40Z

@janjust Last time I tried it was ompi@master. At this point I'm losing track of all the cumulated issues with their respective branches.

hppritcha · 2024-02-13T18:34:17Z

@dalcinl thanks for putting this together. Just to check you are not trying with these failures to run oversubscribed, correct?

dalcinl · 2024-02-13T18:38:59Z

Just to check you are not trying with these failures to run oversubscribed

Oh, hold on... Yes, I may eventually run oversubscribed if the tests spawn too many processes. But I'm setting the proper MCA parameters to allow for that. Am I missing something? Also, see above, I'm reporting failures even in singleton init mode, and in that case I don't think I'm not oversubscribing the machine. Also note that I'm reporting that the deadlocks are not always reproducible, so any potential issue with oversubscription does not seems to be perfectly reproducible.

dalcinl · 2024-02-13T18:39:44Z

I can repeat my local tests tomorrow with current main and then report the outcome.

dalcinl · 2024-02-14T07:55:14Z

Folks, I've updated the description. All my local tests are with ompi@main.

@janjust My CI also failed with deadlock using ompi@v5.0.x, see here.

hppritcha · 2024-02-14T22:58:25Z

I think i have a fix in prrte for the singleton problem:

python test/test_spawn.py -v

odd that you don't see the assert that I observed.

I''m having problems reproducing this one:

mpiexec -n 10 python test/test_spawn.py -v -k testArgsOnlyAtRoot

could you do a run with

mpiexec --display allocation

so i can try better to reproduce?

dalcinl · 2024-02-19T11:26:56Z

@hppritcha This is what I get from mpiexec --display allocation ...

======================   ALLOCATED NODES   ======================
    kw61149: slots=1 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
	aliases: kw61149
=================================================================

======================   ALLOCATED NODES   ======================
    kw61149: slots=64 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: kw61149
=================================================================

======================   ALLOCATED NODES   ======================
    kw61149: slots=64 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: kw61149
=================================================================

rhc54 · 2024-02-19T17:28:21Z

I could possibly help, but I can do nothing without at least some description of what these test do. Just naming a test in somebody's test suite and saying "it doesn't work" isn't very helpful for those of us not dedicated to using that test suite.

Ultimately, I don't really care - if I can help, then I will. But if you'd rather not provide the info, then that's okay too - Howard can struggle on his own.

dalcinl · 2024-02-19T17:35:37Z

@rhc54 I'll submit a trivial reproducer here as soon as I can. The issue is not particular of my testsuite, any spawn example in singleton init mode with a relocated ompi install tree should suffice (issue: setting OPAL_PREFIX is not enough, PATH has to be updated as well for spawn to succeed).

rhc54 · 2024-02-19T17:48:51Z

I'm not concerned about that one - @hppritcha indicates he is already addressing it. I'm talking about the other ones you cite here.

@hppritcha FWIW: I'm reworking dpm.c to use PMIx_Group instead of the flakier "publish/lookup" handshake. No idea how that will impact these issues - as I have no idea what these issues are 😄

dalcinl · 2024-02-19T18:28:54Z

Sorry, I mixed up issues, I was talking about #12349. Regarding spawn testsuites, what mine does that probably no other one does is to issue spawn/spawn_multiple calls in rapid succession from both COMM_SELF and COMM_WORLD, asking for a lot of short-lived child processes, maybe oversubscribing heavily the machine, including testing things like spawn arguments only relevant at the root process. The failures smell to me as race conditions. Maybe the key to the issue is the flaky "publish/lookup" handshake you mentioned above. Your update may very well fix things for good.

dalcinl · 2024-02-20T06:08:57Z

@rhc54 @hppritcha Here you have a C reproducer, as simple as it can get.

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  int maxnp = argc >= 2 ? atoi(argv[1]) : 1;

  int provided;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  MPI_Comm comm;
  MPI_Comm_get_parent(&comm);

  if (MPI_COMM_NULL == comm) {

    for (int i=0; i<100; i++) {
      if (0 == rank) printf("%d\n",i);
      MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, maxnp,
                     MPI_INFO_NULL, 0,
                     MPI_COMM_SELF, &comm,
                     MPI_ERRCODES_IGNORE);
      MPI_Barrier(comm);
      MPI_Comm_disconnect(&comm);
    }

  } else {

    MPI_Barrier(comm);
    MPI_Comm_disconnect(&comm);

  }

  MPI_Finalize();
  return 0;
}

Build and run like below. Other test failing cases can be generated by changing the -np arg to mpiexec and the cmdline arg to the program. You can also change SELF -> WORLD in the C code above.

$ mpicc spawn.c

$ mpiexec -n 10 ./a.out 1
0
1
2
...
11
[kw61149:1737636] *** Process received signal ***
[kw61149:1737636] Signal: Segmentation fault (11)
[kw61149:1737636] Signal code: Address not mapped (1)
[kw61149:1737636] Failing at address: 0x180
[kw61149:1737636] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7f053345c9a0]
[kw61149:1737636] [ 1] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x386b4a)[0x7f0533986b4a]
[kw61149:1737636] [ 2] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x123)[0x7f0533988d32]
[kw61149:1737636] [ 3] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(+0xc76c5)[0x7f05333a26c5]
[kw61149:1737636] [ 4] /home/devel/mpi/openmpi/main/lib/libevent_core-2.1.so.7(+0x1c645)[0x7f0533c5b645]
[kw61149:1737636] [ 5] /home/devel/mpi/openmpi/main/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7f0533c5bccf]
[kw61149:1737636] [ 6] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(+0x23ef1)[0x7f05332feef1]
[kw61149:1737636] [ 7] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(opal_progress+0xa7)[0x7f05332fefaa]
[kw61149:1737636] [ 8] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1cd)[0x7f053334aa1a]
[kw61149:1737636] [ 9] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x6c4bf)[0x7f053366c4bf]
[kw61149:1737636] [10] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_comm_nextcid+0x7a)[0x7f053366e0cf]
[kw61149:1737636] [11] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_dpm_connect_accept+0x3cd6)[0x7f053367ffca]
[kw61149:1737636] [12] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_dpm_dyn_init+0xc6)[0x7f0533689df8]
[kw61149:1737636] [13] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_mpi_init+0x750)[0x7f05336abd47]
[kw61149:1737636] [14] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(PMPI_Init_thread+0xb6)[0x7f0533713c24]
[kw61149:1737636] [15] ./a.out[0x4011f6]
[kw61149:1737636] [16] /lib64/libc.so.6(+0x2814a)[0x7f053344614a]
[kw61149:1737636] [17] /lib64/libc.so.6(__libc_start_main+0x8b)[0x7f053344620b]
[kw61149:1737636] [18] ./a.out[0x4010e5]
[kw61149:1737636] *** End of error message ***
[kw61149:00000] *** An error occurred in Socket closed
[kw61149:00000] *** reported by process [1000800257,5]
[kw61149:00000] *** on a NULL communicator
[kw61149:00000] *** Unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
[kw61149:1737659] OPAL ERROR: Server not available in file ../../ompi/ompi/dpm/dpm.c at line 406
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
LAUNCHER JOB OBJECT NOT FOUND

rhc54 · 2024-02-20T17:26:04Z

I can somewhat reproduce this - it either hangs at the very end, or it segfaults somewhere before the end. Always happens in that "next_cid" code. Afraid that code is deep voodoo and I have no idea what it is doing, or why.

My rewrite gets rid of all that stuff, but it will take me awhile to complete it. I've also been asked to leave the old code in parallel, so I'll add an MCA param to select between the two methods.

dalcinl · 2024-02-20T17:31:47Z

I'll add an MCA param to select between the two methods.

I hope your new method will become the default... The old code is evidently broken.

dalcinl · 2024-02-20T18:01:56Z

@rhc54 After your diagnosis, what would you suggest for the mpi4py test suite? Should I just skip all these spawn tests as know failures, at least until we all have your new implementation available?

rhc54 · 2024-02-20T18:10:13Z

You might as well skip them - they will just keep failing for now, and I doubt anyone will take the time to try and work thru that code in OMPI to figure out the problem.

rhc54 · 2024-02-20T18:12:34Z

FWIW: in fairness, the existing code seems to work fine when not heavily stressed as we haven't heard complaints from the field. Not saying there's anything wrong with your test - it is technically correct and we therefore should pass it. Just noting that the test is higher-stress than we see in practice.

some recent changes broke singleton support - twice in a month. First, remove problematic PMIX_RELEASE of jdata when its not ready to be removed. For some reason this showed up in singleton mode with debug enabled. Various asserts would fail when this PMIX_RELEASE was invoked. This was due to the fact that the jdata had been put on a list of jdata's so the opal_list destructor was having a fit trying to release a jdata which was still in a list. It turns out this jdata is being released in the code starting at line 95 of prte_finalize.c. I assume with debug not enabled that the jdata is released twice, rather than failing in the assert in prted_comm.c Some work to add in session id's for tracking allocations also broke singleton support. This patch restores the singletone functionality. Related to issue open-mpi/ompi#12307 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

bosilca · 2024-02-23T17:31:38Z

We skipped all the connect/accept/spawn tests for years, that's why we are in this mess.

rhc54 · 2024-02-23T17:54:05Z

Let's be fair here - we actually do run connect/accept/spawn tests, but they are low-stress versions.. The only reason this one fails is because it is a high-stress test with a tight loop over comm-spawn. The current code works just fine for people who actually use it, which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn).

Modifying it to support high-stress operations isn't trivial, but probably doable. I concur with other comments, though, that this isn't a high priority.

dalcinl · 2024-02-23T18:20:47Z

which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn).

I have had quite a bit of emails over the years asking about spawn-related issues from Python folks using mpi4py. That's the reason I stress-test implementations within my test suite.

Modifying it to support high-stress operations isn't trivial,

I'm not even sure how high-stress is precisely defined. How could I modify my tests to be lower-stress? I've already backeted my spawn calls with barriers, but that's clearly not enough. Should I use sleep() or something like that? Should I serialize all spawn calls from COMM_SELF? I'm really afraid that if I stop testing spawn functionality and don't keep an eye on it, at some point it will become simply unusable.

bosilca · 2024-02-23T18:47:36Z

This storyline (nobody uses this feature) is getting old. I had similar contact as @dalcinl, people tried to use but it was broken so they found another way.

I have no idea what low-stress and high-stress testing could be. It works or it doesn't.

rhc54 · 2024-02-23T18:59:06Z

Sigh - seems a rather pointless debate, doesn't it? Fact is, nobody in the OMPI community has historically been inclined to spend time worrying about it, so the debate is rather moot.

Kudos to @hppritcha for trying to take it on, or at least portions of it (e.g., the singleton comm-spawn case).

bosilca · 2024-02-23T20:03:22Z

Looking at the last 2 years of updates in the DPM related code, many of us (you/ICM/LANL/Amazon/UTK) tried to do so. Smaller steps but it got to a point where it kind-of-work. The only thing left is to have a solution that make it works everywhere, because this is a critical feature outside the HPC market.

rhc54 · 2024-02-23T21:01:20Z

Agreed - my "low stress" was simply a single invocation of "comm_spawn" by a process in a job. "High-stress" is when a process in a job sits there and calls "comm_spawn" in a loop. Fills the system with lots of coupled jobs, requires that the system (both MPI and RTE) be able to fully cleanup/recover between jobs, etc.

We have historically been satisfied with making the "low stress" operation work. Occasionally, I'd take a crack at the "loop spawn" test and at least keep it from hanging, but it was always a struggle and didn't last for very long. And my "loop spawn" test was very simple, just looping over spawn and disconnecting. Never actually had the conjoined jobs do anything.

I agree that it is an important feature outside HPC, and it obviously should work. Perhaps my new approach will succeed where we previously failed - have to wait and see.

janjust assigned janjust and hppritcha and unassigned janjust Feb 13, 2024

jsquyres added bug Target: v5.0.x labels Feb 16, 2024

hppritcha mentioned this issue Feb 19, 2024

Singleton init mode, MPI_Comm_spawn, and OPAL_PREFIX #12349

Open

hppritcha mentioned this issue Feb 22, 2024

various fixes for singleton support openpmix/prrte#1929

Merged

rhc54 mentioned this issue Apr 12, 2024

Start tracing breakage #12464

Closed

RabiyaF mentioned this issue Aug 29, 2024

Added TAO controller fitbenchmarking/fitbenchmarking#1302

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi4py: Remaining spawn/accept/connect issues #12307

mpi4py: Remaining spawn/accept/connect issues #12307

dalcinl commented Feb 4, 2024 •

edited

Loading

janjust commented Feb 13, 2024

dalcinl commented Feb 13, 2024

hppritcha commented Feb 13, 2024

dalcinl commented Feb 13, 2024

dalcinl commented Feb 13, 2024

dalcinl commented Feb 14, 2024

hppritcha commented Feb 14, 2024

dalcinl commented Feb 19, 2024

rhc54 commented Feb 19, 2024

dalcinl commented Feb 19, 2024

rhc54 commented Feb 19, 2024

dalcinl commented Feb 19, 2024

dalcinl commented Feb 20, 2024

rhc54 commented Feb 20, 2024

dalcinl commented Feb 20, 2024

dalcinl commented Feb 20, 2024

rhc54 commented Feb 20, 2024

rhc54 commented Feb 20, 2024

bosilca commented Feb 23, 2024

rhc54 commented Feb 23, 2024

dalcinl commented Feb 23, 2024

bosilca commented Feb 23, 2024

rhc54 commented Feb 23, 2024

bosilca commented Feb 23, 2024

rhc54 commented Feb 23, 2024

mpi4py: Remaining spawn/accept/connect issues #12307

mpi4py: Remaining spawn/accept/connect issues #12307

Comments

dalcinl commented Feb 4, 2024 • edited Loading

janjust commented Feb 13, 2024

dalcinl commented Feb 13, 2024

hppritcha commented Feb 13, 2024

dalcinl commented Feb 13, 2024

dalcinl commented Feb 13, 2024

dalcinl commented Feb 14, 2024

hppritcha commented Feb 14, 2024

dalcinl commented Feb 19, 2024

rhc54 commented Feb 19, 2024

dalcinl commented Feb 19, 2024

rhc54 commented Feb 19, 2024

dalcinl commented Feb 19, 2024

dalcinl commented Feb 20, 2024

rhc54 commented Feb 20, 2024

dalcinl commented Feb 20, 2024

dalcinl commented Feb 20, 2024

rhc54 commented Feb 20, 2024

rhc54 commented Feb 20, 2024

bosilca commented Feb 23, 2024

rhc54 commented Feb 23, 2024

dalcinl commented Feb 23, 2024

bosilca commented Feb 23, 2024

rhc54 commented Feb 23, 2024

bosilca commented Feb 23, 2024

rhc54 commented Feb 23, 2024

dalcinl commented Feb 4, 2024 •

edited

Loading