A hierarchical, architecture-aware collective communication module #7735

bosilca · 2020-05-13T22:32:38Z

HAN is a flexible hierarchical autotuned collective module. In HAN, the processes of a collective operation are grouped into sub-communicators based on their topologic levels.
HAN breaks hierarchical collective operations into multiple collective operations on the sub-communicators.

HAN uses the existing collective modules in OMPI as sub-modules to perform collective operations (blocking and nonblocking collectives) on the sub-communicators and orchestrates the collective operations to perform hierarchical and architecture-aware collective operations.
HAN provides several features:

It can adapt to hardware and software updates easily by switching out different sub-modules while keeping the algorithm intact in HAN.
It adopts a pipelining technique for big messages to provide the capability of overlapping communications to improve performance.
It defines a configuration file to guild the selection of sub-modules for different collective operations, topologic levels, and configurations sizes.
(In progress) Will provides an autotuning component that can find out the best combination of sub-modules automatically. We turned this off for now, but we will update later in the release cycle.
Any improvement in the tuned collective (such as the ongoing effort to update the decision functions) will directly benefit to HAN.

ompiteam-bot · 2020-05-13T22:32:41Z

Can one of the admins verify this patch?

rhc54 · 2020-05-13T22:42:42Z

@bosilca Just curious - when you say "architecture-aware", are you speaking of knowing (for example) which procs are on nodes sharing a common switch, which procs are lowest-ranked on a given switch, etc? In other words, are you following the hardware wiring topology?

If so, I'm curious as to where you get that topology info. I'm working now on a PMIx plugin to provide it for certain types of fabrics, if that would help here.

bosilca · 2020-05-13T22:57:09Z

@rhc54 right now we are working on a 2 level hierarchy, with the information extracted from OMPI via the MPI_Comm_split_type(COMM_TYPE_SHARED). With additional information we should be able to extend the framework, and build additional levels in the hierarchy.

ggouaillardet · 2020-05-14T01:20:12Z

@jsquyres did we expect the ompiteam-bot to ask for "an admin to verify this patch"?

I mean that even if George did not author the commits, he pushed them, so they could/should have not been flagged as untrusted.

rhc54 · 2020-05-14T02:28:54Z

@rhc54 right now we are working on a 2 level hierarchy, with the information extracted from OMPI via the MPI_Comm_split_type(COMM_TYPE_SHARED). With additional information we should be able to extend the framework, and build additional levels in the hierarchy.

@bosilca Kewl - I'll coordinate with you and your team as I move forward. The info I'll be providing is basically a "wiring diagram". It will consist of the following:

the local ranks, so you can identify the lowest local rank participating in the collective - expected to be used to collect all node-local contributions
a switch rank for each process in the job, from which you can select someone to aggregate the collective at the switch level

etc. etc. Basically, following the wires to minimize any hops. We are working on methods for handling multi-NIC systems - little more complicated 😄

ggouaillardet · 2020-05-14T14:34:20Z

@bwbarrett gcc6 is not supported on ubuntu by https://github.com/open-mpi/ompi-scripts/blob/master/jenkins/open-mpi-build-script.sh

can you please check whether it should be added or if this platform/compiler removed from the "matrix"?

EmmanuelBRELLE · 2020-05-19T14:15:38Z

Wasn't this feature supposed to be tagged for v5.0.0 milestone ?

awlauria · 2020-05-19T14:27:20Z

Thanks, added it for tracking.

jsquyres · 2020-07-20T19:08:32Z

bot:aws:retest

jsquyres · 2020-07-20T20:00:24Z

bot:aws:retest

Same error on a different platform (see #7847)

devreal

I looked through the code and found a few things regarding memory management, maybe I didn't fully understand the control flow though. It also seems like there may be an integer overflow in the selection logic (see my comments about the msg_size variables and members). It seems that some of the functions can be made static and removed from the header files while some can probably made static inline in the headers. Plus some typos in comments I found.

ompi/mca/coll/han/coll_han.h

ompi/mca/coll/han/coll_han_allgather.c

ompi/mca/coll/han/coll_han_dynamic.h

ompi/mca/coll/han/coll_han_dynamic.c

ompi/mca/coll/han/coll_han_module.c

ompi/mca/coll/han/coll_han_utils.c

ompi/mca/coll/han/coll_han.h

bosilca · 2020-07-22T22:34:17Z

All comments have been addressed.

ompi/mca/coll/han/coll_han_allreduce.c

jsquyres

I'm sorry, I've gotten distracted by internal Cisco shiny objects, and I haven't finished my review yet. But I at least wanted to submit what I have done so far...

ompi/mca/coll/han/Makefile.am

ompi/mca/coll/han/coll_han.h

ompi/mca/coll/base/coll_base_util.c

ompi/mca/coll/han/coll_han.h

ompi/mca/coll/han/coll_han_allgather.c

ompi/mca/coll/han/coll_han_component.c

devreal

I think all but one of my points have been addressed (https://github.com/open-mpi/ompi/pull/7735/files#diff-a4e1605aa2038222adeba1c42f218130R231) but that one is minor. LGTM, thanks @bosilca!

bosilca · 2020-10-10T03:46:33Z

There are no changes, I simply cleaned the history in preparation for the merge.

Regarding your report, I tried all possible process placement, with and without subscription (as I don’t have 36 nodes available), but as soon as I curate my list of allowed BTLs, I can’t replicate. As you suggested, the error seems to indicate that a communicator with 2 processes failed to be created, but just using the stack I can’t figure out what’s going on.

zhngaj · 2020-10-12T17:05:40Z

I hit the same error with IMB-MPI1 with your latest forced PR, it's not segfault though.

#----------------------------------------------------------------
# Benchmarking PingPong
# ( 0 groups of 2 processes each running simultaneous )
# Group # ( 288 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec      defects

#----------------------------------------------------------------
# Benchmarking PingPing
# ( 0 groups of 2 processes each running simultaneous )
# Group # ( 288 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec      defects

jsquyres · 2020-10-12T17:22:00Z

With a7cc337, I see a small number of regressions running with HAN vs. not running with HAN and the IBM test suite in the collective subdir:

intercomm/allreduce_nocommute_gap_inter: MPI_Abort with error code 100

Running test
Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./allreduce_nocommute_gap_inter
rbuf[0] = 0, ans[0] = 4
rbuf[1] = 0, ans[1] = 5
rbuf[2] = 0, ans[2] = 16
rbuf[3] = 0, ans[3] = 21
...snipped lots of output...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[27787,1],0]
  Errorcode: 100

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

intercomm/reduce_nocommute_gap_inter: MPI_Abort with error code 100

same output as previous

bcast_struct: segv

Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./bcast_struct
[mpi002:04869] *** Process received signal ***
[mpi002:04869] Signal: Segmentation fault (11)
[mpi002:04869] Signal code: Address not mapped (1)
[mpi002:04869] Failing at address: 0x2aab008dd000
[mpi002:04869] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaab0da630]
[mpi002:04869] [ 1] /lib64/libc.so.6(+0x1573a0)[0x2aaaab43e3a0]
[mpi002:04869] [ 2] /home/jsquyres/bogus/lib/libopen-pal.so.0(+0x82a7e)[0x2aaaab737a7e]
[mpi002:04869] [ 3] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_generic_simple_pack+0x320)[0x2aaaab73950f]
[mpi002:04869] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_convertor_pack+0x323)[0x2aaaab72a786]
[mpi002:04869] [ 5] /home/jsquyres/bogus/lib/openmpi/mca_coll_sm.so(mca_coll_sm_bcast_intra+0x3a4)[0x2aaafca2516d]
[mpi002:04869] [ 6] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(+0x3397)[0x2aaafce33397]
[mpi002:04869] [ 7] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(issue_task+0x21) [0x2aaafce3e682]
[mpi002:04869] [ 8] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(mca_coll_han_bcast_intra+0x4af)[0x2aaafce32f13]
[mpi002:04869] [ 9] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(mca_coll_han_bcast_intra_dynamic+0x32d)[0x2aaafce3fea9]
[mpi002:04869] [10] /home/jsquyres/bogus/lib/libmpi.so.0(MPI_Bcast+0x2e2)[0x2aaaaad66c21]
[mpi002:04869] [11] ./bcast_struct[0x401917]
[mpi002:04869] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaab309555]
[mpi002:04869] [13] ./bcast_struct[0x4010d9]
[mpi002:04869] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node mpi002 exited on signal 11 (Seg
mentation fault).
--------------------------------------------------------------------------
Exit status: 139
FAIL bcast_struct (exit status: 139)

gather_in_place: wrong answer:

Running test
Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./gather_in_place
[**ERROR**]: MPI_COMM_WORLD rank 1, file gather_in_place.c:80:
bad answer (-1) at index 0 of 4 (should be 0)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
  Proc: [[27553,1],1]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Exit status: 1
FAIL gather_in_place (exit status: 1)

gather_in_place2: segv

Running test
Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./gather_in_place
2
[mpi002:05334] *** Process received signal ***
[mpi002:05334] Signal: Segmentation fault (11)
[mpi002:05334] Signal code: Address not mapped (1)
[mpi002:05334] Failing at address: 0x30
[mpi002:05334] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaab0da630]
[mpi002:05334] [ 1] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(+0xe6c0)[0x2aaafce3e6c0]
[mpi002:05334] [ 2] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(+0xe6f5)[0x2aaafce3e6f5]
[mpi002:05334] [ 3] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(mca_coll_han_gather_intra_dynamic+0x53)[0x2aaafce3ff07]
[mpi002:05334] [ 4] /home/jsquyres/bogus/lib/libmpi.so.0(MPI_Gather+0x57d)[0x2aaaaad88ade]
[mpi002:05334] [ 5] ./gather_in_place2[0x400f34]
[mpi002:05334] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaab309555]
[mpi002:05334] [ 7] ./gather_in_place2[0x400d59]
[mpi002:05334] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi002 exited on signal 11 (Seg
mentation fault).
--------------------------------------------------------------------------
Exit status: 139
FAIL gather_in_place2 (exit status: 139)

scatter_in_place: wrong answer

Running test
Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./scatter_in_place
malloc debug: Request for 0 bytes (coll_han_scatter.c, 186)
[**ERROR**]: MPI_COMM_WORLD rank 1, file scatter_in_place.c:78:
task 1: bad answer (0) at index 0 of 1 (should be 1)
[**ERROR**]: MPI_COMM_WORLD rank 3, file scatter_in_place.c:78:
task 3: bad answer (0) at index 0 of 1 (should be 3)
[**ERROR**]: MPI_COMM_WORLD rank 2, file scatter_in_place.c:78:
task 2: bad answer (0) at index 0 of 1 (should be 2)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
  Proc: [[59252,1],1]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
  Proc: [[59252,1],3]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
  Proc: [[59252,1],2]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Exit status: 1
FAIL scatter_in_place (exit status: 1)

shijin-aws · 2020-10-12T17:30:38Z

bosilca@ I can hit similar errors when running ompi-tests/ibm/collective/bcast with HAN on 2 nodes, which might be easier for reproducing? There are similar errors in more ibm collective tests, the bcast here is just one example:

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca coll_han_priority 100 --mca pml ob1 --hostfile /fsx/hosts.file -n 72 collective/bcast
...
[ip-172-31-15-97:29070] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29069] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29075] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29075] *** Process received signal ***
[ip-172-31-15-97:29068] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29097] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29125] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29075] Signal: Segmentation fault (11)
[ip-172-31-15-97:29075] Signal code:  (-6)
[ip-172-31-15-97:29075] Failing at address: 0x1f400007193
[ip-172-31-15-97:29069] *** Process received signal ***
[ip-172-31-15-97:29069] Signal: Segmentation fault (11)
[ip-172-31-15-97:29069] Signal code:  (-6)
[ip-172-31-15-97:29069] Failing at address: 0x1f40000718d
[ip-172-31-15-97:29072] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29096] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29074] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29095] Read -1, expected 65536, errno = 14
[ip-172-31-8-146:29178] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29073] Read -1, expected 65536, errno = 14
[ip-172-31-8-146:29176] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29094] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29072] *** Process received signal ***
[ip-172-31-15-97:29072] Signal: Segmentation fault (11)
[ip-172-31-15-97:29072] Signal code:  (-6)
[ip-172-31-15-97:29072] Failing at address: 0x1f400007190
[ip-172-31-8-146:29177] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29096] *** Process received signal ***
[ip-172-31-15-97:29096] Signal: Segmentation fault (11)
[ip-172-31-15-97:29096] Signal code:  (-6)
[ip-172-31-15-97:29096] Failing at address: 0x1f4000071a8
[ip-172-31-8-146:29203] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29074] *** Process received signal ***
[ip-172-31-15-97:29074] Signal: Segmentation fault (11)
[ip-172-31-15-97:29074] Signal code:  (-6)
[ip-172-31-15-97:29074] Failing at address: 0x1f400007192
[ip-172-31-8-146:29183] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29095] *** Process received signal ***
[ip-172-31-15-97:29095] Signal: Segmentation fault (11)
[ip-172-31-15-97:29095] Signal code:  (-6)
[ip-172-31-15-97:29095] Failing at address: 0x1f4000071a7
[ip-172-31-8-146:29179] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29069] [ 0] [ip-172-31-8-146:29232] Read -1, expected 65536, errno = 14
[ip-172-31-8-146:29232] *** Process received signal ***
[ip-172-31-8-146:29232] Signal: Segmentation fault (11)
[ip-172-31-8-146:29232] Signal code:  (-6)
[ip-172-31-8-146:29232] Failing at address: 0x1f400007230
[ip-172-31-15-97:29073] *** Process received signal ***
[ip-172-31-15-97:29073] Signal: Segmentation fault (11)
[ip-172-31-15-97:29073] Signal code:  (-6)
[ip-172-31-15-97:29073] Failing at address: 0x1f400007191
[ip-172-31-8-146:29234] Read -1, expected 65536, errno = 14
/lib64/libpthread.so.0(+0xf600)[0x7fdfa53ff600]
[ip-172-31-15-97:29069] [ 1] [ip-172-31-8-146:29233] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29094] *** Process received signal ***
[ip-172-31-15-97:29094] Signal: Segmentation fault (11)
[ip-172-31-15-97:29094] Signal code:  (-6)
[ip-172-31-15-97:29094] Failing at address: 0x1f4000071a6
[ip-172-31-8-146:29201] Read -1, expected 65536, errno = 14
[ip-172-31-8-146:29201] *** Process received signal ***
[ip-172-31-8-146:29201] Signal: Segmentation fault (11)
[ip-172-31-8-146:29201] Signal code:  (-6)
[ip-172-31-8-146:29201] Failing at address: 0x1f400007211
...
[ip-172-31-8-146:29202] *** End of error message ***
[ip-172-31-15-97:00000] *** An error occurred in Socket closed
[ip-172-31-15-97:00000] *** reported by process [3797155841,66]
[ip-172-31-15-97:00000] *** on a NULL communicator
[ip-172-31-15-97:00000] *** Unknown error
[ip-172-31-15-97:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-15-97:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-8-146:00000] *** An error occurred in Socket closed
[ip-172-31-8-146:00000] *** reported by process [3797155841,22]
[ip-172-31-8-146:00000] *** on a NULL communicator
[ip-172-31-8-146:00000] *** Unknown error
[ip-172-31-8-146:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-8-146:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257

shijin-aws · 2020-10-12T17:32:15Z

The errors reported by jsquyres@ is also exposed in my ibm test suite as well.

devreal · 2020-10-13T06:16:49Z

@bosilca @jsquyres I believe there is a bug in the gather_in_place2 test: the test passes NULL instead of MPI_DATATYPE_NULL as the sendtype in https://github.com/open-mpi/ompi-tests/blob/master/ibm/collective/gather_in_place2.c#L31. This leads to the Segfault you're seeing.

I think I have a hunch at what causes the gather tests to fail, working on a fix.

shijin-aws · 2020-10-13T23:22:28Z

@bosilca I can confirm 37f8a93 fixed the ibm/bcast error I reported earlier.

shijin-aws · 2020-10-14T23:21:25Z

@bosilca the latest commits fix the gather_in_place reported by jsquyres@. Still having errors in ompi-tests/ibm/collective/scatter_in_place and ompi-tests/ibm/collective/int_overflow:

scatter_in_place: bad answer

/fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca coll_han_priority 100 --mca pml ob1 --hostfile /fsx/hosts.file -n 72 collective/scatter_in_place
[**ERROR**]: MPI_COMM_WORLD rank 34, file scatter_in_place.c:78:
task 34: bad answer (-958849888) at index 0 of 1 (should be 34)
[**ERROR**]: MPI_COMM_WORLD rank 33, file scatter_in_place.c:78:
task 33: bad answer (-558907667) at index 0 of 1 (should be 33)
[**ERROR**]: MPI_COMM_WORLD rank 2, file scatter_in_place.c:78:
task 2: bad answer (28257) at index 0 of 1 (should be 2)
[**ERROR**]: MPI_COMM_WORLD rank 3, file scatter_in_place.c:78:
task 3: bad answer (0) at index 0 of 1 (should be 3)
[**ERROR**]: MPI_COMM_WORLD rank 35, file scatter_in_place.c:78:
task 35: bad answer (32739) at index 0 of 1 (should be 35)
[**ERROR**]: MPI_COMM_WORLD rank 16, file scatter_in_place.c:78:
task 16: bad answer (1936482662) at index 0 of 1 (should be 16)
[**ERROR**]: MPI_COMM_WORLD rank 32, file scatter_in_place.c:78:
task 32: bad answer (-558907667) at index 0 of 1 (should be 32)
[**ERROR**]: MPI_COMM_WORLD rank 8, file scatter_in_place.c:78:
task 8: bad answer (1936482662) at index 0 of 1 (should be 8)
[**ERROR**]: MPI_COMM_WORLD rank 9, file scatter_in_place.c:78:
task 9: bad answer (101) at index 0 of 1 (should be 9)
[**ERROR**]: MPI_COMM_WORLD rank 27, file scatter_in_place.c:78:
task 27: bad answer (512) at index 0 of 1 (should be 27)
[**ERROR**]: MPI_COMM_WORLD rank 24, file scatter_in_place.c:78:
task 24: bad answer (1936482662) at index 0 of 1 (should be 24)
[**ERROR**]: MPI_COMM_WORLD rank 10, file scatter_in_place.c:78:
task 10: bad answer (-954845208) at index 0 of 1 (should be 10)
[**ERROR**]: MPI_COMM_WORLD rank 18, file scatter_in_place.c:78:
task 18: bad answer (-954845240) at index 0 of 1 (should be 18)
[**ERROR**]: MPI_COMM_WORLD rank 26, file scatter_in_place.c:78:

int_overflow: segfault

/fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --hostfile /fsx/hosts.file -n 72 collective/int_overflow
seed value: -1201409492
sys_query:
- R0 (ip-172-31-4-68) : 36 ranks, 198349 Mb
- R36 (ip-172-31-6-125) : 36 ranks, 198349 Mb

Running up to 34710 Mb/rank

**** comm nranks=4 :  0 1 2 3

- pt2pt count=1000000 dtsize=8 :  sbuf=8.0Mb rbuf=8.0Mb
- pt2pt count=500000000 dtsize=8 :  sbuf=4000.0Mb rbuf=4000.0Mb
- pt2pt count=1000000000 dtsize=8 :  sbuf=8000.0Mb rbuf=8000.0Mb
- pt2pt count=2000000000 dtsize=8 :  sbuf=16000.0Mb rbuf=16000.0Mb
- pt2pt count=453069600 dtsize=8 :  sbuf=3624.6Mb rbuf=3624.6Mb
- allgather count=1000000 dtsize=8 :  sbuf=8.0Mb/rank rbuf=32.0Mb
- allgather count=500000000 dtsize=8 :  sbuf=4000.0Mb/rank rbuf=16000.0Mb
- allgather count=1000000000 dtsize=8 :  sbuf=8000.0Mb/rank rbuf=32000.0Mb [SKIP]
- allgather count=2000000000 dtsize=8 :  sbuf=16000.0Mb/rank rbuf=64000.0Mb [SKIP]
- allgather count=1165739880 dtsize=8 :  sbuf=9325.9Mb/rank rbuf=37303.7Mb [SKIP]
[ip-172-31-4-68:35465] *** Process received signal ***
[ip-172-31-4-68:35465] Signal: Segmentation fault (11)
[ip-172-31-4-68:35465] Signal code:  (-6)
[ip-172-31-4-68:35465] Failing at address: 0x1f400008a89
[ip-172-31-4-68:35465] [ 0] /lib64/libpthread.so.0(+0xf600)[0x7fb2446a5600]
[ip-172-31-4-68:35465] [ 1] /fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split_with_info+0xf8)[0x7fb2448e9278]
[ip-172-31-4-68:35465] [ 2] /fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split+0x41)[0x7fb2448e9b1e]
[ip-172-31-4-68:35465] [ 3] /fsx/ompi-han/install/lib/libmpi.so.0(MPI_Comm_split+0x160)[0x7fb244955ea3]
[ip-172-31-4-68:35465] [ 4] collective/int_overflow[0x402027]
[ip-172-31-4-68:35465] [ 5] collective/int_overflow[0x40257c]
[ip-172-31-4-68:35465] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb2442ea575]
[ip-172-31-4-68:35465] [ 7] collective/int_overflow[0x401479]
[ip-172-31-4-68:35465] *** End of error message ***
[ip-172-31-6-125:35480] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-6-125:35480] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-6-125:35480] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257

jsquyres · 2020-10-14T23:50:12Z

With 1634ba0, I get the same results on master with and without han:

FAIL: op
FAIL: reduce_scatter_block_nocommute_stride
FAIL: reduce_scatter_block_nocommute_stride_in_place
FAIL: op_mpifh
FAIL: op_usempi
FAIL: op_usempif08

shijin-aws · 2020-10-14T23:52:23Z

Yes, I hit those failures on master with and without han as well so I didn't include them in the errors I reported earlier.

shijin-aws · 2020-10-15T04:50:13Z

With 1634ba0 ,

ompi-tests/ibm/collective/gather_in_place still failed when running on 144 procs on 4 nodes:

$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca coll_han_priority 100 --mca pml ob1 --hostfile /fsx/hosts.file -n 144 collective/gather_in_place
Warning: Permanently added 'ip-172-31-9-86,172.31.9.86' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-14-214,172.31.14.214' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-2-172,172.31.2.172' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-2-154,172.31.2.154' (ECDSA) to the list of known hosts.
gather_in_place: opal_datatype_copy.h:147: non_overlap_copy_content_same_ddt: Assertion `((iov_len_local) != 0) && ((count) != 0)' failed.
gather_in_place: opal_datatype_copy.h:147: non_overlap_copy_content_same_ddt: Assertion `((iov_len_local) != 0) && ((count) != 0)' failed.
[ip-172-31-2-154:35803] *** Process received signal ***
[ip-172-31-2-154:35803] Signal: Aborted (6)
[ip-172-31-2-154:35803] Signal code:  (-6)
[ip-172-31-2-154:35801] *** Process received signal ***
gather_in_place: opal_datatype_copy.h:147: non_overlap_copy_content_same_ddt: Assertion `((iov_len_local) != 0) && ((count) != 0)' failed.
[ip-172-31-2-154:35801] Signal: Aborted (6)
[ip-172-31-2-154:35801] Signal code:  (-6)
[ip-172-31-2-154:35803] [ 0] [ip-172-31-2-154:35802] *** Process received signal ***
[ip-172-31-2-154:35802] Signal: Aborted (6)
[ip-172-31-2-154:35802] Signal code:  (-6)
/lib64/libpthread.so.0(+0xf600)[0x7fd37d530600]
[ip-172-31-2-154:35803] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fd37d1893a7]
[ip-172-31-2-154:35803] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fd37d18aa98]
[ip-172-31-2-154:35803] [ 3] /lib64/libc.so.6(+0x2f1c6)[0x7fd37d1821c6]
[ip-172-31-2-154:35803] [ 4] /lib64/libc.so.6(+0x2f272)[0x7fd37d182272]
[ip-172-31-2-154:35803] [ 5] [ip-172-31-2-154:35802] [ 0] /lib64/libpthread.so.0(+0xf600)[0x7f06e5756600]

Running on 2 nodes (72 procs) does not have such issue.

reduce_scatter_block: bad answer when running on 144 procs of 4 nodes:

/fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca coll_han_priority 100 --mca pml ob1 --hostfile /fsx/hosts.file -n 144 collective/reduce_scatter_block
[**ERROR**]: MPI_COMM_WORLD rank 16, file reduce_scatter_block.c:80:
[**ERROR**]: MPI_COMM_WORLD rank 0, file reduce_scatter_block.c:80:
bad answer (10368) at index 0 of 1 (should be 0)
[**ERROR**]: MPI_COMM_WORLD rank 68, file reduce_scatter_block.c:80:
bad answer (4608) at index 0 of 1 (should be 9792)
[**ERROR**]: MPI_COMM_WORLD rank 72, file reduce_scatter_block.c:80:
[**ERROR**]: MPI_COMM_WORLD rank 32, file reduce_scatter_block.c:80:

jsquyres · 2020-10-15T14:03:36Z

@wckzhang Does the same issue you fixed in https://github.com/open-mpi/ompi-tests/pull/137 also apply to these two test failures, perchance?

wckzhang · 2020-10-15T15:50:32Z

@wckzhang Does the same issue you fixed in open-mpi/ompi-tests#137 also apply to these two test failures, perchance?

I highly doubt it, I checked those two tests and they both use malloc and have done so for many years.

bosilca · 2020-10-16T02:25:04Z

We found out what was the problem, the final reshuffle of data for the gather/scatter operation is incorrect, because the way to compute the translation between the different hierarchies was incorrect. Fix to come tomorrow.

jsquyres · 2020-10-16T15:01:07Z

@bosilca With the latest commits this morning, I get this compiler warning:

coll_han_gather.c: In function ‘mca_coll_han_gather_intra’:
coll_han_gather.c:103:22: warning: ‘w_rank’ may be used uninitialized in this function [-Wmaybe-uninitialized]
     ompi_datatype_t *dtype = (w_rank == root) ? rdtype : sdtype;
                      ^~~~~

bosilca · 2020-10-16T15:22:48Z

I saw it but I was hoping to finish the entire datatype patch before pushing. I pushed a partial fix (only for this issue), I’m working on fixing the support for MPI_IN_PLACE.

jsquyres · 2020-10-17T13:35:49Z

FWIW, I ran again with 356e089 and still see no differences on master with and without han.

shijin-aws · 2020-10-20T19:26:04Z

With b43145f, I still hit errors in the following tests (which do not fail with master):

collective/gather_in_place3:

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/gather_in_place3
Warning: Permanently added 'ip-172-31-5-167,172.31.5.167' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-172,172.31.0.172' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-155,172.31.0.155' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-179,172.31.10.179' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: ip-172-31-10-179
  PID:        24183
  Message:    connect() to 172.31.0.155:1024 failed
  Error:      Connection reset by peer (104)
--------------------------------------------------------------------------
[ip-172-31-10-179:24183] pml_ob1_sendreq.c:189 FATAL
[ip-172-31-0-172:24275] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: ip-172-31-0-172
  PID:        24326
  Message:    connect() to 172.31.0.155:1024 failed
  Error:      Connection refused (111)
--------------------------------------------------------------------------
[ip-172-31-0-172:24326] pml_ob1_sendreq.c:189 FATAL

collective/scatter_in_place: hang

/fsx/ompi-han/install/bin/mpirun --timeout 120 --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/scatter_in_place
Warning: Permanently added 'ip-172-31-0-155,172.31.0.155' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-5-167,172.31.5.167' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-179,172.31.10.179' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-172,172.31.0.172' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 120 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option oror MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------

collective/intercomm/reduce_scatter_block_nocommute_stride_inter

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/intercomm/reduce_scatter_block_nocommute_stride_inter
Warning: Permanently added 'ip-172-31-5-167,172.31.5.167' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-172,172.31.0.172' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-179,172.31.10.179' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-155,172.31.0.155' (ECDSA) to the list of known hosts.
[ip-172-31-0-172:25203] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-172:25203] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-172:25203] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-172:25203] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,1]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,0]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,33]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-172:00000] *** An error occurred in Socket closed
[ip-172-31-0-172:00000] *** reported by process [4286251009,48]
[ip-172-31-0-172:00000] *** on a NULL communicator
[ip-172-31-0-172:00000] *** Unknown error
[ip-172-31-0-172:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-172:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,25]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-172:00000] *** An error occurred in Socket closed
[ip-172-31-0-172:00000] *** reported by process [4286251009,64]
[ip-172-31-0-172:00000] *** on a NULL communicator
[ip-172-31-0-172:00000] *** Unknown error
[ip-172-31-0-172:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-172:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,121]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,20]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-172:00000] *** An error occurred in Socket closed
[ip-172-31-0-172:00000] *** reported by process [4286251009,40]
[ip-172-31-0-172:00000] *** on a NULL communicator
[ip-172-31-0-172:00000] *** Unknown error
[ip-172-31-0-172:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-172:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,110]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-172:00000] *** An error occurred in Socket closed
[ip-172-31-0-172:00000] *** reported by process [4286251009,37]
[ip-172-31-0-172:00000] *** on a NULL communicator
[ip-172-31-0-172:00000] *** Unknown error
[ip-172-31-0-172:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-172:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,108]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,113]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,80]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,116]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,89]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,97]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,73]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,84]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
mpirun noticed that process rank 17 with PID 0 on node ip-172-31-0-155 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

collective/int_overflow:

/fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/int_overflow
Warning: Permanently added 'ip-172-31-0-172,172.31.0.172' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-5-167,172.31.5.167' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-155,172.31.0.155' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-179,172.31.10.179' (ECDSA) to the list of known hosts.
seed value: -1201409492
sys_query:
- R0 (ip-172-31-0-155) : 36 ranks, 198349 Mb
- R36 (ip-172-31-0-172) : 36 ranks, 198349 Mb
- R72 (ip-172-31-5-167) : 36 ranks, 198349 Mb
- R108 (ip-172-31-10-179) : 36 ranks, 198349 Mb

Running up to 34710 Mb/rank

**** comm nranks=4 :  0 1 2 3

- pt2pt count=1000000 dtsize=8 :  sbuf=8.0Mb rbuf=8.0Mb
- pt2pt count=500000000 dtsize=8 :  sbuf=4000.0Mb rbuf=4000.0Mb
- pt2pt count=1000000000 dtsize=8 :  sbuf=8000.0Mb rbuf=8000.0Mb
- pt2pt count=2000000000 dtsize=8 :  sbuf=16000.0Mb rbuf=16000.0Mb
- pt2pt count=453069600 dtsize=8 :  sbuf=3624.6Mb rbuf=3624.6Mb
- allgather count=1000000 dtsize=8 :  sbuf=8.0Mb/rank rbuf=32.0Mb
- allgather count=500000000 dtsize=8 :  sbuf=4000.0Mb/rank rbuf=16000.0Mb
- allgather count=1000000000 dtsize=8 :  sbuf=8000.0Mb/rank rbuf=32000.0Mb [SKIP]
- allgather count=2000000000 dtsize=8 :  sbuf=16000.0Mb/rank rbuf=64000.0Mb [SKIP]
- allgather count=1165739880 dtsize=8 :  sbuf=9325.9Mb/rank rbuf=37303.7Mb [SKIP]
*** Error in `collective/int_overflow': realloc(): invalid next size: 0x00000000010bc290 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f834)[0x7f7d754fe834]
/lib64/libc.so.6(+0x84c21)[0x7f7d75503c21]
/lib64/libc.so.6(realloc+0x1d2)[0x7f7d755051d2]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x88c1)[0x7f7d6c8f28c1]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x8950)[0x7f7d6c8f2950]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x8abe)[0x7f7d6c8f2abe]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(NBC_Sched_send+0x4f)[0x7f7d6c8f2b43]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0xe0fd)[0x7f7d6c8f80fd]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0xd604)[0x7f7d6c8f7604]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_iallgather+0x4e)[0x7f7d6c8f7999]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_topo_init+0x2e5)[0x7f7d6c4c13fb]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_allgather_intra+0x10da)[0x7f7d6c4b73dd]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_allgather_intra_dynamic+0x35b)[0x7f7d6c4be224]
/fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split_with_info+0x16e)[0x7f7d75aa02ee]
/fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split+0x41)[0x7f7d75aa0b1e]
/fsx/ompi-han/install/lib/libmpi.so.0(MPI_Comm_split+0x160)[0x7f7d75b0cea3]
collective/int_overflow[0x402027]
collective/int_overflow[0x40257c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f7d754a1575]
collective/int_overflow[0x401479]

shijin-aws · 2020-10-21T16:20:30Z

With cea7be6, the errors in scatter_in_place and gather_in_place3 are resolved, but int_overflow and reduce_scatter_block_nocommute_stride_inter still hit the same error reported earlier:

collective/int_overflow

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/int_overflow
Warning: Permanently added 'ip-172-31-12-126,172.31.12.126' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-3-49,172.31.3.49' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-13-65,172.31.13.65' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-9-139,172.31.9.139' (ECDSA) to the list of known hosts.
seed value: -1201409492
sys_query:
- R0 (ip-172-31-3-49) : 36 ranks, 198349 Mb
- R36 (ip-172-31-9-139) : 36 ranks, 198349 Mb
- R72 (ip-172-31-12-126) : 36 ranks, 198349 Mb
- R108 (ip-172-31-13-65) : 36 ranks, 198349 Mb

Running up to 34710 Mb/rank

**** comm nranks=4 :  0 1 2 3

- pt2pt count=1000000 dtsize=8 :  sbuf=8.0Mb rbuf=8.0Mb
- pt2pt count=500000000 dtsize=8 :  sbuf=4000.0Mb rbuf=4000.0Mb
- pt2pt count=1000000000 dtsize=8 :  sbuf=8000.0Mb rbuf=8000.0Mb
- pt2pt count=2000000000 dtsize=8 :  sbuf=16000.0Mb rbuf=16000.0Mb
- pt2pt count=453069600 dtsize=8 :  sbuf=3624.6Mb rbuf=3624.6Mb
- allgather count=1000000 dtsize=8 :  sbuf=8.0Mb/rank rbuf=32.0Mb
- allgather count=500000000 dtsize=8 :  sbuf=4000.0Mb/rank rbuf=16000.0Mb
- allgather count=1000000000 dtsize=8 :  sbuf=8000.0Mb/rank rbuf=32000.0Mb [SKIP]
- allgather count=2000000000 dtsize=8 :  sbuf=16000.0Mb/rank rbuf=64000.0Mb [SKIP]
- allgather count=1165739880 dtsize=8 :  sbuf=9325.9Mb/rank rbuf=37303.7Mb [SKIP]
*** Error in `collective/int_overflow': realloc(): invalid next size: 0x0000000000da5290 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f834)[0x7fdfb6ee4834]
/lib64/libc.so.6(+0x84c21)[0x7fdfb6ee9c21]
/lib64/libc.so.6(realloc+0x1d2)[0x7fdfb6eeb1d2]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x88c1)[0x7fdfaa2618c1]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x8950)[0x7fdfaa261950]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x8abe)[0x7fdfaa261abe]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(NBC_Sched_send+0x4f)[0x7fdfaa261b43]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0xe0fd)[0x7fdfaa2670fd]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0xd604)[0x7fdfaa266604]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_iallgather+0x4e)[0x7fdfaa266999]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_topo_init+0x2e5)[0x7fdfa9e30474]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_allgather_intra+0x10da)[0x7fdfa9e26428]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_allgather_intra_dynamic+0x35b)[0x7fdfa9e2d26f]
/fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split_with_info+0x16e)[0x7fdfb74862ee]
/fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split+0x41)[0x7fdfb7486b1e]
/fsx/ompi-han/install/lib/libmpi.so.0(MPI_Comm_split+0x160)[0x7fdfb74f2ea3]
collective/int_overflow[0x402027]
collective/int_overflow[0x40257c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdfb6e87575]
collective/int_overflow[0x401479]

collective/intercomm/reduce_scatter_block_nocommute_stride_inter

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/intercomm/reduce_scatter_block_nocommute_stride_inter
Warning: Permanently added 'ip-172-31-3-49,172.31.3.49' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-12-126,172.31.12.126' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-13-65,172.31.13.65' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-9-139,172.31.9.139' (ECDSA) to the list of known hosts.
[ip-172-31-3-49:00000] *** An error occurred in Socket closed
[ip-172-31-3-49:00000] *** reported by process [3006398465,1]
[ip-172-31-3-49:00000] *** on a NULL communicator
[ip-172-31-3-49:00000] *** Unknown error
[ip-172-31-3-49:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-3-49:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-3-49:00000] *** An error occurred in Socket closed
[ip-172-31-3-49:00000] *** reported by process [3006398465,0]
[ip-172-31-3-49:00000] *** on a NULL communicator
[ip-172-31-3-49:00000] *** Unknown error
[ip-172-31-3-49:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-3-49:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-9-139:22990] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257

zhngaj · 2020-10-22T16:45:37Z

Ran osu_iallgather with tcp with HAN (commit 1462363), I saw a hang.

mpirun --prefix /fsx/ompi/han-install --wdir results/omb/collective/osu_iallgather -n 288 -N 36 --tag-output  --mca pml ob1 --mca btl tcp,self --mca coll_han_priority 100 --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/osu-micro-benchmarks-5.6-zwhv66m6o6wvgohrcaqbbgjie57hh5xo/libexec/osu-micro-benchmarks/mpi/collective/osu_iallgather

# OSU MPI Non-blocking Allgather Latency Test v5.6
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 7200 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option oror MPIEXEC_TIMEOUT environment variable).

I'm running without HAN to double check.

devreal · 2020-10-22T18:07:16Z

@zhngaj The hang in iallgather seems independent of HAN, I can reproduce it with current master.

zhngaj · 2020-10-22T19:17:56Z

@zhngaj The hang in iallgather seems independent of HAN, I can reproduce it with current master.

Hmm, I ran it without HAN by removing --mca coll_han_priority 100, it passed though. I haven't tried with latest master.

Running collective/osu_iallgather on 8 nodes with 288 procs
==== starting mpirun --prefix /fsx/ompi/han-install --wdir results/omb/collective/osu_iallgather -n 288 -N 36 --tag-output  --mca pml ob1 --mca btl tcp,self --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/osu-micro-benchmarks-5.6-zwhv66m6o6wvgohrcaqbbgjie57hh5xo/libexec/osu-micro-benchmarks/mpi/collective/osu_iallgather  : Thu Oct 22 17:41:17 UTC 2020 ====
Warning: Permanently added 'ip-172-31-4-100,172.31.4.100' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-6,172.31.10.6' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-132,172.31.0.132' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-6-24,172.31.6.24' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-12-29,172.31.12.29' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-14-184,172.31.14.184' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-15-211,172.31.15.211' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-7-110,172.31.7.110' (ECDSA) to the list of known hosts.

# OSU MPI Non-blocking Allgather Latency Test v5.6
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                    7047.81           4025.43           3861.91             21.74
2                    7107.00           4080.86           3914.84             22.70
4                    7100.11           4069.58           3904.16             22.38
8                    7090.47           4042.16           3877.97             21.39
16                   7076.84           4019.07           3855.61             20.69
32                   7146.29           4060.85           3895.75             20.80
64                   7237.81           4102.35           3935.62             20.33
128                  7385.42           4176.15           4006.54             19.90
256                  7937.54           4466.55           4285.11             19.00
512                  8189.45           4604.19           4417.29             18.84
1024                66765.21          62823.00          60286.46             93.46
2048               128812.53         121218.01         116321.82             93.47
4096               250335.38         229206.74         219945.64             90.39
8192               375148.73         356624.66         342202.82             94.59
16384              378097.98         352531.39         338274.57             92.44
32768              313779.39         290950.77         279175.88             91.82
65536              579585.27         309760.17         297228.94              9.22
131072             661252.56         343024.35         329148.81              3.32
262144             824517.45         421888.85         404824.64              0.54
524288            1401573.88         699084.46         670811.12              0.00
1048576           2320442.88        1156049.10        1109301.24              0.00
return status: 0
==== finished mpirun --prefix /fsx/ompi/han-install --wdir results/omb/collective/osu_iallgather -n 288 -N 36 --tag-output  --mca pml ob1 --mca btl tcp,self --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/osu-micro-benchmarks-5.6-zwhv66m6o6wvgohrcaqbbgjie57hh5xo/libexec/osu-micro-benchmarks/mpi/collective/osu_iallgather  : Thu Oct 22 19:14:47 UTC 2020 ====

devreal · 2020-10-22T19:33:44Z

This is a DDT aggregated stack trace from current master:

Processes,Threads,Function
14,14,main (osu_iallgather.c:110)
14,14,  PMPI_Wait (pwait.c:74)
14,14,    ompi_request_default_wait (req_wait.c:42)
14,14,      ompi_request_wait_completion (request.h:418)
2,2,        opal_progress (opal_progress.c:231)
2,2,          mca_btl_sm_component_progress (btl_sm_component.c:762)
1,1,            mca_btl_sm_check_fboxes (btl_sm_fbox.h:188)
1,1,            mca_btl_sm_check_fboxes (btl_sm_fbox.h:196)
1,1,              mca_btl_sm_fbox_read_header (btl_sm_fbox.h:71)
2,2,        opal_progress (opal_progress.c:245)
2,2,          opal_progress_events (opal_progress.c:191)
2,2,            opal_libevent2022_event_base_loop (event.c:1630)
2,2,              poll_dispatch (poll.c:165)
2,2,                poll
10,10,        opal_progress (opal_progress.c:247)
10,10,          opal_progress_events (opal_progress.c:191)
10,10,            opal_libevent2022_event_base_loop (event.c:1630)
10,10,              poll_dispatch (poll.c:165)
10,10,                poll
82,82,main (osu_iallgather.c:117)
82,82,  PMPI_Barrier (pbarrier.c:66)
82,82,    ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:530)
82,82,      ompi_coll_tuned_barrier_intra_do_this (coll_tuned_barrier_decision.c:101)
82,82,        ompi_coll_base_barrier_intra_bruck (coll_base_barrier.c:271)
82,82,          ompi_coll_base_sendrecv_zero (coll_base_barrier.c:64)
82,82,            ompi_request_default_wait (req_wait.c:42)
82,82,              ompi_request_wait_completion (request.h:418)
6,6,                opal_progress (opal_progress.c:231)
6,6,                  mca_btl_sm_component_progress (btl_sm_component.c:762)
1,1,                    mca_btl_sm_check_fboxes (btl_sm_fbox.h:189)
1,1,                    mca_btl_sm_check_fboxes (btl_sm_fbox.h:192)
3,3,                    mca_btl_sm_check_fboxes (btl_sm_fbox.h:196)
2,2,                      mca_btl_sm_fbox_read_header (btl_sm_fbox.h:71)
1,1,                      mca_btl_sm_fbox_read_header (btl_sm_fbox.h:74)
1,1,                    mca_btl_sm_check_fboxes (btl_sm_fbox.h:199)
6,6,                opal_progress (opal_progress.c:245)
6,6,                  opal_progress_events (opal_progress.c:191)
6,6,                    opal_libevent2022_event_base_loop (event.c:1630)
6,6,                      poll_dispatch (poll.c:165)
6,6,                        poll
70,70,                opal_progress (opal_progress.c:247)
70,70,                  opal_progress_events (opal_progress.c:191)
1,1,                    opal_libevent2022_event_base_loop (event.c:1626)
1,1,                      gettime (event.c:372)
1,1,                        clock_gettime
69,69,                    opal_libevent2022_event_base_loop (event.c:1630)
69,69,                      poll_dispatch (poll.c:165)
69,69,                        poll
96,96,progress_engine (pmix_progress_threads.c:232)
96,96,  opal_libevent2022_event_base_loop (event.c:1630)
96,96,    epoll_dispatch (epoll.c:407)
96,96,      epoll_wait

Command:

mpirun -n 96 -N 24 --mca pml ob1 ./mpi/collective/osu_iallgather

With --mca pml ucx the benchmarks runs to completion.

bosilca · 2020-10-22T19:54:04Z

I don't think it's a deadlock, but something is definitively going strangely with the test itself. I left it open while going for a coffee and I got more output coming from, but as you can see the time somehow exploded.

[1,0]<stdout>:1                    2046.78           1460.86           1399.36             58.13
[1,0]<stdout>:2                    2025.67           1449.24           1388.27             58.48
[1,0]<stdout>:4                    2031.15           1452.78           1391.68             58.44
[1,0]<stdout>:8                    2027.17           1451.04           1389.98             58.55
[1,0]<stdout>:16                   2028.36           1446.23           1385.29             57.98
[1,0]<stdout>:32                   2032.47           1455.32           1394.09             58.60
[1,0]<stdout>:64                   2050.21           1464.63           1402.98             58.26
[1,0]<stdout>:128                  2073.28           1481.20           1418.88             58.27
[1,0]<stdout>:256                  2113.55           1520.03           1456.13             59.24
[1,0]<stdout>:512                  2152.94           1551.94           1486.71             59.58
[1,0]<stdout>:1024                 2223.50           1607.81           1540.36             60.03
[1,0]<stdout>:2048                 4615.58           3709.80           3555.43             74.52
[1,0]<stdout>:4096               103998.66          74668.03          71578.04             59.02
[1,0]<stdout>:8192               177358.74         160091.06         153475.64             88.75
[1,0]<stdout>:16384              215117.44         200831.64         192524.81             92.58
[1,0]<stdout>:32768              246830.98         233625.89         223959.42             94.10
[1,0]<stdout>:65536              428522.71         282090.24         270417.72             45.85
[1,0]<stdout>:131072             608281.84         324874.21         311430.16              9.00
[1,0]<stdout>:262144             959911.51         484099.29         463255.69              0.00
[1,0]<stdout>:524288            1879094.81         937309.86         897415.63              0.00

devreal · 2020-10-22T21:14:32Z

Ahh yes, here is another data point, scaling the number ranks with current master (on 4 nodes):

$ mpirun -n 16 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     216.50            140.95            135.61             44.29

$ mpirun -n 24 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     305.55            191.38            184.28             38.04

$ mpirun -n 28 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     361.14            228.21            219.84             39.53

$ mpirun -n 36 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     528.13            350.10            337.65             47.28

$ mpirun -n 38 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     740.92            419.82            404.95             20.71

$ mpirun -n 40 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                    1523.82            406.88            392.41              0.00

$ mpirun -n 42 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                   10038.85            405.94            391.57              0.00

The measurements at higher rank counts seem rather unstable, just the next run with 42 ranks yields:

$ mpirun -n 42 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                   24004.91            686.51            662.49              0.00

As a comparison, with pml/ucx things look ok:

$ mpirun -n 42 --map-by node --bind-to core --mca pml ucx ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     319.24            293.19            282.63             90.78

Among many other things: - Fix an imbalance bug in MPI_allgather - Accept more human readable configuration files. We can now specify the collective by name instead of a magic number, and the component we want to use also by name. - Add the capability to have optional arguments in the collective communication configuration file. Right now the capability exists for segment lengths, but is yet to be connected with the algorithms. - Redo the initialization of all HAN collectives. Cleanup the fallback collective support. - In case the module is unable to deliver the expected result, it will fallback executing the collective operation on another collective component. This change make the support for this fallback simpler to use. - Implement a fallback allowing a HAN module to remove itself as potential active collective module, and instead fallback to the next module in line. - Completely disable the HAN modules on error. From the moment an error is encountered they remove themselves from the communicator, and in case some other modules calls them simply behave as a pass-through. Communicator: provide ompi_comm_split_with_info to split and provide info at the same time Add ompi_comm_coll_preference info key to control collective component selection COLL HAN: use info keys instead of component-level variable to communicate topology level between abstraction layers - The info value is a comma-separated list of entries, which are chosen with decreasing priorities. This overrides the priority of the component, unless the component has disqualified itself. An entry prefixed with ^ starts the ignore-list. Any entry following this character will be ingnored during the collective component selection for the communicator. Example: "sm,libnbc,^han,adapt" gives sm the highest preference, followed by libnbc. The components han and adapt are ignored in the selection process. - Allocate a temporary buffer for all lower-level leaders (length 2 segments) - Fix the handling of MPI_IN_PLACE for gather and scatter. COLL HAN: Fix topology handling - HAN should not rely on node names to determine the ordering of ranks. Instead, use the node leaders as identifiers and short-cut if the node-leaders agree that ranks are consecutive. Also, error out if the rank distribution is imbalanced for now. Signed-off-by: Xi Luo <xluo12@vols.utk.edu> Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu> Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

There was a bug allowing for partial packing of non-data elements (such as loop and end_loop markers) during the exit condition of a pack/unpack call. This has basically no meaning. Prevent this bug from happening by making sure the element point to a data before trying to partially pack it. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

bosilca force-pushed the coll/han branch from 66f6696 to 0b2f141 Compare May 13, 2020 22:35

bosilca force-pushed the coll/han branch from 5249483 to 7a15cfa Compare May 14, 2020 04:24

bosilca mentioned this pull request May 15, 2020

coll/tuned: Change the default collective algorithm selection #7730

Merged

bosilca force-pushed the coll/han branch from 21518c9 to 7bcc504 Compare May 15, 2020 22:23

bosilca added this to the v5.0.0 milestone May 19, 2020

bosilca added performance Target: v5.0.x labels May 19, 2020

bosilca force-pushed the coll/han branch from 7bcc504 to 0f14998 Compare June 4, 2020 17:33

bosilca force-pushed the coll/han branch 2 times, most recently from 0bcf2e5 to 1290951 Compare July 15, 2020 06:39

bosilca mentioned this pull request Jul 15, 2020

Import the HAN collective into 4.1 #7945

Merged

devreal requested changes Jul 21, 2020

View reviewed changes

cniethammer reviewed Jul 22, 2020

View reviewed changes

ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved

devreal reviewed Jul 23, 2020

View reviewed changes

ompi/mca/coll/han/coll_han_allreduce.c Outdated Show resolved Hide resolved

jsquyres reviewed Jul 23, 2020

View reviewed changes

bosilca force-pushed the coll/han branch 3 times, most recently from 70471ae to 23e98bf Compare July 27, 2020 03:58

devreal approved these changes Jul 27, 2020

View reviewed changes

bosilca force-pushed the coll/han branch from 1462363 to 5b78a22 Compare October 21, 2020 23:59

bosilca added 2 commits October 25, 2020 18:13

bosilca force-pushed the coll/han branch from 5b78a22 to cc6432b Compare October 25, 2020 22:15

bosilca merged commit ce97090 into open-mpi:master Oct 26, 2020

A hierarchical, architecture-aware collective communication module #7735

A hierarchical, architecture-aware collective communication module #7735

Uh oh!

Conversation

bosilca commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ompiteam-bot commented May 13, 2020

Uh oh!

rhc54 commented May 13, 2020

Uh oh!

bosilca commented May 13, 2020

Uh oh!

ggouaillardet commented May 14, 2020

Uh oh!

rhc54 commented May 14, 2020

Uh oh!

ggouaillardet commented May 14, 2020

Uh oh!

EmmanuelBRELLE commented May 19, 2020

Uh oh!

awlauria commented May 19, 2020

Uh oh!

jsquyres commented Jul 20, 2020

Uh oh!

jsquyres commented Jul 20, 2020

Uh oh!

devreal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bosilca commented Jul 22, 2020

Uh oh!

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devreal left a comment

Choose a reason for hiding this comment

Uh oh!

bosilca commented Oct 10, 2020

Uh oh!

zhngaj commented Oct 12, 2020

Uh oh!

jsquyres commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

intercomm/allreduce_nocommute_gap_inter: MPI_Abort with error code 100

intercomm/reduce_nocommute_gap_inter: MPI_Abort with error code 100

bcast_struct: segv

gather_in_place: wrong answer:

gather_in_place2: segv

scatter_in_place: wrong answer

Uh oh!

shijin-aws commented Oct 12, 2020

Uh oh!

shijin-aws commented Oct 12, 2020

Uh oh!

devreal commented Oct 13, 2020

Uh oh!

shijin-aws commented Oct 13, 2020

bosilca commented May 13, 2020 •

edited

Loading

jsquyres commented Oct 12, 2020 •

edited

Loading

shijin-aws commented Oct 14, 2020 •

edited

Loading

shijin-aws commented Oct 14, 2020 •

edited

Loading

shijin-aws commented Oct 15, 2020 •

edited

Loading

shijin-aws commented Oct 21, 2020 •

edited

Loading

zhngaj commented Oct 22, 2020 •

edited

Loading

bosilca commented Oct 22, 2020 •

edited

Loading