Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A hierarchical, architecture-aware collective communication module #7735

Merged
merged 4 commits into from
Oct 26, 2020

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented May 13, 2020

HAN is a flexible hierarchical autotuned collective module. In HAN, the processes of a collective operation are grouped into sub-communicators based on their topologic levels.
HAN breaks hierarchical collective operations into multiple collective operations on the sub-communicators.

HAN uses the existing collective modules in OMPI as sub-modules to perform collective operations (blocking and nonblocking collectives)  on the sub-communicators and orchestrates the collective operations to perform hierarchical and architecture-aware collective operations.
HAN provides several features:

  1. It can adapt to hardware and software updates easily by switching out different sub-modules while keeping the algorithm intact in HAN.
  2. It adopts a pipelining technique for big messages to provide the capability of overlapping communications to improve performance.
  3. It defines a configuration file to guild the selection of sub-modules for different collective operations, topologic levels, and configurations sizes.
  4. (In progress) Will provides an autotuning component that can find out the best combination of sub-modules automatically. We turned this off for now, but we will update later in the release cycle.
  5. Any improvement in the tuned collective (such as the ongoing effort to update the decision functions) will directly benefit to HAN.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@rhc54
Copy link
Contributor

rhc54 commented May 13, 2020

@bosilca Just curious - when you say "architecture-aware", are you speaking of knowing (for example) which procs are on nodes sharing a common switch, which procs are lowest-ranked on a given switch, etc? In other words, are you following the hardware wiring topology?

If so, I'm curious as to where you get that topology info. I'm working now on a PMIx plugin to provide it for certain types of fabrics, if that would help here.

@bosilca
Copy link
Member Author

bosilca commented May 13, 2020

@rhc54 right now we are working on a 2 level hierarchy, with the information extracted from OMPI via the MPI_Comm_split_type(COMM_TYPE_SHARED). With additional information we should be able to extend the framework, and build additional levels in the hierarchy.

@ggouaillardet
Copy link
Contributor

@jsquyres did we expect the ompiteam-bot to ask for "an admin to verify this patch"?

I mean that even if George did not author the commits, he pushed them, so they could/should have not been flagged as untrusted.

@rhc54
Copy link
Contributor

rhc54 commented May 14, 2020

@rhc54 right now we are working on a 2 level hierarchy, with the information extracted from OMPI via the MPI_Comm_split_type(COMM_TYPE_SHARED). With additional information we should be able to extend the framework, and build additional levels in the hierarchy.

@bosilca Kewl - I'll coordinate with you and your team as I move forward. The info I'll be providing is basically a "wiring diagram". It will consist of the following:

  • the local ranks, so you can identify the lowest local rank participating in the collective - expected to be used to collect all node-local contributions
  • a switch rank for each process in the job, from which you can select someone to aggregate the collective at the switch level

etc. etc. Basically, following the wires to minimize any hops. We are working on methods for handling multi-NIC systems - little more complicated 😄

@ggouaillardet
Copy link
Contributor

@bwbarrett gcc6 is not supported on ubuntu by https://github.com/open-mpi/ompi-scripts/blob/master/jenkins/open-mpi-build-script.sh

can you please check whether it should be added or if this platform/compiler removed from the "matrix"?

@EmmanuelBRELLE
Copy link
Contributor

Wasn't this feature supposed to be tagged for v5.0.0 milestone ?

@awlauria
Copy link
Contributor

Thanks, added it for tracking.

@jsquyres
Copy link
Member

bot:aws:retest

@jsquyres
Copy link
Member

bot:aws:retest

Same error on a different platform (see #7847)

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the code and found a few things regarding memory management, maybe I didn't fully understand the control flow though. It also seems like there may be an integer overflow in the selection logic (see my comments about the msg_size variables and members). It seems that some of the functions can be made static and removed from the header files while some can probably made static inline in the headers. Plus some typos in comments I found.

ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_allgather.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_allgather.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_dynamic.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_dynamic.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_dynamic.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_module.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_utils.c Outdated Show resolved Hide resolved
@bosilca
Copy link
Member Author

bosilca commented Jul 22, 2020

All comments have been addressed.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I've gotten distracted by internal Cisco shiny objects, and I haven't finished my review yet. But I at least wanted to submit what I have done so far...

ompi/mca/coll/han/Makefile.am Show resolved Hide resolved
ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved
ompi/mca/coll/base/coll_base_util.c Outdated Show resolved Hide resolved
ompi/mca/coll/base/coll_base_util.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han.h Show resolved Hide resolved
ompi/mca/coll/han/coll_han_allgather.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_component.c Outdated Show resolved Hide resolved
@bosilca bosilca force-pushed the coll/han branch 3 times, most recently from 70471ae to 23e98bf Compare July 27, 2020 03:58
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all but one of my points have been addressed (https://github.com/open-mpi/ompi/pull/7735/files#diff-a4e1605aa2038222adeba1c42f218130R231) but that one is minor. LGTM, thanks @bosilca!

@bosilca
Copy link
Member Author

bosilca commented Oct 10, 2020

There are no changes, I simply cleaned the history in preparation for the merge.

Regarding your report, I tried all possible process placement, with and without subscription (as I don’t have 36 nodes available), but as soon as I curate my list of allowed BTLs, I can’t replicate. As you suggested, the error seems to indicate that a communicator with 2 processes failed to be created, but just using the stack I can’t figure out what’s going on.

@zhngaj
Copy link
Contributor

zhngaj commented Oct 12, 2020

I hit the same error with IMB-MPI1 with your latest forced PR, it's not segfault though.

#----------------------------------------------------------------
# Benchmarking PingPong
# ( 0 groups of 2 processes each running simultaneous )
# Group # ( 288 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec      defects

#----------------------------------------------------------------
# Benchmarking PingPing
# ( 0 groups of 2 processes each running simultaneous )
# Group # ( 288 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec      defects

@jsquyres
Copy link
Member

jsquyres commented Oct 12, 2020

With a7cc337, I see a small number of regressions running with HAN vs. not running with HAN and the IBM test suite in the collective subdir:

intercomm/allreduce_nocommute_gap_inter: MPI_Abort with error code 100

Running test
Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./allreduce_nocommute_gap_inter
rbuf[0] = 0, ans[0] = 4
rbuf[1] = 0, ans[1] = 5
rbuf[2] = 0, ans[2] = 16
rbuf[3] = 0, ans[3] = 21
...snipped lots of output...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[27787,1],0]
  Errorcode: 100

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

intercomm/reduce_nocommute_gap_inter: MPI_Abort with error code 100

same output as previous

bcast_struct: segv

Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./bcast_struct
[mpi002:04869] *** Process received signal ***
[mpi002:04869] Signal: Segmentation fault (11)
[mpi002:04869] Signal code: Address not mapped (1)
[mpi002:04869] Failing at address: 0x2aab008dd000
[mpi002:04869] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaab0da630]
[mpi002:04869] [ 1] /lib64/libc.so.6(+0x1573a0)[0x2aaaab43e3a0]
[mpi002:04869] [ 2] /home/jsquyres/bogus/lib/libopen-pal.so.0(+0x82a7e)[0x2aaaab737a7e]
[mpi002:04869] [ 3] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_generic_simple_pack+0x320)[0x2aaaab73950f]
[mpi002:04869] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_convertor_pack+0x323)[0x2aaaab72a786]
[mpi002:04869] [ 5] /home/jsquyres/bogus/lib/openmpi/mca_coll_sm.so(mca_coll_sm_bcast_intra+0x3a4)[0x2aaafca2516d]
[mpi002:04869] [ 6] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(+0x3397)[0x2aaafce33397]
[mpi002:04869] [ 7] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(issue_task+0x21) [0x2aaafce3e682]
[mpi002:04869] [ 8] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(mca_coll_han_bcast_intra+0x4af)[0x2aaafce32f13]
[mpi002:04869] [ 9] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(mca_coll_han_bcast_intra_dynamic+0x32d)[0x2aaafce3fea9]
[mpi002:04869] [10] /home/jsquyres/bogus/lib/libmpi.so.0(MPI_Bcast+0x2e2)[0x2aaaaad66c21]
[mpi002:04869] [11] ./bcast_struct[0x401917]
[mpi002:04869] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaab309555]
[mpi002:04869] [13] ./bcast_struct[0x4010d9]
[mpi002:04869] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node mpi002 exited on signal 11 (Seg
mentation fault).
--------------------------------------------------------------------------
Exit status: 139
FAIL bcast_struct (exit status: 139)

gather_in_place: wrong answer:

Running test
Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./gather_in_place
[**ERROR**]: MPI_COMM_WORLD rank 1, file gather_in_place.c:80:
bad answer (-1) at index 0 of 4 (should be 0)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
  Proc: [[27553,1],1]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Exit status: 1
FAIL gather_in_place (exit status: 1)

gather_in_place2: segv

Running test
Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./gather_in_place
2
[mpi002:05334] *** Process received signal ***
[mpi002:05334] Signal: Segmentation fault (11)
[mpi002:05334] Signal code: Address not mapped (1)
[mpi002:05334] Failing at address: 0x30
[mpi002:05334] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaab0da630]
[mpi002:05334] [ 1] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(+0xe6c0)[0x2aaafce3e6c0]
[mpi002:05334] [ 2] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(+0xe6f5)[0x2aaafce3e6f5]
[mpi002:05334] [ 3] /home/jsquyres/bogus/lib/openmpi/mca_coll_han.so(mca_coll_han_gather_intra_dynamic+0x53)[0x2aaafce3ff07]
[mpi002:05334] [ 4] /home/jsquyres/bogus/lib/libmpi.so.0(MPI_Gather+0x57d)[0x2aaaaad88ade]
[mpi002:05334] [ 5] ./gather_in_place2[0x400f34]
[mpi002:05334] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaab309555]
[mpi002:05334] [ 7] ./gather_in_place2[0x400d59]
[mpi002:05334] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi002 exited on signal 11 (Seg
mentation fault).
--------------------------------------------------------------------------
Exit status: 139
FAIL gather_in_place2 (exit status: 139)

scatter_in_place: wrong answer

Running test
Running test: mpirun --mca pml ob1 --mca btl usnic,vader,self --timeout 120 --mca coll_han_priority 100 -np 4 --hostfile /home/jsquyres/ibm-hostfile.txt ./scatter_in_place
malloc debug: Request for 0 bytes (coll_han_scatter.c, 186)
[**ERROR**]: MPI_COMM_WORLD rank 1, file scatter_in_place.c:78:
task 1: bad answer (0) at index 0 of 1 (should be 1)
[**ERROR**]: MPI_COMM_WORLD rank 3, file scatter_in_place.c:78:
task 3: bad answer (0) at index 0 of 1 (should be 3)
[**ERROR**]: MPI_COMM_WORLD rank 2, file scatter_in_place.c:78:
task 2: bad answer (0) at index 0 of 1 (should be 2)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
  Proc: [[59252,1],1]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
  Proc: [[59252,1],3]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
  Proc: [[59252,1],2]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Exit status: 1
FAIL scatter_in_place (exit status: 1)

@shijin-aws
Copy link
Contributor

bosilca@ I can hit similar errors when running ompi-tests/ibm/collective/bcast with HAN on 2 nodes, which might be easier for reproducing? There are similar errors in more ibm collective tests, the bcast here is just one example:

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca coll_han_priority 100 --mca pml ob1 --hostfile /fsx/hosts.file -n 72 collective/bcast
...
[ip-172-31-15-97:29070] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29069] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29075] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29075] *** Process received signal ***
[ip-172-31-15-97:29068] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29097] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29125] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29075] Signal: Segmentation fault (11)
[ip-172-31-15-97:29075] Signal code:  (-6)
[ip-172-31-15-97:29075] Failing at address: 0x1f400007193
[ip-172-31-15-97:29069] *** Process received signal ***
[ip-172-31-15-97:29069] Signal: Segmentation fault (11)
[ip-172-31-15-97:29069] Signal code:  (-6)
[ip-172-31-15-97:29069] Failing at address: 0x1f40000718d
[ip-172-31-15-97:29072] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29096] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29074] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29095] Read -1, expected 65536, errno = 14
[ip-172-31-8-146:29178] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29073] Read -1, expected 65536, errno = 14
[ip-172-31-8-146:29176] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29094] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29072] *** Process received signal ***
[ip-172-31-15-97:29072] Signal: Segmentation fault (11)
[ip-172-31-15-97:29072] Signal code:  (-6)
[ip-172-31-15-97:29072] Failing at address: 0x1f400007190
[ip-172-31-8-146:29177] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29096] *** Process received signal ***
[ip-172-31-15-97:29096] Signal: Segmentation fault (11)
[ip-172-31-15-97:29096] Signal code:  (-6)
[ip-172-31-15-97:29096] Failing at address: 0x1f4000071a8
[ip-172-31-8-146:29203] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29074] *** Process received signal ***
[ip-172-31-15-97:29074] Signal: Segmentation fault (11)
[ip-172-31-15-97:29074] Signal code:  (-6)
[ip-172-31-15-97:29074] Failing at address: 0x1f400007192
[ip-172-31-8-146:29183] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29095] *** Process received signal ***
[ip-172-31-15-97:29095] Signal: Segmentation fault (11)
[ip-172-31-15-97:29095] Signal code:  (-6)
[ip-172-31-15-97:29095] Failing at address: 0x1f4000071a7
[ip-172-31-8-146:29179] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29069] [ 0] [ip-172-31-8-146:29232] Read -1, expected 65536, errno = 14
[ip-172-31-8-146:29232] *** Process received signal ***
[ip-172-31-8-146:29232] Signal: Segmentation fault (11)
[ip-172-31-8-146:29232] Signal code:  (-6)
[ip-172-31-8-146:29232] Failing at address: 0x1f400007230
[ip-172-31-15-97:29073] *** Process received signal ***
[ip-172-31-15-97:29073] Signal: Segmentation fault (11)
[ip-172-31-15-97:29073] Signal code:  (-6)
[ip-172-31-15-97:29073] Failing at address: 0x1f400007191
[ip-172-31-8-146:29234] Read -1, expected 65536, errno = 14
/lib64/libpthread.so.0(+0xf600)[0x7fdfa53ff600]
[ip-172-31-15-97:29069] [ 1] [ip-172-31-8-146:29233] Read -1, expected 65536, errno = 14
[ip-172-31-15-97:29094] *** Process received signal ***
[ip-172-31-15-97:29094] Signal: Segmentation fault (11)
[ip-172-31-15-97:29094] Signal code:  (-6)
[ip-172-31-15-97:29094] Failing at address: 0x1f4000071a6
[ip-172-31-8-146:29201] Read -1, expected 65536, errno = 14
[ip-172-31-8-146:29201] *** Process received signal ***
[ip-172-31-8-146:29201] Signal: Segmentation fault (11)
[ip-172-31-8-146:29201] Signal code:  (-6)
[ip-172-31-8-146:29201] Failing at address: 0x1f400007211
...
[ip-172-31-8-146:29202] *** End of error message ***
[ip-172-31-15-97:00000] *** An error occurred in Socket closed
[ip-172-31-15-97:00000] *** reported by process [3797155841,66]
[ip-172-31-15-97:00000] *** on a NULL communicator
[ip-172-31-15-97:00000] *** Unknown error
[ip-172-31-15-97:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-15-97:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-8-146:00000] *** An error occurred in Socket closed
[ip-172-31-8-146:00000] *** reported by process [3797155841,22]
[ip-172-31-8-146:00000] *** on a NULL communicator
[ip-172-31-8-146:00000] *** Unknown error
[ip-172-31-8-146:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-8-146:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-8-146:29165] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257

@shijin-aws
Copy link
Contributor

The errors reported by jsquyres@ is also exposed in my ibm test suite as well.

@devreal
Copy link
Contributor

devreal commented Oct 13, 2020

@bosilca @jsquyres I believe there is a bug in the gather_in_place2 test: the test passes NULL instead of MPI_DATATYPE_NULL as the sendtype in https://github.com/open-mpi/ompi-tests/blob/master/ibm/collective/gather_in_place2.c#L31. This leads to the Segfault you're seeing.

I think I have a hunch at what causes the gather tests to fail, working on a fix.

@shijin-aws
Copy link
Contributor

@bosilca I can confirm 37f8a93 fixed the ibm/bcast error I reported earlier.

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 14, 2020

@bosilca the latest commits fix the gather_in_place reported by jsquyres@. Still having errors in ompi-tests/ibm/collective/scatter_in_place and ompi-tests/ibm/collective/int_overflow:

  • scatter_in_place: bad answer
/fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca coll_han_priority 100 --mca pml ob1 --hostfile /fsx/hosts.file -n 72 collective/scatter_in_place
[**ERROR**]: MPI_COMM_WORLD rank 34, file scatter_in_place.c:78:
task 34: bad answer (-958849888) at index 0 of 1 (should be 34)
[**ERROR**]: MPI_COMM_WORLD rank 33, file scatter_in_place.c:78:
task 33: bad answer (-558907667) at index 0 of 1 (should be 33)
[**ERROR**]: MPI_COMM_WORLD rank 2, file scatter_in_place.c:78:
task 2: bad answer (28257) at index 0 of 1 (should be 2)
[**ERROR**]: MPI_COMM_WORLD rank 3, file scatter_in_place.c:78:
task 3: bad answer (0) at index 0 of 1 (should be 3)
[**ERROR**]: MPI_COMM_WORLD rank 35, file scatter_in_place.c:78:
task 35: bad answer (32739) at index 0 of 1 (should be 35)
[**ERROR**]: MPI_COMM_WORLD rank 16, file scatter_in_place.c:78:
task 16: bad answer (1936482662) at index 0 of 1 (should be 16)
[**ERROR**]: MPI_COMM_WORLD rank 32, file scatter_in_place.c:78:
task 32: bad answer (-558907667) at index 0 of 1 (should be 32)
[**ERROR**]: MPI_COMM_WORLD rank 8, file scatter_in_place.c:78:
task 8: bad answer (1936482662) at index 0 of 1 (should be 8)
[**ERROR**]: MPI_COMM_WORLD rank 9, file scatter_in_place.c:78:
task 9: bad answer (101) at index 0 of 1 (should be 9)
[**ERROR**]: MPI_COMM_WORLD rank 27, file scatter_in_place.c:78:
task 27: bad answer (512) at index 0 of 1 (should be 27)
[**ERROR**]: MPI_COMM_WORLD rank 24, file scatter_in_place.c:78:
task 24: bad answer (1936482662) at index 0 of 1 (should be 24)
[**ERROR**]: MPI_COMM_WORLD rank 10, file scatter_in_place.c:78:
task 10: bad answer (-954845208) at index 0 of 1 (should be 10)
[**ERROR**]: MPI_COMM_WORLD rank 18, file scatter_in_place.c:78:
task 18: bad answer (-954845240) at index 0 of 1 (should be 18)
[**ERROR**]: MPI_COMM_WORLD rank 26, file scatter_in_place.c:78:
  • int_overflow: segfault
/fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --hostfile /fsx/hosts.file -n 72 collective/int_overflow
seed value: -1201409492
sys_query:
- R0 (ip-172-31-4-68) : 36 ranks, 198349 Mb
- R36 (ip-172-31-6-125) : 36 ranks, 198349 Mb

Running up to 34710 Mb/rank

**** comm nranks=4 :  0 1 2 3

- pt2pt count=1000000 dtsize=8 :  sbuf=8.0Mb rbuf=8.0Mb
- pt2pt count=500000000 dtsize=8 :  sbuf=4000.0Mb rbuf=4000.0Mb
- pt2pt count=1000000000 dtsize=8 :  sbuf=8000.0Mb rbuf=8000.0Mb
- pt2pt count=2000000000 dtsize=8 :  sbuf=16000.0Mb rbuf=16000.0Mb
- pt2pt count=453069600 dtsize=8 :  sbuf=3624.6Mb rbuf=3624.6Mb
- allgather count=1000000 dtsize=8 :  sbuf=8.0Mb/rank rbuf=32.0Mb
- allgather count=500000000 dtsize=8 :  sbuf=4000.0Mb/rank rbuf=16000.0Mb
- allgather count=1000000000 dtsize=8 :  sbuf=8000.0Mb/rank rbuf=32000.0Mb [SKIP]
- allgather count=2000000000 dtsize=8 :  sbuf=16000.0Mb/rank rbuf=64000.0Mb [SKIP]
- allgather count=1165739880 dtsize=8 :  sbuf=9325.9Mb/rank rbuf=37303.7Mb [SKIP]
[ip-172-31-4-68:35465] *** Process received signal ***
[ip-172-31-4-68:35465] Signal: Segmentation fault (11)
[ip-172-31-4-68:35465] Signal code:  (-6)
[ip-172-31-4-68:35465] Failing at address: 0x1f400008a89
[ip-172-31-4-68:35465] [ 0] /lib64/libpthread.so.0(+0xf600)[0x7fb2446a5600]
[ip-172-31-4-68:35465] [ 1] /fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split_with_info+0xf8)[0x7fb2448e9278]
[ip-172-31-4-68:35465] [ 2] /fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split+0x41)[0x7fb2448e9b1e]
[ip-172-31-4-68:35465] [ 3] /fsx/ompi-han/install/lib/libmpi.so.0(MPI_Comm_split+0x160)[0x7fb244955ea3]
[ip-172-31-4-68:35465] [ 4] collective/int_overflow[0x402027]
[ip-172-31-4-68:35465] [ 5] collective/int_overflow[0x40257c]
[ip-172-31-4-68:35465] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb2442ea575]
[ip-172-31-4-68:35465] [ 7] collective/int_overflow[0x401479]
[ip-172-31-4-68:35465] *** End of error message ***
[ip-172-31-6-125:35480] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-6-125:35480] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-6-125:35480] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257

@jsquyres
Copy link
Member

With 1634ba0, I get the same results on master with and without han:

FAIL: op
FAIL: reduce_scatter_block_nocommute_stride
FAIL: reduce_scatter_block_nocommute_stride_in_place
FAIL: op_mpifh
FAIL: op_usempi
FAIL: op_usempif08

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 14, 2020

Yes, I hit those failures on master with and without han as well so I didn't include them in the errors I reported earlier.

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 15, 2020

With 1634ba0 ,

  • ompi-tests/ibm/collective/gather_in_place still failed when running on 144 procs on 4 nodes:
$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca coll_han_priority 100 --mca pml ob1 --hostfile /fsx/hosts.file -n 144 collective/gather_in_place
Warning: Permanently added 'ip-172-31-9-86,172.31.9.86' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-14-214,172.31.14.214' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-2-172,172.31.2.172' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-2-154,172.31.2.154' (ECDSA) to the list of known hosts.
gather_in_place: opal_datatype_copy.h:147: non_overlap_copy_content_same_ddt: Assertion `((iov_len_local) != 0) && ((count) != 0)' failed.
gather_in_place: opal_datatype_copy.h:147: non_overlap_copy_content_same_ddt: Assertion `((iov_len_local) != 0) && ((count) != 0)' failed.
[ip-172-31-2-154:35803] *** Process received signal ***
[ip-172-31-2-154:35803] Signal: Aborted (6)
[ip-172-31-2-154:35803] Signal code:  (-6)
[ip-172-31-2-154:35801] *** Process received signal ***
gather_in_place: opal_datatype_copy.h:147: non_overlap_copy_content_same_ddt: Assertion `((iov_len_local) != 0) && ((count) != 0)' failed.
[ip-172-31-2-154:35801] Signal: Aborted (6)
[ip-172-31-2-154:35801] Signal code:  (-6)
[ip-172-31-2-154:35803] [ 0] [ip-172-31-2-154:35802] *** Process received signal ***
[ip-172-31-2-154:35802] Signal: Aborted (6)
[ip-172-31-2-154:35802] Signal code:  (-6)
/lib64/libpthread.so.0(+0xf600)[0x7fd37d530600]
[ip-172-31-2-154:35803] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fd37d1893a7]
[ip-172-31-2-154:35803] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fd37d18aa98]
[ip-172-31-2-154:35803] [ 3] /lib64/libc.so.6(+0x2f1c6)[0x7fd37d1821c6]
[ip-172-31-2-154:35803] [ 4] /lib64/libc.so.6(+0x2f272)[0x7fd37d182272]
[ip-172-31-2-154:35803] [ 5] [ip-172-31-2-154:35802] [ 0] /lib64/libpthread.so.0(+0xf600)[0x7f06e5756600]

Running on 2 nodes (72 procs) does not have such issue.

  • reduce_scatter_block: bad answer when running on 144 procs of 4 nodes:
/fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca coll_han_priority 100 --mca pml ob1 --hostfile /fsx/hosts.file -n 144 collective/reduce_scatter_block
[**ERROR**]: MPI_COMM_WORLD rank 16, file reduce_scatter_block.c:80:
[**ERROR**]: MPI_COMM_WORLD rank 0, file reduce_scatter_block.c:80:
bad answer (10368) at index 0 of 1 (should be 0)
[**ERROR**]: MPI_COMM_WORLD rank 68, file reduce_scatter_block.c:80:
bad answer (4608) at index 0 of 1 (should be 9792)
[**ERROR**]: MPI_COMM_WORLD rank 72, file reduce_scatter_block.c:80:
[**ERROR**]: MPI_COMM_WORLD rank 32, file reduce_scatter_block.c:80:

@jsquyres
Copy link
Member

@wckzhang Does the same issue you fixed in https://github.com/open-mpi/ompi-tests/pull/137 also apply to these two test failures, perchance?

@wckzhang
Copy link
Contributor

@wckzhang Does the same issue you fixed in open-mpi/ompi-tests#137 also apply to these two test failures, perchance?

I highly doubt it, I checked those two tests and they both use malloc and have done so for many years.

@bosilca
Copy link
Member Author

bosilca commented Oct 16, 2020

We found out what was the problem, the final reshuffle of data for the gather/scatter operation is incorrect, because the way to compute the translation between the different hierarchies was incorrect. Fix to come tomorrow.

@jsquyres
Copy link
Member

@bosilca With the latest commits this morning, I get this compiler warning:

coll_han_gather.c: In function ‘mca_coll_han_gather_intra’:
coll_han_gather.c:103:22: warning: ‘w_rank’ may be used uninitialized in this function [-Wmaybe-uninitialized]
     ompi_datatype_t *dtype = (w_rank == root) ? rdtype : sdtype;
                      ^~~~~

@bosilca
Copy link
Member Author

bosilca commented Oct 16, 2020

I saw it but I was hoping to finish the entire datatype patch before pushing. I pushed a partial fix (only for this issue), I’m working on fixing the support for MPI_IN_PLACE.

@jsquyres
Copy link
Member

FWIW, I ran again with 356e089 and still see no differences on master with and without han.

@shijin-aws
Copy link
Contributor

With b43145f, I still hit errors in the following tests (which do not fail with master):

collective/gather_in_place3:

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/gather_in_place3
Warning: Permanently added 'ip-172-31-5-167,172.31.5.167' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-172,172.31.0.172' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-155,172.31.0.155' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-179,172.31.10.179' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: ip-172-31-10-179
  PID:        24183
  Message:    connect() to 172.31.0.155:1024 failed
  Error:      Connection reset by peer (104)
--------------------------------------------------------------------------
[ip-172-31-10-179:24183] pml_ob1_sendreq.c:189 FATAL
[ip-172-31-0-172:24275] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: ip-172-31-0-172
  PID:        24326
  Message:    connect() to 172.31.0.155:1024 failed
  Error:      Connection refused (111)
--------------------------------------------------------------------------
[ip-172-31-0-172:24326] pml_ob1_sendreq.c:189 FATAL

collective/scatter_in_place: hang

/fsx/ompi-han/install/bin/mpirun --timeout 120 --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/scatter_in_place
Warning: Permanently added 'ip-172-31-0-155,172.31.0.155' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-5-167,172.31.5.167' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-179,172.31.10.179' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-172,172.31.0.172' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 120 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option oror MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------

collective/intercomm/reduce_scatter_block_nocommute_stride_inter

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/intercomm/reduce_scatter_block_nocommute_stride_inter
Warning: Permanently added 'ip-172-31-5-167,172.31.5.167' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-172,172.31.0.172' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-179,172.31.10.179' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-155,172.31.0.155' (ECDSA) to the list of known hosts.
[ip-172-31-0-172:25203] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:24959] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-172:25203] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-172:25203] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-172:25203] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-5-167:09836] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-10-179:25072] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,1]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,0]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,33]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-172:00000] *** An error occurred in Socket closed
[ip-172-31-0-172:00000] *** reported by process [4286251009,48]
[ip-172-31-0-172:00000] *** on a NULL communicator
[ip-172-31-0-172:00000] *** Unknown error
[ip-172-31-0-172:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-172:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,25]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-172:00000] *** An error occurred in Socket closed
[ip-172-31-0-172:00000] *** reported by process [4286251009,64]
[ip-172-31-0-172:00000] *** on a NULL communicator
[ip-172-31-0-172:00000] *** Unknown error
[ip-172-31-0-172:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-172:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,121]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-155:00000] *** An error occurred in Socket closed
[ip-172-31-0-155:00000] *** reported by process [4286251009,20]
[ip-172-31-0-155:00000] *** on a NULL communicator
[ip-172-31-0-155:00000] *** Unknown error
[ip-172-31-0-155:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-155:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-172:00000] *** An error occurred in Socket closed
[ip-172-31-0-172:00000] *** reported by process [4286251009,40]
[ip-172-31-0-172:00000] *** on a NULL communicator
[ip-172-31-0-172:00000] *** Unknown error
[ip-172-31-0-172:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-172:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,110]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-0-172:00000] *** An error occurred in Socket closed
[ip-172-31-0-172:00000] *** reported by process [4286251009,37]
[ip-172-31-0-172:00000] *** on a NULL communicator
[ip-172-31-0-172:00000] *** Unknown error
[ip-172-31-0-172:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-0-172:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,108]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,113]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,80]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-10-179:00000] *** An error occurred in Socket closed
[ip-172-31-10-179:00000] *** reported by process [4286251009,116]
[ip-172-31-10-179:00000] *** on a NULL communicator
[ip-172-31-10-179:00000] *** Unknown error
[ip-172-31-10-179:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-10-179:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,89]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,97]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,73]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-5-167:00000] *** An error occurred in Socket closed
[ip-172-31-5-167:00000] *** reported by process [4286251009,84]
[ip-172-31-5-167:00000] *** on a NULL communicator
[ip-172-31-5-167:00000] *** Unknown error
[ip-172-31-5-167:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-5-167:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
mpirun noticed that process rank 17 with PID 0 on node ip-172-31-0-155 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

collective/int_overflow:

/fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/int_overflow
Warning: Permanently added 'ip-172-31-0-172,172.31.0.172' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-5-167,172.31.5.167' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-155,172.31.0.155' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-179,172.31.10.179' (ECDSA) to the list of known hosts.
seed value: -1201409492
sys_query:
- R0 (ip-172-31-0-155) : 36 ranks, 198349 Mb
- R36 (ip-172-31-0-172) : 36 ranks, 198349 Mb
- R72 (ip-172-31-5-167) : 36 ranks, 198349 Mb
- R108 (ip-172-31-10-179) : 36 ranks, 198349 Mb

Running up to 34710 Mb/rank

**** comm nranks=4 :  0 1 2 3

- pt2pt count=1000000 dtsize=8 :  sbuf=8.0Mb rbuf=8.0Mb
- pt2pt count=500000000 dtsize=8 :  sbuf=4000.0Mb rbuf=4000.0Mb
- pt2pt count=1000000000 dtsize=8 :  sbuf=8000.0Mb rbuf=8000.0Mb
- pt2pt count=2000000000 dtsize=8 :  sbuf=16000.0Mb rbuf=16000.0Mb
- pt2pt count=453069600 dtsize=8 :  sbuf=3624.6Mb rbuf=3624.6Mb
- allgather count=1000000 dtsize=8 :  sbuf=8.0Mb/rank rbuf=32.0Mb
- allgather count=500000000 dtsize=8 :  sbuf=4000.0Mb/rank rbuf=16000.0Mb
- allgather count=1000000000 dtsize=8 :  sbuf=8000.0Mb/rank rbuf=32000.0Mb [SKIP]
- allgather count=2000000000 dtsize=8 :  sbuf=16000.0Mb/rank rbuf=64000.0Mb [SKIP]
- allgather count=1165739880 dtsize=8 :  sbuf=9325.9Mb/rank rbuf=37303.7Mb [SKIP]
*** Error in `collective/int_overflow': realloc(): invalid next size: 0x00000000010bc290 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f834)[0x7f7d754fe834]
/lib64/libc.so.6(+0x84c21)[0x7f7d75503c21]
/lib64/libc.so.6(realloc+0x1d2)[0x7f7d755051d2]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x88c1)[0x7f7d6c8f28c1]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x8950)[0x7f7d6c8f2950]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x8abe)[0x7f7d6c8f2abe]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(NBC_Sched_send+0x4f)[0x7f7d6c8f2b43]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0xe0fd)[0x7f7d6c8f80fd]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0xd604)[0x7f7d6c8f7604]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_iallgather+0x4e)[0x7f7d6c8f7999]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_topo_init+0x2e5)[0x7f7d6c4c13fb]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_allgather_intra+0x10da)[0x7f7d6c4b73dd]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_allgather_intra_dynamic+0x35b)[0x7f7d6c4be224]
/fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split_with_info+0x16e)[0x7f7d75aa02ee]
/fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split+0x41)[0x7f7d75aa0b1e]
/fsx/ompi-han/install/lib/libmpi.so.0(MPI_Comm_split+0x160)[0x7f7d75b0cea3]
collective/int_overflow[0x402027]
collective/int_overflow[0x40257c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f7d754a1575]
collective/int_overflow[0x401479]

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 21, 2020

With cea7be6, the errors in scatter_in_place and gather_in_place3 are resolved, but int_overflow and reduce_scatter_block_nocommute_stride_inter still hit the same error reported earlier:

collective/int_overflow

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/int_overflow
Warning: Permanently added 'ip-172-31-12-126,172.31.12.126' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-3-49,172.31.3.49' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-13-65,172.31.13.65' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-9-139,172.31.9.139' (ECDSA) to the list of known hosts.
seed value: -1201409492
sys_query:
- R0 (ip-172-31-3-49) : 36 ranks, 198349 Mb
- R36 (ip-172-31-9-139) : 36 ranks, 198349 Mb
- R72 (ip-172-31-12-126) : 36 ranks, 198349 Mb
- R108 (ip-172-31-13-65) : 36 ranks, 198349 Mb

Running up to 34710 Mb/rank

**** comm nranks=4 :  0 1 2 3

- pt2pt count=1000000 dtsize=8 :  sbuf=8.0Mb rbuf=8.0Mb
- pt2pt count=500000000 dtsize=8 :  sbuf=4000.0Mb rbuf=4000.0Mb
- pt2pt count=1000000000 dtsize=8 :  sbuf=8000.0Mb rbuf=8000.0Mb
- pt2pt count=2000000000 dtsize=8 :  sbuf=16000.0Mb rbuf=16000.0Mb
- pt2pt count=453069600 dtsize=8 :  sbuf=3624.6Mb rbuf=3624.6Mb
- allgather count=1000000 dtsize=8 :  sbuf=8.0Mb/rank rbuf=32.0Mb
- allgather count=500000000 dtsize=8 :  sbuf=4000.0Mb/rank rbuf=16000.0Mb
- allgather count=1000000000 dtsize=8 :  sbuf=8000.0Mb/rank rbuf=32000.0Mb [SKIP]
- allgather count=2000000000 dtsize=8 :  sbuf=16000.0Mb/rank rbuf=64000.0Mb [SKIP]
- allgather count=1165739880 dtsize=8 :  sbuf=9325.9Mb/rank rbuf=37303.7Mb [SKIP]
*** Error in `collective/int_overflow': realloc(): invalid next size: 0x0000000000da5290 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f834)[0x7fdfb6ee4834]
/lib64/libc.so.6(+0x84c21)[0x7fdfb6ee9c21]
/lib64/libc.so.6(realloc+0x1d2)[0x7fdfb6eeb1d2]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x88c1)[0x7fdfaa2618c1]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x8950)[0x7fdfaa261950]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0x8abe)[0x7fdfaa261abe]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(NBC_Sched_send+0x4f)[0x7fdfaa261b43]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0xe0fd)[0x7fdfaa2670fd]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(+0xd604)[0x7fdfaa266604]
/fsx/ompi-han/install/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_iallgather+0x4e)[0x7fdfaa266999]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_topo_init+0x2e5)[0x7fdfa9e30474]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_allgather_intra+0x10da)[0x7fdfa9e26428]
/fsx/ompi-han/install/lib/openmpi/mca_coll_han.so(mca_coll_han_allgather_intra_dynamic+0x35b)[0x7fdfa9e2d26f]
/fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split_with_info+0x16e)[0x7fdfb74862ee]
/fsx/ompi-han/install/lib/libmpi.so.0(ompi_comm_split+0x41)[0x7fdfb7486b1e]
/fsx/ompi-han/install/lib/libmpi.so.0(MPI_Comm_split+0x160)[0x7fdfb74f2ea3]
collective/int_overflow[0x402027]
collective/int_overflow[0x40257c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdfb6e87575]
collective/int_overflow[0x401479]

collective/intercomm/reduce_scatter_block_nocommute_stride_inter

[ec2-user@ip-172-31-11-68 ibm]$ /fsx/ompi-han/install/bin/mpirun --prefix /fsx/ompi-han/install --mca pml ob1 --mca coll_han_priority 100 --hostfile /fsx/hosts.file -n 144 collective/intercomm/reduce_scatter_block_nocommute_stride_inter
Warning: Permanently added 'ip-172-31-3-49,172.31.3.49' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-12-126,172.31.12.126' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-13-65,172.31.13.65' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-9-139,172.31.9.139' (ECDSA) to the list of known hosts.
[ip-172-31-3-49:00000] *** An error occurred in Socket closed
[ip-172-31-3-49:00000] *** reported by process [3006398465,1]
[ip-172-31-3-49:00000] *** on a NULL communicator
[ip-172-31-3-49:00000] *** Unknown error
[ip-172-31-3-49:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-3-49:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-3-49:00000] *** An error occurred in Socket closed
[ip-172-31-3-49:00000] *** reported by process [3006398465,0]
[ip-172-31-3-49:00000] *** on a NULL communicator
[ip-172-31-3-49:00000] *** Unknown error
[ip-172-31-3-49:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-3-49:00000] ***    and MPI will try to terminate your MPI job as well)
[ip-172-31-9-139:22990] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2257

@zhngaj
Copy link
Contributor

zhngaj commented Oct 22, 2020

Ran osu_iallgather with tcp with HAN (commit 1462363), I saw a hang.

mpirun --prefix /fsx/ompi/han-install --wdir results/omb/collective/osu_iallgather -n 288 -N 36 --tag-output  --mca pml ob1 --mca btl tcp,self --mca coll_han_priority 100 --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/osu-micro-benchmarks-5.6-zwhv66m6o6wvgohrcaqbbgjie57hh5xo/libexec/osu-micro-benchmarks/mpi/collective/osu_iallgather

# OSU MPI Non-blocking Allgather Latency Test v5.6
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 7200 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option oror MPIEXEC_TIMEOUT environment variable).

I'm running without HAN to double check.

@devreal
Copy link
Contributor

devreal commented Oct 22, 2020

@zhngaj The hang in iallgather seems independent of HAN, I can reproduce it with current master.

@zhngaj
Copy link
Contributor

zhngaj commented Oct 22, 2020

@zhngaj The hang in iallgather seems independent of HAN, I can reproduce it with current master.

Hmm, I ran it without HAN by removing --mca coll_han_priority 100, it passed though. I haven't tried with latest master.

Running collective/osu_iallgather on 8 nodes with 288 procs
==== starting mpirun --prefix /fsx/ompi/han-install --wdir results/omb/collective/osu_iallgather -n 288 -N 36 --tag-output  --mca pml ob1 --mca btl tcp,self --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/osu-micro-benchmarks-5.6-zwhv66m6o6wvgohrcaqbbgjie57hh5xo/libexec/osu-micro-benchmarks/mpi/collective/osu_iallgather  : Thu Oct 22 17:41:17 UTC 2020 ====
Warning: Permanently added 'ip-172-31-4-100,172.31.4.100' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-10-6,172.31.10.6' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-0-132,172.31.0.132' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-6-24,172.31.6.24' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-12-29,172.31.12.29' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-14-184,172.31.14.184' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-15-211,172.31.15.211' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-172-31-7-110,172.31.7.110' (ECDSA) to the list of known hosts.

# OSU MPI Non-blocking Allgather Latency Test v5.6
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                    7047.81           4025.43           3861.91             21.74
2                    7107.00           4080.86           3914.84             22.70
4                    7100.11           4069.58           3904.16             22.38
8                    7090.47           4042.16           3877.97             21.39
16                   7076.84           4019.07           3855.61             20.69
32                   7146.29           4060.85           3895.75             20.80
64                   7237.81           4102.35           3935.62             20.33
128                  7385.42           4176.15           4006.54             19.90
256                  7937.54           4466.55           4285.11             19.00
512                  8189.45           4604.19           4417.29             18.84
1024                66765.21          62823.00          60286.46             93.46
2048               128812.53         121218.01         116321.82             93.47
4096               250335.38         229206.74         219945.64             90.39
8192               375148.73         356624.66         342202.82             94.59
16384              378097.98         352531.39         338274.57             92.44
32768              313779.39         290950.77         279175.88             91.82
65536              579585.27         309760.17         297228.94              9.22
131072             661252.56         343024.35         329148.81              3.32
262144             824517.45         421888.85         404824.64              0.54
524288            1401573.88         699084.46         670811.12              0.00
1048576           2320442.88        1156049.10        1109301.24              0.00
return status: 0
==== finished mpirun --prefix /fsx/ompi/han-install --wdir results/omb/collective/osu_iallgather -n 288 -N 36 --tag-output  --mca pml ob1 --mca btl tcp,self --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/osu-micro-benchmarks-5.6-zwhv66m6o6wvgohrcaqbbgjie57hh5xo/libexec/osu-micro-benchmarks/mpi/collective/osu_iallgather  : Thu Oct 22 19:14:47 UTC 2020 ====

@devreal
Copy link
Contributor

devreal commented Oct 22, 2020

This is a DDT aggregated stack trace from current master:

Processes,Threads,Function
14,14,main (osu_iallgather.c:110)
14,14,  PMPI_Wait (pwait.c:74)
14,14,    ompi_request_default_wait (req_wait.c:42)
14,14,      ompi_request_wait_completion (request.h:418)
2,2,        opal_progress (opal_progress.c:231)
2,2,          mca_btl_sm_component_progress (btl_sm_component.c:762)
1,1,            mca_btl_sm_check_fboxes (btl_sm_fbox.h:188)
1,1,            mca_btl_sm_check_fboxes (btl_sm_fbox.h:196)
1,1,              mca_btl_sm_fbox_read_header (btl_sm_fbox.h:71)
2,2,        opal_progress (opal_progress.c:245)
2,2,          opal_progress_events (opal_progress.c:191)
2,2,            opal_libevent2022_event_base_loop (event.c:1630)
2,2,              poll_dispatch (poll.c:165)
2,2,                poll
10,10,        opal_progress (opal_progress.c:247)
10,10,          opal_progress_events (opal_progress.c:191)
10,10,            opal_libevent2022_event_base_loop (event.c:1630)
10,10,              poll_dispatch (poll.c:165)
10,10,                poll
82,82,main (osu_iallgather.c:117)
82,82,  PMPI_Barrier (pbarrier.c:66)
82,82,    ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:530)
82,82,      ompi_coll_tuned_barrier_intra_do_this (coll_tuned_barrier_decision.c:101)
82,82,        ompi_coll_base_barrier_intra_bruck (coll_base_barrier.c:271)
82,82,          ompi_coll_base_sendrecv_zero (coll_base_barrier.c:64)
82,82,            ompi_request_default_wait (req_wait.c:42)
82,82,              ompi_request_wait_completion (request.h:418)
6,6,                opal_progress (opal_progress.c:231)
6,6,                  mca_btl_sm_component_progress (btl_sm_component.c:762)
1,1,                    mca_btl_sm_check_fboxes (btl_sm_fbox.h:189)
1,1,                    mca_btl_sm_check_fboxes (btl_sm_fbox.h:192)
3,3,                    mca_btl_sm_check_fboxes (btl_sm_fbox.h:196)
2,2,                      mca_btl_sm_fbox_read_header (btl_sm_fbox.h:71)
1,1,                      mca_btl_sm_fbox_read_header (btl_sm_fbox.h:74)
1,1,                    mca_btl_sm_check_fboxes (btl_sm_fbox.h:199)
6,6,                opal_progress (opal_progress.c:245)
6,6,                  opal_progress_events (opal_progress.c:191)
6,6,                    opal_libevent2022_event_base_loop (event.c:1630)
6,6,                      poll_dispatch (poll.c:165)
6,6,                        poll
70,70,                opal_progress (opal_progress.c:247)
70,70,                  opal_progress_events (opal_progress.c:191)
1,1,                    opal_libevent2022_event_base_loop (event.c:1626)
1,1,                      gettime (event.c:372)
1,1,                        clock_gettime
69,69,                    opal_libevent2022_event_base_loop (event.c:1630)
69,69,                      poll_dispatch (poll.c:165)
69,69,                        poll
96,96,progress_engine (pmix_progress_threads.c:232)
96,96,  opal_libevent2022_event_base_loop (event.c:1630)
96,96,    epoll_dispatch (epoll.c:407)
96,96,      epoll_wait

Command:

mpirun -n 96 -N 24 --mca pml ob1 ./mpi/collective/osu_iallgather

With --mca pml ucx the benchmarks runs to completion.

@bosilca
Copy link
Member Author

bosilca commented Oct 22, 2020

I don't think it's a deadlock, but something is definitively going strangely with the test itself. I left it open while going for a coffee and I got more output coming from, but as you can see the time somehow exploded.

[1,0]<stdout>:1                    2046.78           1460.86           1399.36             58.13
[1,0]<stdout>:2                    2025.67           1449.24           1388.27             58.48
[1,0]<stdout>:4                    2031.15           1452.78           1391.68             58.44
[1,0]<stdout>:8                    2027.17           1451.04           1389.98             58.55
[1,0]<stdout>:16                   2028.36           1446.23           1385.29             57.98
[1,0]<stdout>:32                   2032.47           1455.32           1394.09             58.60
[1,0]<stdout>:64                   2050.21           1464.63           1402.98             58.26
[1,0]<stdout>:128                  2073.28           1481.20           1418.88             58.27
[1,0]<stdout>:256                  2113.55           1520.03           1456.13             59.24
[1,0]<stdout>:512                  2152.94           1551.94           1486.71             59.58
[1,0]<stdout>:1024                 2223.50           1607.81           1540.36             60.03
[1,0]<stdout>:2048                 4615.58           3709.80           3555.43             74.52
[1,0]<stdout>:4096               103998.66          74668.03          71578.04             59.02
[1,0]<stdout>:8192               177358.74         160091.06         153475.64             88.75
[1,0]<stdout>:16384              215117.44         200831.64         192524.81             92.58
[1,0]<stdout>:32768              246830.98         233625.89         223959.42             94.10
[1,0]<stdout>:65536              428522.71         282090.24         270417.72             45.85
[1,0]<stdout>:131072             608281.84         324874.21         311430.16              9.00
[1,0]<stdout>:262144             959911.51         484099.29         463255.69              0.00
[1,0]<stdout>:524288            1879094.81         937309.86         897415.63              0.00

@devreal
Copy link
Contributor

devreal commented Oct 22, 2020

Ahh yes, here is another data point, scaling the number ranks with current master (on 4 nodes):

$ mpirun -n 16 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     216.50            140.95            135.61             44.29

$ mpirun -n 24 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     305.55            191.38            184.28             38.04

$ mpirun -n 28 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     361.14            228.21            219.84             39.53

$ mpirun -n 36 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     528.13            350.10            337.65             47.28

$ mpirun -n 38 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     740.92            419.82            404.95             20.71

$ mpirun -n 40 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                    1523.82            406.88            392.41              0.00

$ mpirun -n 42 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                   10038.85            405.94            391.57              0.00

The measurements at higher rank counts seem rather unstable, just the next run with 42 ranks yields:

$ mpirun -n 42 --map-by node --bind-to core --mca pml ob1 ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                   24004.91            686.51            662.49              0.00

As a comparison, with pml/ucx things look ok:

$ mpirun -n 42 --map-by node --bind-to core --mca pml ucx ./mpi/collective/osu_iallgather -m 1:1

# OSU MPI Non-blocking Allgather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                     319.24            293.19            282.63             90.78

Among many other things:
- Fix an imbalance bug in MPI_allgather
- Accept more human readable configuration files. We can now specify
  the collective by name instead of a magic number, and the component
  we want to use also by name.
- Add the capability to have optional arguments in the collective
  communication configuration file. Right now the capability exists
  for segment lengths, but is yet to be connected with the algorithms.
- Redo the initialization of all HAN collectives.

Cleanup the fallback collective support.
- In case the module is unable to deliver the expected result, it will fallback
  executing the collective operation on another collective component. This change
  make the support for this fallback simpler to use.
- Implement a fallback allowing a HAN module to remove itself as
  potential active collective module, and instead fallback to the
  next module in line.
- Completely disable the HAN modules on error. From the moment an error is
  encountered they remove themselves from the communicator, and in case some
  other modules calls them simply behave as a pass-through.

Communicator: provide ompi_comm_split_with_info to split and provide info at the same time
Add ompi_comm_coll_preference info key to control collective component selection

COLL HAN: use info keys instead of component-level variable to communicate topology level between abstraction layers
- The info value is a comma-separated list of entries, which are chosen with
  decreasing priorities. This overrides the priority of the component,
  unless the component has disqualified itself.
  An entry prefixed with ^ starts the ignore-list. Any entry following this
  character will be ingnored during the collective component selection for the
  communicator.
  Example: "sm,libnbc,^han,adapt" gives sm the highest preference, followed
  by libnbc. The components han and adapt are ignored in the selection process.
- Allocate a temporary buffer for all lower-level leaders (length 2 segments)
- Fix the handling of MPI_IN_PLACE for gather and scatter.

COLL HAN: Fix topology handling
 - HAN should not rely on node names to determine the ordering of ranks.
   Instead, use the node leaders as identifiers and short-cut if the
   node-leaders agree that ranks are consecutive. Also, error out if
   the rank distribution is imbalanced for now.

Signed-off-by: Xi Luo <xluo12@vols.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
There was a bug allowing for partial packing of non-data elements (such as loop
and end_loop markers) during the exit condition of a pack/unpack call. This has
basically no meaning. Prevent this bug from happening by making sure the element
point to a data before trying to partially pack it.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@bosilca bosilca merged commit ce97090 into open-mpi:master Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.