-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A hierarchical, architecture-aware collective communication module #7735
Conversation
Can one of the admins verify this patch? |
@bosilca Just curious - when you say "architecture-aware", are you speaking of knowing (for example) which procs are on nodes sharing a common switch, which procs are lowest-ranked on a given switch, etc? In other words, are you following the hardware wiring topology? If so, I'm curious as to where you get that topology info. I'm working now on a PMIx plugin to provide it for certain types of fabrics, if that would help here. |
@rhc54 right now we are working on a 2 level hierarchy, with the information extracted from OMPI via the |
@jsquyres did we expect the ompiteam-bot to ask for "an admin to verify this patch"? I mean that even if George did not author the commits, he pushed them, so they could/should have not been flagged as untrusted. |
@bosilca Kewl - I'll coordinate with you and your team as I move forward. The info I'll be providing is basically a "wiring diagram". It will consist of the following:
etc. etc. Basically, following the wires to minimize any hops. We are working on methods for handling multi-NIC systems - little more complicated 😄 |
@bwbarrett can you please check whether it should be added or if this platform/compiler removed from the "matrix"? |
Wasn't this feature supposed to be tagged for v5.0.0 milestone ? |
Thanks, added it for tracking. |
0bcf2e5
to
1290951
Compare
bot:aws:retest |
bot:aws:retest Same error on a different platform (see #7847) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the code and found a few things regarding memory management, maybe I didn't fully understand the control flow though. It also seems like there may be an integer overflow in the selection logic (see my comments about the msg_size
variables and members). It seems that some of the functions can be made static
and removed from the header files while some can probably made static inline
in the headers. Plus some typos in comments I found.
All comments have been addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I've gotten distracted by internal Cisco shiny objects, and I haven't finished my review yet. But I at least wanted to submit what I have done so far...
70471ae
to
23e98bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all but one of my points have been addressed (https://github.com/open-mpi/ompi/pull/7735/files#diff-a4e1605aa2038222adeba1c42f218130R231) but that one is minor. LGTM, thanks @bosilca!
There are no changes, I simply cleaned the history in preparation for the merge. Regarding your report, I tried all possible process placement, with and without subscription (as I don’t have 36 nodes available), but as soon as I curate my list of allowed BTLs, I can’t replicate. As you suggested, the error seems to indicate that a communicator with 2 processes failed to be created, but just using the stack I can’t figure out what’s going on. |
I hit the same error with IMB-MPI1 with your latest forced PR, it's not segfault though.
|
With a7cc337, I see a small number of regressions running with HAN vs. not running with HAN and the IBM test suite in the collective subdir: intercomm/allreduce_nocommute_gap_inter: MPI_Abort with error code 100
intercomm/reduce_nocommute_gap_inter: MPI_Abort with error code 100
bcast_struct: segv
gather_in_place: wrong answer:
gather_in_place2: segv
scatter_in_place: wrong answer
|
bosilca@ I can hit similar errors when running
|
The errors reported by jsquyres@ is also exposed in my ibm test suite as well. |
@bosilca @jsquyres I believe there is a bug in the I think I have a hunch at what causes the gather tests to fail, working on a fix. |
@bosilca the latest commits fix the gather_in_place reported by jsquyres@. Still having errors in ompi-tests/ibm/collective/scatter_in_place and ompi-tests/ibm/collective/int_overflow:
|
With 1634ba0, I get the same results on master with and without han:
|
Yes, I hit those failures on master with and without han as well so I didn't include them in the errors I reported earlier. |
With 1634ba0 ,
Running on 2 nodes (72 procs) does not have such issue.
|
@wckzhang Does the same issue you fixed in https://github.com/open-mpi/ompi-tests/pull/137 also apply to these two test failures, perchance? |
I highly doubt it, I checked those two tests and they both use malloc and have done so for many years. |
We found out what was the problem, the final reshuffle of data for the gather/scatter operation is incorrect, because the way to compute the translation between the different hierarchies was incorrect. Fix to come tomorrow. |
@bosilca With the latest commits this morning, I get this compiler warning:
|
I saw it but I was hoping to finish the entire datatype patch before pushing. I pushed a partial fix (only for this issue), I’m working on fixing the support for MPI_IN_PLACE. |
FWIW, I ran again with 356e089 and still see no differences on master with and without han. |
With b43145f, I still hit errors in the following tests (which do not fail with master): collective/gather_in_place3:
collective/scatter_in_place: hang
collective/intercomm/reduce_scatter_block_nocommute_stride_inter
collective/int_overflow:
|
With cea7be6, the errors in collective/int_overflow
collective/intercomm/reduce_scatter_block_nocommute_stride_inter
|
Ran osu_iallgather with tcp with HAN (commit 1462363), I saw a hang.
I'm running without HAN to double check. |
@zhngaj The hang in iallgather seems independent of HAN, I can reproduce it with current master. |
Hmm, I ran it without HAN by removing
|
This is a DDT aggregated stack trace from current master:
Command:
With |
I don't think it's a deadlock, but something is definitively going strangely with the test itself. I left it open while going for a coffee and I got more output coming from, but as you can see the time somehow exploded.
|
Ahh yes, here is another data point, scaling the number ranks with current master (on 4 nodes):
The measurements at higher rank counts seem rather unstable, just the next run with 42 ranks yields:
As a comparison, with pml/ucx things look ok:
|
Among many other things: - Fix an imbalance bug in MPI_allgather - Accept more human readable configuration files. We can now specify the collective by name instead of a magic number, and the component we want to use also by name. - Add the capability to have optional arguments in the collective communication configuration file. Right now the capability exists for segment lengths, but is yet to be connected with the algorithms. - Redo the initialization of all HAN collectives. Cleanup the fallback collective support. - In case the module is unable to deliver the expected result, it will fallback executing the collective operation on another collective component. This change make the support for this fallback simpler to use. - Implement a fallback allowing a HAN module to remove itself as potential active collective module, and instead fallback to the next module in line. - Completely disable the HAN modules on error. From the moment an error is encountered they remove themselves from the communicator, and in case some other modules calls them simply behave as a pass-through. Communicator: provide ompi_comm_split_with_info to split and provide info at the same time Add ompi_comm_coll_preference info key to control collective component selection COLL HAN: use info keys instead of component-level variable to communicate topology level between abstraction layers - The info value is a comma-separated list of entries, which are chosen with decreasing priorities. This overrides the priority of the component, unless the component has disqualified itself. An entry prefixed with ^ starts the ignore-list. Any entry following this character will be ingnored during the collective component selection for the communicator. Example: "sm,libnbc,^han,adapt" gives sm the highest preference, followed by libnbc. The components han and adapt are ignored in the selection process. - Allocate a temporary buffer for all lower-level leaders (length 2 segments) - Fix the handling of MPI_IN_PLACE for gather and scatter. COLL HAN: Fix topology handling - HAN should not rely on node names to determine the ordering of ranks. Instead, use the node leaders as identifiers and short-cut if the node-leaders agree that ranks are consecutive. Also, error out if the rank distribution is imbalanced for now. Signed-off-by: Xi Luo <xluo12@vols.utk.edu> Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu> Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
There was a bug allowing for partial packing of non-data elements (such as loop and end_loop markers) during the exit condition of a pack/unpack call. This has basically no meaning. Prevent this bug from happening by making sure the element point to a data before trying to partially pack it. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
HAN is a flexible hierarchical autotuned collective module. In HAN, the processes of a collective operation are grouped into sub-communicators based on their topologic levels.
HAN breaks hierarchical collective operations into multiple collective operations on the sub-communicators.
HAN uses the existing collective modules in OMPI as sub-modules to perform collective operations (blocking and nonblocking collectives) on the sub-communicators and orchestrates the collective operations to perform hierarchical and architecture-aware collective operations.
HAN provides several features: