Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4: add MPIDI_NM_am_tag_send/recv #7183

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Oct 21, 2024

Pull Request Description

In mpidig RNDV, if the netmod supports tag -- query via
MPIDI_NM_am_can_do_tag() -- add MPIDI_NM_am_tag_{send,recv} to support
direct send/recv. This not only avoids the extra copy in UCX or extra
hdr message and additional RDMA read in OFI am_send_long, it also
supports native GPU RDMA direct when the netmod supports it.

This is how it works -
* sender sends RTS
* receiver posts MPIDI_NM_am_tag_recv and reply CTS
* sender do MPIDI_NM_am_tag_send

Strength

image

  • Simple - avoid additional transport-layer protocols
  • Flexible - ch4-layer unified am protocols
  • Efficient - native matching + native RDMA

Comparisons

  • Existing OFI

  • Existing UCX

Follow-up

  • OFI native path transition to AM protocols

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2410_am_tag_send branch 9 times, most recently from 077265c to 07f43ac Compare October 26, 2024 16:58
In mpidig RNDV, if the netmod supports tag -- query via
MPIDI_NM_am_can_do_tag() -- add MPIDI_NM_am_tag_{send,recv} to support
direct send/recv. This not only avoids the extra copy in UCX or extra
hdr message and additional RDMA read in OFI am_send_long, it also
supports native GPU RDMA direct when the netmod supports it.

This is how it works -
    * sender sends RTS
    * receiver posts MPIDI_NM_am_tag_recv and reply CTS
    * sender do MPIDI_NM_am_tag_send
Some adjustment to make part of {ofi,ucx}_{send,recv}.h available
in AM_ONLY build to provide MPIDI_NM_am_tag_send and MPIDI_NM_am_tag_recv.
* remove err_flag -- we no longer propagate err_flag (ref e1c2ee8)
* remove the "noreq" param -- we no longer use pre-created request
* derive sender_nic, receiver_nic, match_bits in MPIDI_OFI_send and pass
  them down to avoid code duplication.
Remove the type argument from MPIDI_OFI_init_sendtag. Instead, always
explicitly set the protocol bit to the match_bits. This makes it clear
that we are using separate tag spaces. This is symmetric to what we do
with MPIDI_OFI_init_recvtag.
When MPIDI_OFI_CONTEXT_BITS, MPIDI_OFI_SOURCE_BITS, and
MPIDI_OFI_TAG_BITS are defined as constants, all its derived macros are
compile-time constants as well. Thus, they do not need be individually
defined in capability sets in order to avoid the runtime bit-calculation
cost.

Thus we can define these macros in ofi_types.h without distinguishing
MPIDI_OFI_ENABLE_RUNTIME_CHECKS.
@hzhou
Copy link
Contributor Author

hzhou commented Oct 27, 2024

test:mpich/ch4/most ✔️
test:mpich/ch4/ofi/more - am-only has some issue

MPIDI_NM_am_tag_send is similar to MPIDI_NM_mpi_isend except -

* already inside per-vci critical section
* uses MPIDI_OFI_TAG_AM protocol bit (am tag space)
* sets am_req and am_handler_id so at event completion it will invoke
  MPIDIG_global.tag_recv_cbs.
It depends on MPIDI_OFI_ENABLE_TAGGED.
Now we can support a true RNDV mode, where sender will not wait for
receiver complete but only wait for receiver to match.
* remove the have_request parameter in MPIDI_UCX_send. We always
allocate a new request.
MPIDI_NM_am_tag_{send,recv} is similar to MPIDI_NM_mpi_i{send,recv} except -

* already inside per-vci critical section
* uses MPIDI_UCX_TAG_AM protocol bit (am tag space)
When both netmod and shm are using active messages, we need check the
is_local field in matching or they may mismatch when it is anysource
recv or probe.

The previous code happen to work because we reply based on the is_local
from the am header, thus even with MPIDI_SHM_mpi_isend mismatched an
MPIDI_NM_mpi_isend (am_only mode), the matched request switched to reply
over netmod after matching. This won't work when netmod am path deviates
from shm, such as when we introduce MPIDI_NM_am_tag_{send,recv}.

This was fixed in 21870d1, but got lost during recvq refactor.
We should call MPIDI_anysrc_try_cancel_partner upon matching, and call
MPIDI_anysrc_free_partner upon completion.

* In MPIDIG_do_irecv, we only called MPIDI_anysrc_try_cancel_partner
upon matching unexpected request but only for has_request branch. We
should do that for both branches.

* MPIDI_anysrc_free_partner need copy the status when the partner
request is the user visible request. Thus we only should free partner
when rreq completes. Unfortunately, we have multiple completion
branches.
  1. Eager unexpected path
  2. recv_target_cmpl_cb
  3. MPIDIG_tag_recv_complete
@hzhou
Copy link
Contributor Author

hzhou commented Oct 28, 2024

test:mpich/ch4/ofi/more ✔️

@hzhou hzhou requested a review from raffenet October 28, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant