Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP: Send ATP on a correct lane to fix CI failures #10191

Merged
merged 1 commit into from
Oct 2, 2024

Conversation

brminich
Copy link
Contributor

What

Fixes the following bug in CI

[ RUN      ] dcx/test_ucp_am_nbx_rndv_memtype.rndv/32 <dc_x,cuda_copy,rocm_copy/cuda-managed/cuda-managed>
common/mem_buffer.cc:321: Failure
Pattern check failed at 0x7fab92100200 offset 524288 (length 1 mask: 0xff): Expected: 0xb4 Actual: 0x0
Trace/breakpoint trap (core dumped)

The issue occurs when using the rndv put pipeline protocol, and data is split into two fragments: the first being 512KB and the second only 1 byte. For the second fragment, UCP sends the 1 byte starting from lane 1, transmitting all of it on that lane. However, ATP is sent starting from lane 0. Each ATP, except the last one, confirms just 1 byte to prevent premature receive completion on the peer side.

The problem arises because the data size is only 1 byte. The first ATP (on lane 0) informs the receiver that the data is ready, but since it is sent on a different lane than the data, it arrives before the data itself. As a result, the receiver completes the request even though the actual data hasn't arrived yet.

* lane, ATP may never be sent on the data lane. */
if (ucs_unlikely((req->send.state.dt_iter.length < rpriv->atp_num_lanes) &&
(lane < req->send.multi_lane_idx))) {
return UCS_OK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is because ucp_proto_rndv_bulk_request_init can set req->send.multi_lane_idx to >0, depending on the initial offset (which is >0 for pipeline fragment).

First, IMO this fix is not correct: if we have 3 lanes, we start send.multi_lane_idx from lane[1], then after sending put_zcopy send.multi_lane_idx becomes 2, then in ATP stage we would skip lane[1] (because lane=1 < multi_lane_idx=2) - which is wrong because we actually want to send ATP on lane[1].

Second, we probably need some kind of debug-build protection that we send ATP on same lanes we sent put_zcopy on.

How about the following alternatives to fix the issue:

  1. Initialize atp_lane to req->send.multi_lane_idx in ucp_proto_rndv_put_common_request_init, in case req->send.state.dt_iter.length < rpriv->atp_num_lanes - only if (req->send.state.dt_iter.length < rpriv->atp_num_lanes). It's similar to the current fix but can avoid one branch and extra iterations over ATP lanes, but still has potential issue: if we need to send ATP on 2 (out of 4) lanes, but starting from lane[3] and then lane[0]. In this case send.multi_lane_idx starts with lane[3], so we will never send ATP on lane[0] because of the way the mask is used.

  2. Instead of using ucp_proto_multi_lane_map_progress function, we save inside the send request the first and last lane indexes on which we did put_zcopy (in addition/ instead of atp_lane), including wraparound, and then we iterate over that range on lanes when sending ATP.

Copy link
Contributor Author

@brminich brminich Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix is correct, because all the data is sent on a single lane and multi_lane_idx is not advanced in this case (rather the completion function is invoked). So, in your example we will send ATP on lane 1. I. e. at ATP stage multi_lane_idx is a always a last lane where data was sent and in case of small data it is also the only lane where the data was sent
Option #1 will not work in certain cases (as you mentioned, because of the mask usage in ucp_proto_multi_lane_map_progress)
Option #2 may be challendging if we do not want to increase request size. AFAIR, you already optimized req->send.rndv.put to decrease req side

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure all data is sent on a single lane, since in general rndv protocol supports multi lanes, why in this case it's only a single lane? maybe add assert to verify that?
IMO option 2 will not increase request size because we will add 1-2 uint8_t fields which will fit into the padding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single lane will be used because of this condition req->send.state.dt_iter.length < rpriv->atp_num_lanes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this condition enforce that atp_num_lanes==1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem exists when data length is smaller than number of atp lanes, because every ATP confirms 1 byte assuming that the last ATP confirms the rest. In a faulty case there is no enough data for every ATP and the message is confirmed even before ATP is sent on every lane it was supposed to be sent. For such use-case data size should be smaller than number of ATP lanes (1 byte in our case) and is supposed to be sent on one lane.

@brminich
Copy link
Contributor Author

brminich commented Oct 2, 2024

@yosefe, ok to merge?

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merging as temp solution

@yosefe yosefe merged commit 875125b into openucx:master Oct 2, 2024
141 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants