UCP: Send ATP on a correct lane to fix CI failures #10191

brminich · 2024-09-27T17:28:17Z

What

Fixes the following bug in CI

[ RUN      ] dcx/test_ucp_am_nbx_rndv_memtype.rndv/32 <dc_x,cuda_copy,rocm_copy/cuda-managed/cuda-managed>
common/mem_buffer.cc:321: Failure
Pattern check failed at 0x7fab92100200 offset 524288 (length 1 mask: 0xff): Expected: 0xb4 Actual: 0x0
Trace/breakpoint trap (core dumped)

The issue occurs when using the rndv put pipeline protocol, and data is split into two fragments: the first being 512KB and the second only 1 byte. For the second fragment, UCP sends the 1 byte starting from lane 1, transmitting all of it on that lane. However, ATP is sent starting from lane 0. Each ATP, except the last one, confirms just 1 byte to prevent premature receive completion on the peer side.

The problem arises because the data size is only 1 byte. The first ATP (on lane 0) informs the receiver that the data is ready, but since it is sent on a different lane than the data, it arrives before the data itself. As a result, the receiver completes the request even though the actual data hasn't arrived yet.

yosefe · 2024-09-28T12:06:29Z

src/ucp/rndv/rndv_put.c

+     * lane, ATP may never be sent on the data lane. */
+    if (ucs_unlikely((req->send.state.dt_iter.length < rpriv->atp_num_lanes) &&
+                     (lane < req->send.multi_lane_idx))) {
+        return UCS_OK;


So the issue is because ucp_proto_rndv_bulk_request_init can set req->send.multi_lane_idx to >0, depending on the initial offset (which is >0 for pipeline fragment).

First, IMO this fix is not correct: if we have 3 lanes, we start send.multi_lane_idx from lane[1], then after sending put_zcopy send.multi_lane_idx becomes 2, then in ATP stage we would skip lane[1] (because lane=1 < multi_lane_idx=2) - which is wrong because we actually want to send ATP on lane[1].

Second, we probably need some kind of debug-build protection that we send ATP on same lanes we sent put_zcopy on.

How about the following alternatives to fix the issue:

Initialize atp_lane to req->send.multi_lane_idx in ucp_proto_rndv_put_common_request_init, in case req->send.state.dt_iter.length < rpriv->atp_num_lanes - only if (req->send.state.dt_iter.length < rpriv->atp_num_lanes). It's similar to the current fix but can avoid one branch and extra iterations over ATP lanes, but still has potential issue: if we need to send ATP on 2 (out of 4) lanes, but starting from lane[3] and then lane[0]. In this case send.multi_lane_idx starts with lane[3], so we will never send ATP on lane[0] because of the way the mask is used.

Instead of using ucp_proto_multi_lane_map_progress function, we save inside the send request the first and last lane indexes on which we did put_zcopy (in addition/ instead of atp_lane), including wraparound, and then we iterate over that range on lanes when sending ATP.

I think the fix is correct, because all the data is sent on a single lane and multi_lane_idx is not advanced in this case (rather the completion function is invoked). So, in your example we will send ATP on lane 1. I. e. at ATP stage multi_lane_idx is a always a last lane where data was sent and in case of small data it is also the only lane where the data was sent
Option #1 will not work in certain cases (as you mentioned, because of the mask usage in ucp_proto_multi_lane_map_progress)
Option #2 may be challendging if we do not want to increase request size. AFAIR, you already optimized req->send.rndv.put to decrease req side

i am not sure all data is sent on a single lane, since in general rndv protocol supports multi lanes, why in this case it's only a single lane? maybe add assert to verify that?
IMO option 2 will not increase request size because we will add 1-2 uint8_t fields which will fit into the padding

single lane will be used because of this condition req->send.state.dt_iter.length < rpriv->atp_num_lanes

how does this condition enforce that atp_num_lanes==1?

the problem exists when data length is smaller than number of atp lanes, because every ATP confirms 1 byte assuming that the last ATP confirms the rest. In a faulty case there is no enough data for every ATP and the message is confirmed even before ATP is sent on every lane it was supposed to be sent. For such use-case data size should be smaller than number of ATP lanes (1 byte in our case) and is supposed to be sent on one lane.

brminich · 2024-10-02T09:11:51Z

@yosefe, ok to merge?

yosefe

merging as temp solution

UCP: Send ATP on a correct lane

2030406

yosefe reviewed Sep 28, 2024

View reviewed changes

yosefe approved these changes Oct 2, 2024

View reviewed changes

yosefe merged commit 875125b into openucx:master Oct 2, 2024
141 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCP: Send ATP on a correct lane to fix CI failures #10191

UCP: Send ATP on a correct lane to fix CI failures #10191

brminich commented Sep 27, 2024

yosefe Sep 28, 2024

brminich Sep 28, 2024 •

edited

Loading

yosefe Sep 28, 2024

brminich Sep 28, 2024

yosefe Sep 29, 2024

brminich Sep 29, 2024

brminich commented Oct 2, 2024

yosefe left a comment

UCP: Send ATP on a correct lane to fix CI failures #10191

UCP: Send ATP on a correct lane to fix CI failures #10191

Conversation

brminich commented Sep 27, 2024

What

yosefe Sep 28, 2024

Choose a reason for hiding this comment

brminich Sep 28, 2024 • edited Loading

Choose a reason for hiding this comment

yosefe Sep 28, 2024

Choose a reason for hiding this comment

brminich Sep 28, 2024

Choose a reason for hiding this comment

yosefe Sep 29, 2024

Choose a reason for hiding this comment

brminich Sep 29, 2024

Choose a reason for hiding this comment

brminich commented Oct 2, 2024

yosefe left a comment

Choose a reason for hiding this comment

brminich Sep 28, 2024 •

edited

Loading