Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Enable inject rdma write #9512

Merged
merged 2 commits into from
Nov 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 22 additions & 5 deletions prov/efa/src/rdm/efa_rdm_ope.c
Original file line number Diff line number Diff line change
Expand Up @@ -1451,14 +1451,14 @@ int efa_rdm_ope_post_remote_write(struct efa_rdm_ope *ope)
{
int err;
int iov_idx = 0, rma_iov_idx = 0;
ssize_t copied;
size_t iov_offset = 0, rma_iov_offset = 0;
size_t write_once_len, max_write_once_len;
struct efa_rdm_ep *ep;
struct efa_rdm_pke *pkt_entry;

assert(ope->iov_count > 0);
assert(ope->rma_iov_count > 0);
efa_rdm_ope_try_fill_desc(ope, 0, FI_WRITE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for avoiding an MR reg call for zero byte write?

Copy link
Contributor Author

@a-szegel a-szegel Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efa_rdm_ope_try_fill_desc() calls fi_mr_regv() on the buffer. This line assumes that the user will pass a real registered buffer to do a 0 byte write. If you are doing a 0 byte write, then your buffer could be NULL, and your desc could be NULL b/c EFA shouldn't actually do anything with them. The fi_mr_regv() will segfault on NULL desc. Also segfaults for FI_INJECT which always gives a NULL desc.

The reason this is in its own commit is you could hit this bug and segfault via fi_write() with 0 bytes and an invalid buffer in main. I wanted to be explicit about fixing this bug is separate to adding fi_inject_write().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you observe any performance impact for this commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a speedup usec/xfer 1.30 (before), .98 (after) for

            "program": "/home/ec2-user/libfabric/fabtests/install/bin/fi_rma_bw",
            "args": ["-e", "rdm", "-o", "write", "-S", "10", "-p", "efa", "-E=9235"],

ep = ope->ep;
if (ope->bytes_write_total_len == 0) {
/* According to libfabric document
Expand All @@ -1473,15 +1473,22 @@ int efa_rdm_ope_post_remote_write(struct efa_rdm_ope *ope)
if (OFI_UNLIKELY(!pkt_entry))
return -FI_EAGAIN;

/* Provide the registered bounce buffer and its desc to rdma-core.
* The user provided buffer/desc will not be used for 0 byte writes.
* This allows the user to pass NULL for buff/desc.
*/
efa_rdm_pke_init_write_context(
pkt_entry, ope, ope->iov[0].iov_base, 0, ope->desc[0],
pkt_entry, ope, pkt_entry->wiredata, 0, fi_mr_desc(pkt_entry->mr),
ope->rma_iov[0].addr, ope->rma_iov[0].key);
err = efa_rdm_pke_write(pkt_entry);
if (err)
efa_rdm_pke_release_tx(pkt_entry);
return err;
}

if (!(ope->fi_flags & FI_INJECT))
efa_rdm_ope_try_fill_desc(ope, 0, FI_WRITE);

assert(ope->bytes_write_submitted < ope->bytes_write_total_len);
max_write_once_len = MIN(efa_env.efa_write_segment_size, efa_rdm_ep_domain(ep)->device->max_rdma_size);

Expand Down Expand Up @@ -1513,7 +1520,7 @@ int efa_rdm_ope_post_remote_write(struct efa_rdm_ope *ope)
if (ep->efa_outstanding_tx_ops == ep->efa_max_outstanding_tx_ops)
return -FI_EAGAIN;

if (!ope->desc[iov_idx]) {
if (!ope->desc[iov_idx] && !(ope->fi_flags & FI_INJECT)) {
/* efa_rdm_ope_try_fill_desc() did not fill the desc,
* which means memory registration failed.
* return -FI_EAGAIN here will cause user to run progress
Expand All @@ -1527,6 +1534,16 @@ int efa_rdm_ope_post_remote_write(struct efa_rdm_ope *ope)
if (OFI_UNLIKELY(!pkt_entry))
return -FI_EAGAIN;

if (ope->fi_flags & FI_INJECT) {
assert(ope->iov_count == 1);
assert(ope->total_len <= ep->inject_size);
copied = ofi_copy_from_hmem_iov(pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt),
Copy link
Contributor

@shijin-aws shijin-aws Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line is too long

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EFA doesn't enforce line length.

ope->total_len, FI_HMEM_SYSTEM, 0, ope->iov, ope->iov_count, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am wrong here, it should be system as there is no MR passed in for inject

assert(copied == ope->total_len);
Copy link
Contributor

@shijin-aws shijin-aws Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be an assert. It should be warning and return error when copied != total_len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (iov_mr && FI_HMEM_CUDA == iov_mr->peer.iface &&
(iov_mr->peer.flags & OFI_HMEM_DATA_DEV_REG_HANDLE)) {
assert(iov_mr->peer.hmem_data);
copied = ofi_gdrcopy_from_cuda_iov((uint64_t)iov_mr->peer.hmem_data,
pke->wiredata + payload_offset,
ope->iov, ope->iov_count,
segment_offset, data_size);
} else {
copied = ofi_copy_from_hmem_iov(pke->wiredata + payload_offset,
data_size,
iov_mr ? iov_mr->peer.iface : FI_HMEM_SYSTEM,
iov_mr ? iov_mr->peer.device.reserved : 0,
ope->iov, ope->iov_count, segment_offset);
}
assert(copied == data_size);
pke->payload = pke->wiredata + payload_offset;

ope->desc[0] = fi_mr_desc(pkt_entry->mr);
ope->iov[0].iov_base = pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt);
}

write_once_len = MIN(ope->iov[iov_idx].iov_len - iov_offset,
ope->rma_iov[rma_iov_idx].len - rma_iov_offset);
write_once_len = MIN(write_once_len, max_write_once_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You copy the data from user buffer to the packet entry but the efa_rdm_pke_init_write_context call a few lines below is still using the user buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hack this by:

			ope->desc[0] = fi_mr_desc(pkt_entry->mr);
			ope->iov[0].iov_base = pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt);

and that works b/c:

		efa_rdm_pke_init_write_context(
			pkt_entry, ope,
			(char *) ope->iov[iov_idx].iov_base + iov_offset,
			write_once_len, ope->desc[iov_idx],
			ope->rma_iov[rma_iov_idx].addr + rma_iov_offset,
			ope->rma_iov[rma_iov_idx].key);

passes in ope->iov[iov_idx].iov_base + iov_offset and ope->desc[iov_idx], and it will only ever go around this loop once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The txe has a separate cq_entry field that has the buffer and length data. That information is copied to txe when the txe is created. So changing the ope->iov doesn't affect the completion and this OK to do

txe->cq_entry.len = ofi_total_iov_len(txe->iov, txe->iov_count);
txe->cq_entry.buf = OFI_LIKELY(txe->cq_entry.len > 0) ? txe->iov[0].iov_base : NULL;

Expand Down Expand Up @@ -1589,10 +1606,10 @@ int efa_rdm_ope_post_remote_read_or_queue(struct efa_rdm_ope *ope)

/**
* @brief post a local read request, queue it if necessary
*
*
* a local read request is posted to copy data from a packet
* entry to user posted receive buffer on device.
*
*
* @param[in] rxe which has the receive buffer information
* @param[in] rx_data_offset offset of data in the receive buffer
* @param[in] pkt_entry which has the data
Expand Down
4 changes: 4 additions & 0 deletions prov/efa/src/rdm/efa_rdm_pkt_type.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "efa_rdm_peer.h"
#include "efa_rdm_protocol.h"
#include "efa_rdm_pkt_type.h"
#include "efa_rdm_pke_nonreq.h"

struct efa_rdm_pkt_type_req_info EFA_RDM_PKT_TYPE_REQ_INFO_VEC[] = {
/* rtm header */
Expand Down Expand Up @@ -137,5 +138,8 @@ size_t efa_rdm_pkt_type_get_max_hdr_size(void)
pkt_type += 1;
}

/* Non-emulated (real) rdma inject write requires a header */
max_hdr_size = MAX(max_hdr_size, sizeof(struct efa_rdm_rma_context_pkt));

return max_hdr_size;
}
8 changes: 0 additions & 8 deletions prov/efa/src/rdm/efa_rdm_rma.c
Original file line number Diff line number Diff line change
Expand Up @@ -317,14 +317,6 @@ ssize_t efa_rdm_rma_read(struct fid_ep *ep, void *buf, size_t len, void *desc,
static inline
bool efa_rdm_rma_should_write_using_rdma(struct efa_rdm_ep *ep, struct efa_rdm_ope *txe, struct efa_rdm_peer *peer)
{
/*
* RDMA_WRITE does not support FI_INJECT, because device may
* need to re-send data and FI_INJECT allows user to re-use
* these buffers immediately.
*/
if (txe->fi_flags & FI_INJECT)
return false;

/*
* Because EFA is unordered and EFA iov descriptions can be more
* expressive than the IBV sge's, we only implement
Expand Down