Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Enable inject rdma write #9512

Merged
merged 2 commits into from
Nov 6, 2023

Conversation

a-szegel
Copy link
Contributor

@a-szegel a-szegel commented Nov 3, 2023

No description provided.

@a-szegel a-szegel requested a review from a team November 3, 2023 21:28
@a-szegel a-szegel force-pushed the rdma-write-with-inject branch 2 times, most recently from a09f91e to d004f5f Compare November 3, 2023 21:44
@a-szegel
Copy link
Contributor Author

a-szegel commented Nov 3, 2023

This file prov/efa/src/rdm/efa_rdm_pkt_type.c doesn't have a new line at the end... and it looks ugly in the diff so tried to add one to see if it made it look prettier, but it didn't, so then removed it. Also modified first commit message to be more clear.

@@ -1473,15 +1472,17 @@ int efa_rdm_ope_post_remote_write(struct efa_rdm_ope *ope)
if (OFI_UNLIKELY(!pkt_entry))
return -FI_EAGAIN;

/* Use bounce buffer (registered) for buffer/desc, user input doesn't matter */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Use bounce buffer (registered) for buffer/desc, the buffer we pass to rdma-core doesn't matter

sounds better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't quite use your words exactly, but made this more clear.

@@ -1458,7 +1458,6 @@ int efa_rdm_ope_post_remote_write(struct efa_rdm_ope *ope)

assert(ope->iov_count > 0);
assert(ope->rma_iov_count > 0);
efa_rdm_ope_try_fill_desc(ope, 0, FI_WRITE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for avoiding an MR reg call for zero byte write?

Copy link
Contributor Author

@a-szegel a-szegel Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efa_rdm_ope_try_fill_desc() calls fi_mr_regv() on the buffer. This line assumes that the user will pass a real registered buffer to do a 0 byte write. If you are doing a 0 byte write, then your buffer could be NULL, and your desc could be NULL b/c EFA shouldn't actually do anything with them. The fi_mr_regv() will segfault on NULL desc. Also segfaults for FI_INJECT which always gives a NULL desc.

The reason this is in its own commit is you could hit this bug and segfault via fi_write() with 0 bytes and an invalid buffer in main. I wanted to be explicit about fixing this bug is separate to adding fi_inject_write().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you observe any performance impact for this commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a speedup usec/xfer 1.30 (before), .98 (after) for

            "program": "/home/ec2-user/libfabric/fabtests/install/bin/fi_rma_bw",
            "args": ["-e", "rdm", "-o", "write", "-S", "10", "-p", "efa", "-E=9235"],

ope->desc[0] = fi_mr_desc(pkt_entry->mr);
ope->iov[0].iov_base = pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt);
}

write_once_len = MIN(ope->iov[iov_idx].iov_len - iov_offset,
ope->rma_iov[rma_iov_idx].len - rma_iov_offset);
write_once_len = MIN(write_once_len, max_write_once_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You copy the data from user buffer to the packet entry but the efa_rdm_pke_init_write_context call a few lines below is still using the user buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hack this by:

			ope->desc[0] = fi_mr_desc(pkt_entry->mr);
			ope->iov[0].iov_base = pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt);

and that works b/c:

		efa_rdm_pke_init_write_context(
			pkt_entry, ope,
			(char *) ope->iov[iov_idx].iov_base + iov_offset,
			write_once_len, ope->desc[iov_idx],
			ope->rma_iov[rma_iov_idx].addr + rma_iov_offset,
			ope->rma_iov[rma_iov_idx].key);

passes in ope->iov[iov_idx].iov_base + iov_offset and ope->desc[iov_idx], and it will only ever go around this loop once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The txe has a separate cq_entry field that has the buffer and length data. That information is copied to txe when the txe is created. So changing the ope->iov doesn't affect the completion and this OK to do

txe->cq_entry.len = ofi_total_iov_len(txe->iov, txe->iov_count);
txe->cq_entry.buf = OFI_LIKELY(txe->cq_entry.len > 0) ? txe->iov[0].iov_base : NULL;

@a-szegel a-szegel force-pushed the rdma-write-with-inject branch from d004f5f to 872a1c5 Compare November 4, 2023 01:05
The non-inject 0 byte write path assumed that users would pass in
valid buffers, and valid descriptors (memory registrations) for
those buffers. The user should not have to pass in a valid buffer to do
a 0 byte write. Use pre-registered bounce buffers for buffer/MR, to
allow user to pass in whatever they choose.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Turns off emulated inject rdma write, and enables real inject rdma
write. Inject rdma write is special for two reasons:

1. The user gets to re-use their immediately after we return
2. RDMA operations require MR's, and the inject write interfaces do not
   provide a memory descritpor

In order to handle both cases, the users data is copied into a
pre-registered bounce buffer. The bounce buffer is allocated to be the
sizeof(struct efa_rdm_pke) + ep->mtu_size, where the wire data starts at
ep->mtu_size. The RDMA operations use part of the wire data for a RDMA
header struct efa_rdm_rma_context_pkt.  This change needs to lower the
maximum inject size to never be more than mtu_size - sizeof(struct
efa_rdm_rma_context_pkt). I created a github issue for libfabric 2.0 API
to seperate inject sizes for send/recv, atomic, and rdma operations
(issue 9510).

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
@a-szegel a-szegel force-pushed the rdma-write-with-inject branch from 872a1c5 to 876ba80 Compare November 4, 2023 01:09
@a-szegel a-szegel requested review from sunkuamzn and a team and removed request for a team November 4, 2023 01:10
@j-xiong j-xiong merged commit 6840302 into ofiwg:main Nov 6, 2023
8 checks passed
if (ope->fi_flags & FI_INJECT) {
assert(ope->iov_count == 1);
assert(ope->total_len <= ep->inject_size);
copied = ofi_copy_from_hmem_iov(pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt),
Copy link
Contributor

@shijin-aws shijin-aws Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line is too long

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EFA doesn't enforce line length.

assert(ope->iov_count == 1);
assert(ope->total_len <= ep->inject_size);
copied = ofi_copy_from_hmem_iov(pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt),
ope->total_len, FI_HMEM_SYSTEM, 0, ope->iov, ope->iov_count, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am wrong here, it should be system as there is no MR passed in for inject

assert(ope->total_len <= ep->inject_size);
copied = ofi_copy_from_hmem_iov(pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt),
ope->total_len, FI_HMEM_SYSTEM, 0, ope->iov, ope->iov_count, 0);
assert(copied == ope->total_len);
Copy link
Contributor

@shijin-aws shijin-aws Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be an assert. It should be warning and return error when copied != total_len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (iov_mr && FI_HMEM_CUDA == iov_mr->peer.iface &&
(iov_mr->peer.flags & OFI_HMEM_DATA_DEV_REG_HANDLE)) {
assert(iov_mr->peer.hmem_data);
copied = ofi_gdrcopy_from_cuda_iov((uint64_t)iov_mr->peer.hmem_data,
pke->wiredata + payload_offset,
ope->iov, ope->iov_count,
segment_offset, data_size);
} else {
copied = ofi_copy_from_hmem_iov(pke->wiredata + payload_offset,
data_size,
iov_mr ? iov_mr->peer.iface : FI_HMEM_SYSTEM,
iov_mr ? iov_mr->peer.device.reserved : 0,
ope->iov, ope->iov_count, segment_offset);
}
assert(copied == data_size);
pke->payload = pke->wiredata + payload_offset;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants