-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/efa: Enable inject rdma write #9512
Conversation
a09f91e
to
d004f5f
Compare
This file |
prov/efa/src/rdm/efa_rdm_ope.c
Outdated
@@ -1473,15 +1472,17 @@ int efa_rdm_ope_post_remote_write(struct efa_rdm_ope *ope) | |||
if (OFI_UNLIKELY(!pkt_entry)) | |||
return -FI_EAGAIN; | |||
|
|||
/* Use bounce buffer (registered) for buffer/desc, user input doesn't matter */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Use bounce buffer (registered) for buffer/desc, the buffer we pass to rdma-core doesn't matter
sounds better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't quite use your words exactly, but made this more clear.
@@ -1458,7 +1458,6 @@ int efa_rdm_ope_post_remote_write(struct efa_rdm_ope *ope) | |||
|
|||
assert(ope->iov_count > 0); | |||
assert(ope->rma_iov_count > 0); | |||
efa_rdm_ope_try_fill_desc(ope, 0, FI_WRITE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for avoiding an MR reg call for zero byte write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
efa_rdm_ope_try_fill_desc()
calls fi_mr_regv()
on the buffer. This line assumes that the user will pass a real registered buffer to do a 0 byte write. If you are doing a 0 byte write, then your buffer could be NULL, and your desc could be NULL b/c EFA shouldn't actually do anything with them. The fi_mr_regv()
will segfault on NULL desc. Also segfaults for FI_INJECT which always gives a NULL desc.
The reason this is in its own commit is you could hit this bug and segfault via fi_write() with 0 bytes and an invalid buffer in main. I wanted to be explicit about fixing this bug is separate to adding fi_inject_write().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you observe any performance impact for this commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a speedup usec/xfer 1.30 (before), .98 (after) for
"program": "/home/ec2-user/libfabric/fabtests/install/bin/fi_rma_bw",
"args": ["-e", "rdm", "-o", "write", "-S", "10", "-p", "efa", "-E=9235"],
ope->desc[0] = fi_mr_desc(pkt_entry->mr); | ||
ope->iov[0].iov_base = pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt); | ||
} | ||
|
||
write_once_len = MIN(ope->iov[iov_idx].iov_len - iov_offset, | ||
ope->rma_iov[rma_iov_idx].len - rma_iov_offset); | ||
write_once_len = MIN(write_once_len, max_write_once_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You copy the data from user buffer to the packet entry but the efa_rdm_pke_init_write_context
call a few lines below is still using the user buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hack this by:
ope->desc[0] = fi_mr_desc(pkt_entry->mr);
ope->iov[0].iov_base = pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt);
and that works b/c:
efa_rdm_pke_init_write_context(
pkt_entry, ope,
(char *) ope->iov[iov_idx].iov_base + iov_offset,
write_once_len, ope->desc[iov_idx],
ope->rma_iov[rma_iov_idx].addr + rma_iov_offset,
ope->rma_iov[rma_iov_idx].key);
passes in ope->iov[iov_idx].iov_base + iov_offset
and ope->desc[iov_idx]
, and it will only ever go around this loop once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The txe has a separate cq_entry
field that has the buffer and length data. That information is copied to txe when the txe is created. So changing the ope->iov
doesn't affect the completion and this OK to do
libfabric/prov/efa/src/rdm/efa_rdm_ope.c
Lines 85 to 86 in d43c492
txe->cq_entry.len = ofi_total_iov_len(txe->iov, txe->iov_count); | |
txe->cq_entry.buf = OFI_LIKELY(txe->cq_entry.len > 0) ? txe->iov[0].iov_base : NULL; |
d004f5f
to
872a1c5
Compare
The non-inject 0 byte write path assumed that users would pass in valid buffers, and valid descriptors (memory registrations) for those buffers. The user should not have to pass in a valid buffer to do a 0 byte write. Use pre-registered bounce buffers for buffer/MR, to allow user to pass in whatever they choose. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Turns off emulated inject rdma write, and enables real inject rdma write. Inject rdma write is special for two reasons: 1. The user gets to re-use their immediately after we return 2. RDMA operations require MR's, and the inject write interfaces do not provide a memory descritpor In order to handle both cases, the users data is copied into a pre-registered bounce buffer. The bounce buffer is allocated to be the sizeof(struct efa_rdm_pke) + ep->mtu_size, where the wire data starts at ep->mtu_size. The RDMA operations use part of the wire data for a RDMA header struct efa_rdm_rma_context_pkt. This change needs to lower the maximum inject size to never be more than mtu_size - sizeof(struct efa_rdm_rma_context_pkt). I created a github issue for libfabric 2.0 API to seperate inject sizes for send/recv, atomic, and rdma operations (issue 9510). Signed-off-by: Seth Zegelstein <szegel@amazon.com>
872a1c5
to
876ba80
Compare
if (ope->fi_flags & FI_INJECT) { | ||
assert(ope->iov_count == 1); | ||
assert(ope->total_len <= ep->inject_size); | ||
copied = ofi_copy_from_hmem_iov(pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line is too long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EFA doesn't enforce line length.
assert(ope->iov_count == 1); | ||
assert(ope->total_len <= ep->inject_size); | ||
copied = ofi_copy_from_hmem_iov(pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt), | ||
ope->total_len, FI_HMEM_SYSTEM, 0, ope->iov, ope->iov_count, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I am wrong here, it should be system as there is no MR passed in for inject
assert(ope->total_len <= ep->inject_size); | ||
copied = ofi_copy_from_hmem_iov(pkt_entry->wiredata + sizeof(struct efa_rdm_rma_context_pkt), | ||
ope->total_len, FI_HMEM_SYSTEM, 0, ope->iov, ope->iov_count, 0); | ||
assert(copied == ope->total_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be an assert. It should be warning and return error when copied != total_len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libfabric/prov/efa/src/rdm/efa_rdm_pke_utils.c
Lines 123 to 139 in 387a733
if (iov_mr && FI_HMEM_CUDA == iov_mr->peer.iface && | |
(iov_mr->peer.flags & OFI_HMEM_DATA_DEV_REG_HANDLE)) { | |
assert(iov_mr->peer.hmem_data); | |
copied = ofi_gdrcopy_from_cuda_iov((uint64_t)iov_mr->peer.hmem_data, | |
pke->wiredata + payload_offset, | |
ope->iov, ope->iov_count, | |
segment_offset, data_size); | |
} else { | |
copied = ofi_copy_from_hmem_iov(pke->wiredata + payload_offset, | |
data_size, | |
iov_mr ? iov_mr->peer.iface : FI_HMEM_SYSTEM, | |
iov_mr ? iov_mr->peer.device.reserved : 0, | |
ope->iov, ope->iov_count, segment_offset); | |
} | |
assert(copied == data_size); | |
pke->payload = pke->wiredata + payload_offset; |
No description provided.