-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/efa: Make handshake code use TXE, and provide perf improvements to EFA #9682
prov/efa: Make handshake code use TXE, and provide perf improvements to EFA #9682
Conversation
13192c7
to
7542362
Compare
prov/efa/src/rdm/efa_rdm_ep_utils.c
Outdated
txe->addr = addr; | ||
txe->peer = efa_rdm_ep_get_peer(ep, txe->addr); | ||
assert(txe->peer); | ||
dlist_insert_tail(&txe->peer_entry, &txe->peer->txe_list); | ||
txe->msg_id = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For handshake my impression is that txe->msg_id
and txe->fi_flags
shouldn't matter that much. I doubt we need to reset them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in both places, grep shows txe->msg_id
isn't being used in anywhere critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shi requested I restore this to -1... "don't fix it if it isn't broken".
prov/efa/src/rdm/efa_rdm_ep_utils.c
Outdated
txe->msg_id = -1; | ||
/* efa_rdm_ep_alloc_txe() joins ep->base_ep.util_ep.tx_op_flags and passed in flags, | ||
* reset to passed in flags | ||
*/ | ||
txe->fi_flags = EFA_RDM_TXE_NO_COMPLETION | EFA_RDM_TXE_NO_COUNTER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not needed - I think we can remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in both places... my only concern is that the user can potentially set flags that make us not behave as expected... (like FI_DELEIVERY_COMPLETE)... I think our tests should catch this though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested my theory in the debugger:
#define FI_DELIVERY_COMPLETE (1ULL << 28)
p tx_op_flags & (1ULL << 28)
$8 = 268435456
#define FI_COMPLETION (1ULL << 24)
p tx_op_flags & (1ULL << 24)
$9 = 16777216
Looks like we do need these so the common utilities work correctly.
4ce38a5
to
c109240
Compare
prov/efa/src/rdm/efa_rdm_ep_utils.c
Outdated
@@ -580,8 +576,8 @@ ssize_t efa_rdm_ep_trigger_handshake(struct efa_rdm_ep *ep, | |||
|
|||
txe->ep = ep; | |||
txe->total_len = 0; | |||
txe->addr = addr; | |||
txe->peer = efa_rdm_ep_get_peer(ep, txe->addr); | |||
txe->addr = peer->efa_fiaddr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you still need addr
in txe since you already have peer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, will fix in separate PR
txe->iov_count = 0; | ||
/* efa_rdm_ep_alloc_txe() joins ep->base_ep.util_ep.tx_op_flags and passed in flags, | ||
* reset to desired flags (remove things like FI_DELIVERY_COMPLETE, and FI_COMPLETION) | ||
*/ | ||
txe->fi_flags = EFA_RDM_TXE_NO_COMPLETION | EFA_RDM_TXE_NO_COUNTER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks odd, if the flag set in efa_rdm_ep_alloc_txe
doesn't do what we want, why don't we pass these flags as 0 in the call and only set it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
prov/efa/src/rdm/efa_rdm_ep_utils.c
Outdated
msg.addr = addr; | ||
|
||
/* ofi_op_write is ignored in handshake path */ | ||
txe = efa_rdm_ep_alloc_txe(ep, peer, &msg, ofi_op_write, 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this extra txe allocation impact perf? Probably not... I understand you want to make all pkt type have a valid ope so you can save the peer lookup in the fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the handshake path take slightly longer but it should amortize out, and our fast path should get faster. I expect to see on the order of 2-3ns improvement from this PR in send path.
struct ibv_sge sg_list[2]; /* efa device support up to 2 iov */ | ||
struct ibv_data_buf inline_data_list[2]; | ||
int ret, pkt_idx, iov_cnt; | ||
|
||
assert(pkt_entry_cnt); | ||
ep = pkt_entry_vec[0]->ep; | ||
assert(ep); | ||
peer = efa_rdm_ep_get_peer(ep, pkt_entry_vec[0]->addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only need peer instead of txe in this function, why not just assign peer as txe->peer once so you don't have to reference it as txe->peer
everywhere in the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
struct efa_rdm_ope *ope; | ||
|
||
ope = pkt_entry->ope; | ||
/* | ||
* peer can be NULL when the pkt_entry is a RMA_CONTEXT_PKT, | ||
* and the RMA is a local read toward the endpoint itself | ||
*/ | ||
peer = efa_rdm_ep_get_peer(ep, pkt_entry->addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment. I prefer you assign peer = ope->peer
once to save instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
need to look at the AWS CI failures. |
c109240
to
7d93de6
Compare
Looking at CI failures |
Pass a pointer to the peer instead of the peer's address in efa_rdm_ep_trigger_handshake(). This lets us avoid doing redundant peer lookups. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
7d93de6
to
ea0b742
Compare
Fixed the failing UT's, lets see if we pass CI. |
ret = fi_close(&efa_rdm_ep->shm_ep->fid); | ||
assert_int_equal(ret, 0); | ||
efa_rdm_ep->shm_ep = NULL; | ||
if (efa_rdm_ep->shm_ep) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we disable shm in this test? how come it wasn't broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my env had the env set, and I figured this is a nice change to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move this to a separate commit, or remove it if you want me to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to separate commit, and got other cases.
bot:aws:retest |
Prevously, efa_rdm_txe_construct() was not initializing rma_iov_count to 0 when creating a txe. This value needs to be initialized to 0 b/c the OPE buff pool is shared between the fi_send()/fi_rma() path, and if there is garbage in that value, than our common recv utilities can end up going down the wrong path. This bug was exposed when we switched the efa_rdm_ep_trigger_handshake() code to use efa_rdm_ep_alloc_txe() with IntelMPI osu_mbw_mr collective test. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Switch the handshake code to re-use the efa utility efa_rdm_ep_alloc_txe() to alloc the txe used in the handshake. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
ea0b742
to
ca0345d
Compare
Users can set FI_EFA_ENABLE_SHM_TRANSFER=0 in their environment, and then they do not have a SHM EP. The unit tests in their current state fail if the env is set, but this patch fixes that. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
In order to optimize the fast path send path, we need the handshake response code to create a txe. Now all sends will have a txe with a valid peer so there is no need to do the lookup multiple times. Fixes a bug in the error path to report issue to CQ instead of EQ. Updates EFA Unit test to make the corresponding changes pass. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Now that all handshake requests, and responses use a txe, we can rely on all send packets having a txe and we no longer need to look up the peer in the send path. This is an optimization that will make all EFA sends faster. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Since all EFA send operations use a txe, we can remove the peer lookup from the critical path offering a performance improvement to all EFA sends. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
ca0345d
to
edc23c0
Compare
Switch handshake send/response functions to create a txe. Optimize the EFA send path by removing redundant efa_rdm_ep_get_peer() functions from it.