Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Make handshake code use TXE, and provide perf improvements to EFA #9682

Merged
merged 7 commits into from
Jan 4, 2024

Conversation

a-szegel
Copy link
Contributor

@a-szegel a-szegel commented Dec 29, 2023

Switch handshake send/response functions to create a txe. Optimize the EFA send path by removing redundant efa_rdm_ep_get_peer() functions from it.

@a-szegel a-szegel requested a review from a team December 29, 2023 19:27
@a-szegel a-szegel changed the title prov/efa: Remove peer lookup from handshake prov/efa: Cleanup Handshake Code Dec 29, 2023
@a-szegel a-szegel changed the title prov/efa: Cleanup Handshake Code prov/efa: Make handshake code use TXE, and provide perf improvements to EFA Dec 29, 2023
@a-szegel a-szegel force-pushed the remove-peer-lookup-from-handshake branch from 13192c7 to 7542362 Compare December 30, 2023 00:10
txe->addr = addr;
txe->peer = efa_rdm_ep_get_peer(ep, txe->addr);
assert(txe->peer);
dlist_insert_tail(&txe->peer_entry, &txe->peer->txe_list);
txe->msg_id = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For handshake my impression is that txe->msg_id and txe->fi_flags shouldn't matter that much. I doubt we need to reset them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in both places, grep shows txe->msg_id isn't being used in anywhere critical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shi requested I restore this to -1... "don't fix it if it isn't broken".

Comment on lines 619 to 624
txe->msg_id = -1;
/* efa_rdm_ep_alloc_txe() joins ep->base_ep.util_ep.tx_op_flags and passed in flags,
* reset to passed in flags
*/
txe->fi_flags = EFA_RDM_TXE_NO_COMPLETION | EFA_RDM_TXE_NO_COUNTER;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not needed - I think we can remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in both places... my only concern is that the user can potentially set flags that make us not behave as expected... (like FI_DELEIVERY_COMPLETE)... I think our tests should catch this though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested my theory in the debugger:

#define FI_DELIVERY_COMPLETE	(1ULL << 28)

p  tx_op_flags & (1ULL << 28)
$8 = 268435456

#define FI_COMPLETION		(1ULL << 24)

p  tx_op_flags & (1ULL << 24)
$9 = 16777216

Looks like we do need these so the common utilities work correctly.

@a-szegel a-szegel force-pushed the remove-peer-lookup-from-handshake branch 2 times, most recently from 4ce38a5 to c109240 Compare December 30, 2023 00:39
@@ -580,8 +576,8 @@ ssize_t efa_rdm_ep_trigger_handshake(struct efa_rdm_ep *ep,

txe->ep = ep;
txe->total_len = 0;
txe->addr = addr;
txe->peer = efa_rdm_ep_get_peer(ep, txe->addr);
txe->addr = peer->efa_fiaddr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still need addr in txe since you already have peer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, will fix in separate PR

txe->iov_count = 0;
/* efa_rdm_ep_alloc_txe() joins ep->base_ep.util_ep.tx_op_flags and passed in flags,
* reset to desired flags (remove things like FI_DELIVERY_COMPLETE, and FI_COMPLETION)
*/
txe->fi_flags = EFA_RDM_TXE_NO_COMPLETION | EFA_RDM_TXE_NO_COUNTER;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks odd, if the flag set in efa_rdm_ep_alloc_txe doesn't do what we want, why don't we pass these flags as 0 in the call and only set it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

msg.addr = addr;

/* ofi_op_write is ignored in handshake path */
txe = efa_rdm_ep_alloc_txe(ep, peer, &msg, ofi_op_write, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this extra txe allocation impact perf? Probably not... I understand you want to make all pkt type have a valid ope so you can save the peer lookup in the fast path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the handshake path take slightly longer but it should amortize out, and our fast path should get faster. I expect to see on the order of 2-3ns improvement from this PR in send path.

struct ibv_sge sg_list[2]; /* efa device support up to 2 iov */
struct ibv_data_buf inline_data_list[2];
int ret, pkt_idx, iov_cnt;

assert(pkt_entry_cnt);
ep = pkt_entry_vec[0]->ep;
assert(ep);
peer = efa_rdm_ep_get_peer(ep, pkt_entry_vec[0]->addr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need peer instead of txe in this function, why not just assign peer as txe->peer once so you don't have to reference it as txe->peer everywhere in the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

struct efa_rdm_ope *ope;

ope = pkt_entry->ope;
/*
* peer can be NULL when the pkt_entry is a RMA_CONTEXT_PKT,
* and the RMA is a local read toward the endpoint itself
*/
peer = efa_rdm_ep_get_peer(ep, pkt_entry->addr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment. I prefer you assign peer = ope->peer once to save instructions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@shijin-aws
Copy link
Contributor

need to look at the AWS CI failures.

@a-szegel a-szegel force-pushed the remove-peer-lookup-from-handshake branch from c109240 to 7d93de6 Compare January 2, 2024 18:30
@a-szegel
Copy link
Contributor Author

a-szegel commented Jan 2, 2024

Looking at CI failures

Pass a pointer to the peer instead of the peer's address in
efa_rdm_ep_trigger_handshake().  This lets us avoid doing redundant peer
lookups.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
@a-szegel a-szegel force-pushed the remove-peer-lookup-from-handshake branch from 7d93de6 to ea0b742 Compare January 3, 2024 00:16
@a-szegel
Copy link
Contributor Author

a-szegel commented Jan 3, 2024

Fixed the failing UT's, lets see if we pass CI.

ret = fi_close(&efa_rdm_ep->shm_ep->fid);
assert_int_equal(ret, 0);
efa_rdm_ep->shm_ep = NULL;
if (efa_rdm_ep->shm_ep) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we disable shm in this test? how come it wasn't broken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my env had the env set, and I figured this is a nice change to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move this to a separate commit, or remove it if you want me to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to separate commit, and got other cases.

@a-szegel
Copy link
Contributor Author

a-szegel commented Jan 3, 2024

bot:aws:retest

Prevously, efa_rdm_txe_construct() was not initializing rma_iov_count to
0 when creating a txe. This value needs to be initialized to 0 b/c the
OPE buff pool is shared between the fi_send()/fi_rma() path, and if
there is garbage in that value, than our common recv utilities can end
up going down the wrong path.  This bug was exposed when we switched the
efa_rdm_ep_trigger_handshake() code to use efa_rdm_ep_alloc_txe() with
IntelMPI osu_mbw_mr collective test.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Switch the handshake code to re-use the efa utility
efa_rdm_ep_alloc_txe() to alloc the txe used in the handshake.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
@a-szegel a-szegel force-pushed the remove-peer-lookup-from-handshake branch from ea0b742 to ca0345d Compare January 3, 2024 23:18
Users can set FI_EFA_ENABLE_SHM_TRANSFER=0 in their environment, and
then they do not have a SHM EP.  The unit tests in their current state
fail if the env is set, but this patch fixes that.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
In order to optimize the fast path send path, we need the handshake
response code to create a txe. Now all sends will have a txe with a
valid peer so there is no need to do the lookup multiple times.

Fixes a bug in the error path to report issue to CQ instead of EQ.

Updates EFA Unit test to make the corresponding changes pass.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Now that all handshake requests, and responses use a txe, we can rely on
all send packets having a txe and we no longer need to look up the peer
in the send path.  This is an optimization that will make all EFA sends
faster.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Since all EFA send operations use a txe, we can remove the peer lookup
from the critical path offering a performance improvement to all EFA
sends.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
@a-szegel a-szegel force-pushed the remove-peer-lookup-from-handshake branch from ca0345d to edc23c0 Compare January 3, 2024 23:33
@shijin-aws shijin-aws merged commit ff8478a into ofiwg:main Jan 4, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants