Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Use long CTS protocol if runting read protocol fails because of memory registration limits #9493

Merged
merged 8 commits into from
Nov 7, 2023

Conversation

sunkuamzn
Copy link
Contributor

The runting read protocol can fail because of MR registration limits on the hardware. This PR has changes to switch to long CTS protocol when that happens.

The first four commits are in #9432

/* The data_offset will be non-zero when the long CTS RTM packet
* is sent to continue a runting read transfer after the
* receiver has run out of memory registrations */
assert((data_offset == 0 || ope->internal_flags & EFA_RDM_OPE_READ_NACK) && data_size == -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for runting read, receiver already gets some data before try to register the memory. If we fallback to long cts, we restart from the data_offset. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct

send any data with the RTM packet. This is because the runting read RTM
packets have already delivered some of the data and the long CTS RTM
packet does not have a seg_offset field */
if (txe->internal_flags & EFA_RDM_OPE_READ_NACK) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is because long cts rtm currently doesn't support sending from a non-zero offset. data pkts supports offset so it is not impacted. Do I understand correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly

@shijin-aws shijin-aws requested a review from wzamazon November 1, 2023 00:59
@sunkuamzn sunkuamzn force-pushed the runt-read-fallback-both branch from 22e50bc to 248ed7f Compare November 1, 2023 15:49
@shijin-aws
Copy link
Contributor

@wzamazon could you have a look at it?

@j-xiong j-xiong removed the for-1.20.x label Nov 6, 2023
FI_ENOMR is returned when the hardware memory registration limit is
reached

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
The READ_NACK feature is checked before sending a EFA_RDM_READ_NACK_PKT
packet. The EFA_RDM_READ_NACK_PKT packet is sent by a receiver when it
fails to register a buffer to receive the RDMA read data in a long read
or runting read protocol

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
…ENOMR

Long read protocol could fail with ENOMR if the EFA provider is unable
to register the buffer with the NIC. In that case, we should fall back
to long CTS instead

This commit is for the changes when the sender fails to register the
source buffer. The sender will switch to the long CTS protocol.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
This change is required for the long read nack protocol where we get the
msg_id from ope instead of from the pke

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Long read protocol could fail with ENOMR if the EFA provider is unable
to register the buffer with the NIC. In that case, we should fall back
to long CTS protocol.

This commit is for the changes when the receiver fails to register the
destination memory. Receiver sends a NACK packet (packet type
EFA_RDM_READ_NACK_PKT) to the sender. The sender switches to the long
CTS protocol.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
…ENOMR

Runting read protocol could fail with ENOMR if the EFA provider is unable
to register the buffer with the NIC. In that case, we should fall back
to long CTS instead

This commit is for the changes when the sender fails to register the
source buffer. The sender will switch to the long CTS protocol.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Runting read protocol could fail with ENOMR if the EFA provider is unable
to register the buffer with the NIC. In that case, we should fall back
to long CTS protocol.

This commit is for the changes when the receiver fails to register the
destination memory. Receiver sends a NACK packet (packet type
EFA_RDM_READ_NACK_PKT) to the sender. The sender switches to the long
CTS protocol.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants