Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/rxm: fixes to handling completion errors #6707

Merged
merged 4 commits into from
Apr 27, 2021
Merged

Conversation

shefty
Copy link
Member

@shefty shefty commented Apr 22, 2021

These changes are based on a code inspection of the states for tx operations. For transfers generated for internal protocol operations, discard all completions. Do not report failures to the application, which lacks any context for the operation. Add completion error handling for missing states. In a couple of cases, mark bugs in the state machine where races impact error handling (affects write rendezvous and atomics).

Most of these changes impact the processing of transfers that complete in error. As a result, testing the changes is difficult.

shefty added 3 commits April 22, 2021 08:39
The emulated inject path sets the hdr.state to TX.  This can
result in the completion being reported to the application.
Use the INJECT_TX state instead, so we can trap for the
completion and discard it.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Remove duplicate setting of the hdr.state and unnecessary
progress call.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Do not generate CQ entries that are returned to the
application for transfers that are associated with
internal sends.  Provide error handling for missing
transfer states.

To prepare this patch a full analysis of the rxm tx/rx
states was done.  Minor code cleanups occur around the
state handling locations.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
When establishing a loopback connection, it's possible for the
rxm_cmap_handle fi_addr and peer fields to both be valid.
However, the rxm_conn_handle_notify() path assumes that only
one is valid.  The result is that the cmap->handles_av[]
can be left with a stale entry referencing a freed handle.

Check to see if the handle's fi_addr is valid directly and
remove it from the av prior to freeing the handle.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants