Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/rxm: minor fixes to connection management #6953

Merged
merged 4 commits into from
Aug 3, 2021
Merged

Conversation

ooststep
Copy link
Contributor

First, address an issue where we weren't comparing the correct address structure (thanks @aingerson)

This exposed an issue, which causes problems at shut down, in loopback tracking and clean up. We were losing track of the recipient side of a loopback connection and not closing and releasing them as we should have been.

prov/rxm/src/rxm_conn.c Outdated Show resolved Hide resolved
prov/rxm/src/rxm_conn.c Outdated Show resolved Hide resolved
@j-xiong
Copy link
Contributor

j-xiong commented Jul 29, 2021

The "merge" commit looks suspicious. Need a local rebase first?

ooststep and others added 3 commits July 30, 2021 07:01
loopback connections are not necessarily indexed in the destination,
so trak them so that we can properly clean them up when we shut down.

Signed-off-by: Stephen Oost <stephen.oost@intel.com>
Signed-off-by: Stephen Oost <stephen.oost@intel.com>
The address comparison to determine whether to reject or keep a duplicate
connection was incorrect and falsely identifying too many loopback
connections. The address to compare against should be the local EP's
address, not the peer's address.

This fix exposes an issue with the reject/close path which was not being
taken before. A race condition during simultaneous connections was
occurring where both sides sent a connection request. One side received
a connection request, accepted, and closed its outgoing connection.
However, the connection request had already been received by the peer,
causing the peer to incorrectly identify that connection as a lost
connection, causing send failures for any outstanding sends on that
active connection.

This patch could potentially cause issues in how lost connections are
managed, but testing to date has not shown any issues... yet...

Signed-off-by: aingerson <alexia.ingerson@intel.com>
If the error entry is not initialized, the core provider
may try to save into err_data if err_data_size is set to
a bogus value indicating there is space when there is not.

Signed-off-by: Stephen Oost <stephen.oost@intel.com>
Signed-off-by: aingerson <alexia.ingerson@intel.com>
@j-xiong j-xiong removed the evaluating label Aug 3, 2021
@j-xiong j-xiong merged commit 7d6d2a1 into ofiwg:main Aug 3, 2021
@ooststep ooststep deleted the cm_fixes branch October 13, 2022 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants