Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS: stale connection with rxm #6660

Closed
frostedcmos opened this issue Mar 26, 2021 · 6 comments
Closed

DAOS: stale connection with rxm #6660

frostedcmos opened this issue Mar 26, 2021 · 6 comments

Comments

@frostedcmos
Copy link

In DAOS usage, we have a scenario where we launch 4 servers, stop 1 server and restart it.
3 other servers fail then to reconnect to the restarted server, even though we destroy all mentions of restarted servers address once the server is stopped.

Experimentally, when we call mercury API that forces address to also be removed from AV table, we are then able to reconnect to the server that got restarted.

This issue only happens when we use rxm provider; using sockets provider restarted server is connectable just fine.
OFI: 1.12.0rc1

@shefty
Copy link
Member

shefty commented Mar 30, 2021

This is likely related to issues reported in #6665. RxM isn't handling the situation when the msg endpoint is disconnected cleanly. It needs to remove the connection state data from its connection map (without leaving some state around that later results in a crash), so that a new connection can be formed.

@shefty
Copy link
Member

shefty commented May 15, 2021

What do you mean by "we destroy all mentions of restarted servers address"?

@frostedcmos
Copy link
Author

in Cart we initially get URI of the endpoint for each server and upon first rpc we resolve mercury 'hg_address' from it and cache it; mercury hg_Address is used for rpc sending.

In this scenario each server detects that one of the servers is dead and each server destroys any hd_address that is associated with the server that went down.

@frostedcmos
Copy link
Author

frostedcmos commented Jul 12, 2021

Cart-level reproducer added to https://github.com/daos-stack/daos/pull/6167/files

To run: download and compile daos from the PR above
modify run_servers.sh and change hostname on line 1. comment/uncomment lines 9-16 to select desired provider.
from top directory of daos run sh ./run_servers.sh
This will start 3 servers, with rank0 pinging rank2 every second
In a separate terminal window run sh /tmp/relaunch.sh (this file gets automatically re-created by the test itself on every run)
This will kill rank2 server, sleep for 5 seconds and restart it with the same URI.

Expected output:
Servers in the old terminal window will display errors while rank2 is restarted, but should continue pinging new rank2 server successfully after the restart.
Server in the new terminal to display that pings are received.

Observed output:
For sockets provider observed output matches the expected.
For verbs;ofi_rxm or tcp;ofi_rxm new server does not receive pings, and senders fail to send rpc to the newly restarted server

@frostedcmos
Copy link
Author

Tried with 7d6d2a1, the reproducer is now passing on tcp;rxm as well as verbs;rxm.

On TCP however during the run of reproducer when server is killed, others are displaying a flood of following messages:

libfabric:72366:tcp:ep_ctrl:tcpx_cm_send_req():390 connection failure
libfabric:72366:tcp:ep_ctrl:tcpx_cm_send_req():390 connection failure
libfabric:72366:tcp:ep_ctrl:tcpx_cm_send_req():390 connection failure
...

Verbs does not display those.

@shefty shefty changed the title Stale connection with rxm DAOS: stale connection with rxm Sep 14, 2021
@shefty
Copy link
Member

shefty commented Sep 14, 2021

Based on the last comment, this issue has been resolved. The only concern is that there is debug output each time a connection is retried and fails. And since the clients are spinning trying to send, this can result in a lot of output. I'll see if there's an easy way to reduce the output without removing it entirely.

@shefty shefty closed this as completed Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants