-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS: stale connection with rxm #6660
Comments
This is likely related to issues reported in #6665. RxM isn't handling the situation when the msg endpoint is disconnected cleanly. It needs to remove the connection state data from its connection map (without leaving some state around that later results in a crash), so that a new connection can be formed. |
What do you mean by "we destroy all mentions of restarted servers address"? |
in Cart we initially get URI of the endpoint for each server and upon first rpc we resolve mercury 'hg_address' from it and cache it; mercury hg_Address is used for rpc sending. In this scenario each server detects that one of the servers is dead and each server destroys any hd_address that is associated with the server that went down. |
Cart-level reproducer added to https://github.com/daos-stack/daos/pull/6167/files To run: download and compile daos from the PR above Expected output: Observed output: |
Tried with 7d6d2a1, the reproducer is now passing on tcp;rxm as well as verbs;rxm. On TCP however during the run of reproducer when server is killed, others are displaying a flood of following messages: libfabric:72366:tcp:ep_ctrl:tcpx_cm_send_req():390 connection failure Verbs does not display those. |
Based on the last comment, this issue has been resolved. The only concern is that there is debug output each time a connection is retried and fails. And since the clients are spinning trying to send, this can result in a lot of output. I'll see if there's an easy way to reduce the output without removing it entirely. |
In DAOS usage, we have a scenario where we launch 4 servers, stop 1 server and restart it.
3 other servers fail then to reconnect to the restarted server, even though we destroy all mentions of restarted servers address once the server is stopped.
Experimentally, when we call mercury API that forces address to also be removed from AV table, we are then able to reconnect to the server that got restarted.
This issue only happens when we use rxm provider; using sockets provider restarted server is connectable just fine.
OFI: 1.12.0rc1
The text was updated successfully, but these errors were encountered: