DAOS: stale connection with rxm #6660

frostedcmos · 2021-03-26T17:12:39Z

In DAOS usage, we have a scenario where we launch 4 servers, stop 1 server and restart it.
3 other servers fail then to reconnect to the restarted server, even though we destroy all mentions of restarted servers address once the server is stopped.

Experimentally, when we call mercury API that forces address to also be removed from AV table, we are then able to reconnect to the server that got restarted.

This issue only happens when we use rxm provider; using sockets provider restarted server is connectable just fine.
OFI: 1.12.0rc1

shefty · 2021-03-30T21:46:14Z

This is likely related to issues reported in #6665. RxM isn't handling the situation when the msg endpoint is disconnected cleanly. It needs to remove the connection state data from its connection map (without leaving some state around that later results in a crash), so that a new connection can be formed.

shefty · 2021-05-15T00:00:29Z

What do you mean by "we destroy all mentions of restarted servers address"?

frostedcmos · 2021-05-17T18:01:22Z

in Cart we initially get URI of the endpoint for each server and upon first rpc we resolve mercury 'hg_address' from it and cache it; mercury hg_Address is used for rpc sending.

In this scenario each server detects that one of the servers is dead and each server destroys any hd_address that is associated with the server that went down.

frostedcmos · 2021-07-12T20:25:51Z

Cart-level reproducer added to https://github.com/daos-stack/daos/pull/6167/files

To run: download and compile daos from the PR above
modify run_servers.sh and change hostname on line 1. comment/uncomment lines 9-16 to select desired provider.
from top directory of daos run sh ./run_servers.sh
This will start 3 servers, with rank0 pinging rank2 every second
In a separate terminal window run sh /tmp/relaunch.sh (this file gets automatically re-created by the test itself on every run)
This will kill rank2 server, sleep for 5 seconds and restart it with the same URI.

Expected output:
Servers in the old terminal window will display errors while rank2 is restarted, but should continue pinging new rank2 server successfully after the restart.
Server in the new terminal to display that pings are received.

Observed output:
For sockets provider observed output matches the expected.
For verbs;ofi_rxm or tcp;ofi_rxm new server does not receive pings, and senders fail to send rpc to the newly restarted server

frostedcmos · 2021-08-04T17:33:45Z

Tried with 7d6d2a1, the reproducer is now passing on tcp;rxm as well as verbs;rxm.

On TCP however during the run of reproducer when server is killed, others are displaying a flood of following messages:

libfabric:72366:tcp:ep_ctrl:tcpx_cm_send_req():390 connection failure
libfabric:72366:tcp:ep_ctrl:tcpx_cm_send_req():390 connection failure
libfabric:72366:tcp:ep_ctrl:tcpx_cm_send_req():390 connection failure
...

Verbs does not display those.

shefty · 2021-09-14T20:12:51Z

Based on the last comment, this issue has been resolved. The only concern is that there is debug output each time a connection is retried and fails. And since the clients are spinning trying to send, this can result in a lot of output. I'll see if there's an easy way to reduce the output without removing it entirely.

frostedcmos mentioned this issue Mar 26, 2021

DAOS-6155 fix: Destroy underlying connection when removing hg addr daos-stack/daos#5170

Closed

shefty added the high priority label Apr 3, 2021

shefty changed the title ~~Stale connection with rxm~~ DAOS: stale connection with rxm Sep 14, 2021

shefty closed this as completed Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS: stale connection with rxm #6660

DAOS: stale connection with rxm #6660

frostedcmos commented Mar 26, 2021

shefty commented Mar 30, 2021

shefty commented May 15, 2021

frostedcmos commented May 17, 2021

frostedcmos commented Jul 12, 2021 •

edited

Loading

frostedcmos commented Aug 4, 2021

shefty commented Sep 14, 2021

DAOS: stale connection with rxm #6660

DAOS: stale connection with rxm #6660

Comments

frostedcmos commented Mar 26, 2021

shefty commented Mar 30, 2021

shefty commented May 15, 2021

frostedcmos commented May 17, 2021

frostedcmos commented Jul 12, 2021 • edited Loading

frostedcmos commented Aug 4, 2021

shefty commented Sep 14, 2021

frostedcmos commented Jul 12, 2021 •

edited

Loading