-
Notifications
You must be signed in to change notification settings - Fork 142
EOS-24883: cas-client FAILED with -ETIMEOUT #1069
Conversation
Root cause : Reordering of the initialization sequence of transports caused the issue. sock was put before lnet, because of this cas_client in UT is trying to use very first transport(sock) while the server is using the transport based on the endpoint format(lnet). That is why we see ETIMEDOUT. Client cannot connect to server because they are using different transport Fix: Changed the transport order back to lnet first and then sock next Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine.
Why is this empty template in the PR description above? If it's not used - just delete and get rid of the excess noise. |
Hi @andriytk , Deleted. |
04motr-single-node/32m0t1fs-rconfc-fail-test test passed on my VM.
#m0t1fs/linux_kernel/st/m0t1fs_rconfc_fail_test.sh End: m0t1fs rconfc fatal testing m0t1fs-rconfc-fatal: test status: SUCCESS |
@nikitadanilov, I am unable to determine if changing this initialization order will impact anything else and if this would is the correct way to fix the issue. I ask you to kindly review it. |
If we are not confident in this fix, then we could approach the issue from a different direction. We just need to try to call "m0_net_xprt_default_get" instead of taking the first item from the array of registered transports. Everywhere else UTs are using |
@nkommuri, Please review the fix and see if @ivan-alekhin's suggestion can be adopted. |
@nkommuri @nikitadanilov : This UT failure is not being seen with libfabric PR#623, which is already handling the multiple transports and the sequence. |
I am merging it, so that UT's will be running successfully before libfab merge. |
Root cause : Reordering of the initialization sequence of transports caused the issue. sock was put before lnet, because of this cas_client in UT is trying to use very first transport(sock) while the server is using the transport based on the endpoint format(lnet). That is why we see ETIMEDOUT. Client cannot connect to server because they are using different transport Fix: Changed the transport order back to lnet first and then sock next Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
Root cause : Reordering of the initialization sequence of transports
caused the issue. sock was put before lnet, because of this cas_client
in UT is trying to use very first transport(sock) while the server is
using the transport based on the endpoint format(lnet). That is why we
see ETIMEDOUT. Client cannot connect to server because they are using
different transport
Fix: Changed the transport order back to lnet first and then sock next
Signed-off-by: Naga Kishore Kommuri nagakishore.kommuri@seagate.com