Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

EOS-24883: cas-client FAILED with -ETIMEOUT #1069

Merged
merged 2 commits into from
Oct 1, 2021
Merged

EOS-24883: cas-client FAILED with -ETIMEOUT #1069

merged 2 commits into from
Oct 1, 2021

Conversation

nkommuri
Copy link

@nkommuri nkommuri commented Sep 23, 2021

Root cause : Reordering of the initialization sequence of transports
caused the issue. sock was put before lnet, because of this cas_client
in UT is trying to use very first transport(sock) while the server is
using the transport based on the endpoint format(lnet). That is why we
see ETIMEDOUT. Client cannot connect to server because they are using
different transport

Fix: Changed the transport order back to lnet first and then sock next

Signed-off-by: Naga Kishore Kommuri nagakishore.kommuri@seagate.com

Root cause : Reordering of the initialization sequence of transports
caused the issue. sock was put before lnet, because of this cas_client
in UT is trying to use very first transport(sock) while the server is
using the transport based on the endpoint format(lnet). That is why we
see ETIMEDOUT. Client cannot connect to server because they are using
different transport

Fix: Changed the transport order back to lnet first and then sock next

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
Copy link

@huanghua78 huanghua78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine.

@andriytk
Copy link
Contributor

Why is this empty template in the PR description above? If it's not used - just delete and get rid of the excess noise.

@cortx-admin
Copy link

Jenkins CI Result : Motr#766

Motr Test Summary

Test ResultCountInfo
❌Failed1
📁

04motr-single-node/32m0t1fs-rconfc-fail-test

🏁Skipped5
📁

01motr-single-node/37protocol
04motr-single-node/31motr-sys-kvs-kernel-test
05motr-single-node/07multi-client-tests
05motr-single-node/12sns-cc-io-tests
05motr-single-node/39mount-fail

✔️Passed58
📁

01motr-single-node/00userspace-tests
01motr-single-node/01kernel-tests
01motr-single-node/03initscript-tests
01motr-single-node/04net-tests
01motr-single-node/05rpcping-tests
01motr-single-node/06console-tests
01motr-single-node/28motr-sys-kvs-test
01motr-single-node/29confgen-test
01motr-single-node/30hagen-test
01motr-single-node/35m0singlenode
01motr-single-node/46m0crate
01motr-single-node/52singlenode-sanity
02motr-single-node/02system-tests
02motr-single-node/08degraded-mode-tests
02motr-single-node/09poolmach-tests
02motr-single-node/20pool-version-test
02motr-single-node/21spiel-test
02motr-single-node/22rpc-cancel-test
02motr-single-node/23m0d-signal-test
02motr-single-node/24m0d-fsync-test
02motr-single-node/41motr-conf-update
02motr-single-node/47motr-tests-user-kernel
02motr-single-node/51kem
02motr-single-node/53clusterusage-alert
03motr-single-node/16conf-st-test
03motr-single-node/17sss-st-test
03motr-single-node/36spare-reservation
03motr-single-node/54sns-repair-motr-1f
03motr-single-node/55sns-repair-motr-1n-1f
03motr-single-node/56sns-repair-motr-mf
03motr-single-node/57sns-repair-motr-1k-1f
03motr-single-node/58sns-repair-motr-ios-fail
03motr-single-node/59sns-repair-motr-abort
03motr-single-node/60sns-repair-motr-abort-quiesce
03motr-single-node/61spiel-multi-confd
04motr-single-node/19sns-abort-test
04motr-single-node/25spiel-sns-repair-test
04motr-single-node/26spiel-sns-repair-quiesce-test
04motr-single-node/27sns-repair-io-fail-test
04motr-single-node/33motr-st
04motr-single-node/34sns-repair-1n-1f
04motr-single-node/48motr-raid0-io
04motr-single-node/49motr-rpc-cancel
04motr-single-node/50motr-rm-lock-cc-io
04motr-single-node/51motr-rmw
05motr-single-node/10sns-single-tests
05motr-single-node/11sns-multi-tests
05motr-single-node/13sns-repair-quiesce
05motr-single-node/14m0t1fs-fsync-test
05motr-single-node/15m0t1fs-fwait-test
05motr-single-node/18m0mount-test
05motr-single-node/38sns-abort-quiesce-test
05motr-single-node/40motr-dgmode
05motr-single-node/42motr-utils
05motr-single-node/43motr-sync-replication
05motr-single-node/44motr-sns-repair
05motr-single-node/45motr-sns-repair-N-1
motr-single-node/62dix-repair-st-lookup-insert

Total64🔗

CppCheck Summary

   Cppcheck: No new warnings found 👍

@nkommuri
Copy link
Author

Why is this empty template in the PR description above? If it's not used - just delete and get rid of the excess noise.

Hi @andriytk , Deleted.

@nkommuri
Copy link
Author

04motr-single-node/32m0t1fs-rconfc-fail-test test passed on my VM.

  • id : 32m0t1fs-rconfc-fail-test
    script : m0t1fs_rconfc_fail_test.sh
    dir : src/m0t1fs/linux_kernel/st/
    executor : Xperior::Executor::MotrTest
    sandbox : /var/motr/sandbox.32m0t1fs-rconfc-fail-test
    groupname: 04motr-single-node
    polltime : 30
    timeout : 1800

#m0t1fs/linux_kernel/st/m0t1fs_rconfc_fail_test.sh
...
...
stopping /root/eos-24883/cortx-motr/motr/m0d processes...
=== pids of services: 56659 57567 57683 57830 ===
Shutting down services one by one. mdservice is the last.
----- 56659 stopping--------lt-m0d: got signal 1
motr[57830]: a440 ERROR [reqh/reqh.c:454:m0_reqh_fop_allow] <! rc=-111
motr[57830]: a530 WARN [reqh/reqh.c:561:m0_reqh_fop_handle] fop "Credit Revoke"@0x7fe888015bf0 disallowed: -111.
motr[57830]: a530 ERROR [reqh/reqh.c:569:m0_reqh_fop_handle] <! rc=-111
motr[56659]: 8b40 ERROR [fop/fom_generic.c:93:m0_rpc_item_generic_reply_rc] Receiver reported error: -111 "No service running."
motr[56659]: 8b80 ERROR [rm/rm_fops.c:489:rm_revoke_ast] revoke request 0x1813670 failed: rc=-111
----- 56659 stopped --------
----- 57567 stopping--------lt-m0d: got signal 1
motr[57567]: 8860 WARN [ha/link.c:1513:ha_link_outgoing_fom_tick] rlk_rc=-110 endpoint=192.168.55.241@tcp:12345:33:901
motr[57567]: 8860 WARN [ha/link.c:1513:ha_link_outgoing_fom_tick] rlk_rc=-110 endpoint=192.168.55.241@tcp:12345:33:902
----- 57567 stopped --------
----- 57683 stopping--------lt-m0d: got signal 1
motr[57683]: 1860 WARN [ha/link.c:1285:ha_link_outgoing_fop_replied] rc=-110 nr=1 hl=0x1ba9360 ep=192.168.55.241@tcp:12345:34:1 lq_tags=(confirmed=108 delivered=108 next=108 assign=134)
motr[57683]: 1860 WARN [ha/link.c:1289:ha_link_outgoing_fop_replied] old_rc=0 old_nr=56 hl=0x1ba9360 ep=192.168.55.241@tcp:12345:34:1
motr[57683]: 1c10 WARN [ha/entrypoint.c:563:ha_entrypoint_client_fom_tick] rlk_rc=-110
----- 57683 stopped --------
----- 57830 stopping--------lt-m0d: got signal 1
motr[57830]: 1860 WARN [ha/link.c:1285:ha_link_outgoing_fop_replied] rc=-110 nr=1 hl=0x20ef4e0 ep=192.168.55.241@tcp:12345:34:1 lq_tags=(confirmed=50 delivered=50 next=50 assign=138)
motr[57830]: 1860 WARN [ha/link.c:1289:ha_link_outgoing_fop_replied] old_rc=0 old_nr=25 hl=0x20ef4e0 ep=192.168.55.241@tcp:12345:34:1
motr[57830]: 1c10 WARN [ha/entrypoint.c:563:ha_entrypoint_client_fom_tick] rlk_rc=-110
----- 57830 stopped --------
m0tr 14049540 0
galois 22944 1 m0tr
lnet 591262 3 m0tr,ksocklnd
Motr services stopped.


End: m0t1fs rconfc fatal testing


m0t1fs-rconfc-fatal: test status: SUCCESS

@mehjoshi
Copy link

@nikitadanilov, I am unable to determine if changing this initialization order will impact anything else and if this would is the correct way to fix the issue. I ask you to kindly review it.

@ivan-alekhin
Copy link
Contributor

I am unable to determine if changing this initialization order will impact anything else and if this would is the correct way to fix the issue. I ask you to kindly review it.

If we are not confident in this fix, then we could approach the issue from a different direction. We just need to try to call "m0_net_xprt_default_get" instead of taking the first item from the array of registered transports. Everywhere else UTs are using m0_net_xprt_default_get whenever they need to initialize a dummy RPC client (see usage of m0_rpc_client_start). So, I guess it will be the "right way" solution for this issue (at least it will make DIX/CAS UTs to use the same approach as the other UTs are using).

@mehjoshi
Copy link

@nkommuri, Please review the fix and see if @ivan-alekhin's suggestion can be adopted.

@madhavemuri
Copy link
Contributor

@nkommuri @nikitadanilov : This UT failure is not being seen with libfabric PR#623, which is already handling the multiple transports and the sequence.
@cc: @upendrapatwardhan

@madhavemuri
Copy link
Contributor

I am merging it, so that UT's will be running successfully before libfab merge.

@madhavemuri madhavemuri merged commit 3a8bac8 into Seagate:main Oct 1, 2021
mehjoshi pushed a commit that referenced this pull request Oct 8, 2021
Root cause : Reordering of the initialization sequence of transports
caused the issue. sock was put before lnet, because of this cas_client
in UT is trying to use very first transport(sock) while the server is
using the transport based on the endpoint format(lnet). That is why we
see ETIMEDOUT. Client cannot connect to server because they are using
different transport

Fix: Changed the transport order back to lnet first and then sock next

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants