Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"paramiko.ssh_exception.AuthenticationException: Authentication timeout " on remoter run #1793

Closed
2 of 3 tasks
yarongilor opened this issue Feb 10, 2020 · 7 comments
Closed
2 of 3 tasks
Assignees
Labels
P1 Urgent

Comments

@yarongilor
Copy link
Contributor

yarongilor commented Feb 10, 2020

  • Are you rebased to master ?
  • Is it reproducible ?
  • Did you perform a cursory search if this issue isn't opened ?

Versions

  • SCT: branch-3.3

see all test details here:
scylladb/scylladb#5740

sct log had the following failures:

< t:2020-02-05 01:24:21,772 f:remote.py       l:253  c:sdcm.remote          p:ERROR > Authentication timeout.
< t:2020-02-05 01:24:21,827 f:nemesis.py      l:642  c:sdcm.nemesis         p:ERROR > sdcm.nemesis.ChaosMonkey: ('Exception in random_disrupt_method %s: %s', 'restart_then_repair_node', AuthenticationException('Authentication timeout.',))
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.ChaosMonkey: Unhandled exception in method <function ChaosMonkey.disrupt at 0x7f02af88c6a8> < t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.ChaosMonkey: Unhandled exception in method <function ChaosMonkey.disrupt at 0x7f02af88c6a8>
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/nemesis.py", line 1564, in wrapper
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     result = method(*args, **kwargs)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/nemesis.py", line 1762, in disrupt
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.call_random_disrupt_method()
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/nemesis.py", line 639, in call_random_disrupt_method
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     disrupt_method()
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/nemesis.py", line 243, in disrupt_restart_then_repair_node
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.target_node.restart()
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/cluster_aws.py", line 652, in restart
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.stop_scylla_server(verify_down=False)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/utils/common.py", line 100, in inner
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     return func(*args, **kwargs)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/cluster.py", line 2250, in stop_scylla_server
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.remoter.run('sudo systemctl stop scylla-server.service', timeout=timeout, ignore_status=ignore_status)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/remote.py", line 261, in run
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     result = _run()
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/utils/common.py", line 95, in inner
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     return func(*args, **kwargs)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/remote.py", line 248, in _run
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     watchers=watchers, timeout=timeout)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "<decorator-gen-3>", line 2, in run
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/fabric/connection.py", line 29, in opens
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.open()
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/fabric/connection.py", line 634, in open
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.client.connect(**kwargs)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 446, in connect
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     passphrase,
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 764, in _auth
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     raise saved_exception
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 664, in _auth
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self._transport.auth_publickey(username, pkey)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/transport.py", line 1580, in auth_publickey
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     return self.auth_handler.wait_for_response(my_event)
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/auth_handler.py", line 240, in wait_for_response
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     raise AuthenticationException("Authentication timeout.")
< t:2020-02-05 01:24:21,828 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR > paramiko.ssh_exception.AuthenticationException: Authentication timeout.
< t:2020-02-05 01:24:21,835 f:nemesis.py      l:1586 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: RestartThenRepairNode Node longevity-large-partitions-200k-pks-db-node-dd772ace-3 [52.16.14.95 | 10.0.16.116] (seed: False) duration -> 256 s
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  > 2020-02-05 01:24:22.186: (DisruptionEvent Severity.ERROR): type=end name=RestartThenRepairNode node=Node longevity-large-partitions-200k-pks-db-node-dd772ace-3 [52.16.14.95 | 10.0.16.116] (seed: False) duration=256 error=Authentication timeout.
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  > Traceback (most recent call last):
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/nemesis.py", line 1564, in wrapper
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     result = method(*args, **kwargs)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/nemesis.py", line 1762, in disrupt
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     self.call_random_disrupt_method()
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/nemesis.py", line 639, in call_random_disrupt_method
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     disrupt_method()
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/nemesis.py", line 243, in disrupt_restart_then_repair_node
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     self.target_node.restart()
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/cluster_aws.py", line 652, in restart
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     self.stop_scylla_server(verify_down=False)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/utils/common.py", line 100, in inner
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     return func(*args, **kwargs)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/cluster.py", line 2250, in stop_scylla_server
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     self.remoter.run('sudo systemctl stop scylla-server.service', timeout=timeout, ignore_status=ignore_status)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/remote.py", line 261, in run
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     result = _run()
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/utils/common.py", line 95, in inner
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     return func(*args, **kwargs)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/sct/sdcm/remote.py", line 248, in _run
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     watchers=watchers, timeout=timeout)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "<decorator-gen-3>", line 2, in run
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/usr/local/lib/python3.6/site-packages/fabric/connection.py", line 29, in opens
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     self.open()
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/usr/local/lib/python3.6/site-packages/fabric/connection.py", line 634, in open
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     self.client.connect(**kwargs)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 446, in connect
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     passphrase,
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 764, in _auth
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     raise saved_exception
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 664, in _auth
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     self._transport.auth_publickey(username, pkey)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/usr/local/lib/python3.6/site-packages/paramiko/transport.py", line 1580, in auth_publickey
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     return self.auth_handler.wait_for_response(my_event)
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >   File "/usr/local/lib/python3.6/site-packages/paramiko/auth_handler.py", line 240, in wait_for_response
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  >     raise AuthenticationException("Authentication timeout.")
< t:2020-02-05 01:24:22,198 f:sct_events.py   l:659  c:sdcm.sct_events      p:INFO  > paramiko.ssh_exception.AuthenticationException: Authentication timeout.

@roydahan
Copy link
Contributor

@bentsi can you please look at it?
I would like to avoid having such issues, what can be fixed here?

@bentsi
Copy link
Contributor

bentsi commented Feb 10, 2020

the server was stuck for 5 seconds for some reason, nothing happend between 01:24:17 and 01:24:22

2020-02-05T01:24:17+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | scylla: [shard 2] compaction - Compacted 2 sstables to [/var/lib/scylla/data/system_schema/scylla_tables-5d912ff1f7593
665b2c88042ab5103dd/mc-206-big-Data.db:level=0, ]. 11013 bytes to 5664 (~51% of original) in 8685ms = 0.00MB/s. ~256 total partitions merged to 1.
2020-02-05T01:24:17+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | scylla: [shard 2] compaction - Compacted 2 sstables to [/var/lib/scylla/data/system_schema/keyspaces-abac5682dea631c5b
535b3d6cffd0fb6/mc-206-big-Data.db:level=0, ]. 10910 bytes to 5609 (~51% of original) in 8377ms = 0.00MB/s. ~256 total partitions merged to 1.
2020-02-05T01:24:17+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | scylla: [shard 2] compaction - Compacted 2 sstables to [/var/lib/scylla/data/system_schema/tables-afddfb9dbc1e30688056
eed6c302ba09/mc-206-big-Data.db:level=0, ]. 14200 bytes to 7254 (~51% of original) in 8377ms = 0.00MB/s. ~256 total partitions merged to 1.
2020-02-05T01:24:22+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | sshd[4771]: Accepted publickey for centos from 10.0.188.99 port 45020 ssh2: RSA SHA256:g7OmDbWC8qLUZHxjMohQe3LsQHfcOJbG1ZHqodB6smc
2020-02-05T01:24:22+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | systemd-logind: New session 8 of user centos.
2020-02-05T01:24:22+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | systemd: Started Session 8 of user centos.
2020-02-05T01:24:22+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | sshd[4771]: pam_unix(sshd:session): session opened for user centos by (uid=0)
2020-02-05T01:24:22+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/coredumpctl --no-pager --no-legend
2020-02-05T01:24:22+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-02-05T01:24:22+00:00  longevity-large-partitions-200k-pks-db-node-dd772ace-3 !INFO    | sudo: pam_unix(sudo:session): session closed for user root

and we have a timeout exception paramiko.ssh_exception.AuthenticationException: Authentication timeout.
this is probably due to some network issue or the server was frozen

@roydahan
Copy link
Contributor

@bentsi we need to be able to workaround such issues in SCT.
remoter should at least retry.
We have may connections during a longevity, we can't fail the test due to such issue.

@yarongilor
Copy link
Contributor Author

yarongilor commented Feb 17, 2020

reproduced in 3.3 2TB longevity and failed all the stress.

< t:2020-02-14 00:48:01,380 f:nemesis.py      l:642  c:sdcm.nemesis         p:ERROR > sdcm.nemesis.DisruptiveMonkey: ('Exception in random_disrupt_method %s: %s', 'restart_then_repair_node', AuthenticationException('Authentication 
timeout.',))
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.DisruptiveMonkey: Unhandled exception in method <function DisruptiveMonkey.disrupt at 0x7f4c94fd71e0> < t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.DisruptiveMonkey: Unhandled exception in method <function DisruptiveMonkey.disrupt at 0x7f4c94fd71e0>
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/nemesis.py", line 1564, in wrapper
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     result = method(*args, **kwargs)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/nemesis.py", line 2108, in disrupt
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.call_random_disrupt_method(disrupt_methods=self.disrupt_methods_list)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/nemesis.py", line 639, in call_random_disrupt_method
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     disrupt_method()
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/nemesis.py", line 243, in disrupt_restart_then_repair_node
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.target_node.restart()
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/cluster_aws.py", line 652, in restart
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.stop_scylla_server(verify_down=False)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/utils/common.py", line 100, in inner
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     return func(*args, **kwargs)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/cluster.py", line 2250, in stop_scylla_server
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.remoter.run('sudo systemctl stop scylla-server.service', timeout=timeout, ignore_status=ignore_status)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/remote.py", line 261, in run
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     result = _run()
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/utils/common.py", line 95, in inner
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     return func(*args, **kwargs)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/sct/sdcm/remote.py", line 248, in _run
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     watchers=watchers, timeout=timeout)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "<decorator-gen-3>", line 2, in run
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/fabric/connection.py", line 29, in opens
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.open()
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/fabric/connection.py", line 634, in open
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self.client.connect(**kwargs)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 446, in connect
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     passphrase,
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 764, in _auth
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     raise saved_exception
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/client.py", line 664, in _auth
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     self._transport.auth_publickey(username, pkey)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/transport.py", line 1580, in auth_publickey
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     return self.auth_handler.wait_for_response(my_event)
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/paramiko/auth_handler.py", line 240, in wait_for_response
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR >     raise AuthenticationException("Authentication timeout.")
< t:2020-02-14 00:48:01,381 f:nemesis.py      l:1574 c:sdcm.nemesis         p:ERROR > paramiko.ssh_exception.AuthenticationException: Authentication timeout.
< t:2020-02-14 00:48:01,389 f:nemesis.py      l:1586 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DisruptiveMonkey: RestartThenRepairNode Node longevity-tls-2tb-4d-1dis-2nondis-3-db-node-4f126310-2 [34.245.102.2 | 10.0.223.236] (seed: False) duration -> 250 s

test details:
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| Collected logs by test-id: 4f126310-d8b1-4830-a5e6-4545206ece2a Directory: /home/yarongilor/sct-results/20200213-105034-795744/collected_logs |
+--------------+----------------------------------------------------------------------------------------------------------------------------------+
| Cluster set | Link |
+--------------+----------------------------------------------------------------------------------------------------------------------------------+
| db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/4f126310-d8b1-4830-a5e6-4545206ece2a/20200216_104031/db-cluster-4f126310.zip |
| monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/4f126310-d8b1-4830-a5e6-4545206ece2a/20200216_104031/monitor-set-4f126310.zip |
| loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/4f126310-d8b1-4830-a5e6-4545206ece2a/20200216_104031/loader-set-4f126310.zip |
| sct-runner | None |
+--------------+----------------------------------------------------------------------------------------------------------------------------------+

@yarongilor
Copy link
Contributor Author

yarongilor commented Feb 17, 2020

we may try to think on a general solution to eliminate the severe impact of such similar issues.
#1: add retries to the ssh connection.
#2: if not resolved after retries, we can optionally run a kind of terminate and replace operation for this down and disconnected node.
#3 or the healthcheck operation before nemesis can verify scylla-server status and try running start for any stopped nodes.

@roydahan roydahan added the P1 Urgent label Feb 17, 2020
@roydahan
Copy link
Contributor

@bentsi This one as well

@bentsi
Copy link
Contributor

bentsi commented Feb 18, 2020

@yarongilor your last example is from your local run, there is no sct.log and dblog of the relevant node is cut (due to logs_transport: ssh)

bentsi pushed a commit to bentsi/scylla-cluster-tests that referenced this issue Feb 18, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: scylladb#1793, scylladb#1631, scylladb#1815
bentsi pushed a commit to bentsi/scylla-cluster-tests that referenced this issue Feb 19, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: scylladb#1793, scylladb#1631, scylladb#1815
bentsi pushed a commit that referenced this issue Feb 20, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: #1793, #1631, #1815
@bentsi bentsi closed this as completed Feb 23, 2020
bentsi pushed a commit that referenced this issue Feb 24, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: #1793, #1631, #1815

(cherry picked from commit 5503f25)
amoskong pushed a commit that referenced this issue Feb 28, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: #1793, #1631, #1815

(cherry picked from commit 5503f25)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Urgent
Projects
None yet
Development

No branches or pull requests

3 participants