node setup failed with: ssh_exchange_identification: Connection closed by remote host #1815

yarongilor · 2020-02-13T08:28:27Z

Prerequisites

Are you rebased to master ?
Is it reproducible ?
Did you perform a cursory search if this issue isn't opened ?

Versions

SCT: [branch 3.3]
scylla: [branch 3.3]

see full details in:
https://jenkins.scylladb.com/job/scylla-3.3/job/rolling-upgrade/job/rolling-upgrade-centos7/19/execution/node/38/log/
failure:

22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > GCE Cluster rolling-upgrade-3-3-centos-db-cluster-663a4156 | Image: centos-7 | Root Disk: pd-ssd 50 GB | Local SSD: 3 | Type: n1-highmem-8: Node setup failed: Node rolling-upgrade-3-3-centos-db-node-663a4156-0-2 [34.73.2.42 | 10.142.0.40] (seed: False) < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > GCE Cluster rolling-upgrade-3-3-centos-db-cluster-663a4156 | Image: centos-7 | Root Disk: pd-ssd 50 GB | Local SSD: 3 | Type: n1-highmem-8: Node setup failed: Node rolling-upgrade-3-3-centos-db-node-663a4156-0-2 [34.73.2.42 | 10.142.0.40] (seed: False)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Traceback (most recent call last):
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/cluster.py", line 2763, in node_setup
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     cl_inst.node_setup(node, **setup_kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/cluster.py", line 3264, in node_setup
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     node.scylla_setup(disks)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/utils/common.py", line 142, in inner
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     res = func(*args, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/cluster.py", line 2007, in scylla_setup
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     dst='/tmp/')
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/remote.py", line 471, in send_files
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     result = LocalCmdRunner().run(scp)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/remote.py", line 158, in run
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     env=os.environ, replace_env=True)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/fabric/connection.py", line 748, in local
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     return super(Connection, self).run(*args, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/invoke/context.py", line 94, in run
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     return self._run(runner, command, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/invoke/context.py", line 101, in _run
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     return runner.run(command, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/invoke/runners.py", line 291, in run
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     return self._run_body(command, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/invoke/runners.py", line 442, in _run_body
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     raise UnexpectedExit(result)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Command: 'scp -r -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -o UserKnownHostsFile=/tmp/tmp78ektk16 -P 22 -i /jenkins/.ssh/scylla-test ./configurations/io_properties.yaml \'scylla-test@[10.142.0.40]:"/tmp/"\''
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Exit code: 1
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Stdout:
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Stderr:
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > ssh_exchange_identification: Connection closed by remote host
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > lost connection
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >

The text was updated successfully, but these errors were encountered:

dkropachev · 2020-02-13T08:38:47Z

We just need to add retry there

roydahan · 2020-02-13T12:40:03Z

Yes, basically we need to add retries on all places we are trying to make connections.

bentsi · 2020-02-17T12:52:27Z

Adding retry will improve a bit, but this solution doesn't fix the real problem: how reliable protocol like SSH over TCP loses the connection? The issue is in the network or builders.
Per my investigation, it happens on the builders that have autossh containers left running.
We need to clean the autossh containers that are stuck/left from previous runs...

roydahan · 2020-02-18T12:02:31Z

According to @ShlomiBalalis he hits it as well, but he is maybe too lazy to update the issue.

I need to solve this.

We can safely retry the command when it didn't run on remote. This situation can happen when SSH/channel connection was not successfully initiated. Related issues: scylladb#1793, scylladb#1631, scylladb#1815

We can safely retry the command when it didn't run on remote. This situation can happen when SSH/channel connection was not successfully initiated. Related issues: #1793, #1631, #1815

We can safely retry the command when it didn't run on remote. This situation can happen when SSH/channel connection was not successfully initiated. Related issues: #1793, #1631, #1815 (cherry picked from commit 5503f25)

roydahan assigned bentsi Feb 18, 2020

roydahan added the P1 Urgent label Feb 18, 2020

bentsi mentioned this issue Feb 18, 2020

improvement(remoter): re-runing command on retriable exceptions #1837

Merged

9 tasks

bentsi closed this as completed Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node setup failed with: ssh_exchange_identification: Connection closed by remote host #1815

node setup failed with: ssh_exchange_identification: Connection closed by remote host #1815

yarongilor commented Feb 13, 2020 •

edited

Loading

dkropachev commented Feb 13, 2020

roydahan commented Feb 13, 2020

bentsi commented Feb 17, 2020

roydahan commented Feb 18, 2020

node setup failed with: ssh_exchange_identification: Connection closed by remote host #1815

node setup failed with: ssh_exchange_identification: Connection closed by remote host #1815

Comments

yarongilor commented Feb 13, 2020 • edited Loading

Prerequisites

Versions

dkropachev commented Feb 13, 2020

roydahan commented Feb 13, 2020

bentsi commented Feb 17, 2020

roydahan commented Feb 18, 2020

yarongilor commented Feb 13, 2020 •

edited

Loading