Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node setup failed with: ssh_exchange_identification: Connection closed by remote host #1815

Closed
1 of 3 tasks
yarongilor opened this issue Feb 13, 2020 · 4 comments
Closed
1 of 3 tasks
Assignees
Labels
P1 Urgent

Comments

@yarongilor
Copy link
Contributor

yarongilor commented Feb 13, 2020

Prerequisites

  • Are you rebased to master ?
  • Is it reproducible ?
  • Did you perform a cursory search if this issue isn't opened ?

Versions

  • SCT: [branch 3.3]
  • scylla: [branch 3.3]

see full details in:
https://jenkins.scylladb.com/job/scylla-3.3/job/rolling-upgrade/job/rolling-upgrade-centos7/19/execution/node/38/log/
failure:

22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > GCE Cluster rolling-upgrade-3-3-centos-db-cluster-663a4156 | Image: centos-7 | Root Disk: pd-ssd 50 GB | Local SSD: 3 | Type: n1-highmem-8: Node setup failed: Node rolling-upgrade-3-3-centos-db-node-663a4156-0-2 [34.73.2.42 | 10.142.0.40] (seed: False) < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > GCE Cluster rolling-upgrade-3-3-centos-db-cluster-663a4156 | Image: centos-7 | Root Disk: pd-ssd 50 GB | Local SSD: 3 | Type: n1-highmem-8: Node setup failed: Node rolling-upgrade-3-3-centos-db-node-663a4156-0-2 [34.73.2.42 | 10.142.0.40] (seed: False)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Traceback (most recent call last):
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/cluster.py", line 2763, in node_setup
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     cl_inst.node_setup(node, **setup_kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/cluster.py", line 3264, in node_setup
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     node.scylla_setup(disks)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/utils/common.py", line 142, in inner
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     res = func(*args, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/cluster.py", line 2007, in scylla_setup
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     dst='/tmp/')
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/remote.py", line 471, in send_files
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     result = LocalCmdRunner().run(scp)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/sct/sdcm/remote.py", line 158, in run
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     env=os.environ, replace_env=True)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/fabric/connection.py", line 748, in local
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     return super(Connection, self).run(*args, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/invoke/context.py", line 94, in run
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     return self._run(runner, command, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/invoke/context.py", line 101, in _run
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     return runner.run(command, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/invoke/runners.py", line 291, in run
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     return self._run_body(command, **kwargs)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >   File "/usr/local/lib/python3.6/site-packages/invoke/runners.py", line 442, in _run_body
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR >     raise UnexpectedExit(result)
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Command: 'scp -r -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -o UserKnownHostsFile=/tmp/tmp78ektk16 -P 22 -i /jenkins/.ssh/scylla-test ./configurations/io_properties.yaml \'scylla-test@[10.142.0.40]:"/tmp/"\''
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Exit code: 1
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Stdout:
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > Stderr:
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > ssh_exchange_identification: Connection closed by remote host
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > lost connection
22:20:28  < t:2020-02-12 20:20:25,324 f:cluster.py      l:2765 c:sdcm.cluster         p:ERROR > 
@dkropachev
Copy link
Collaborator

We just need to add retry there

@roydahan
Copy link
Contributor

Yes, basically we need to add retries on all places we are trying to make connections.

@bentsi
Copy link
Contributor

bentsi commented Feb 17, 2020

Adding retry will improve a bit, but this solution doesn't fix the real problem: how reliable protocol like SSH over TCP loses the connection? The issue is in the network or builders.
Per my investigation, it happens on the builders that have autossh containers left running.
We need to clean the autossh containers that are stuck/left from previous runs...

@roydahan roydahan added the P1 Urgent label Feb 18, 2020
@roydahan
Copy link
Contributor

According to @ShlomiBalalis he hits it as well, but he is maybe too lazy to update the issue.

I need to solve this.

bentsi pushed a commit to bentsi/scylla-cluster-tests that referenced this issue Feb 18, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: scylladb#1793, scylladb#1631, scylladb#1815
bentsi pushed a commit to bentsi/scylla-cluster-tests that referenced this issue Feb 19, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: scylladb#1793, scylladb#1631, scylladb#1815
bentsi pushed a commit that referenced this issue Feb 20, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: #1793, #1631, #1815
@bentsi bentsi closed this as completed Feb 23, 2020
bentsi pushed a commit that referenced this issue Feb 24, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: #1793, #1631, #1815

(cherry picked from commit 5503f25)
amoskong pushed a commit that referenced this issue Feb 28, 2020
We can safely retry the command when it didn't run on remote.
This situation can happen when SSH/channel connection was not
successfully initiated.
Related issues: #1793, #1631, #1815

(cherry picked from commit 5503f25)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Urgent
Projects
None yet
Development

No branches or pull requests

4 participants