Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH Transport: Channel seems to be in a superposition of open and closed #4940

Open
dev-zero opened this issue May 12, 2021 · 3 comments
Open
Labels

Comments

@dev-zero
Copy link
Contributor

Describe the bug

After launching a couple of hundred workchains by demon (which in turn launches a number of calculations per workchain), I get for ~10% of the uploads this error:

15770  33m ago    Cp2kCalculation         ⏸ Waiting        Pausing after failed transport task: upload_calculation failed 5 times consecutively

and with verdi process report:

+-> ERROR at 2021-05-12 17:59:36.582154+02:00
 | Traceback (most recent call last):
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/engine/transports.py", line 110, in request_transport
 |     yield transport_request.future
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
 |     execmanager.upload_calculation(node, transport, calc_info, folder)
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/engine/daemon/execmanager.py", line 102, in upload_calculation
 |     remote_user = transport.whoami()
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/transports/transport.py", line 707, in whoami
 |     retval, username, stderr = self.exec_command_wait(command)
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/transports/plugins/ssh.py", line 1300, in exec_command_wait
 |     ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/transports/plugins/ssh.py", line 1263, in _exec_command_internal
 |     channel = self.sshclient.get_transport().open_session()
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/transports/plugins/ssh.py", line 490, in sshclient
 |     raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
 | aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
 | 
 | During handling of the above exception, another exception occurred:
 | 
 | Traceback (most recent call last):
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/engine/processes/calcjobs/tasks.py", line 92, in do_upload
 |     return skip_submit
 |   File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
 |     self.gen.throw(type, value, traceback)
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/engine/transports.py", line 126, in request_transport
 |     transport_request.future.result().close()
 |   File "/home/tiziano/work/aiida/aiida_core/aiida/transports/plugins/ssh.py", line 481, in close
 |     raise InvalidOperation('Cannot close the transport: it is already closed')
 | aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed

Your environment

  • Operating system [e.g. Linux]: Linux
  • Python version [e.g. 3.7.1]: 3.9.5
  • aiida-core version [e.g. 1.2.1]: 1.6.2
@dev-zero
Copy link
Contributor Author

About the SSH setup: key comes from token/agent, but it remains in the cache for a long time.

@dev-zero
Copy link
Contributor Author

and possibly related I got the following warning at some point:

Warning: 121% of the available daemon worker slots have been used!
Warning: Increase the number of workers with 'verdi daemon incr'.

@dev-zero
Copy link
Contributor Author

Simply replaying the tasks did not help as they repeatedly ended up in the same state again with the same error message.
Only after a restart of the daemon were they able to be replayed.

dev-zero added a commit that referenced this issue Jul 19, 2021
SSH provides multiple ways to forward connections. The legacy way is via SSHProxyCommand which spawns a separate process for each jump host/proxy. Controlling those processes is error prone and lingering/hanging processes have been observed (#4940 and others, depending on the setup). This commit adds support for the SSHProxyJump feature which permits to setup an arbitrary number of proxy jumps without additional processes by creating TCP channels over existing (Paramiko) connections. This gives a good control over the lifetime of the different connections and since a users SSH config is not re-read after the initial setup gives a controlled environment.
Hence it has been decided to make this new directive the recommended default in the documentation while still supporting both ways.

Co-authored-by: Marnik Bercx <mbercx@gmail.com>
Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
sphuber pushed a commit that referenced this issue Aug 8, 2021
SSH provides multiple ways to forward connections. The legacy way is via SSHProxyCommand which spawns a separate process for each jump host/proxy. Controlling those processes is error prone and lingering/hanging processes have been observed (#4940 and others, depending on the setup). This commit adds support for the SSHProxyJump feature which permits to setup an arbitrary number of proxy jumps without additional processes by creating TCP channels over existing (Paramiko) connections. This gives a good control over the lifetime of the different connections and since a users SSH config is not re-read after the initial setup gives a controlled environment.
Hence it has been decided to make this new directive the recommended default in the documentation while still supporting both ways.

Co-authored-by: Marnik Bercx <mbercx@gmail.com>
Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>

Cherry-pick: da179dc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant