SSH issues lead to processes becoming unreachable #4801

mjclarke94 · 2021-03-10T18:21:28Z

After restarting my local machine, I started some daemons before running ssh-add -K on any appropriate ssh keys.

As such, all waiting processes ended up with errors of the form:

+-> ERROR at 2021-03-10 14:15:04.968898+00:00
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
...

 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/ed25519key.py", line 96, in _parse_signing_key_data
 |     raise PasswordRequiredException(
 | paramiko.ssh_exception.PasswordRequiredException: Private key file is encrypted

Fine in itself, there are multiple errors like this per process so I assume the exponential backoff mechanism is preventing it from trying and failing to connect repeatedly. I stopped the daemon, fixed the ssh keys, restarted the daemon and the connection issues were resolved.

The issue is now any processes which had that error are showing as "unreachable". I have created new processes since which can be paused and played with no issue, but all the processes which were queued up when I made the error with the ssh key can no longer be paused/played.

I'm unsure whether this is expected behaviour with the backoff mechanism or a bug. I'm also unsure whether or not this means all my jobs queued on the HPC need cancelling and resubmitting so they have an associated "reachable" process with them.

The text was updated successfully, but these errors were encountered:

ltalirz · 2021-03-11T12:29:19Z

thanks for reporting. @sphuber @chrisjsewell any ideas why this would lead processes to become unreachable?

sphuber · 2021-03-11T14:22:48Z

When you say they are unreachable it is when you try to run verdi process play/pause on them, right? This should not happen in principle, so would most likely point to a bug. Can I ask what version of aiida-core you are using. You can run verdi --version to determine this (as long as you did not install from a particular branch directly from the repository).

Try to restart the daemon once more with verdi daemon restart --reset and wait a bit for things to get running again (a minute or so). Then try to play them again. If they are still marked as unreachable, here is a trick that you can use to get them running again. Disclaimer this should not be used regularly as it can cause problems if used incorrectly.

In a verdi shell do the following

from aiida.manage.manager import get_manager
controller = get_manager().get_process_controller()
pks = []  # Add the pks to this list of the processes that have become unreachable. Warning do **not** add processes that are actually running and are reachable
for pk in pks:
    controller.continue_process(pk, no_reply=True, nowait=True)

mjclarke94 · 2021-03-17T11:38:10Z

Sorry for the slow reply. I've installed from develop, specifically commit d762522. The daemon restart didn't help, nor did restarting my machine and the backend services (postgres/rabbiitmq).

I was in a rush so just deleted and resubmitted the processes (which in hindsight is probably not all that helpful for bug hunting....sorry!), but will try the snippet above if I inadvertently recreate it!

mjclarke94 · 2021-03-18T10:57:08Z

Unsure if related, but I seem to very frequently be getting the following on many of my jobs...

+-> ERROR at 2021-03-18 08:05:18.028135+00:00
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2211, in _check_banner
 |     buf = self.packetizer.readline(timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
 |     buf += self._read_timeout(timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/packet.py", line 607, in _read_timeout
 |     x = self.__socket.recv(128)
 | ConnectionResetError: [Errno 54] Connection reset by peer
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 190, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 95, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 609, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 258, in __step
 |     result = coro.throw(exc)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 180, in updating
 |     await self._update_job_info()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
 |     transport = await request
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 284, in __await__
 |     yield self  # This tells Task to wait for completion.
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
 |     future.result()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/transports.py", line 89, in do_open
 |     transport.open()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/transports/plugins/ssh.py", line 438, in open
 |     self._client.connect(self._machine, **connection_arguments)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/client.py", line 406, in connect
 |     t.start_client(timeout=timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 660, in start_client
 |     raise e
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2039, in run
 |     self._check_banner()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2215, in _check_banner
 |     raise SSHException(
 | paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 54] Connection reset by peer

I'm running on a mac rather than a persistent server so am wondering if it going to sleep overnight is causing backend processes to be interrupted in a way aiida wouldn't be able to handle safely? I suggest that primarily because things run fine running during the day but I tend to awake to a big stack of errors rather than any technical insight I have to offer!

sphuber · 2021-03-18T11:07:15Z

When your laptop goes to sleep, the daemon should not actually be running. This should not be a problem as AiiDA is designed to be able to deal with this and simply continue the processes where it left off last time the daemon was stopped. That being said, it can be the case that when your computer wakes up and the daemon restarts, there is a problem with the SSH agent or keys, causing connections to the remote machine to fail, which is the exception that you see here. Ultimately, this should not be a problem since the exponential backoff mechanism will retry. If it keeps failing and the process is paused as a result, you can try to restart the daemon.

dev-zero · 2021-05-18T13:32:54Z

@mjclarke94 semi related to this as processes should not become unreachable, but are you using the proxy_command configuration (to have a jumphost)?

mjclarke94 · 2021-05-18T13:35:14Z

Nope, just plain old SSH.

rikigigi · 2022-08-22T12:30:32Z

Just to log it:
I also got unreachable processes in AiiDA version 1.5.0 (sorry for the old version) also after a single process crashed with

File "/u/r/rbertoss/.virtualenvs/aiida/lib/python3.7/site-packages/aiida/orm/nodes/data/array/trajectory.py", line 209, in _validate
    f'The TrajectoryData did not validate. Error: {type(exception).__name__} with message {exception}'
aiida.common.exceptions.ValidationError: The TrajectoryData did not validate. Error: MemoryError with message Unable to allocate 395. MiB for an array with shape (51710400,) and data type float64

and the snipped posted was able to get the processes start again

mjclarke94 added the type/bug label Mar 10, 2021

dev-zero mentioned this issue Mar 19, 2021

verdi run consumes CPU in case of connection error #4827

Open

tsthakur mentioned this issue Oct 16, 2022

Processes become unreachable when rebooting local machine #5699

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSH issues lead to processes becoming unreachable #4801

SSH issues lead to processes becoming unreachable #4801

mjclarke94 commented Mar 10, 2021 •

edited

Loading

ltalirz commented Mar 11, 2021

sphuber commented Mar 11, 2021 •

edited

Loading

mjclarke94 commented Mar 17, 2021

mjclarke94 commented Mar 18, 2021

sphuber commented Mar 18, 2021 •

edited by greschd

Loading

dev-zero commented May 18, 2021

mjclarke94 commented May 18, 2021

rikigigi commented Aug 22, 2022

SSH issues lead to processes becoming unreachable #4801

SSH issues lead to processes becoming unreachable #4801

Comments

mjclarke94 commented Mar 10, 2021 • edited Loading

ltalirz commented Mar 11, 2021

sphuber commented Mar 11, 2021 • edited Loading

mjclarke94 commented Mar 17, 2021

mjclarke94 commented Mar 18, 2021

sphuber commented Mar 18, 2021 • edited by greschd Loading

dev-zero commented May 18, 2021

mjclarke94 commented May 18, 2021

rikigigi commented Aug 22, 2022

mjclarke94 commented Mar 10, 2021 •

edited

Loading

sphuber commented Mar 11, 2021 •

edited

Loading

sphuber commented Mar 18, 2021 •

edited by greschd

Loading