Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH issues lead to processes becoming unreachable #4801

Open
mjclarke94 opened this issue Mar 10, 2021 · 8 comments
Open

SSH issues lead to processes becoming unreachable #4801

mjclarke94 opened this issue Mar 10, 2021 · 8 comments
Labels

Comments

@mjclarke94
Copy link
Contributor

mjclarke94 commented Mar 10, 2021

After restarting my local machine, I started some daemons before running ssh-add -K on any appropriate ssh keys.

As such, all waiting processes ended up with errors of the form:

+-> ERROR at 2021-03-10 14:15:04.968898+00:00
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
...

 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/ed25519key.py", line 96, in _parse_signing_key_data
 |     raise PasswordRequiredException(
 | paramiko.ssh_exception.PasswordRequiredException: Private key file is encrypted

Fine in itself, there are multiple errors like this per process so I assume the exponential backoff mechanism is preventing it from trying and failing to connect repeatedly. I stopped the daemon, fixed the ssh keys, restarted the daemon and the connection issues were resolved.

The issue is now any processes which had that error are showing as "unreachable". I have created new processes since which can be paused and played with no issue, but all the processes which were queued up when I made the error with the ssh key can no longer be paused/played.

I'm unsure whether this is expected behaviour with the backoff mechanism or a bug. I'm also unsure whether or not this means all my jobs queued on the HPC need cancelling and resubmitting so they have an associated "reachable" process with them.

@ltalirz
Copy link
Member

ltalirz commented Mar 11, 2021

thanks for reporting. @sphuber @chrisjsewell any ideas why this would lead processes to become unreachable?

@sphuber
Copy link
Contributor

sphuber commented Mar 11, 2021

When you say they are unreachable it is when you try to run verdi process play/pause on them, right? This should not happen in principle, so would most likely point to a bug. Can I ask what version of aiida-core you are using. You can run verdi --version to determine this (as long as you did not install from a particular branch directly from the repository).

Try to restart the daemon once more with verdi daemon restart --reset and wait a bit for things to get running again (a minute or so). Then try to play them again. If they are still marked as unreachable, here is a trick that you can use to get them running again. Disclaimer this should not be used regularly as it can cause problems if used incorrectly.

In a verdi shell do the following

from aiida.manage.manager import get_manager
controller = get_manager().get_process_controller()
pks = []  # Add the pks to this list of the processes that have become unreachable. Warning do **not** add processes that are actually running and are reachable
for pk in pks:
    controller.continue_process(pk, no_reply=True, nowait=True)

@mjclarke94
Copy link
Contributor Author

Sorry for the slow reply. I've installed from develop, specifically commit d762522. The daemon restart didn't help, nor did restarting my machine and the backend services (postgres/rabbiitmq).

I was in a rush so just deleted and resubmitted the processes (which in hindsight is probably not all that helpful for bug hunting....sorry!), but will try the snippet above if I inadvertently recreate it!

@mjclarke94
Copy link
Contributor Author

Unsure if related, but I seem to very frequently be getting the following on many of my jobs...

+-> ERROR at 2021-03-18 08:05:18.028135+00:00
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2211, in _check_banner
 |     buf = self.packetizer.readline(timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
 |     buf += self._read_timeout(timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/packet.py", line 607, in _read_timeout
 |     x = self.__socket.recv(128)
 | ConnectionResetError: [Errno 54] Connection reset by peer
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 190, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 95, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 609, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 258, in __step
 |     result = coro.throw(exc)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 180, in updating
 |     await self._update_job_info()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
 |     transport = await request
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 284, in __await__
 |     yield self  # This tells Task to wait for completion.
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
 |     future.result()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/transports.py", line 89, in do_open
 |     transport.open()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/transports/plugins/ssh.py", line 438, in open
 |     self._client.connect(self._machine, **connection_arguments)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/client.py", line 406, in connect
 |     t.start_client(timeout=timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 660, in start_client
 |     raise e
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2039, in run
 |     self._check_banner()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2215, in _check_banner
 |     raise SSHException(
 | paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 54] Connection reset by peer

I'm running on a mac rather than a persistent server so am wondering if it going to sleep overnight is causing backend processes to be interrupted in a way aiida wouldn't be able to handle safely? I suggest that primarily because things run fine running during the day but I tend to awake to a big stack of errors rather than any technical insight I have to offer!

@sphuber
Copy link
Contributor

sphuber commented Mar 18, 2021

When your laptop goes to sleep, the daemon should not actually be running. This should not be a problem as AiiDA is designed to be able to deal with this and simply continue the processes where it left off last time the daemon was stopped. That being said, it can be the case that when your computer wakes up and the daemon restarts, there is a problem with the SSH agent or keys, causing connections to the remote machine to fail, which is the exception that you see here. Ultimately, this should not be a problem since the exponential backoff mechanism will retry. If it keeps failing and the process is paused as a result, you can try to restart the daemon.

@dev-zero
Copy link
Contributor

@mjclarke94 semi related to this as processes should not become unreachable, but are you using the proxy_command configuration (to have a jumphost)?

@mjclarke94
Copy link
Contributor Author

Nope, just plain old SSH.

@rikigigi
Copy link
Member

Just to log it:
I also got unreachable processes in AiiDA version 1.5.0 (sorry for the old version) also after a single process crashed with

File "/u/r/rbertoss/.virtualenvs/aiida/lib/python3.7/site-packages/aiida/orm/nodes/data/array/trajectory.py", line 209, in _validate
    f'The TrajectoryData did not validate. Error: {type(exception).__name__} with message {exception}'
aiida.common.exceptions.ValidationError: The TrajectoryData did not validate. Error: MemoryError with message Unable to allocate 395. MiB for an array with shape (51710400,) and data type float64

and the snipped posted was able to get the processes start again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants