-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSH issues lead to processes becoming unreachable #4801
Comments
thanks for reporting. @sphuber @chrisjsewell any ideas why this would lead processes to become unreachable? |
When you say they are unreachable it is when you try to run Try to restart the daemon once more with In a from aiida.manage.manager import get_manager
controller = get_manager().get_process_controller()
pks = [] # Add the pks to this list of the processes that have become unreachable. Warning do **not** add processes that are actually running and are reachable
for pk in pks:
controller.continue_process(pk, no_reply=True, nowait=True) |
Sorry for the slow reply. I've installed from I was in a rush so just deleted and resubmitted the processes (which in hindsight is probably not all that helpful for bug hunting....sorry!), but will try the snippet above if I inadvertently recreate it! |
Unsure if related, but I seem to very frequently be getting the following on many of my jobs...
I'm running on a mac rather than a persistent server so am wondering if it going to sleep overnight is causing backend processes to be interrupted in a way |
When your laptop goes to sleep, the daemon should not actually be running. This should not be a problem as AiiDA is designed to be able to deal with this and simply continue the processes where it left off last time the daemon was stopped. That being said, it can be the case that when your computer wakes up and the daemon restarts, there is a problem with the SSH agent or keys, causing connections to the remote machine to fail, which is the exception that you see here. Ultimately, this should not be a problem since the exponential backoff mechanism will retry. If it keeps failing and the process is paused as a result, you can try to restart the daemon. |
@mjclarke94 semi related to this as processes should not become unreachable, but are you using the |
Nope, just plain old SSH. |
Just to log it:
and the snipped posted was able to get the processes start again |
After restarting my local machine, I started some daemons before running
ssh-add -K
on any appropriate ssh keys.As such, all waiting processes ended up with errors of the form:
Fine in itself, there are multiple errors like this per process so I assume the exponential backoff mechanism is preventing it from trying and failing to connect repeatedly. I stopped the daemon, fixed the ssh keys, restarted the daemon and the connection issues were resolved.
The issue is now any processes which had that error are showing as "unreachable". I have created new processes since which can be paused and played with no issue, but all the processes which were queued up when I made the error with the ssh key can no longer be paused/played.
I'm unsure whether this is expected behaviour with the backoff mechanism or a bug. I'm also unsure whether or not this means all my jobs queued on the HPC need cancelling and resubmitting so they have an associated "reachable" process with them.
The text was updated successfully, but these errors were encountered: