Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping the daemon always seems to hit a timeout #2963

Closed
sphuber opened this issue Jun 4, 2019 · 3 comments · Fixed by #2966
Closed

Stopping the daemon always seems to hit a timeout #2963

sphuber opened this issue Jun 4, 2019 · 3 comments · Fixed by #2966

Comments

@sphuber
Copy link
Contributor

sphuber commented Jun 4, 2019

When calling verdi daemon stop consistently the command times out. Eventually, the circus process will be killed. This behavior is very recent and it might have to do with the recent PR #2744 which dealt with attempting to kill Process instances when a local runner was interrupted. Now when the daemon is shutdown, the workers receive the interrupt signal, triggering the code that was added in #2744, where they try to kill the processes that were run. However, since these processes are not un-registered once they are completed, the daemon will try to kill processes that are already finished, which will cause the "hanging".

@ltalirz
Copy link
Member

ltalirz commented Jun 4, 2019

One comment - I don't know how hard it is to start/stop the daemon from the python API but if it is not too hard, then a very powerful test to be added (that would at least catch one of the potential issues you mention) would be to

  • start the daemon
  • submit a calculation (could be a calcfunction that simply waits forever)
  • stop the daemon & start it again
  • check that the calculation is still running
  • kill the calculation

@sphuber
Copy link
Contributor Author

sphuber commented Jun 5, 2019

The test you propose might be difficult to implement and would not test the actual problem of this issue:

  • A process function can only be run and not submitted, so somehow you need to launch the daemon and get it to "run" a process function without submitting it from the main test process. If instead one wants to use a CalcJob which can be submitted, we would of course have to mock one that sleeps indefinitely
  • Testing whether a calculation is still running is also not trivial if it is with the daemon. Checking the database won't help, because it might not have the correct updated state. So we would have to ask the daemon over RabbitMQ, but there is currently no simple API for this. We could just ask to kill it, which if this works is already a relatively strong indicator the task was correctly reloaded, but again this does not test the problem of this issue.

The problem actually stems from process functions. They create their own runner instance, in order not to get blocked for nested process functions, and they also attach their own handlers for interrupt signals to kill the process. This last part is important that if you run one in a local interpreter and then press CTRL+C, the AiiDA process is also properly killed and not just the python process. Otherwise you would end up with a process node that is still Running in your database even though it is not. However, this is problematic, because the same interrupt signal is sent to the daemon runners if the daemon is stopped.

The thing that causes the problem described in this issue arises due to an error in the attaching of interrupt signal handlers of process functions. They are attached, but never deattached. That means that each and everyone of those will be called when the daemon runner is asked to stop, even after the functions have long since finished. On top of that, the logic in the handler is incorrect and so will hang because the process no longer exists. This is ultimately what causes the verdi daemon stop command to timeout

@sphuber
Copy link
Contributor Author

sphuber commented Jun 5, 2019

Fixed in PR #2966

@sphuber sphuber closed this as completed Jun 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants