Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture KeyboardInterrupt in Runner.run and kill processes #2744

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Apr 12, 2019

Addresses #2711 partially but does not fix it

Currently, if a user runs a Process in a local interpreter and then
interrupts the interpreter the process will be lost. However, the node
will reflect the last active state. This can be confusing to new users,
who will still for example see a "process" in the Waiting state.

To prevent this situation, the Runner._run method is modified and
attaches listeners to the SIGINT and SIGTERM signals that when
emitted will trigger a clean function that will log the interrupt and
call process.kill on the process that is currently in the scope.
This should ensure that if a user runs a process in a local interpreter
and interrupts it, the process will be properly killed, which will cause
the node to properly reflect what has happened.

The current solution works for a single process being run, but has
issues for nested processes. Sometimes only the current process is
killed and the others in the chain are unaffected before the interpreter
shuts down.

@sphuber sphuber requested review from muhrin and giovannipizzi April 12, 2019 09:52
@sphuber
Copy link
Contributor Author

sphuber commented Apr 12, 2019

@muhrin : would welcome your input if plumpy._process_stack is the advised way of getting the running process and if the location of the signal handlers is fine. I tested this locally for calc jobs and it works pretty well. Don't really know how to setup up decent unit tests for this. Problem is that with workchains it does not seem to work. The workchain might get killed, but I am getting all sort off other exceptions elsewhere, I think when the callback of the child gets excepted. Its different everytime and can even go down to postgres 😓

04/12/2019 11:48:50 AM <4723> tornado.application: [ERROR] Exception in callback <functools.partial object at 0x7fe4aa169940>
Traceback (most recent call last):
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
    future.result()
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/processes.py", line 1109, in step_until_terminated
    yield self.step()
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/processes.py", line 1100, in step
    self.transition_to(next_state)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 326, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 339, in transition_failed
    six.reraise(type(exception), exception, trace)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 310, in transition_to
    self._enter_next_state(new_state)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 374, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 288, in _fire_state_event
    callback(self, hook, state)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/processes.py", line 308, in <lambda>
    lambda _s, _h, from_state: self.on_entered(from_state))
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/engine/processes/process.py", line 285, in on_entered
    self.update_node_state(self._state)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/engine/processes/process.py", line 495, in update_node_state
    self.node.set_process_state(state.LABEL)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/orm/nodes/process/process.py", line 181, in set_process_state
    return self.set_attribute(self.PROCESS_STATE_KEY, state)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/orm/utils/mixins.py", line 207, in set_attribute
    super(Sealable, self).set_attribute(key, value, stored_check=False, **kwargs)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/orm/nodes/node.py", line 395, in set_attribute
    self.backend_entity.set_attribute(key, value)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/orm/implementation/django/nodes.py", line 178, in set_attribute
    self.ATTRIBUTE_CLASS.set_value_for_node(self.dbmodel, key, value)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/backends/djsite/db/models.py", line 1167, in set_value_for_node
    stop_if_existing=stop_if_existing)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/backends/djsite/db/models.py", line 663, in set_value
    transaction.savepoint_rollback(sid)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/transaction.py", line 66, in savepoint_rollback
    get_connection(using).savepoint_rollback(sid)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 348, in savepoint_rollback
    self._savepoint_rollback(sid)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 308, in _savepoint_rollback
    cursor.execute(self.ops.savepoint_rollback_sql(sid))
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/backends/base/operations.py", line 361, in savepoint_rollback_sql
    return "ROLLBACK TO SAVEPOINT %s" % self.quote_name(sid)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/backends/postgresql/operations.py", line 111, in quote_name
    if name.startswith('"') and name.endswith('"'):
AttributeError: 'NoneType' object has no attribute 'startswith'

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.04%) to 70.538% when pulling 97c477ef18cf32eed569edba16b37eec8e2282fc on sphuber:fix_2711_kill_process_interrupt_local_runner into 3898fad on aiidateam:develop.

@coveralls
Copy link

coveralls commented Apr 12, 2019

Coverage Status

Coverage increased (+0.0006%) to 72.669% when pulling 76d7228 on sphuber:fix_2711_kill_process_interrupt_local_runner into 0353e11 on aiidateam:develop.

@giovannipizzi
Copy link
Member

@sphuber there are three except: in the code, two are in the part related to rolling back (even if in a different place than the traceback). But what could happen for instance is that you send the signal exactly in that part of the code, then except is triggered, you roll back, and then you see the problem elsewhere? Just guessing, though. If you replace those, temporarily, with except Exception: instead do you still see the problem? (BTW the try/except in TrajectoryData can be removed, and for the other two maybe we should rethink better how to deal with those rollbacks (link to the Django docs for reference).

@giovannipizzi
Copy link
Member

to test it, you could try something like os.kill(os.getpid(), signal.SIGINT) even if you should run a different process with subprocess and verdi run, otherwise you kill the test interpreter?

@sphuber
Copy link
Contributor Author

sphuber commented Apr 12, 2019

Regarding the test suggestion: the difficult thing is not the killing but to launch the sub process. It should be a sub process that runs an aiida environment in which we run a process in a local runner. This can only be done by calling run on a process with some inputs. So the only way I can envisage this happening is to write a dummy runaiida script that runs a long running dummy calcjob and then send the signal. But as you can see it is not very clean

@sphuber
Copy link
Contributor Author

sphuber commented May 14, 2019

@giovannipizzi I have a solution that works partially. It is not tested through unit tests and doesn't work in all cases, but I think it will be important to have this partial non-ideal solution in v1.0.0b3 of tomorrow. I am sure that people during the tutorial will CTRL+C after running a process and will end up with a zombie process. It would be good that it will show as killed. When launching nested processes not everything will get properly killed, but at least it is something.

giovannipizzi
giovannipizzi previously approved these changes May 14, 2019
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's not close the issue though.

Currently, if a user runs a `Process` in a local interpreter and then
interrupts the interpreter the process will be lost. However, the node
will reflect the last active state. This can be confusing to new users,
who will still for example see a "process" in the `Waiting` state.

To prevent this situation, the `Runner._run` method is modified and
attaches listeners to the `SIGINT` and `SIGTERM` signals that when
emitted will trigger a clean function that will log the interrupt and
call `process.kill` on the process that is currently in the scope.
This should ensure that if a user runs a process in a local interpreter
and interrupts it, the process will be properly killed, which will cause
the node to properly reflect what has happened.

The current solution works for a single process being run, but has
issues for nested processes. Sometimes only the current process is
killed and the others in the chain are unaffected before the interpreter
shuts down.
@sphuber sphuber force-pushed the fix_2711_kill_process_interrupt_local_runner branch from 97c477e to 76d7228 Compare May 15, 2019 06:26
@sphuber sphuber changed the title [WIP] Capture KeyboardInterrupt in Runner.run and kill processes Capture KeyboardInterrupt in Runner.run and kill processes May 15, 2019
@sphuber
Copy link
Contributor Author

sphuber commented May 15, 2019

@giovannipizzi could you please reapprove, I have rebased

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a bit different than before but I guess it's ok

@sphuber sphuber merged commit 15feb96 into aiidateam:develop May 15, 2019
@sphuber sphuber deleted the fix_2711_kill_process_interrupt_local_runner branch May 15, 2019 09:22
unkcpz pushed a commit to unkcpz/aiida-core that referenced this pull request May 20, 2019
…team#2744)

Currently, if a user runs a `Process` in a local interpreter and then
interrupts the interpreter the process will be lost. However, the node
will reflect the last active state. This can be confusing to new users,
who will still for example see a "process" in the `Waiting` state.

To prevent this situation, the `Runner._run` method is modified and
attaches listeners to the `SIGINT` and `SIGTERM` signals that when
emitted will trigger a clean function that will log the interrupt and
call `process.kill` on the process that is currently in the scope.
This should ensure that if a user runs a process in a local interpreter
and interrupts it, the process will be properly killed, which will cause
the node to properly reflect what has happened.

The current solution works for a single process being run, but has
issues for nested processes. Sometimes only the current process is
killed and the others in the chain are unaffected before the interpreter
shuts down.
unkcpz pushed a commit to unkcpz/aiida-core that referenced this pull request May 20, 2019
…team#2744)

Currently, if a user runs a `Process` in a local interpreter and then
interrupts the interpreter the process will be lost. However, the node
will reflect the last active state. This can be confusing to new users,
who will still for example see a "process" in the `Waiting` state.

To prevent this situation, the `Runner._run` method is modified and
attaches listeners to the `SIGINT` and `SIGTERM` signals that when
emitted will trigger a clean function that will log the interrupt and
call `process.kill` on the process that is currently in the scope.
This should ensure that if a user runs a process in a local interpreter
and interrupts it, the process will be properly killed, which will cause
the node to properly reflect what has happened.

The current solution works for a single process being run, but has
issues for nested processes. Sometimes only the current process is
killed and the others in the chain are unaffected before the interpreter
shuts down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants