Capture `KeyboardInterrupt` in `Runner.run` and kill processes #2744

sphuber · 2019-04-12T09:52:36Z

Addresses #2711 partially but does not fix it

Currently, if a user runs a Process in a local interpreter and then
interrupts the interpreter the process will be lost. However, the node
will reflect the last active state. This can be confusing to new users,
who will still for example see a "process" in the Waiting state.

To prevent this situation, the Runner._run method is modified and
attaches listeners to the SIGINT and SIGTERM signals that when
emitted will trigger a clean function that will log the interrupt and
call process.kill on the process that is currently in the scope.
This should ensure that if a user runs a process in a local interpreter
and interrupts it, the process will be properly killed, which will cause
the node to properly reflect what has happened.

The current solution works for a single process being run, but has
issues for nested processes. Sometimes only the current process is
killed and the others in the chain are unaffected before the interpreter
shuts down.

sphuber · 2019-04-12T09:55:31Z

@muhrin : would welcome your input if plumpy._process_stack is the advised way of getting the running process and if the location of the signal handlers is fine. I tested this locally for calc jobs and it works pretty well. Don't really know how to setup up decent unit tests for this. Problem is that with workchains it does not seem to work. The workchain might get killed, but I am getting all sort off other exceptions elsewhere, I think when the callback of the child gets excepted. Its different everytime and can even go down to postgres 😓

04/12/2019 11:48:50 AM <4723> tornado.application: [ERROR] Exception in callback <functools.partial object at 0x7fe4aa169940>
Traceback (most recent call last):
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
    future.result()
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/processes.py", line 1109, in step_until_terminated
    yield self.step()
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/processes.py", line 1100, in step
    self.transition_to(next_state)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 326, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 339, in transition_failed
    six.reraise(type(exception), exception, trace)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 310, in transition_to
    self._enter_next_state(new_state)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 374, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/base/state_machine.py", line 288, in _fire_state_event
    callback(self, hook, state)
  File "/home/sphuber/code/aiida/env/dev/plumpy/plumpy/processes.py", line 308, in <lambda>
    lambda _s, _h, from_state: self.on_entered(from_state))
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/engine/processes/process.py", line 285, in on_entered
    self.update_node_state(self._state)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/engine/processes/process.py", line 495, in update_node_state
    self.node.set_process_state(state.LABEL)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/orm/nodes/process/process.py", line 181, in set_process_state
    return self.set_attribute(self.PROCESS_STATE_KEY, state)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/orm/utils/mixins.py", line 207, in set_attribute
    super(Sealable, self).set_attribute(key, value, stored_check=False, **kwargs)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/orm/nodes/node.py", line 395, in set_attribute
    self.backend_entity.set_attribute(key, value)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/orm/implementation/django/nodes.py", line 178, in set_attribute
    self.ATTRIBUTE_CLASS.set_value_for_node(self.dbmodel, key, value)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/backends/djsite/db/models.py", line 1167, in set_value_for_node
    stop_if_existing=stop_if_existing)
  File "/home/sphuber/code/aiida/env/dev/aiida-core/aiida/backends/djsite/db/models.py", line 663, in set_value
    transaction.savepoint_rollback(sid)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/transaction.py", line 66, in savepoint_rollback
    get_connection(using).savepoint_rollback(sid)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 348, in savepoint_rollback
    self._savepoint_rollback(sid)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 308, in _savepoint_rollback
    cursor.execute(self.ops.savepoint_rollback_sql(sid))
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/backends/base/operations.py", line 361, in savepoint_rollback_sql
    return "ROLLBACK TO SAVEPOINT %s" % self.quote_name(sid)
  File "/home/sphuber/.virtualenvs/aiida_dev/local/lib/python2.7/site-packages/django/db/backends/postgresql/operations.py", line 111, in quote_name
    if name.startswith('"') and name.endswith('"'):
AttributeError: 'NoneType' object has no attribute 'startswith'

coveralls · 2019-04-12T10:38:43Z

Coverage decreased (-0.04%) to 70.538% when pulling 97c477ef18cf32eed569edba16b37eec8e2282fc on sphuber:fix_2711_kill_process_interrupt_local_runner into 3898fad on aiidateam:develop.

coveralls · 2019-04-12T10:38:44Z

Coverage increased (+0.0006%) to 72.669% when pulling 76d7228 on sphuber:fix_2711_kill_process_interrupt_local_runner into 0353e11 on aiidateam:develop.

giovannipizzi · 2019-04-12T16:07:57Z

@sphuber there are three except: in the code, two are in the part related to rolling back (even if in a different place than the traceback). But what could happen for instance is that you send the signal exactly in that part of the code, then except is triggered, you roll back, and then you see the problem elsewhere? Just guessing, though. If you replace those, temporarily, with except Exception: instead do you still see the problem? (BTW the try/except in TrajectoryData can be removed, and for the other two maybe we should rethink better how to deal with those rollbacks (link to the Django docs for reference).

giovannipizzi · 2019-04-12T16:12:14Z

to test it, you could try something like os.kill(os.getpid(), signal.SIGINT) even if you should run a different process with subprocess and verdi run, otherwise you kill the test interpreter?

sphuber · 2019-04-12T19:46:41Z

Regarding the test suggestion: the difficult thing is not the killing but to launch the sub process. It should be a sub process that runs an aiida environment in which we run a process in a local runner. This can only be done by calling run on a process with some inputs. So the only way I can envisage this happening is to write a dummy runaiida script that runs a long running dummy calcjob and then send the signal. But as you can see it is not very clean

sphuber · 2019-05-14T12:00:42Z

@giovannipizzi I have a solution that works partially. It is not tested through unit tests and doesn't work in all cases, but I think it will be important to have this partial non-ideal solution in v1.0.0b3 of tomorrow. I am sure that people during the tutorial will CTRL+C after running a process and will end up with a zombie process. It would be good that it will show as killed. When launching nested processes not everything will get properly killed, but at least it is something.

giovannipizzi

Agreed. Let's not close the issue though.

Currently, if a user runs a `Process` in a local interpreter and then interrupts the interpreter the process will be lost. However, the node will reflect the last active state. This can be confusing to new users, who will still for example see a "process" in the `Waiting` state. To prevent this situation, the `Runner._run` method is modified and attaches listeners to the `SIGINT` and `SIGTERM` signals that when emitted will trigger a clean function that will log the interrupt and call `process.kill` on the process that is currently in the scope. This should ensure that if a user runs a process in a local interpreter and interrupts it, the process will be properly killed, which will cause the node to properly reflect what has happened. The current solution works for a single process being run, but has issues for nested processes. Sometimes only the current process is killed and the others in the chain are unaffected before the interpreter shuts down.

sphuber · 2019-05-15T06:27:02Z

@giovannipizzi could you please reapprove, I have rebased

giovannipizzi

Looks a bit different than before but I guess it's ok

…team#2744) Currently, if a user runs a `Process` in a local interpreter and then interrupts the interpreter the process will be lost. However, the node will reflect the last active state. This can be confusing to new users, who will still for example see a "process" in the `Waiting` state. To prevent this situation, the `Runner._run` method is modified and attaches listeners to the `SIGINT` and `SIGTERM` signals that when emitted will trigger a clean function that will log the interrupt and call `process.kill` on the process that is currently in the scope. This should ensure that if a user runs a process in a local interpreter and interrupts it, the process will be properly killed, which will cause the node to properly reflect what has happened. The current solution works for a single process being run, but has issues for nested processes. Sometimes only the current process is killed and the others in the chain are unaffected before the interpreter shuts down.

sphuber requested review from muhrin and giovannipizzi April 12, 2019 09:52

giovannipizzi previously approved these changes May 14, 2019

View reviewed changes

sphuber dismissed giovannipizzi’s stale review via 76d7228 May 15, 2019 06:26

sphuber force-pushed the fix_2711_kill_process_interrupt_local_runner branch from 97c477e to 76d7228 Compare May 15, 2019 06:26

sphuber changed the title ~~[WIP] Capture KeyboardInterrupt in Runner.run and kill processes~~ Capture KeyboardInterrupt in Runner.run and kill processes May 15, 2019

giovannipizzi approved these changes May 15, 2019

View reviewed changes

sphuber merged commit 15feb96 into aiidateam:develop May 15, 2019

sphuber deleted the fix_2711_kill_process_interrupt_local_runner branch May 15, 2019 09:22

sphuber mentioned this pull request Jun 4, 2019

Stopping the daemon always seems to hit a timeout #2963

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture `KeyboardInterrupt` in `Runner.run` and kill processes #2744

Capture `KeyboardInterrupt` in `Runner.run` and kill processes #2744

sphuber commented Apr 12, 2019 •

edited

Loading

sphuber commented Apr 12, 2019

coveralls commented Apr 12, 2019

coveralls commented Apr 12, 2019 •

edited

Loading

giovannipizzi commented Apr 12, 2019

giovannipizzi commented Apr 12, 2019

sphuber commented Apr 12, 2019

sphuber commented May 14, 2019

giovannipizzi left a comment

sphuber commented May 15, 2019

giovannipizzi left a comment

Capture KeyboardInterrupt in Runner.run and kill processes #2744

Capture KeyboardInterrupt in Runner.run and kill processes #2744

Conversation

sphuber commented Apr 12, 2019 • edited Loading

sphuber commented Apr 12, 2019

coveralls commented Apr 12, 2019

coveralls commented Apr 12, 2019 • edited Loading

giovannipizzi commented Apr 12, 2019

giovannipizzi commented Apr 12, 2019

sphuber commented Apr 12, 2019

sphuber commented May 14, 2019

giovannipizzi left a comment

Choose a reason for hiding this comment

sphuber commented May 15, 2019

giovannipizzi left a comment

Choose a reason for hiding this comment

Capture `KeyboardInterrupt` in `Runner.run` and kill processes #2744

Capture `KeyboardInterrupt` in `Runner.run` and kill processes #2744

sphuber commented Apr 12, 2019 •

edited

Loading

coveralls commented Apr 12, 2019 •

edited

Loading