Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix monitoring ctrlc hang #1670

Merged
merged 16 commits into from
May 13, 2020
Merged

Fix monitoring ctrlc hang #1670

merged 16 commits into from
May 13, 2020

Conversation

ZhuozhaoLi
Copy link
Contributor

@ZhuozhaoLi ZhuozhaoLi commented May 6, 2020

This PR tries to fix two monitoring issues related to ctrl-c

  1. Parsl hangs on the following line after ctrl-c when monitoring is enabled.

    self.monitoring.send(MessageType.WORKFLOW_INFO,

    This PR fixes it by adding an explicit zmq_SNDTIMEO (1 second) to the channel between DFK and Hub. With this PR, if one presses ctrl+c, parsl will exit properly after ~1 second.
    I think setting a timeout here is reasonable---the monitoring should not block the DFK to process tasks.

  2. Issue workflow.time_completed is not always populated on ctrl-C #1589
    This PR fixes it by catching the KeyboardInterrupt signal and add some cleanup steps to database_manager.

The original plan to fix this issue was to add an atexit_cleanup like DFK. However, it turns out atexit is never called in multiprocessing processes, since MP processes quit via os._exit(), skipping any cleanup job (including atexit functions, __del__() and weakref finalizers). (source: https://stackoverflow.com/questions/34506638/how-to-register-atexit-function-in-pythons-multiprocessing-subprocess ).

parsl/monitoring/db_manager.py Outdated Show resolved Hide resolved
parsl/monitoring/db_manager.py Outdated Show resolved Hide resolved
parsl/monitoring/db_manager.py Show resolved Hide resolved
parsl/monitoring/monitoring.py Outdated Show resolved Hide resolved
self.logger.exception("Got exception when trying to insert to Table {}".format(table))
try:
self.db.rollback()
except Exception:
self.logger.exception("Rollback failed")
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_insert and _update now pass on database errors to their caller, rather than absorbing them.
So probably the caller (the main loop process) now needs to deal with non-KeyboardInterrupt exceptions that might occur. Or, the _insert and _update code should only re-raise KeyboardInterrupt exceptions to preserve previous behaviour.

Copy link
Contributor Author

@ZhuozhaoLi ZhuozhaoLi May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am choosing the later approach: raise KeyboardInterrupt exceptions in _insert and _update code. I think for exceptions other than KeyboardInterrupt, we should re-raise too since the db is missing some messages at that point, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should re-raise too since the db is missing some messages at that point, no?

the behaviour in master is to ignore most exceptions at the top level, and then carry on the main loop receiving and processing messages. I think this PR should not change that behaviour.

It would be worth investigating separately though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense. thanks

@yadudoc yadudoc added this to the 1.0 milestone May 11, 2020
@benclifford benclifford merged commit 6f8f66c into master May 13, 2020
@benclifford benclifford deleted the fix_monitoring_ctrlc_hang branch May 13, 2020 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants