-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix monitoring ctrlc hang #1670
Conversation
…to fix_monitoring_ctrlc_hang
self.logger.exception("Got exception when trying to insert to Table {}".format(table)) | ||
try: | ||
self.db.rollback() | ||
except Exception: | ||
self.logger.exception("Rollback failed") | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_insert and _update now pass on database errors to their caller, rather than absorbing them.
So probably the caller (the main loop process) now needs to deal with non-KeyboardInterrupt exceptions that might occur. Or, the _insert and _update code should only re-raise KeyboardInterrupt exceptions to preserve previous behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am choosing the later approach: raise KeyboardInterrupt exceptions in _insert
and _update
code. I think for exceptions other than KeyboardInterrupt, we should re-raise too since the db is missing some messages at that point, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should re-raise too since the db is missing some messages at that point, no?
the behaviour in master is to ignore most exceptions at the top level, and then carry on the main loop receiving and processing messages. I think this PR should not change that behaviour.
It would be worth investigating separately though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense. thanks
This PR tries to fix two monitoring issues related to ctrl-c
Parsl hangs on the following line after ctrl-c when monitoring is enabled.
parsl/parsl/dataflow/dflow.py
Line 960 in 4709b30
This PR fixes it by adding an explicit
zmq_SNDTIMEO
(1 second) to the channel between DFK and Hub. With this PR, if one presses ctrl+c, parsl will exit properly after ~1 second.I think setting a timeout here is reasonable---the monitoring should not block the DFK to process tasks.
Issue workflow.time_completed is not always populated on ctrl-C #1589
This PR fixes it by catching the
KeyboardInterrupt
signal and add some cleanup steps to database_manager.The original plan to fix this issue was to add an
atexit_cleanup
like DFK. However, it turns outatexit
is never called in multiprocessing processes, since MP processesquit via os._exit(), skipping any cleanup job (including atexit functions, __del__() and weakref finalizers).
(source: https://stackoverflow.com/questions/34506638/how-to-register-atexit-function-in-pythons-multiprocessing-subprocess ).