Fix monitoring ctrlc hang #1670

ZhuozhaoLi · 2020-05-06T23:30:18Z

This PR tries to fix two monitoring issues related to ctrl-c

Parsl hangs on the following line after ctrl-c when monitoring is enabled.

parsl/parsl/dataflow/dflow.py

Line 960 in 4709b30

self.monitoring.send(MessageType.WORKFLOW_INFO,

This PR fixes it by adding an explicit zmq_SNDTIMEO (1 second) to the channel between DFK and Hub. With this PR, if one presses ctrl+c, parsl will exit properly after ~1 second.
I think setting a timeout here is reasonable---the monitoring should not block the DFK to process tasks.
Issue workflow.time_completed is not always populated on ctrl-C #1589
This PR fixes it by catching the KeyboardInterrupt signal and add some cleanup steps to database_manager.

The original plan to fix this issue was to add an atexit_cleanup like DFK. However, it turns out atexit is never called in multiprocessing processes, since MP processes quit via os._exit(), skipping any cleanup job (including atexit functions, __del__() and weakref finalizers). (source: https://stackoverflow.com/questions/34506638/how-to-register-atexit-function-in-pythons-multiprocessing-subprocess ).

…ally

parsl/monitoring/db_manager.py

parsl/monitoring/monitoring.py

…manager

…to fix_monitoring_ctrlc_hang

benclifford · 2020-05-10T12:20:24Z

parsl/monitoring/db_manager.py

            self.logger.exception("Got exception when trying to insert to Table {}".format(table))
            try:
                self.db.rollback()
            except Exception:
                self.logger.exception("Rollback failed")
+            raise


_insert and _update now pass on database errors to their caller, rather than absorbing them.
So probably the caller (the main loop process) now needs to deal with non-KeyboardInterrupt exceptions that might occur. Or, the _insert and _update code should only re-raise KeyboardInterrupt exceptions to preserve previous behaviour.

I am choosing the later approach: raise KeyboardInterrupt exceptions in _insert and _update code. I think for exceptions other than KeyboardInterrupt, we should re-raise too since the db is missing some messages at that point, no?

we should re-raise too since the db is missing some messages at that point, no?

the behaviour in master is to ignore most exceptions at the top level, and then carry on the main loop receiving and processing messages. I think this PR should not change that behaviour.

It would be worth investigating separately though.

that makes sense. thanks

ZhuozhaoLi added 4 commits May 6, 2020 18:48

add a timeout to monitoring dfk channel

a1672a2

monitoring database logs workflow completion time when exiting abnorm…

159337b

…ally

remove a line that sets workflow_end attribute

8c305f7

remove one unused exception e

eafbb64

ZhuozhaoLi requested review from yadudoc and benclifford May 6, 2020 23:30

ZhuozhaoLi added the monitoring label May 6, 2020

Merge branch 'master' into fix_monitoring_ctrlc_hang

88a446f

benclifford reviewed May 7, 2020

View reviewed changes

parsl/monitoring/db_manager.py Outdated Show resolved Hide resolved

parsl/monitoring/db_manager.py Outdated Show resolved Hide resolved

parsl/monitoring/db_manager.py Show resolved Hide resolved

parsl/monitoring/monitoring.py Outdated Show resolved Hide resolved

ZhuozhaoLi added 6 commits May 7, 2020 15:47

catch KeyboardInterrupt exception and raise to the upper level in db_…

2229f64

…manager

raise KeyboardInterrupt exception in dbm_starter

6faf49a

rename workflow_message and change code sytle

ffcdcbe

Merge branch 'fix_monitoring_ctrlc_hang' of github.com:Parsl/parsl in…

859b1c2

…to fix_monitoring_ctrlc_hang

increase timeout and better exception text

6879691

change f string to format

aa20379

benclifford reviewed May 10, 2020

View reviewed changes

ZhuozhaoLi and others added 3 commits May 11, 2020 17:32

separate keyboardinterrupt and other exceptions in db_manager

ecc8d4c

do not re-raise exceptions in _insert and _update

2396072

Merge branch 'master' into fix_monitoring_ctrlc_hang

dc4512f

yadudoc added this to the 1.0 milestone May 11, 2020

Merge branch 'master' into fix_monitoring_ctrlc_hang

f0558e5

benclifford approved these changes May 13, 2020

View reviewed changes

Merge branch 'master' into fix_monitoring_ctrlc_hang

4292374

benclifford merged commit 6f8f66c into master May 13, 2020

benclifford deleted the fix_monitoring_ctrlc_hang branch May 13, 2020 15:06

benclifford added a commit that referenced this pull request May 13, 2020

Bring in master up to PR #1670 into lsst-dm-202005

12f9cc6

TomGlanzman mentioned this pull request Sep 20, 2021

Problems with unexpected Parsl workflow shutdown #2123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix monitoring ctrlc hang #1670

Fix monitoring ctrlc hang #1670

ZhuozhaoLi commented May 6, 2020 •

edited

Loading

benclifford May 10, 2020

ZhuozhaoLi May 11, 2020 •

edited

Loading

benclifford May 11, 2020

ZhuozhaoLi May 11, 2020

Fix monitoring ctrlc hang #1670

Fix monitoring ctrlc hang #1670

Conversation

ZhuozhaoLi commented May 6, 2020 • edited Loading

benclifford May 10, 2020

Choose a reason for hiding this comment

ZhuozhaoLi May 11, 2020 • edited Loading

Choose a reason for hiding this comment

benclifford May 11, 2020

Choose a reason for hiding this comment

ZhuozhaoLi May 11, 2020

Choose a reason for hiding this comment

ZhuozhaoLi commented May 6, 2020 •

edited

Loading

ZhuozhaoLi May 11, 2020 •

edited

Loading