Skip to content

Commit

Permalink
Check *all* child process for liveness every HB (#2817)
Browse files Browse the repository at this point in the history
Prior to this, a dead process in a pool of N processes would take, on average,
(N * heartbeat) / 2 seconds to be restarted.
  • Loading branch information
khk-globus authored Jul 12, 2023
1 parent a3cae27 commit 11509cb
Showing 1 changed file with 2 additions and 3 deletions.
5 changes: 2 additions & 3 deletions parsl/executors/high_throughput/process_worker_pool.py
Original file line number Diff line number Diff line change
Expand Up @@ -370,7 +370,7 @@ def push_results(self, kill_event):
logger.critical("Exiting")

@wrap_with_logs
def worker_watchdog(self, kill_event):
def worker_watchdog(self, kill_event: threading.Event):
"""Keeps workers alive.
Parameters:
Expand All @@ -381,7 +381,7 @@ def worker_watchdog(self, kill_event):

logger.debug("Starting worker watchdog")

while not kill_event.is_set():
while not kill_event.wait(self.heartbeat_period):
for worker_id, p in self.procs.items():
if not p.is_alive():
logger.error("Worker {} has died".format(worker_id))
Expand Down Expand Up @@ -409,7 +409,6 @@ def worker_watchdog(self, kill_event):
name="HTEX-Worker-{}".format(worker_id))
self.procs[worker_id] = p
logger.info("Worker {} has been restarted".format(worker_id))
time.sleep(self.heartbeat_period)

logger.critical("Exiting")

Expand Down

0 comments on commit 11509cb

Please sign in to comment.