mark running jobs as failed on exc k8s restart #3642

busma13 · 2024-06-05T17:49:07Z

This PR makes the following changes:

The ExecutionController _verifyExecution() function now marks the execution status as failed if the current status is a running status ('recovering', 'running', 'failing', 'paused', 'stopping'). We can't start from a running state so the job will be stopped, but we were not updating the status previously. The assumption is that one of these statuses means that the execution controller pod has crashed and k8s is restarting a new pod. We would never get in this situation in native clustering, as there are check before the execution controller process is forked.
The ExecutionService _finishExecution() function also marks the execution status as failed if in a running status. It looks like this function is designed to run after an execution controller error and shutdown, so it's safe to assume a running status means there was an error and the execution controller shut down before status was updated.

This PR makes the following changes: - The `ExecutionController` `_verifyExecution()` function now marks the execution status as failed if the current status is a `running` status ('recovering', 'running', 'failing', 'paused', 'stopping'). We can't start from a running state so the job will be stopped, but we were not updating the status previously. The assumption is that one of these statuses means that the execution controller pod has crashed and k8s is restarting a new pod. We would never get in this situation in native clustering, as there are check before the execution controller process is forked. - The `ExecutionService` `_finishExecution()` function also marks the execution status as failed if in a `running` status. It looks like this function is designed to run after an execution controller error and shutdown, so it's safe to assume a running status means there was an error and the execution controller shut down before status was updated. Ref: #2673

busma13 self-assigned this Jun 5, 2024

godber added this to the v2.0.1 milestone Jun 5, 2024

busma13 marked this pull request as ready for review June 6, 2024 15:36

busma13 requested a review from godber June 6, 2024 15:36

busma13 changed the title ~~mark running jobs as failed on exc restart~~ mark running jobs as failed on exc k8s restart Jun 6, 2024

busma13 added 2 commits June 13, 2024 16:05

mark running jobs as failed on exc restart

8fd118a

Update comments

7c83155

busma13 force-pushed the mark-job-failed-on-k8s-exc-restart branch from ecca23e to 7c83155 Compare June 13, 2024 23:11

godber approved these changes Jun 14, 2024

View reviewed changes

godber merged commit 19e70bf into master Jun 14, 2024
59 checks passed

godber deleted the mark-job-failed-on-k8s-exc-restart branch June 14, 2024 20:40

godber temporarily deployed to github-pages June 14, 2024 20:49 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mark running jobs as failed on exc k8s restart #3642

mark running jobs as failed on exc k8s restart #3642

busma13 commented Jun 5, 2024 •

edited

Loading

mark running jobs as failed on exc k8s restart #3642

mark running jobs as failed on exc k8s restart #3642

Conversation

busma13 commented Jun 5, 2024 • edited Loading

busma13 commented Jun 5, 2024 •

edited

Loading