Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
mark running jobs as failed on exc k8s restart (#3642)
This PR makes the following changes: - The `ExecutionController` `_verifyExecution()` function now marks the execution status as failed if the current status is a `running` status ('recovering', 'running', 'failing', 'paused', 'stopping'). We can't start from a running state so the job will be stopped, but we were not updating the status previously. The assumption is that one of these statuses means that the execution controller pod has crashed and k8s is restarting a new pod. We would never get in this situation in native clustering, as there are check before the execution controller process is forked. - The `ExecutionService` `_finishExecution()` function also marks the execution status as failed if in a `running` status. It looks like this function is designed to run after an execution controller error and shutdown, so it's safe to assume a running status means there was an error and the execution controller shut down before status was updated. Ref: #2673
- Loading branch information