Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mark running jobs as failed on exc k8s restart #3642

Merged
merged 2 commits into from
Jun 14, 2024

Conversation

busma13
Copy link
Contributor

@busma13 busma13 commented Jun 5, 2024

This PR makes the following changes:

  • The ExecutionController _verifyExecution() function now marks the execution status as failed if the current status is a running status ('recovering', 'running', 'failing', 'paused', 'stopping'). We can't start from a running state so the job will be stopped, but we were not updating the status previously. The assumption is that one of these statuses means that the execution controller pod has crashed and k8s is restarting a new pod. We would never get in this situation in native clustering, as there are check before the execution controller process is forked.
  • The ExecutionService _finishExecution() function also marks the execution status as failed if in a running status. It looks like this function is designed to run after an execution controller error and shutdown, so it's safe to assume a running status means there was an error and the execution controller shut down before status was updated.

Ref: #2673

@busma13 busma13 self-assigned this Jun 5, 2024
@godber godber added this to the v2.0.1 milestone Jun 5, 2024
@busma13 busma13 marked this pull request as ready for review June 6, 2024 15:36
@busma13 busma13 requested a review from godber June 6, 2024 15:36
@busma13 busma13 changed the title mark running jobs as failed on exc restart mark running jobs as failed on exc k8s restart Jun 6, 2024
@busma13 busma13 force-pushed the mark-job-failed-on-k8s-exc-restart branch from ecca23e to 7c83155 Compare June 13, 2024 23:11
@godber godber merged commit 19e70bf into master Jun 14, 2024
59 checks passed
@godber godber deleted the mark-job-failed-on-k8s-exc-restart branch June 14, 2024 20:40
busma13 added a commit that referenced this pull request Jun 18, 2024
This PR makes the following changes:
- The `ExecutionController` `_verifyExecution()` function now marks the
execution status as failed if the current status is a `running` status
('recovering', 'running', 'failing', 'paused', 'stopping'). We can't
start from a running state so the job will be stopped, but we were not
updating the status previously. The assumption is that one of these
statuses means that the execution controller pod has crashed and k8s is
restarting a new pod. We would never get in this situation in native
clustering, as there are check before the execution controller process
is forked.
- The `ExecutionService` `_finishExecution()` function also marks the
execution status as failed if in a `running` status. It looks like this
function is designed to run after an execution controller error and
shutdown, so it's safe to assume a running status means there was an
error and the execution controller shut down before status was updated.

Ref: #2673
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants