Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove left-over logic from switch to finalizing everything #643

Merged
merged 3 commits into from
Aug 31, 2023

Conversation

bloodearnest
Copy link
Member

ExecutorState.ERROR can be confusing. It is meant to indicate that the
executor has errored - a job exiting with an exit code is expected
behaviour, not an executor error.

As such, get_status() should not be returning ExecutorState.ERROR when
a job is OOM killed - that's business as usual, and should be handled by
finalise() as normal.

When we refactored job-runner recently for everything to go via
finalise(), I think this behaviour was mistakenly preserved from the
old way of doing things. It resulted in job-runner treating OOM kills
as INTERNAL_ERRORS, and obscuring the message to the users and
needlessly paging teck-support.

Now we only handle the OOM case in one place, I de-factored the
functions that were previously shared between the two locations

Fixes #634

@bloodearnest bloodearnest force-pushed the fix-error-erroring branch 3 times, most recently from 71060c8 to ac2af20 Compare August 29, 2023 12:03
@bloodearnest bloodearnest requested a review from madwort August 29, 2023 12:04
Copy link
Contributor

@madwort madwort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good fix, a couple of nits to address though

jobrunner/run.py Outdated Show resolved Hide resolved
jobrunner/job_executor.py Outdated Show resolved Hide resolved
tests/test_local_executor.py Show resolved Hide resolved
ExecutorState.ERROR can be confusing. It is meant to indicate that the
*executor* has errored - a job exiting with an exit code is expected
behaviour, not an *executor* error.

As such, `get_status()` should not be returning ExecutorState.ERROR when
a job is OOM killed - that's business as usual, and should be handled by
`finalise()` as normal.

When we refactored job-runner recently for everything to go via
`finalise()`, I think this behaviour was mistakenly preserved from the
old way of doing things. It resulted in job-runner treating OOM kills
as INTERNAL_ERRORS, and obscuring the message to the users and
needlessly paging teck-support.

Now we only handle the OOM case in one place, I de-factored the
functions that were previously shared between the two locations

Fixes #634
If a resuable action happened to be the last job to execute, then the
repo in the manifest file was set to the action's repo, not the study's
repo.

However, we no longer need the repo in the manifest files, so just set
it none to avoid this problem.

Once we have a releases UI, we can get rid of the manifest file code
alltogether.
@bloodearnest bloodearnest merged commit 68db9b0 into main Aug 31, 2023
12 checks passed
@bloodearnest bloodearnest deleted the fix-error-erroring branch August 31, 2023 12:57
@bloodearnest
Copy link
Member Author

deployed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing docker metrics for some jobs
2 participants