Remove left-over logic from switch to finalizing everything #643

bloodearnest · 2023-08-28T12:34:02Z

ExecutorState.ERROR can be confusing. It is meant to indicate that the
executor has errored - a job exiting with an exit code is expected
behaviour, not an executor error.

As such, get_status() should not be returning ExecutorState.ERROR when
a job is OOM killed - that's business as usual, and should be handled by
finalise() as normal.

When we refactored job-runner recently for everything to go via
finalise(), I think this behaviour was mistakenly preserved from the
old way of doing things. It resulted in job-runner treating OOM kills
as INTERNAL_ERRORS, and obscuring the message to the users and
needlessly paging teck-support.

Now we only handle the OOM case in one place, I de-factored the
functions that were previously shared between the two locations

Fixes #634

madwort

good fix, a couple of nits to address though

jobrunner/run.py

jobrunner/job_executor.py

tests/test_local_executor.py

ExecutorState.ERROR can be confusing. It is meant to indicate that the *executor* has errored - a job exiting with an exit code is expected behaviour, not an *executor* error. As such, `get_status()` should not be returning ExecutorState.ERROR when a job is OOM killed - that's business as usual, and should be handled by `finalise()` as normal. When we refactored job-runner recently for everything to go via `finalise()`, I think this behaviour was mistakenly preserved from the old way of doing things. It resulted in job-runner treating OOM kills as INTERNAL_ERRORS, and obscuring the message to the users and needlessly paging teck-support. Now we only handle the OOM case in one place, I de-factored the functions that were previously shared between the two locations Fixes #634

If a resuable action happened to be the last job to execute, then the repo in the manifest file was set to the action's repo, not the study's repo. However, we no longer need the repo in the manifest files, so just set it none to avoid this problem. Once we have a releases UI, we can get rid of the manifest file code alltogether.

bloodearnest · 2023-08-31T13:05:59Z

deployed

bloodearnest force-pushed the fix-error-erroring branch 3 times, most recently from 71060c8 to ac2af20 Compare August 29, 2023 12:03

bloodearnest requested a review from madwort August 29, 2023 12:04

bloodearnest force-pushed the fix-error-erroring branch from ac2af20 to 9ae9a97 Compare August 29, 2023 16:08

madwort approved these changes Aug 30, 2023

View reviewed changes

jobrunner/run.py Outdated Show resolved Hide resolved

jobrunner/job_executor.py Outdated Show resolved Hide resolved

tests/test_local_executor.py Show resolved Hide resolved

bloodearnest added 3 commits August 31, 2023 11:56

Driveby fix tracing attrs error noticed in logs

430a9a9

bloodearnest force-pushed the fix-error-erroring branch from 9ae9a97 to 6d7d502 Compare August 31, 2023 10:56

bloodearnest merged commit 68db9b0 into main Aug 31, 2023
12 checks passed

bloodearnest deleted the fix-error-erroring branch August 31, 2023 12:57

bloodearnest mentioned this pull request Aug 31, 2023

Jobs are failing with an internal error when they should be returning an out of memory error message #636

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove left-over logic from switch to finalizing everything #643

Remove left-over logic from switch to finalizing everything #643

bloodearnest commented Aug 28, 2023

madwort left a comment

bloodearnest commented Aug 31, 2023

Remove left-over logic from switch to finalizing everything #643

Remove left-over logic from switch to finalizing everything #643

Conversation

bloodearnest commented Aug 28, 2023

madwort left a comment

Choose a reason for hiding this comment

bloodearnest commented Aug 31, 2023