KubernetesJob fails due to timeout even when job_watch_timeout_secs is set to None #8345
Closed
4 tasks done
Labels
bug
Something isn't working
First check
Bug summary
KubernetesJob
flows terminate early and end up asCrashed
when executing long-running tasks (e.g. long SQL scripts) because thekubernetes.watch.Watch().stream()
has retries disabled and thus the stream exits after a period of job inactivity. This is happening even whenjob_watch_timeout_secs
is set to None.The reason this happens is that the kwarg
timeout_seconds
is always passed intowatch.stream()
code ref here and the kubernetes package uses the presence of this kwarg (not its value) to determine whether or not to disable retries code ref here.As such, even when
job_watch_timeout_secs
is set toNone
,disable_retries
is always set to False purely because of the presence of the kwarg (even though the value of it is None).The result is that flows with long-running tasks on k8s end up exiting early, because the event stream goes quiet, retries are disabled and thus it exits. Note - it hits this else clause when this happens, resulting in the following Agent log output:
Suggested simple fix might be passing the
timeout_seconds
arg via**
, ie only passing the timeout if it's actually set, something like...EXISTING CODE
REPLACED WITH:
Reproduction
Error
No response
Versions
Additional context
No response
The text was updated successfully, but these errors were encountered: