KubernetesJob fails due to timeout even when job_watch_timeout_secs is set to None #8345

david-elliott-deliveroo · 2023-02-01T15:29:22Z

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

KubernetesJob flows terminate early and end up as Crashed when executing long-running tasks (e.g. long SQL scripts) because the kubernetes.watch.Watch().stream() has retries disabled and thus the stream exits after a period of job inactivity. This is happening even when job_watch_timeout_secs is set to None.

The reason this happens is that the kwarg timeout_seconds is always passed into watch.stream() code ref here and the kubernetes package uses the presence of this kwarg (not its value) to determine whether or not to disable retries code ref here.

As such, even when job_watch_timeout_secs is set to None, disable_retries is always set to False purely because of the presence of the kwarg (even though the value of it is None).

The result is that flows with long-running tasks on k8s end up exiting early, because the event stream goes quiet, retries are disabled and thus it exits. Note - it hits this else clause when this happens, resulting in the following Agent log output:

prefect.infrastructure.kubernetes-job - Job 'sapphire-beluga-n6t7d': Job did not complete within timeout of Nones.

Suggested simple fix might be passing the timeout_seconds arg via **, ie only passing the timeout if it's actually set, something like...

EXISTING CODE

            watch = kubernetes.watch.Watch()
            with self.get_batch_client() as batch_client:
                remaining_timeout = (
                    (  # subtract previous watch time
                        self.job_watch_timeout_seconds - elapsed
                    )
                    if self.job_watch_timeout_seconds
                    else None
                )

                for event in watch.stream(
                    func=batch_client.list_namespaced_job,
                    field_selector=f"metadata.name={job_name}",
                    namespace=self.namespace,
                    timeout_seconds=remaining_timeout,
                ):
                    if event["object"].status.completion_time:

REPLACED WITH:

            watch = kubernetes.watch.Watch()
            with self.get_batch_client() as batch_client:
                remaining_timeout = (
                    (  # subtract previous watch time
                        self.job_watch_timeout_seconds - elapsed
                    )
                    if self.job_watch_timeout_seconds
                    else None
                )
                watch_kwargs = {}
                if remaining_timeout:
                    watch_kwargs["timeout_seconds"] = remaining_timeout

                for event in watch.stream(
                    func=batch_client.list_namespaced_job,
                    field_selector=f"metadata.name={job_name}",
                    namespace=self.namespace,
                    **watch_kwargs,
                ):
                    if event["object"].status.completion_time:

Reproduction

Hopefully the above logic is clear as to why this is happening without a whole MRE. I can create one if needed

Error

No response

Versions

$ prefect version                                                                                                                      
Version:             2.7.10
API version:         0.8.4
Python version:      3.8.7
Git commit:          f269d49b
Built:               Thu, Jan 26, 2023 3:51 PM
OS/Arch:             darwin/x86_64
Profile:             bi-p
Server type:         cloud

Additional context

No response

The text was updated successfully, but these errors were encountered:

hsirah · 2023-02-01T16:21:35Z

I also ran into this same issue - last night coincidentally. I did not trace the cause of the error as far as @david-elliott-deliveroo did (nice work there). But, I can confirm that my job run was also executing a task that has a long running SQL script. Here are the relevant logs I saw on the agent:

21:28:55.398 | DEBUG | prefect.infrastructure.kubernetes-job - Job 'kappa862-gorthos-c-cg5rq': Starting watch for job completion
....
22:14:28.873 | ERROR | prefect.infrastructure.kubernetes-job - Job 'kappa862-gorthos-c-cg5rq': Job did not complete within timeout of Nones.
22:14:28.899 | INFO | prefect.agent - Reported flow run 'e130ab1c-7e80-4eb6-b1d9-3f0ff5d6c75b' as crashed: Flow run infrastructure exited with non-zero status code -1.

I am also on Prefect 2.7.10 and running Python 3.9.

Bianca Hoch referred me to this issue on the slack support channel.

david-elliott-deliveroo added bug Something isn't working status:triage labels Feb 1, 2023

zanieb added status:accepted and removed status:triage labels Feb 1, 2023

EmilRex mentioned this issue Feb 1, 2023

allow KubernetesJob._watch_job to track long-running jobs #8189

Merged

3 tasks

zanieb mentioned this issue Feb 1, 2023

Fix Kubernetes job watch exit when no timeout given #8350

Merged

3 tasks

zanieb closed this as completed in #8350 Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubernetesJob fails due to timeout even when job_watch_timeout_secs is set to None #8345

KubernetesJob fails due to timeout even when job_watch_timeout_secs is set to None #8345

david-elliott-deliveroo commented Feb 1, 2023

hsirah commented Feb 1, 2023

KubernetesJob fails due to timeout even when job_watch_timeout_secs is set to None #8345

KubernetesJob fails due to timeout even when job_watch_timeout_secs is set to None #8345

Comments

david-elliott-deliveroo commented Feb 1, 2023

First check

Bug summary

Reproduction

Error

Versions

Additional context

hsirah commented Feb 1, 2023