Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubernetesJob fails due to timeout even when job_watch_timeout_secs is set to None #8345

Closed
4 tasks done
david-elliott-deliveroo opened this issue Feb 1, 2023 · 1 comment · Fixed by #8350
Closed
4 tasks done
Labels
bug Something isn't working

Comments

@david-elliott-deliveroo
Copy link

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the Prefect documentation for this issue.
  • I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

KubernetesJob flows terminate early and end up as Crashed when executing long-running tasks (e.g. long SQL scripts) because the kubernetes.watch.Watch().stream() has retries disabled and thus the stream exits after a period of job inactivity. This is happening even when job_watch_timeout_secs is set to None.

The reason this happens is that the kwarg timeout_seconds is always passed into watch.stream() code ref here and the kubernetes package uses the presence of this kwarg (not its value) to determine whether or not to disable retries code ref here.

As such, even when job_watch_timeout_secs is set to None, disable_retries is always set to False purely because of the presence of the kwarg (even though the value of it is None).

The result is that flows with long-running tasks on k8s end up exiting early, because the event stream goes quiet, retries are disabled and thus it exits. Note - it hits this else clause when this happens, resulting in the following Agent log output:

prefect.infrastructure.kubernetes-job - Job 'sapphire-beluga-n6t7d': Job did not complete within timeout of Nones.

Suggested simple fix might be passing the timeout_seconds arg via **, ie only passing the timeout if it's actually set, something like...

EXISTING CODE

            watch = kubernetes.watch.Watch()
            with self.get_batch_client() as batch_client:
                remaining_timeout = (
                    (  # subtract previous watch time
                        self.job_watch_timeout_seconds - elapsed
                    )
                    if self.job_watch_timeout_seconds
                    else None
                )

                for event in watch.stream(
                    func=batch_client.list_namespaced_job,
                    field_selector=f"metadata.name={job_name}",
                    namespace=self.namespace,
                    timeout_seconds=remaining_timeout,
                ):
                    if event["object"].status.completion_time:

REPLACED WITH:

            watch = kubernetes.watch.Watch()
            with self.get_batch_client() as batch_client:
                remaining_timeout = (
                    (  # subtract previous watch time
                        self.job_watch_timeout_seconds - elapsed
                    )
                    if self.job_watch_timeout_seconds
                    else None
                )
                watch_kwargs = {}
                if remaining_timeout:
                    watch_kwargs["timeout_seconds"] = remaining_timeout

                for event in watch.stream(
                    func=batch_client.list_namespaced_job,
                    field_selector=f"metadata.name={job_name}",
                    namespace=self.namespace,
                    **watch_kwargs,
                ):
                    if event["object"].status.completion_time:

Reproduction

Hopefully the above logic is clear as to why this is happening without a whole MRE. I can create one if needed

Error

No response

Versions

$ prefect version                                                                                                                      
Version:             2.7.10
API version:         0.8.4
Python version:      3.8.7
Git commit:          f269d49b
Built:               Thu, Jan 26, 2023 3:51 PM
OS/Arch:             darwin/x86_64
Profile:             bi-p
Server type:         cloud

Additional context

No response

@hsirah
Copy link

hsirah commented Feb 1, 2023

I also ran into this same issue - last night coincidentally. I did not trace the cause of the error as far as @david-elliott-deliveroo did (nice work there). But, I can confirm that my job run was also executing a task that has a long running SQL script. Here are the relevant logs I saw on the agent:

21:28:55.398 | DEBUG | prefect.infrastructure.kubernetes-job - Job 'kappa862-gorthos-c-cg5rq': Starting watch for job completion
....
22:14:28.873 | ERROR | prefect.infrastructure.kubernetes-job - Job 'kappa862-gorthos-c-cg5rq': Job did not complete within timeout of Nones.
22:14:28.899 | INFO | prefect.agent - Reported flow run 'e130ab1c-7e80-4eb6-b1d9-3f0ff5d6c75b' as crashed: Flow run infrastructure exited with non-zero status code -1.

I am also on Prefect 2.7.10 and running Python 3.9.

Bianca Hoch referred me to this issue on the slack support channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants