Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Kubernetes job watch exit when no timeout given #8350

Merged
merged 5 commits into from
Feb 2, 2023

Conversation

zanieb
Copy link
Contributor

@zanieb zanieb commented Feb 1, 2023

Closes #8345

The issue is a good summary of the details here, but briefly:

  1. We should not pass timeout_seconds to K8s unless it is set, they use the presence of the kwarg to determine if retries should be enabled.
  2. We should not exit early if the watch exits unless a timeout is given

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • This pull request includes tests or only affects documentation.
  • This pull request includes a label categorizing the change e.g. fix, feature, enhancement

@zanieb zanieb requested a review from a team as a code owner February 1, 2023 16:32
@netlify
Copy link

netlify bot commented Feb 1, 2023

Deploy Preview for prefect-orion ready!

Name Link
🔨 Latest commit 7e18fc8
🔍 Latest deploy log https://app.netlify.com/sites/prefect-orion/deploys/63dbf18753f5db0009b6ed74
😎 Deploy Preview https://deploy-preview-8350--prefect-orion.netlify.app/api-ref/prefect/infrastructure
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@zanieb zanieb requested a review from zzstoatzz February 1, 2023 16:33
@zanieb zanieb added the fix A fix for a bug in an existing feature label Feb 1, 2023
Copy link
Collaborator

@zzstoatzz zzstoatzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me - my only other question is whether the k8s watch (with retries enabled) would make our while redundant

EDIT: no I actually don't think it does, because if someone sets the timeout to a very long time then we could feasibly encounter the same issue as we are now, so it think it makes sense to keep the while loop

@david-elliott-deliveroo
Copy link

Thanks for this! 🌟 Just a heads up - I installed this branch (git+https://github.com/PrefectHQ/prefect.git@fix-kubernetes-watch) and deployed my k8s agent with it, and it immediately throws this below error and the pod restarts. Wanted to flag it - pretty certain it's from other changes in the branch (vs 2.7.10) and not from this change specifically - one of these changes presumably

Starting v2.7.10+35.g14c0f1b77 agent connected to
https://api.prefect.cloud/api/accounts/<redacted>/work
spaces/<redacted>...

  ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____
 | _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
 |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
 |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|


Agent started! Looking for work from queue(s): bi-pipeline-main-local-dev-k8s,
bi-pipeline-other-local-dev-k8s...
An exception occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/cli/_utilities.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 230, in coroutine_wrapper
    return run_async_in_new_loop(async_fn, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 181, in run_async_in_new_loop
    return anyio.run(partial(__fn, *args, **kwargs))
  File "/usr/local/lib/python3.8/site-packages/anyio/_core/_eventloop.py", line 70, in run
    return asynclib.run(func, *args, **backend_options)
  File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 292, in run
    return native_run(wrapper(), debug=debug)
  File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 287, in wrapper
    return await func(*args)
  File "/usr/local/lib/python3.8/site-packages/prefect/cli/agent.py", line 192, in start
    tg.start_soon(
  File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/services.py", line 46, in critical_service_loop
    await workload()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent.py", line 200, in get_and_submit_flow_runs
    async for work_queue in self.get_work_queues():
  File "/usr/local/lib/python3.8/site-packages/prefect/agent.py", line 143, in get_work_queues
    work_queue = await self.client.read_work_queue_by_name(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/orion.py", line 790, in read_work_queue_by_name
    return schemas.core.WorkQueue.parse_obj(response.json())
  File "pydantic/main.py", line 527, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for WorkQueue
priority
  field required (type=value_error.missing)

@david-elliott-deliveroo
Copy link

Following up from my above comment - I've now tested the change in this PR myself (branched off 2.7.10 and made the above changes manually in my local copy of prefect) and I can confirm they fix the original issue I raised 🎉 Ran a 1hr sql wait query with no problem 👍 .

So yeah, this PR fixes the reported issue, however there is a bug (see above comment) in the current main branch (included in this branch) which borks the k8s agent in some other way that will need resolving before releasing 2.7.11 CC @madkinsz fyi

Copy link
Collaborator

@zzstoatzz zzstoatzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

@zanieb zanieb merged commit 4766c2e into main Feb 2, 2023
@zanieb zanieb deleted the fix-kubernetes-watch branch February 2, 2023 17:53
@zanieb zanieb mentioned this pull request Feb 3, 2023
3 tasks
zanieb added a commit that referenced this pull request Feb 3, 2023
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
zanieb added a commit that referenced this pull request Feb 3, 2023
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
zanieb added a commit that referenced this pull request Feb 6, 2023
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
zanieb added a commit that referenced this pull request Feb 6, 2023
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
serinamarie pushed a commit that referenced this pull request Feb 6, 2023
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix A fix for a bug in an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KubernetesJob fails due to timeout even when job_watch_timeout_secs is set to None
4 participants