-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Kubernetes job watch exit when no timeout given #8350
Conversation
✅ Deploy Preview for prefect-orion ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me - my only other question is whether the k8s watch (with retries enabled) would make our while
redundant
EDIT: no I actually don't think it does, because if someone sets the timeout to a very long time then we could feasibly encounter the same issue as we are now, so it think it makes sense to keep the while loop
Thanks for this! 🌟 Just a heads up - I installed this branch (git+https://github.com/PrefectHQ/prefect.git@fix-kubernetes-watch) and deployed my k8s agent with it, and it immediately throws this below error and the pod restarts. Wanted to flag it - pretty certain it's from other changes in the branch (vs 2.7.10) and not from this change specifically - one of these changes presumably Starting v2.7.10+35.g14c0f1b77 agent connected to
https://api.prefect.cloud/api/accounts/<redacted>/work
spaces/<redacted>...
___ ___ ___ ___ ___ ___ _____ _ ___ ___ _ _ _____
| _ \ _ \ __| __| __/ __|_ _| /_\ / __| __| \| |_ _|
| _/ / _|| _|| _| (__ | | / _ \ (_ | _|| .` | | |
|_| |_|_\___|_| |___\___| |_| /_/ \_\___|___|_|\_| |_|
Agent started! Looking for work from queue(s): bi-pipeline-main-local-dev-k8s,
bi-pipeline-other-local-dev-k8s...
An exception occurred.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/prefect/cli/_utilities.py", line 41, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 230, in coroutine_wrapper
return run_async_in_new_loop(async_fn, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 181, in run_async_in_new_loop
return anyio.run(partial(__fn, *args, **kwargs))
File "/usr/local/lib/python3.8/site-packages/anyio/_core/_eventloop.py", line 70, in run
return asynclib.run(func, *args, **backend_options)
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 292, in run
return native_run(wrapper(), debug=debug)
File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 287, in wrapper
return await func(*args)
File "/usr/local/lib/python3.8/site-packages/prefect/cli/agent.py", line 192, in start
tg.start_soon(
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.8/site-packages/prefect/utilities/services.py", line 46, in critical_service_loop
await workload()
File "/usr/local/lib/python3.8/site-packages/prefect/agent.py", line 200, in get_and_submit_flow_runs
async for work_queue in self.get_work_queues():
File "/usr/local/lib/python3.8/site-packages/prefect/agent.py", line 143, in get_work_queues
work_queue = await self.client.read_work_queue_by_name(
File "/usr/local/lib/python3.8/site-packages/prefect/client/orion.py", line 790, in read_work_queue_by_name
return schemas.core.WorkQueue.parse_obj(response.json())
File "pydantic/main.py", line 527, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for WorkQueue
priority
field required (type=value_error.missing) |
Following up from my above comment - I've now tested the change in this PR myself (branched off 2.7.10 and made the above changes manually in my local copy of prefect) and I can confirm they fix the original issue I raised 🎉 Ran a 1hr sql wait query with no problem 👍 . So yeah, this PR fixes the reported issue, however there is a bug (see above comment) in the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
Co-authored-by: Nathan Nowack <thrast36@gmail.com>
Closes #8345
The issue is a good summary of the details here, but briefly:
timeout_seconds
to K8s unless it is set, they use the presence of the kwarg to determine if retries should be enabled.Checklist
<link to issue>
"fix
,feature
,enhancement