-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Fix KubernetesJobOperator.get_or_create_pod() #52885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix KubernetesJobOperator.get_or_create_pod() #52885
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
|
|
||
| def get_or_create_pod(self, pod_request_obj, context): | ||
| pod = None | ||
| while pod is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this risk an infinite loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I was relying on the task timeout to interrupt this, which is a naive assumption. I can add a static timeout, or add a retry with backoff here or use the activeDeadlineSeconds from the job spec?
| @patch(JOB_OPERATORS_PATH.format("KubernetesJobOperator.build_job_request_obj")) | ||
| @patch(JOB_OPERATORS_PATH.format("KubernetesJobOperator.create_job")) | ||
| @patch(HOOK_CLASS) | ||
| def test_pod_creation_timeout(self, mock_hook, mock_create_job, mock_build_job_request_obj, mock_find_pod): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong this tests waits the whole 10s you used as a timeout. Could you mock the behavior, such that the test does not run for 10s or set it to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I'll set it to 1
dominikhei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t have a strong preference between introducing a new parameter or defaulting to a few retries with exponential backoff. cc: @hussein-awala
apache#52908) The file might be a left-over from earlier api generation and it is already .gitignored, so it does not appear in the source tarball, but sdist and wheel generation does not look at .gitignored files (for a good reason - we might want to include some generated files in those distributions) - so we should exclude it manually similarly as we already do with node_modules for example.
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
…pache#52909) I noticed that we did not have `pre-commit-uv` in the image that was building the distribution packages. Adding it should slightly speed up building such an image for isolated package preparation. The `uv` has been upgraded along the way, also the comment that we should automate updating the versions in release_management_command.py has been removed - because we already have automation in place. The missing pieces were adding "hatch", "pyyaml", "gitpython" and "rich".
…he#52931) We are still waiting for release of the Lucas's pre-commit after the Lucas-C/pre-commit-hooks#103 has been merged - and for now we need to use `bleeding-edge` for the repo. Also upgrades zizmor and solves using "env." in shell commands that might lead to security issues.
|
argh I've messed up the diff for this PR - going to close and reopen as fresh |
Fix KubernetesJobOperator.get_or_create_pod() sometimes creating duplicate pods.
during
execute()the KubernetesJobOperator attempts to find the pod associated with the Job object usingself.get_or_create_pod(). If Kubernetes is being slow then the Job object will not create a pod before this method gets called, which will result in the underlyingfind_pod()method returningNone, and a duplicate headless pod being created for this task.This PR overrides the
get_or_create_pod()method inKubernetesJobOperatorto poll for the pod associated with the Job if one is not created immediately.^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.