Skip to content

Conversation

@stephen-bracken
Copy link
Contributor

Fix KubernetesJobOperator.get_or_create_pod() sometimes creating duplicate pods.

during execute() the KubernetesJobOperator attempts to find the pod associated with the Job object using self.get_or_create_pod(). If Kubernetes is being slow then the Job object will not create a pod before this method gets called, which will result in the underlying find_pod() method returning None, and a duplicate headless pod being created for this task.

This PR overrides the get_or_create_pod() method in KubernetesJobOperator to poll for the pod associated with the Job if one is not created immediately.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jul 4, 2025
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 4, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@stephen-bracken stephen-bracken marked this pull request as ready for review July 4, 2025 16:50

def get_or_create_pod(self, pod_request_obj, context):
pod = None
while pod is None:
Copy link
Contributor

@dominikhei dominikhei Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this risk an infinite loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was relying on the task timeout to interrupt this, which is a naive assumption. I can add a static timeout, or add a retry with backoff here or use the activeDeadlineSeconds from the job spec?

@patch(JOB_OPERATORS_PATH.format("KubernetesJobOperator.build_job_request_obj"))
@patch(JOB_OPERATORS_PATH.format("KubernetesJobOperator.create_job"))
@patch(HOOK_CLASS)
def test_pod_creation_timeout(self, mock_hook, mock_create_job, mock_build_job_request_obj, mock_find_pod):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong this tests waits the whole 10s you used as a timeout. Could you mock the behavior, such that the test does not run for 10s or set it to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I'll set it to 1

Copy link
Contributor

@dominikhei dominikhei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have a strong preference between introducing a new parameter or defaulting to a few retries with exponential backoff. cc: @hussein-awala

@eladkal eladkal requested a review from romsharon98 July 11, 2025 17:12
Bracken, Stephen (London) and others added 20 commits July 15, 2025 12:19
apache#52908)

The file might be a left-over from earlier api generation and it is
already .gitignored, so it does not appear in the source tarball,
but sdist and wheel generation does not look at .gitignored files
(for a good reason - we might want to include some generated files
in those distributions) - so we should exclude it manually similarly
as we already do with node_modules for example.
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
…pache#52909)

I noticed that we did not have `pre-commit-uv` in the image
that was building the distribution packages. Adding it should
slightly speed up building such an image for isolated package
preparation.

The `uv` has been upgraded along the way, also the comment that
we should automate updating the versions in
release_management_command.py has been removed - because we
already have automation in place. The missing pieces were
adding "hatch", "pyyaml", "gitpython" and "rich".
…he#52931)

We are still waiting for release of the Lucas's pre-commit after
the Lucas-C/pre-commit-hooks#103 has
been merged - and for now we need to use `bleeding-edge` for
the repo.

Also upgrades zizmor and solves using "env." in shell commands
that might lead to security issues.
@stephen-bracken
Copy link
Contributor Author

argh I've messed up the diff for this PR - going to close and reopen as fresh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.