Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator #39201

Merged
merged 1 commit into from
May 10, 2024

Conversation

MaksYermak
Copy link
Contributor

In this PR I have added retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator.

This logic is needed for preventing 'No agent available' error. The error appears time to time when users try to create a Resource or Job. This issue is inside Kubernetes and in the current moment has no solution. Like a temporary solution we decided to retry Job or Resource creation request each time when this error appears.

Link for the same issue for cert-manager service: cert-manager/cert-manager#6457


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@raphaelauv
Copy link
Contributor

why don't you use the internal retry parameter of airflow ?

@MaksYermak
Copy link
Contributor Author

why don't you use the internal retry parameter of airflow ?

I use the same approach what we use for retry Pod creation: https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L347C1-L356C1
Also, I didn't see any other approaches in the cncf package for retry functionality in this case I decided to use the same. Could you please share with me the example of Airflow's code for this internal retry parameter logic?

@raphaelauv
Copy link
Contributor

raphaelauv commented Apr 23, 2024

I was thinking about the BaseOperator argument retries

    PythonOperator(
        task_id="aa",
        retries=3,
        python_callable=toto,
    )

@raphaelauv
Copy link
Contributor

related : #15137

@vincbeck
Copy link
Contributor

I was thinking about the BasOperator argument retries

    PythonOperator(
        task_id="aa",
        retries=3,
        python_callable=toto,
    )

This is an option for users. If a user wants to retry a specific task, they can use this parameter. Here, if I understand correctly, @MaksYermak wants to retry without the user being aware or needing to do something.

Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests to cover these retries?

@MaksYermak MaksYermak force-pushed the add-retry-k8s-create-resource branch from f265b2a to 798a60d Compare May 7, 2024 10:30
@MaksYermak
Copy link
Contributor Author

Could you add tests to cover these retries?

Sure, I have added a unit tests.

@MaksYermak MaksYermak force-pushed the add-retry-k8s-create-resource branch from 798a60d to 1941e94 Compare May 7, 2024 11:17
@MaksYermak MaksYermak force-pushed the add-retry-k8s-create-resource branch from 1941e94 to 8e07ee0 Compare May 7, 2024 12:33
@VladaZakharova
Copy link
Contributor

Hi @raphaelauv @dirrao @vincbeck !
Can you please check the changes here again? Thanks!

@potiuk potiuk merged commit 20265fe into apache:main May 10, 2024
49 checks passed
pateash pushed a commit to pateash/airflow that referenced this pull request May 13, 2024
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants