Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow Kubernetes Executor spams etcd when task fails because of exceeded quota error #35792

Closed
1 of 2 tasks
chenyair opened this issue Nov 22, 2023 · 2 comments · Fixed by #36882
Closed
1 of 2 tasks
Labels
area:providers kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@chenyair
Copy link

chenyair commented Nov 22, 2023

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

I am using Airflow 2.5.2 but this issue applies to all versions of Airflow.
When I'm creating a task and I don't have enough quota for the Airflow executor to create the kubernetes api returns an ApiException with the status code 403 that says Reason: Forbidden with the message: Pods ... is forbidden: exceeded quota: .... The Kubernetes executor puts the task back in the queue because the status code is not 400 or 422, from kubernetes_executor.py:

# These codes indicate something is wrong with pod definition; otherwise we assume pod
# definition is ok, and that retrying may work
if e.status in (400, 422):

The problem is that it tries excessively to run the task again and again and it spams the Kubernetes API which then makes kyverno write a lot of obejcts to etcd.

What you think should happen instead

I want to be able to control the amount of times the scheduler re-queues a job and the timeout between each time it tires to re-run the task if it was re-queued.

How to reproduce

Run an Airflow task with insufficient memory and cpu in the ACRQ

Operating System

Red Hat Enterprise Linux 8.5 (Ootpa)

Versions of Apache Airflow Providers

No response

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@chenyair chenyair added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Nov 22, 2023
Copy link

boring-cyborg bot commented Nov 22, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@Taragolis Taragolis added area:providers provider:cncf-kubernetes Kubernetes provider related issues and removed area:core labels Nov 22, 2023
@josh-fell josh-fell removed the needs-triage label for new issues that we didn't triage yet label Dec 14, 2023
@dirrao
Copy link
Contributor

dirrao commented Dec 28, 2023

Hi @chenyair,
I have seen the same behavior. I have prepared the MR to handle this issue and it will be open for review soon.
#36403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants