Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The task is stuck in a queued state forever in case of pod launch errors #36403

Closed
2 tasks done
dirrao opened this issue Dec 24, 2023 · 5 comments · Fixed by #36882
Closed
2 tasks done

The task is stuck in a queued state forever in case of pod launch errors #36403

dirrao opened this issue Dec 24, 2023 · 5 comments · Fixed by #36882
Assignees

Comments

@dirrao
Copy link
Collaborator

dirrao commented Dec 24, 2023

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes: 7.11.0

Apache Airflow version

2.8.0

Operating System

Cent OS 7

Deployment

Other

Deployment details

Terraform based deployment

What happened

When the K8 executor is unable to launch the worker pod due to permissions issues or an invalid namespace. The K8 executor keep trying to launch the worker pod and the errors remain persist. So, the task ends up in a queued state for so long/forever.

What you think should happen instead

We shouldn't retry the worker pods launch continuously in case of persistent/transient errors. Let the executor mark them as failed and let the scheduler honor the task retries with retry delay (5 mins) and then fail the task eventually if the error persists.

How to reproduce

We can try the following scenarios
a> Provide incorrect namespace in prod_override configuration
b> Role that doesn't have permissions to launch the pod in a specific namespace
c> namespace quota exhausted

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dirrao dirrao added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Dec 24, 2023
@dirrao dirrao changed the title K8 exectuor retring on launching the worker pods forver in case of persistant errors The task is stuck in a queued state forever in case of pod launch errors Dec 24, 2023
@potiuk potiuk added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Dec 24, 2023
@potiuk
Copy link
Member

potiuk commented Dec 24, 2023

feel free to attempt to handle it

@adihakimi
Copy link

We’re facing the same issue on v2.6.3.
We thought that ‘scheduler_task_queued_timeout’ might help but it doesn’t.

@dirrao
Copy link
Collaborator Author

dirrao commented Dec 25, 2023

scheduler_task_queued_timeout

Existing options doesn't help here. I am going to raise a MR to fix this issue very soon.

@RNHTTR
Copy link
Contributor

RNHTTR commented Jan 17, 2024

Hey @dirrao -- Wondering if you had a chance to open a PR for this?

@dirrao
Copy link
Collaborator Author

dirrao commented Jan 18, 2024

@RNHTTR The MR is ready and I will most likely to raise the MR in this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants