You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the K8 executor is unable to launch the worker pod due to permissions issues or an invalid namespace. The K8 executor keep trying to launch the worker pod and the errors remain persist. So, the task ends up in a queued state for so long/forever.
What you think should happen instead
We shouldn't retry the worker pods launch continuously in case of persistent/transient errors. Let the executor mark them as failed and let the scheduler honor the task retries with retry delay (5 mins) and then fail the task eventually if the error persists.
How to reproduce
We can try the following scenarios
a> Provide incorrect namespace in prod_override configuration
b> Role that doesn't have permissions to launch the pod in a specific namespace
c> namespace quota exhausted
dirrao
changed the title
K8 exectuor retring on launching the worker pods forver in case of persistant errors
The task is stuck in a queued state forever in case of pod launch errors
Dec 24, 2023
Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes: 7.11.0
Apache Airflow version
2.8.0
Operating System
Cent OS 7
Deployment
Other
Deployment details
Terraform based deployment
What happened
When the K8 executor is unable to launch the worker pod due to permissions issues or an invalid namespace. The K8 executor keep trying to launch the worker pod and the errors remain persist. So, the task ends up in a queued state for so long/forever.
What you think should happen instead
We shouldn't retry the worker pods launch continuously in case of persistent/transient errors. Let the executor mark them as failed and let the scheduler honor the task retries with retry delay (5 mins) and then fail the task eventually if the error persists.
How to reproduce
We can try the following scenarios
a> Provide incorrect namespace in prod_override configuration
b> Role that doesn't have permissions to launch the pod in a specific namespace
c> namespace quota exhausted
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: