The task is stuck in a queued state forever in case of pod launch errors #36403

dirrao · 2023-12-24T06:22:20Z

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes: 7.11.0

Apache Airflow version

2.8.0

Operating System

Cent OS 7

Deployment

Other

Deployment details

Terraform based deployment

What happened

When the K8 executor is unable to launch the worker pod due to permissions issues or an invalid namespace. The K8 executor keep trying to launch the worker pod and the errors remain persist. So, the task ends up in a queued state for so long/forever.

What you think should happen instead

We shouldn't retry the worker pods launch continuously in case of persistent/transient errors. Let the executor mark them as failed and let the scheduler honor the task retries with retry delay (5 mins) and then fail the task eventually if the error persists.

How to reproduce

We can try the following scenarios
a> Provide incorrect namespace in prod_override configuration
b> Role that doesn't have permissions to launch the pod in a specific namespace
c> namespace quota exhausted

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

potiuk · 2023-12-24T10:33:48Z

feel free to attempt to handle it

adihakimi · 2023-12-25T09:00:27Z

We’re facing the same issue on v2.6.3.
We thought that ‘scheduler_task_queued_timeout’ might help but it doesn’t.

dirrao · 2023-12-25T09:31:34Z

scheduler_task_queued_timeout

Existing options doesn't help here. I am going to raise a MR to fix this issue very soon.

RNHTTR · 2024-01-17T19:46:23Z

Hey @dirrao -- Wondering if you had a chance to open a PR for this?

dirrao · 2024-01-18T12:52:38Z

@RNHTTR The MR is ready and I will most likely to raise the MR in this week.

dirrao added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Dec 24, 2023

dirrao changed the title ~~K8 exectuor retring on launching the worker pods forver in case of persistant errors~~ The task is stuck in a queued state forever in case of pod launch errors Dec 24, 2023

potiuk added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Dec 24, 2023

potiuk assigned dirrao Dec 24, 2023

dirrao mentioned this issue Dec 28, 2023

Airflow Kubernetes Executor spams etcd when task fails because of exceeded quota error #35792

Closed

2 tasks

dirrao mentioned this issue Jan 19, 2024

The task is stuck in a queued state forever in case of pod launch errors #36882

Merged

potiuk closed this as completed in #36882 Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The task is stuck in a queued state forever in case of pod launch errors #36403

The task is stuck in a queued state forever in case of pod launch errors #36403

dirrao commented Dec 24, 2023 •

edited

Loading

potiuk commented Dec 24, 2023

adihakimi commented Dec 25, 2023

dirrao commented Dec 25, 2023 •

edited

Loading

RNHTTR commented Jan 17, 2024

dirrao commented Jan 18, 2024

The task is stuck in a queued state forever in case of pod launch errors #36403

The task is stuck in a queued state forever in case of pod launch errors #36403

Comments

dirrao commented Dec 24, 2023 • edited Loading

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

potiuk commented Dec 24, 2023

adihakimi commented Dec 25, 2023

dirrao commented Dec 25, 2023 • edited Loading

RNHTTR commented Jan 17, 2024

dirrao commented Jan 18, 2024

dirrao commented Dec 24, 2023 •

edited

Loading

dirrao commented Dec 25, 2023 •

edited

Loading