The task is stuck in a queued state forever in case of pod launch errors #36882

dirrao · 2024-01-19T03:04:26Z

What happened

When the K8 executor is unable to launch the worker pod due to permissions issues or an invalid namespace. The K8 executor keep trying to launch the worker pod and the errors remain persist. So, the task ends up in a queued state for so long/forever.

What you think should happen instead

We shouldn't retry the worker pods launch continuously in case of persistent/transient errors. Let the executor mark them as failed and let the scheduler honor the task retries with retry delay (5 mins by default) and then fail the task eventually if the error persists.

closes: #36403
closes: #35792

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

eladkal · 2024-01-19T22:07:10Z

Just to clarify this also solves #35792 ?

jedcunningham · 2024-01-19T22:08:20Z

Just to clarify this also solves #35792 ?

Yes, it would in some cases (like quota being all used).

jedcunningham · 2024-01-19T22:10:19Z

It might be worth adding a note in the changelog about this behavior change, so folks can reevaluate if they need to enable/increase retries.

eladkal · 2024-01-19T22:12:41Z

It might be worth adding a note in the changelog about this behavior change, so folks can reevaluate if they need to enable/increase retries.

Agree. @dirrao can you please add note at the top of kubernetes provider change log (on top of the lastest version number) I will pick it and rearrange it to the right place during release

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

dirrao · 2024-01-20T04:47:52Z

It might be worth adding a note in the changelog about this behavior change, so folks can reevaluate if they need to enable/increase retries.

Agree. @dirrao can you please add note at the top of kubernetes provider change log (on top of the lastest version number) I will pick it and rearrange it to the right place during release

Updated change log

amoghrajesh

I am ok with the pull request, pending the consensus on ongoing discussion

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

chenyair · 2024-01-29T15:52:13Z

I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it.

dirrao · 2024-01-30T02:04:19Z

I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it.

We are complicating the functionality. I would suggest using task retries to honor the retry delay of 5 minutes to solve your use case.

airflow/providers/cncf/kubernetes/provider.yaml

jedcunningham · 2024-01-30T03:26:15Z

I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it.

We are complicating the functionality. I would suggest using task retries to honor the retry delay of 5 minutes to solve your use case.

It feels a little weird to use the task level retry delay for the global quota failure retry, doesn't it? This is partially why I think using the normal retry makes sense - it avoids us duplicating the whole retry concept.

dirrao · 2024-01-30T08:20:48Z

I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it.

We are complicating the functionality. I would suggest using task retries to honor the retry delay of 5 minutes to solve your use case.

It feels a little weird to use the task level retry delay for the global quota failure retry, doesn't it? This is partially why I think using the normal retry makes sense - it avoids us duplicating the whole retry concept.

In this scenario, I am referring to the normal retry instead of relying on the current implementation.

devscheffer · 2024-02-01T14:18:52Z

I had similar problems and thought about something like that

from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from airflow.utils.decorators import apply_defaults
from datetime import timedelta
import time

class CustomKubernetesPodOperator(KubernetesPodOperator):
    @apply_defaults
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def execute(self, context):
        # Check if there are enough resources on the cluster
        while True:
            cpu_available = check_cpu()
            memory_available = check_memory()
            if cpu_available and memory_available:
                break
            else:
                time.sleep(300)  # Wait for 5 minutes

        # Send the request to the Kubernetes cluster
        super().execute(context)

    def check_cpu(self):
        # Check if there is enough CPU available on the cluster
        # Return True if there is enough CPU available, False otherwise
        pass

    def check_memory(self):
        # Check if there is enough memory available on the cluster
        # Return True if there is enough memory available, False otherwise
        pass

dirrao · 2024-02-02T03:44:27Z

I had similar problems and thought about something like that

Ok. I would suggest to use dedicated pool slots per namespace. pool slots should depicts the namespace resources. So, you can control number of active running tasks through scheduler.

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

airflow/providers/cncf/kubernetes/provider.yaml

dirrao · 2024-02-07T16:27:34Z

@hussein-awala I have addressed all the comments. Can you re-review it?

dirrao requested review from jedcunningham and hussein-awala as code owners January 19, 2024 03:04

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes provider related issues labels Jan 19, 2024

Lee-W approved these changes Jan 19, 2024

View reviewed changes

shohamy7 reviewed Jan 19, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

jedcunningham approved these changes Jan 19, 2024

View reviewed changes

hussein-awala reviewed Jan 20, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

dirrao force-pushed the 36403-pod_launch_errors_requeue_bug_fix branch from 84cf67b to afce969 Compare January 20, 2024 04:47

amoghrajesh reviewed Jan 24, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

dirrao added 5 commits January 29, 2024 10:32

pod launch errors re-queue bug fix

e60de48

comments update

6bf4ea5

spell check

4c2d8a4

Added support for task publish max retries in kubernetes executor

480bb97

change log update

f4fc6b1

dirrao force-pushed the 36403-pod_launch_errors_requeue_bug_fix branch from 3eea218 to f4fc6b1 Compare January 29, 2024 06:26

spell check addressed

4b8998a

dirrao force-pushed the 36403-pod_launch_errors_requeue_bug_fix branch from cb4b61a to 4b8998a Compare January 29, 2024 07:20

dirrao requested review from hussein-awala and jedcunningham January 29, 2024 07:22

jedcunningham reviewed Jan 30, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/provider.yaml Outdated Show resolved Hide resolved

hussein-awala reviewed Feb 2, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

airflow/providers/cncf/kubernetes/provider.yaml Outdated Show resolved Hide resolved

dirrao and others added 2 commits February 4, 2024 09:34

review comments addressed

d91b18b

Merge branch 'main' into 36403-pod_launch_errors_requeue_bug_fix

d4975a4

dirrao requested a review from hussein-awala February 4, 2024 04:06

Merge branch 'main' into 36403-pod_launch_errors_requeue_bug_fix

0e0ccaf

dirrao requested a review from potiuk February 9, 2024 02:53

potiuk approved these changes Feb 10, 2024

View reviewed changes

potiuk merged commit e994879 into apache:main Feb 10, 2024
61 checks passed

This was referenced Feb 12, 2024

Status of testing Providers that were prepared on February 12, 2024 #37358

Closed

Status of testing Providers that were prepared on February 17, 2024 #37504

Closed

Status of testing Providers that were prepared on February 19, 2024 #37534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The task is stuck in a queued state forever in case of pod launch errors #36882

The task is stuck in a queued state forever in case of pod launch errors #36882

dirrao commented Jan 19, 2024 •

edited

Loading

eladkal commented Jan 19, 2024

jedcunningham commented Jan 19, 2024 •

edited

Loading

jedcunningham commented Jan 19, 2024

eladkal commented Jan 19, 2024

dirrao commented Jan 20, 2024

amoghrajesh left a comment

chenyair commented Jan 29, 2024

dirrao commented Jan 30, 2024

jedcunningham commented Jan 30, 2024

dirrao commented Jan 30, 2024

devscheffer commented Feb 1, 2024

dirrao commented Feb 2, 2024

dirrao commented Feb 7, 2024 •

edited

Loading

The task is stuck in a queued state forever in case of pod launch errors #36882

The task is stuck in a queued state forever in case of pod launch errors #36882

Conversation

dirrao commented Jan 19, 2024 • edited Loading

eladkal commented Jan 19, 2024

jedcunningham commented Jan 19, 2024 • edited Loading

jedcunningham commented Jan 19, 2024

eladkal commented Jan 19, 2024

dirrao commented Jan 20, 2024

amoghrajesh left a comment

Choose a reason for hiding this comment

chenyair commented Jan 29, 2024

dirrao commented Jan 30, 2024

jedcunningham commented Jan 30, 2024

dirrao commented Jan 30, 2024

devscheffer commented Feb 1, 2024

dirrao commented Feb 2, 2024

dirrao commented Feb 7, 2024 • edited Loading

dirrao commented Jan 19, 2024 •

edited

Loading

jedcunningham commented Jan 19, 2024 •

edited

Loading

dirrao commented Feb 7, 2024 •

edited

Loading