Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The task is stuck in a queued state forever in case of pod launch errors #36882

Merged
merged 9 commits into from
Feb 10, 2024

Conversation

dirrao
Copy link
Contributor

@dirrao dirrao commented Jan 19, 2024

What happened

When the K8 executor is unable to launch the worker pod due to permissions issues or an invalid namespace. The K8 executor keep trying to launch the worker pod and the errors remain persist. So, the task ends up in a queued state for so long/forever.

What you think should happen instead

We shouldn't retry the worker pods launch continuously in case of persistent/transient errors. Let the executor mark them as failed and let the scheduler honor the task retries with retry delay (5 mins by default) and then fail the task eventually if the error persists.

closes: #36403
closes: #35792

@eladkal
Copy link
Contributor

eladkal commented Jan 19, 2024

Just to clarify this also solves #35792 ?

@jedcunningham
Copy link
Member

jedcunningham commented Jan 19, 2024

Just to clarify this also solves #35792 ?

Yes, it would in some cases (like quota being all used).

@jedcunningham
Copy link
Member

It might be worth adding a note in the changelog about this behavior change, so folks can reevaluate if they need to enable/increase retries.

@eladkal
Copy link
Contributor

eladkal commented Jan 19, 2024

It might be worth adding a note in the changelog about this behavior change, so folks can reevaluate if they need to enable/increase retries.

Agree. @dirrao can you please add note at the top of kubernetes provider change log (on top of the lastest version number) I will pick it and rearrange it to the right place during release

@dirrao dirrao force-pushed the 36403-pod_launch_errors_requeue_bug_fix branch from 84cf67b to afce969 Compare January 20, 2024 04:47
@dirrao
Copy link
Contributor Author

dirrao commented Jan 20, 2024

It might be worth adding a note in the changelog about this behavior change, so folks can reevaluate if they need to enable/increase retries.

Agree. @dirrao can you please add note at the top of kubernetes provider change log (on top of the lastest version number) I will pick it and rearrange it to the right place during release

Updated change log

Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with the pull request, pending the consensus on ongoing discussion

@dirrao dirrao force-pushed the 36403-pod_launch_errors_requeue_bug_fix branch from 3eea218 to f4fc6b1 Compare January 29, 2024 06:26
@dirrao dirrao force-pushed the 36403-pod_launch_errors_requeue_bug_fix branch from cb4b61a to 4b8998a Compare January 29, 2024 07:20
@chenyair
Copy link

I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it.

@dirrao
Copy link
Contributor Author

dirrao commented Jan 30, 2024

I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it.

We are complicating the functionality. I would suggest using task retries to honor the retry delay of 5 minutes to solve your use case.

@jedcunningham
Copy link
Member

I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it.

We are complicating the functionality. I would suggest using task retries to honor the retry delay of 5 minutes to solve your use case.

It feels a little weird to use the task level retry delay for the global quota failure retry, doesn't it? This is partially why I think using the normal retry makes sense - it avoids us duplicating the whole retry concept.

@dirrao
Copy link
Contributor Author

dirrao commented Jan 30, 2024

I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it.

We are complicating the functionality. I would suggest using task retries to honor the retry delay of 5 minutes to solve your use case.

It feels a little weird to use the task level retry delay for the global quota failure retry, doesn't it? This is partially why I think using the normal retry makes sense - it avoids us duplicating the whole retry concept.

In this scenario, I am referring to the normal retry instead of relying on the current implementation.

@devscheffer
Copy link
Contributor

I had similar problems and thought about something like that

from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from airflow.utils.decorators import apply_defaults
from datetime import timedelta
import time

class CustomKubernetesPodOperator(KubernetesPodOperator):
    @apply_defaults
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def execute(self, context):
        # Check if there are enough resources on the cluster
        while True:
            cpu_available = check_cpu()
            memory_available = check_memory()
            if cpu_available and memory_available:
                break
            else:
                time.sleep(300)  # Wait for 5 minutes

        # Send the request to the Kubernetes cluster
        super().execute(context)

    def check_cpu(self):
        # Check if there is enough CPU available on the cluster
        # Return True if there is enough CPU available, False otherwise
        pass

    def check_memory(self):
        # Check if there is enough memory available on the cluster
        # Return True if there is enough memory available, False otherwise
        pass

@dirrao
Copy link
Contributor Author

dirrao commented Feb 2, 2024

I had similar problems and thought about something like that

Ok. I would suggest to use dedicated pool slots per namespace. pool slots should depicts the namespace resources. So, you can control number of active running tasks through scheduler.

@dirrao dirrao requested a review from hussein-awala February 4, 2024 04:06
@dirrao
Copy link
Contributor Author

dirrao commented Feb 7, 2024

@hussein-awala I have addressed all the comments. Can you re-review it?

@dirrao dirrao requested a review from potiuk February 9, 2024 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet