-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The task is stuck in a queued state forever in case of pod launch errors #36882
The task is stuck in a queued state forever in case of pod launch errors #36882
Conversation
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
Just to clarify this also solves #35792 ? |
Yes, it would in some cases (like quota being all used). |
It might be worth adding a note in the changelog about this behavior change, so folks can reevaluate if they need to enable/increase retries. |
Agree. @dirrao can you please add note at the top of kubernetes provider change log (on top of the lastest version number) I will pick it and rearrange it to the right place during release |
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
84cf67b
to
afce969
Compare
Updated change log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with the pull request, pending the consensus on ongoing discussion
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
3eea218
to
f4fc6b1
Compare
cb4b61a
to
4b8998a
Compare
I see you added retires counter. What do you think about custom delay between each retry of exceeded quota also? My issue is the high rate of requests to Kubernetes API and currently it does not solve it. |
We are complicating the functionality. I would suggest using task retries to honor the retry delay of 5 minutes to solve your use case. |
It feels a little weird to use the task level retry delay for the global quota failure retry, doesn't it? This is partially why I think using the normal retry makes sense - it avoids us duplicating the whole retry concept. |
In this scenario, I am referring to the normal retry instead of relying on the current implementation. |
I had similar problems and thought about something like that
|
Ok. I would suggest to use dedicated pool slots per namespace. pool slots should depicts the namespace resources. So, you can control number of active running tasks through scheduler. |
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
@hussein-awala I have addressed all the comments. Can you re-review it? |
What happened
When the K8 executor is unable to launch the worker pod due to permissions issues or an invalid namespace. The K8 executor keep trying to launch the worker pod and the errors remain persist. So, the task ends up in a queued state for so long/forever.
What you think should happen instead
We shouldn't retry the worker pods launch continuously in case of persistent/transient errors. Let the executor mark them as failed and let the scheduler honor the task retries with retry delay (5 mins by default) and then fail the task eventually if the error persists.
closes: #36403
closes: #35792