Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes executor running slots leak fix #36240

Merged

Conversation

dirrao
Copy link
Contributor

@dirrao dirrao commented Dec 15, 2023

What happened

Schedulers are racing for pod adoption when there is a delay in schedulers' heartbeats. However, the schedulers are alive but not dead their heartbeat is delayed due to network timeout or heavy processing, etc. This leads to a leak in the executor.running_tasks slots. Eventually, the schedulers are not able to launch the pods due to executor.running_tasks reaches maximum parallelism.

What you think should happen instead

We should remove the entry from the Kubernetes executor running queue when worker pod deleted / moved to another scheduler.

Closes: #35675
Fix: #32928
Fix: #35426
Fix: #36478

Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! LGTM

@dirrao dirrao mentioned this pull request Dec 20, 2023
2 tasks
@potiuk potiuk added this to the Airflow 2.8.1 milestone Dec 20, 2023
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@potiuk potiuk merged commit 49108e1 into apache:main Dec 20, 2023
57 checks passed
@dirrao dirrao deleted the 35675-leak_in_kubernetes_executor_running_task_slots branch December 20, 2023 15:26
ephraimbuddy pushed a commit that referenced this pull request Jan 11, 2024
---------

Co-authored-by: gopal <gopal_dirisala@apple.com>
(cherry picked from commit 49108e1)
@jedcunningham jedcunningham removed this from the Airflow 2.8.1 milestone Jan 23, 2024
@smhood smhood mentioned this pull request Jan 24, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
4 participants