-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes Executor Task Leak #36998
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Looking over the logs I get two different outcomes.... when I restart the pods I get the following:
However after that I never get the success event:
|
are you seeing this issue when you run the airflow with single scheduler? Can you share the details to reprice it? This requires triaging. Meanwhile, you can bump up the parallelism configuration to a higher number to beat the leak. Or Restart the scheduler after a certain number of iterations to rest these values. |
Yes, this is when running on a single scheduler. We are utilizing the helm chart and only overriding the following values in values.yaml.
|
We have been seeing this issue basically ever since we upgraded from 2.7.3 -> 2.8.0 (and now on 2.8.1) |
Also we are creating our own airflow image off and importing our dags there:
requirements.txt
|
I think we run into the same problem with a similar setup (using the official helm chart and upgraded vom airflow 2.7.3 to 2.8.1). After a while I stop seeing logs from the kubernetes_executor and tasks are stuck in queued after a deferred trigger event was fired. I.e. they are queued -> running -> deferred -> queued (and stuck here). Restarting the scheduler helps and the stuck tasks complete as expected. I cannot identify the moment when the kubernetes_executor (assumingly) stops working yet. |
So we actually are starting to see things work now potentially. We were utilizing an old version of the helm chart, and after upgrading from 1.10 to 1.11 we are seeing the executors just work.... So keeping this open till confirmed but that was the solution for at least us. |
So after a brief sighting of things working again, we are now seeing it again..... Single Scheduler running on the 1.11 Helm Chart, Airflow 2.8.1. |
Looking into the latest occurrence, what is weird is we are seeing the following logged event: However also see in the current running slots:
This would semi indicate that the code here: Is not finding the event in our Database..... correct? We have a single scheduler and are setting row locking to false, which means the query should be completely un affected by any additional arguments. |
@dirrao do I have to change labels in order to get follow up? |
I am experiencing similar problem in 2.7.3
Running two scheduler replica and getting a lot of msgs like I manually killed one of the scheduler pod and it helped to evaluate the issue. My scheduler were not restarted from last 5 days.
Airflow config
|
After a couple of weeks with no responses on this post we just decided to revert back to 2.7.2 and the issue is gone. Down the road we will investigate being able to use 2.8.x on something like astronomer, but can definitely confirm that the change to 2.7 -> 2.8 caused this issue. |
It looks like the scheduler or the kubernetes_executor cannot recover from communication issues with kubernetes. I've collected a few hours of logging after a restart of the scheduler and the problems seem to occur after following lines:
After that, tasks are stuck in queued and I don't see any more lines of the kind
I can only recover from that state by clearing all scheduled and queued tasks and restarting the scheduler. I wasn't able to dig deeper into the kubernetes_executor yet, but there seem to be quite a few changes between 2.7.3 and 2.8.1. That would be my first guess for the origin of this. |
Hi! I had the same error with 2.7.3 described in #36478 |
Update: I have the same error with 1 scheduler with Airflow 2.8.4.
|
@smhood I have downgraded Airflow version to 2.7.2, but issue still exists...
Maybe issue is inside providers packages? |
Found emails related the same issue: https://www.mail-archive.com/commits@airflow.apache.org/msg309101.html |
I found workaround and some insights: |
@crabio , Were you able to find a solution ? We are also facing the task leak issue in v2.6.3 |
@paramjeet01 Fully - not But we found walkaround:
|
@crabio I have updated my comments here #38968 (comment) , I was able to improve the performance and the task no longer have longer queue duration |
I see similarity between the issue we are facing and the one you describe. Airflow 2.8.4 Config:
At 3 AM there were supposed to run more tasks but nothing like that happened. In the morning we ended up having no running tasks, many in queued state |
@crabio Could you please post steps to reproduce the issue? Then I could spend a little bit more time understanding it |
@aru-trackunit
Maybe tasks in another namespace not required, because we faced this issue before we started using multiple namespaces. |
@crabio , Yes we run in single namespace. |
I am seeing issue in single namespace. Scheduler fails to remove pods in Completed state and after running slot reaches 32 new scheduled task aren't getting queued.
I was working fine for a minute; it was reporting back the pod changes.
After this the Watcher went silent, no logs with PID 3740. KubernetesExecuter.running set kept increasing:
I confirmed that the PID: 3740 is still running.
airflow/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Line 168 in cead3da
waiting for stream from kubernetes Watch.stream() and urllib3 HttpResponse.stream() |
Also, those TaskInstances has been marked as completed and new DagRun has already started (while the open slots was still > 0) but the KubernetesExecuter.running set was still keeping those TaskInstances. |
Based on my finding above that the KubernetesJobWatcher was running but not getting back any pod changes, I have added a timeout of 5 min so that watcher restarts itself. This has fixed the issue for me. |
We experiencing the problem with slots leaking. We have several schedulers running and one of the root causes is that two different schedulers pick up the same task and after the succeeds, the task is removed from the running list only on a single scheduler. |
Meanwhile, restart the scheduler after a certain number of iterations and increase the number of executor pool slots to high number. |
@jedcunningham @dstandish do you think it makes sense to implement this fix? |
This issue is related to watcher is not able to scale and process the events on time. This leads to so many completed pods over the time. |
This issue is related to watcher is not able to scale and process the events on time. This leads to so many completed pods over the time. |
Closed by: #39551 |
Apache Airflow version
2.8.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
Scheduler stops processing DAGs and moving them to the queued status. When looking at the scheduler is debug mode following information is displayed.
We noticed that a fix was addressed here #36240, however still seeing the same issues.
We are utilizing the airflow helm chart version 1.10, and we have the same issue happening in multiple environments.
Two environments have parallelism set to 32 with 1 scheduler running.
The other has 3 schedulers all with 32 parallelism.
What you think should happen instead?
When a task is complete it should release the slot.
How to reproduce
Currently it seems to just be a time thing, after a certain period of time running the slots fill up with completed tasks.
Operating System
Debian GNU/Linux 12
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.16.0
apache-airflow-providers-celery==3.5.1
apache-airflow-providers-cncf-kubernetes==7.13.0
apache-airflow-providers-common-io==1.2.0
apache-airflow-providers-common-sql==1.10.0
apache-airflow-providers-docker==3.9.1
apache-airflow-providers-elasticsearch==5.3.1
apache-airflow-providers-ftp==3.7.0
apache-airflow-providers-google==10.13.1
apache-airflow-providers-grpc==3.4.1
apache-airflow-providers-hashicorp==3.6.1
apache-airflow-providers-http==4.8.0
apache-airflow-providers-imap==3.5.0
apache-airflow-providers-microsoft-azure==8.5.1
apache-airflow-providers-mysql==5.5.1
apache-airflow-providers-odbc==4.4.0
apache-airflow-providers-openlineage==1.4.0
apache-airflow-providers-postgres==5.10.0
apache-airflow-providers-redis==3.6.0
apache-airflow-providers-sendgrid==3.4.0
apache-airflow-providers-sftp==4.8.1
apache-airflow-providers-slack==8.5.1
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-sqlite==3.7.0
apache-airflow-providers-ssh==3.10.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
Deploy via helm chart (1.10) to an azure aks.
Deploy our own image with required packages/dags copied FROM apache/airflow:2.8.1-python3.11
Process is synced with ArgoCD deployment pipeline.
Anything else?
This problem for the most part occurs daily. We have a test instance with only 5 running dags that run once every hour and we are still seeing the issue.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: