-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MappedTasks: Very slow execution :: taskinstance.py seems to be a bottleneck. #35267
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Airflow is OSS and developed by different contributors (most of them also users of Airflow), so this required if someone spend free time and figure out how this could be optimised on the 3 different DB backends: Postgres, MySQL and SQLite. |
I think you can limit it with config |
I think indeed we need more details an investigation and answers (from your side @a-meledin ) .
Generally speaking when you allow PGBouncer to allow 150 connections to your database, then it will open all 150 connections and Postgres creates a separate process for each connection. So if your postress does not have enough resources (for example memory - but it can be CPU or other resources ) to run all your processes, it will slow down to a crawl as those process will compete for those resources. You need to look at the resource usage and see what your bottleneck are. Finally the last question and request:
|
That will not work - each task established its own connection and runs in separate process - sqlalchemy pool only work per process. |
I. First, I installed pgbouncer and added pool 20, then increased to 150. There were ~104 active connections waiting for lock and no waiting processes in pgbouncer (so 120-130 connection pool was enough).
This settings added 2 times speed up, but it seems as this DAG gets more new dynamic mapped taskinstances, the slower processing becomes.
As I mentioned above, if running with MappedNumber=108 param, execution of 3460 task instances takes 20 minutes. Celery flower shows runtime between 1.7 - 4.5 sec for task instance. I used one Celery Worker with 8 threads on Core i5 11 (12 CPU) machine and 32 Gb Mem, Ubuntu under WSL2. Memory is enough. CPU load is almost 100%.
I've used pgbouncer. And with schedule_after_task_execution = False airflow doesn't use more than 30-35 connections as I observed.
If set pgbouncer lower pool limits (e.g. 20) then there are processes waiting for connection (For situation when schedule_after_task_execution = True and use_row_level_locking = True).. Sure, airflow and postgres consumed less resouces, but DAG's execution time slowed down even more.
I observed this. Swap file was minimal. The problem was with locks. See above explanation.
Tested only 2.7.2. with LocalExecutor and CeleryExecutor+Reddis+PG backend. A bit problematic to test under 2.7.1. |
It would be great if you could try this and compare. As mentioned there was a change in 2.7.2 that was around that (connections for mapped tasks) but it was supposed to improve number of connectiosn (lowering them by half essentially), but maybe there are some unforeseen side effects the change triggered and it would be great to narrow-down investigation, whether this is somewhat a problem with the deployment or something that is caused by that change. |
Jarec,
Thank you for your answer. You could run the code above on 2.7.1 too.
Actually I don't have such a possibility now.
…On Tue, 31 Oct 2023 at 23:15, Jarek Potiuk ***@***.***> wrote:
Tested only 2.7.2. with LocalExecutor and CeleryExecutor+Reddis+PG
backend. A bit problematic to test under 2.7.1.
It would be great if you could try this and compare. As mentioned there
was a change in 2.7.2 that was around that (connections for mapped tasks)
but it was supposed to improve number of connectiosn (lowering them by half
essentially), but maybe there are some unforeseen side effects the change
triggered and it would be great to narrow-down investigation, whether this
is somewhat a problem with the deployment or something that is caused by
that change.
—
Reply to this email directly, view it on GitHub
<#35267 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXNAWSSFGOIBOIP7G33FMPDYCFL7ZAVCNFSM6AAAAAA6VWLSPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXHE3DQNRUGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
С уважением, Алексей Меледин
|
It's. not the code, it's the scale. I am contributor, not user, so I do not have Airflow installation to test it on. |
I'm having similar issue with Dynamically Mapped Tasks for 2.8.4. What is a best way to fix it? |
@uranusjr Any ideas or other instance of such things you have noticed from other users? |
I don’t recall anything similar reported elsewhere. I can imagine the FOR UPDATE part causing this though since it can saturate the database with all workers doing it at the same time, especially if you don’t have a good combination of the database connection limit and worker count. Is is possible to get the full SELECT ... FOR UPDATE query you mentioned in the top message? That would help a lot pin down the problem. |
This could be related to this PR, and looks like it was released in 2.9.1: We observed that when you have a bunch of mapped tasks, running the "mini scheduler" can result in too many processes waiting for the same lock on the same table. With that change, it won't wait if it's already locked. |
@dstandish, thanks a lot. I see that after upgrading to |
Great , glad to hear it |
Going to close it as resolved in 2.9.1 |
Cool. Thanks @dstandish for the pointer ! |
I think
in DB logs after upgrading to |
Yes, please open a new issue and if possible please also add details of any impact on performance or stability |
Apache Airflow version
2.7.2
What happened
We have a Dag with 116 tasks each having some number of MappedTasks. Overall number of tasks instances is ~6000. Problem is that we encounter very slow execution. Investigation and testing with the code below has shown that we have ~110 open connections with SQL: SELECT dag_run.state AS dag_run_state, ... FROM dag_run WHERE dag_run.dag_id = ... AND dag_run.run_id = ... FOR UPDATE .
In logs it's seen as delay caused by taskinstance.py:
What you think should happen instead
Invetigation required on taskinstance.py code.
How to reproduce
See code above. With pgbouncer allow 150 connections. Then check query stat by:
select query,wait_event_type,wait_event,count(*), min(now() - backend_start), max(now() - backend_start) from pg_stat_activity WHERE usename = 'airflow_user'::name GROUP BY 1,2,3 ORDER BY 4 DESC;
Operating System
Ubuntu on Linux and under WSL2
Versions of Apache Airflow Providers
2.7.2
Deployment
Docker-Compose
Deployment details
Tested on Docker, Python 3.8-310, under Celery and LocalExecutor.
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: