-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Remove state sync during celery task processing #41870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
o-nikolas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
hussein-awala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, LGTM
|
Nice analysis. Though I think we need to check why this code is there to begin with, there may be a reason we are missing. Also, this fix is based on the assumption:
If we accept it I think we need a test to make sure this stay that way and won't be changed with future PRs |
|
It's an interesting one. And yeah - nice analysis @Kytha ! Looks like the original code was supposed to handle "quick" tasks - but It did not take into account that - apparently - in order to check if result is available, celery will indeed make a db call for EVERY async result - no matter what state the task is - whether it is submitted, running, or completec.. That's a very nasty side-effect if that's confirmed. But I'd be rather suprised to see such "huge" inefficiency of this check - I imagine it would have been found already if that was a "general" problem. @Kytha - question - was your pyspy testing done with "real" tasks or ones that were doing nothing (i.e. pass-only tasks or similar)? One possible reason why you might see that bad numbers beccause you have a very quick ( It would be rather inefficient way of checking if task completed by calling a DB for every single task. I Intuitively would imagine that celery should not call the DB if task did not complete yet (but maybe this is how it is implemented and there is no other side-communication to check it). I'd imagine you need to retrieve the result from the DB when task completed, because of persistency and recovery, but if the task is still "running" - I hardly see the need to reach out to the result backend (I imagine celery could check in the queue if the task was picked up and acknowledged way faster - without reaching out to the backend). One reason why you would see so big "percentage" of time there is that you made the tests with "pass-only" celery tasks and quite a big number of those tasks manage to complete before the check happens (but this is just a hypothesis). Can you please elaborate on that ? |
|
Also another question: https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#result-backend-sqlalchemy-engine-options -> there is also a possible pool configuration, so celery backend can use pooling. As you mentioned NullPool is used only when process is not forked. But.... what does it mean that process is not forked and is the case in case of task submitssion and checking result? Again - I am not that deep into celery interface - but I think the code you pointed at is called only on the side of worker (i.e. when worker wants to write status to the result backend) but it's not the same code that is used on a client side when the tasks are submitted and queried for status. Maybe there is another reason why in your case pooling connections are not used. Also - another question - are you using pgbouncer for your result backend? Because, I believe this might be the actual root cause of the overhead. It's indeed quite slow to open and close a connection to a postgres database directly - because it needs to fork a new process and reserve some memory - but if you have pgbouncer on a local nettwork, opening and closing a connection to the bgbounder instance should be super-quick - because pgbouncer does "real" DB connection pooling in this case. And PGBouncer is absolutely recommended for all "serious" Airlfow installations on Postgres. So even if "NullPool" is used in this case - that should be fine because in fact NullPool is even recommended if you have PGBouncer - for single-threaded programs: https://dba.stackexchange.com/questions/36828/how-to-best-use-connection-pooling-in-sqlalchemy-for-pgbouncer-transaction-level/118105#118105 and many other places. I am not against this change, however I think we need to understand better where the "huge" inefficiency you noticed comes from - I'd really find it quite strange to see it after so many years of Airlfow using celery in multiple - even huge - installation for it to be unnoticed, so I suspect what you see is a result of some environmental setup that simply "boosts" the hotness of that code artifficially. |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes until we understand better where the "hotness" of that testing comes from - becasue IMHO, it should not be "that" hot.
My search into why this was added didn't yield much insight. It was added in #10949, but never directly addressed or explained beyond the original comment that was left. The PR generally seems to be separate changes related to task adoption. I also agree, I can add a test to prevent regression of this assumption. |
|
@potiuk Thanks for in-depth review. I'll try to clarify a few things here. For my test I was using 'dummy' tasks that would sleep for 1 minute each (4000 DAGs, 2 parallel tasks each, scheduled every 5 minutes). Also, the profiler results and stats attached is only a 1-minute sample from the runtime. I did take many samples and saw a similar trend, however mileage will definitely vary based on speed of db connection, number of concurrent tasks, etc.
I am not sure if I follow the logic here. From my understanding, the results backend is the mechanism the executor uses to determine the state of a task. Therefore, it would be a requirement to query the results backend to get the status. I think the problem here is that this is not done in batch (like it is during the normal executor sync). It seems to me this was a simple oversight when adding this seemingly innocent line, which under-the-hood requires to 2 db connections to backend for |
Yeah, so unfortunately celery will override any pooling config provided here if the process is not forked. Meaning that the python process needs to have made a call to os.fork() prior to calling get_engine. (this is the celery callback that will execute once process is forked). In the case of the scheduler, it is not a forked process. For workers, this would be the case (for default airflow config). This means for the scheduler, celery is creating a new engine every time get_engine is called. |
For my airflow setup, I am not using PGbouncer, but I am using RDS Proxy for database connection pooling. I can understand the logic behind using NullPool if there is a database proxy in place, but in this case celery is creating a new engine for every query, which the stackoverflow issue linked advising against, due to the increased overhead. |
For additional clarity, here is some relevant airflow config options used during the testing. |
|
@ashb Just another ping post Airflow Summit, do you want to weigh in? Or shall we go ahead and merge this? |
|
Ok. Let me merge it then - there does not seem to be a good reason for it, no answer from those who can remember, let's just merge it. |
|
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
An Airflow scheduler performance test highlighted a very hot piece of code in the celery executor when using a database results backend. This code seemed to be doing redundant work. Below is a flamegraph of the airflow scheduler process captured by a statistical profiler (py-spy) during a period of heavy load (4000 tasks required scheduling).
During this 1 minute profiler session, the scheduler spent 42% of it's time on nested within this line of code. The reason this code is so hot is that when using celery with a database results backend, celery will not pool database connections (unless process is forked) and thus a new db connection must be established for each task in the loop. This is very expensive and scales with number of tasks. We can see from flame graph most of the time is spent creating a database connection.
The solution put forth in this PR is to remove this operation entirely from the _process_tasks function. This is based on the justification that immediately after the celery executor processes tasks, the sync method of the celery executor will be called by its parent base executor to sync task state. In my view, this renders this line of code redundant.
When calling sync, the celery executor makes use of batch fetching and thus is more optimized.
Some additional deployment details
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.