-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Make Edge Worker using async loop #56457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Edge Worker using async loop #56457
Conversation
16dce9a to
e5de8ad
Compare
e5de8ad to
5a0ef71
Compare
|
That will be a nice improvement @jscheffl. Looking forward to this! |
|
I've just checked the code and I saw maybe a possible "improvement" regarding following code in worker: We could encapsulate the checks of the state in the WorkerInfo dataclass through properties, then above code would be easier to read and to test also, as then you can test the check of state directly in the WorkerInfo dataclass. So I would add following properties in WorkerInfo dataclass: Then you could rewrite following as below and add dedicated tests in WorkerInfo dataclass for above properties: WDYT? |
Yes, the code is not perfect and the state management has grown over time. I am also not 100% happy about it, if you have a good idea and some time... like with all: contributions are welcome. |
ad8fc74 to
53921f2
Compare
|
Really like the refactorings being done here, looking forward to test it once it's done ;-) |
|
Almost ready to review, need to hunt for one bug that I saw (and had no time, was distracted by family) hope I can make it ready for review by EOB today. |
53921f2 to
507ba20
Compare
|
@jscheffl Im unable to login to airflow on this branch. I lauched airflow with |
|
aaah @dheerajturaga now I am able to add you as reviewer.. seems the write permissions have landed! Cool! |
fe8bca1 to
f0bf876
Compare
Actually I can confirm I had the same. Switched to FabAuthManager and then it worked. Somehow I was not able to reproduce this in main and therefore rebased my PR to latest main. Not able to get the error again. And I am not sure how this can be produced - for sure not with EdgeExecutor as this PR does not change anything on auth managers. What I saw is that the cookie |
Yup, the rebase fixed this issue |
|
Feature is working, I was able to run tasks on the edge worker aswell as test to see if the worker is responding correctly to my request to change states from the edge worker dashboard. I have not had much luck scale testing this however, having 100 concurrent tasks running seems to overload my laptop (behavior is consistent without this PR aswell) |
f0bf876 to
5221219
Compare
Haha, yeah, running ~125 tasks took ~12GB RAM on top of the rest on my laptop. (each task just a python sleep()) - But well at least factor 10 more than the previous implementation. |
How come the sudden increase? That’s huge |
The worker forks a Python process per supervisor and the supervisor forks another Python process to separate the workload. 125 running tasks mean 1+125*2 == 251 processes. 125 sleeping, 125+1 sending heartbeats to API. Subtracting the worker I think this is very lean actually, 12GB/125 workload == ~98MB per workload == <50MB per Python interpreter (whereas benefitting from COW as process manager shows ~250MB/process in the task mananger) I assume some memory can be save if we also implement the gc-freeze like in #58365 - Did not apply in this PR but would be the same here. A major saving would be if the supervisor would be adding async capabilities such that the supervising runs all in one process for multiple tasks in async loops - but this would be less crash resistent and a major rework of the supervisor. Maybe also in the future... |
Makes sense. |
|
@dheerajturaga Thanks for merging and congrats to first merged PR! |
* First draft of async edge worker * Further async implementation * Remove TODO * Uuups, remove example * Full rewrite in Python async * Fix pytests after async re-write * Revert async implementation to use multiprocessing for supervisor * Ensure running processes are not killed on CTRL+C * Fix signal handler, make it proper async * Fix pytest * Review feedback by dabla
|
@jscheffl , I was able to scale test apache-airflow-providers-edge3 3.0.0rc1 (this pull request) and here are my findings. I was able to load the edge worker with 100 concurrent tasks, I see that the memory utilization spikes to 5GB and slowly starts shrinking as tasks continue to finish. This is significantly better than celery which jumps to 20+GB at 100 concurrent tasks! Also, as the number of tasks queued on the worker increases, the memory consumption of the worker also increases. with 100 tasks running and 400 tasks queued on the same worker, the memory spikes to 13GB. Overall, this is a significant improvement! thanks for implementing! I was also able to push the boundaries with 400 parallel tasks and I see memory consumption of 18GB. However some tasks start to fail due to socket limitations. I think we can safely claim a concurrency of 100 if the machine allows. |
@dheerajturaga Interesting thanks for sharing this is very valuable information as I haven’t got time to test it yet. |
Cool! Thanks for confirming my tests! Yes, this was my longer outstanding aim! But as LocalExecutor was also optimized via gc.freeze I assume memory can be optimized still down a bit. And with the limits on sockets... yeah in such a level of hundreds of tasks I assume some kernel parameters demand tuning as well. But for 400 concurrent tasks I assume we rather need to start thinking about supporting deferred mode on edge as well which would be a major next cool increment! Would it not be cool if the edge worker fetches the DeferredException and puts the triggerer actions into the async loop of the worker... almost zero effort and memory needed and then returns work back to normal supervisor when returnign from deferred state? Would need some changes on core or SDK as well... But yeah we need to have some dreams... |
That would just be awesome Jens!!! The edge executor is already a huge step forward, I’m sure one day that “dream” will become reality ;-) |
Good dreams :) . |
So far the Edge worker running remote is using a single threaded loop to fetch, process, monitor tasks and upload results logs as well as heartbeat. This limits the parallelism achieveable on a Edge Worker - how many tasks can be handled in parallel.
With this PR a major refactor is made to use Python AsyncIO for handling of tasks in order to improve concurrency with many tasks.
Note: Before merging this needs to have proper review and testing as a lot of logic and libraries change with the risk of degraded quality/stability as of implementation glitched. UPDATE: WIP status left, did a lot of (local) testing and I think it is ready now.
Tested with 100 concurrent tasks (mostly sleeping) on my machine and worker used only 10-20% of CPU (compared to 10GB RAM all the task needed). So with the re-implementation in AsyncIO the worker scales very much more than before where I considered 10-15 tasks can be handled.
FYI @dheerajturaga - As also being a user, committer status still pending but looking forward for a review!
@dabla As you do a lot of AsyncIO on your side, looking forward for a review from you as well!
@AutomationDev85 - Would also like your feedback!