-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow version
3.0.0
If "Other Airflow 2 version" selected, which one?
No response
What happened?
When the task supervisor monitor subprocess's max wait time drops to 0 (i.e. task process heartbeat happened long time ago), then the CPU usage shoots to 100%. This might also be happening as a side effect of #50500, that causes supervisor to runs indefinitely after the task process has finished (as a result the HEARTBEAT_TIMEOUT - last_heartbeat_ago * 0.75 would be < 0 and thus the wait time get set to 0).
When selector.select has timeout set to 0, it would mean a non-blocking mode and report currently ready file objects, and returns even if nothing is ready. Because we have the selector.select in a tight while loop from monitor_subprocess causes the CPU usage to spike to 100%. Reference: https://docs.python.org/3/library/selectors.html#selectors.BaseSelector.select
Code reference: https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L886-L895
What you think should happen instead?
The CPU should not throttle for that edge case
How to reproduce
Set the task_instance_heartbeat_timeout to half of min_heartbeat_interval, so that the max wait time would end up in being 0. Observe the CPU usage during task execution.
Operating System
Debian GNU/Linux 12
Versions of Apache Airflow Providers
No response
Deployment
Astronomer
Deployment details
No response
Anything else?
Setting the min of the max_wait_time to 0.1 instead of 0 (https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L889) seems to be resolving the underlying issue.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status