Skip to content

Task supervisor CPU spike with 0 timeout for socket selector events #50507

@neel-astro

Description

@neel-astro

Apache Airflow version

3.0.0

If "Other Airflow 2 version" selected, which one?

No response

What happened?

When the task supervisor monitor subprocess's max wait time drops to 0 (i.e. task process heartbeat happened long time ago), then the CPU usage shoots to 100%. This might also be happening as a side effect of #50500, that causes supervisor to runs indefinitely after the task process has finished (as a result the HEARTBEAT_TIMEOUT - last_heartbeat_ago * 0.75 would be < 0 and thus the wait time get set to 0).

When selector.select has timeout set to 0, it would mean a non-blocking mode and report currently ready file objects, and returns even if nothing is ready. Because we have the selector.select in a tight while loop from monitor_subprocess causes the CPU usage to spike to 100%. Reference: https://docs.python.org/3/library/selectors.html#selectors.BaseSelector.select

Code reference: https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L886-L895

What you think should happen instead?

The CPU should not throttle for that edge case

How to reproduce

Set the task_instance_heartbeat_timeout to half of min_heartbeat_interval, so that the max wait time would end up in being 0. Observe the CPU usage during task execution.

Operating System

Debian GNU/Linux 12

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else?

Setting the min of the max_wait_time to 0.1 instead of 0 (https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L889) seems to be resolving the underlying issue.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

affected_version:3.0Issues Reported for 3.0area:corekind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions