Skip to content

Conversation

@kaxil
Copy link
Member

@kaxil kaxil commented May 23, 2025

closes #50507

When the task supervisor's heartbeat timeout is exceeded, the max_wait_time calculation can become 0 or negative, causing selector.select(timeout=0) to run in a tight non-blocking loop that consumes 100% CPU as explained in the GitHub issue.

Add minimum timeout of 0.01s in _service_subprocess to prevent this issue while maintaining responsive task monitoring.

Root Cause:
The issue occurs in the _monitor_subprocess method when:

  1. last_heartbeat_ago becomes very large e.g., 100+ seconds due to bugs like network issues or due to bugs like The task supervisor continues running indefinitely, even after the associated task process has completed #50500
  2. The calculation HEARTBEAT_TIMEOUT - last_heartbeat_ago * 0.75 becomes negative
  3. max(0, negative_value) results in 0
  4. selector.select(timeout=0) runs in a non-blocking tight loop

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

closes apache#50507

When the task supervisor's heartbeat timeout is exceeded, the `max_wait_time`
calculation can become 0 or negative, causing `selector.select(timeout=0)` to
run in a tight non-blocking loop that consumes 100% CPU as explained in the GitHub issue.

Add minimum timeout of 0.01s in `_service_subprocess` to prevent this issue
while maintaining responsive task monitoring.

**Root Cause:**
The issue occurs in the `_monitor_subprocess` method when:
1. `last_heartbeat_ago` becomes very large e.g., 100+ seconds due to bugs like network issues or due to bugs like apache#50500
2. The calculation `HEARTBEAT_TIMEOUT - last_heartbeat_ago * 0.75` becomes negative
3. `max(0, negative_value)` results in 0
4. `selector.select(timeout=0)` runs in a non-blocking tight loop
@kaxil kaxil requested review from amoghrajesh and ashb as code owners May 23, 2025 23:43
@kaxil kaxil added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label May 23, 2025
@kaxil kaxil merged commit beb7b62 into apache:main May 25, 2025
72 checks passed
@kaxil kaxil deleted the stop-cpu-spike branch May 25, 2025 11:08
github-actions bot pushed a commit that referenced this pull request May 25, 2025
…ut exceeded (#51023)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
@github-actions
Copy link

Backport successfully created: v3-0-test

Status Branch Result
v3-0-test PR Link

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request May 25, 2025
…ut exceeded (apache#51023)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
github-actions bot pushed a commit to guan404ming/airflow that referenced this pull request May 25, 2025
…ut exceeded (apache#51023)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

kaxil added a commit that referenced this pull request Jun 2, 2025
…ut exceeded (#51023) (#51047)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
kaxil added a commit that referenced this pull request Jun 3, 2025
…ut exceeded (#51023) (#51047)

(cherry picked from commit beb7b62)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:task-sdk backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task supervisor CPU spike with 0 timeout for socket selector events

4 participants