[Bug]: Python pipelines running on Python 3.11 may experience periodic stuckness. #33966

tvalentyn · 2025-02-12T13:16:40Z

Summary

Python pipelines running on Python 3.11 may experience periodic stuckness. Beam Dataflow users might see this stuckess accompanied with errors like:

Unable to retrieve status info from SDK harness sdk_harness_id within allowed time

SDK worker appears to be permanently unresponsive. Aborting the SDK.

The issue may be more pronounced in pipelines that frequently trigger garbage collection.

Mitigation: Use Python 3.12, Python 3.10, or switch to Beam 2.64.0 once it is released.

Details

Beam SDK has a mechanism to provide status report to a runner that captures the ongoing work. The status report includes stacktraces of running threads.

To collect such stacktraces, we inspect the content of running Python frames via sys._current_frames().

It appears that on Python 3.11, such invocation can cause a deadlock if/when garbage collection triggers during the call to sys._current_frames(): python/cpython#106883. The issue is not reproducible on Python 3.10 or Python 3.12.

On Python 3.11, a Beam job might get stuck. An a stuck job running on Dataflow might have errors like:

Unable to retrieve status info from SDK harness sdk_harness_id within allowed time

SDK worker appears to be permanently unresponsive. Aborting the SDK.

As noted in https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact , such errors can happen when a thread in Python process permanently holds the GIL.

Inspecting the Dataflow workers with pystack, for example via an automated script like: https://gist.github.com/tvalentyn/82fcee6b93253740d2ae50bd425916a5 , reveals a thread with a stacktrace in frames = sys._current_frames() holding the GIL and a thread doing garbage collecting; sometimes these are also the same thread:

Traceback for thread 107 (python) [Has the GIL,Garbage collecting] (most recent call last):
    (C) File "Python/thread_pthread.h", line 241, in pythread_wrapper (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 1124, in thread_run (/usr/local/lib/libpython3.11.so.1.0)
    (Python) File "/usr/local/lib/python3.11/threading.py", line 1002, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
        self.run()
    (Python) File "/usr/local/lib/python3.11/threading.py", line 982, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 175, in <lambda>
        target=lambda: self._serve(), name='fn_api_status_handler')
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 200, in _serve
        id=request.id, status_info=self.generate_status_response()))
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 219, in generate_status_response
        all_status_sections.append(thread_dump())
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 60, in thread_dump
        frames = sys._current_frames()  # pylint: disable=protected-access
    (C) File "Modules/gcmodule.c", line 2290, in gc_alloc (inlined) (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1400, in gc_collect_with_callback (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1287, in gc_collect_main (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1013, in delete_garbage (inlined) (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Objects/typeobject.c", line 1279, in subtype_clear (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Objects/typeobject.c", line 1463, in subtype_dealloc (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 904, in local_dealloc (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 872, in local_clear (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Python/thread_pthread.h", line 497, in PyThread_acquire_lock_timed (/usr/local/lib/libpython3.11.so.1.0)

This failure mode matches the description of python/cpython#106883, which is known to affect CPython 3.11, has been fixed in CPython 3.12 and has not been reproduced in CPython 3.10.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

tvalentyn added awaiting triage bug labels Feb 12, 2025

tvalentyn self-assigned this Feb 12, 2025

github-actions bot added python P2 and removed awaiting triage labels Feb 12, 2025

tvalentyn mentioned this issue Feb 12, 2025

Disable GC before collecting stack frames on Python 3.11 #33967

Merged

3 tasks

tvalentyn closed this as completed in #33967 Feb 12, 2025

github-actions bot added this to the 2.64.0 Release milestone Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Python pipelines running on Python 3.11 may experience periodic stuckness. #33966

[Bug]: Python pipelines running on Python 3.11 may experience periodic stuckness. #33966

tvalentyn commented Feb 12, 2025 •

edited

Loading

[Bug]: Python pipelines running on Python 3.11 may experience periodic stuckness. #33966

[Bug]: Python pipelines running on Python 3.11 may experience periodic stuckness. #33966

Comments

tvalentyn commented Feb 12, 2025 • edited Loading

Summary

Details

Issue Priority

Issue Components

tvalentyn commented Feb 12, 2025 •

edited

Loading