You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Python pipelines running on Python 3.11 may experience periodic stuckness. Beam Dataflow users might see this stuckess accompanied with errors like:
Unable to retrieve status info from SDK harness sdk_harness_id within allowed time
SDK worker appears to be permanently unresponsive. Aborting the SDK.
The issue may be more pronounced in pipelines that frequently trigger garbage collection.
Mitigation: Use Python 3.12, Python 3.10, or switch to Beam 2.64.0 once it is released.
Details
Beam SDK has a mechanism to provide status report to a runner that captures the ongoing work. The status report includes stacktraces of running threads.
To collect such stacktraces, we inspect the content of running Python frames via sys._current_frames().
It appears that on Python 3.11, such invocation can cause a deadlock if/when garbage collection triggers during the call to sys._current_frames(): python/cpython#106883. The issue is not reproducible on Python 3.10 or Python 3.12.
On Python 3.11, a Beam job might get stuck. An a stuck job running on Dataflow might have errors like:
Unable to retrieve status info from SDK harness sdk_harness_id within allowed time
SDK worker appears to be permanently unresponsive. Aborting the SDK.
Inspecting the Dataflow workers with pystack, for example via an automated script like: https://gist.github.com/tvalentyn/82fcee6b93253740d2ae50bd425916a5 , reveals a thread with a stacktrace in frames = sys._current_frames() holding the GIL and a thread doing garbage collecting; sometimes these are also the same thread:
Traceback for thread 107 (python) [Has the GIL,Garbage collecting] (most recent call last):
(C) File "Python/thread_pthread.h", line 241, in pythread_wrapper (/usr/local/lib/libpython3.11.so.1.0)
(C) File "./Modules/_threadmodule.c", line 1124, in thread_run (/usr/local/lib/libpython3.11.so.1.0)
(Python) File "/usr/local/lib/python3.11/threading.py", line 1002, in _bootstrap
self._bootstrap_inner()
(Python) File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
(Python) File "/usr/local/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
(Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 175, in <lambda>
target=lambda: self._serve(), name='fn_api_status_handler')
(Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 200, in _serve
id=request.id, status_info=self.generate_status_response()))
(Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 219, in generate_status_response
all_status_sections.append(thread_dump())
(Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 60, in thread_dump
frames = sys._current_frames() # pylint: disable=protected-access
(C) File "Modules/gcmodule.c", line 2290, in gc_alloc (inlined) (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Modules/gcmodule.c", line 1400, in gc_collect_with_callback (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Modules/gcmodule.c", line 1287, in gc_collect_main (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Modules/gcmodule.c", line 1013, in delete_garbage (inlined) (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Objects/typeobject.c", line 1279, in subtype_clear (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Objects/typeobject.c", line 1463, in subtype_dealloc (/usr/local/lib/libpython3.11.so.1.0)
(C) File "./Modules/_threadmodule.c", line 904, in local_dealloc (/usr/local/lib/libpython3.11.so.1.0)
(C) File "./Modules/_threadmodule.c", line 872, in local_clear (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Python/thread_pthread.h", line 497, in PyThread_acquire_lock_timed (/usr/local/lib/libpython3.11.so.1.0)
This failure mode matches the description of python/cpython#106883, which is known to affect CPython 3.11, has been fixed in CPython 3.12 and has not been reproduced in CPython 3.10.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Infrastructure
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
Summary
Python pipelines running on Python 3.11 may experience periodic stuckness. Beam Dataflow users might see this stuckess accompanied with errors like:
Unable to retrieve status info from SDK harness sdk_harness_id within allowed time
SDK worker appears to be permanently unresponsive. Aborting the SDK.
The issue may be more pronounced in pipelines that frequently trigger garbage collection.
Mitigation: Use Python 3.12, Python 3.10, or switch to Beam 2.64.0 once it is released.
Details
Beam SDK has a mechanism to provide status report to a runner that captures the ongoing work. The status report includes stacktraces of running threads.
To collect such stacktraces, we inspect the content of running Python frames via
sys._current_frames()
.It appears that on Python 3.11, such invocation can cause a deadlock if/when garbage collection triggers during the call to
sys._current_frames()
: python/cpython#106883. The issue is not reproducible on Python 3.10 or Python 3.12.On Python 3.11, a Beam job might get stuck. An a stuck job running on Dataflow might have errors like:
Unable to retrieve status info from SDK harness sdk_harness_id within allowed time
SDK worker appears to be permanently unresponsive. Aborting the SDK.
As noted in https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact , such errors can happen when a thread in Python process permanently holds the GIL.
Inspecting the Dataflow workers with pystack, for example via an automated script like: https://gist.github.com/tvalentyn/82fcee6b93253740d2ae50bd425916a5 , reveals a thread with a stacktrace in
frames = sys._current_frames()
holding the GIL and a thread doing garbage collecting; sometimes these are also the same thread:This failure mode matches the description of python/cpython#106883, which is known to affect CPython 3.11, has been fixed in CPython 3.12 and has not been reproduced in CPython 3.10.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: