Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Python pipelines running on Python 3.11 may experience periodic stuckness. #33966

Closed
1 of 17 tasks
tvalentyn opened this issue Feb 12, 2025 · 0 comments · Fixed by #33967
Closed
1 of 17 tasks

[Bug]: Python pipelines running on Python 3.11 may experience periodic stuckness. #33966

tvalentyn opened this issue Feb 12, 2025 · 0 comments · Fixed by #33967
Assignees

Comments

@tvalentyn
Copy link
Contributor

tvalentyn commented Feb 12, 2025

Summary

Python pipelines running on Python 3.11 may experience periodic stuckness. Beam Dataflow users might see this stuckess accompanied with errors like:

Unable to retrieve status info from SDK harness sdk_harness_id within allowed time

SDK worker appears to be permanently unresponsive. Aborting the SDK.

The issue may be more pronounced in pipelines that frequently trigger garbage collection.

Mitigation: Use Python 3.12, Python 3.10, or switch to Beam 2.64.0 once it is released.

Details

Beam SDK has a mechanism to provide status report to a runner that captures the ongoing work. The status report includes stacktraces of running threads.

To collect such stacktraces, we inspect the content of running Python frames via sys._current_frames().

It appears that on Python 3.11, such invocation can cause a deadlock if/when garbage collection triggers during the call to sys._current_frames(): python/cpython#106883. The issue is not reproducible on Python 3.10 or Python 3.12.

On Python 3.11, a Beam job might get stuck. An a stuck job running on Dataflow might have errors like:

Unable to retrieve status info from SDK harness sdk_harness_id within allowed time

SDK worker appears to be permanently unresponsive. Aborting the SDK.

As noted in https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact , such errors can happen when a thread in Python process permanently holds the GIL.

Inspecting the Dataflow workers with pystack, for example via an automated script like: https://gist.github.com/tvalentyn/82fcee6b93253740d2ae50bd425916a5 , reveals a thread with a stacktrace in frames = sys._current_frames() holding the GIL and a thread doing garbage collecting; sometimes these are also the same thread:

Traceback for thread 107 (python) [Has the GIL,Garbage collecting] (most recent call last):
    (C) File "Python/thread_pthread.h", line 241, in pythread_wrapper (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 1124, in thread_run (/usr/local/lib/libpython3.11.so.1.0)
    (Python) File "/usr/local/lib/python3.11/threading.py", line 1002, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
        self.run()
    (Python) File "/usr/local/lib/python3.11/threading.py", line 982, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 175, in <lambda>
        target=lambda: self._serve(), name='fn_api_status_handler')
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 200, in _serve
        id=request.id, status_info=self.generate_status_response()))
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 219, in generate_status_response
        all_status_sections.append(thread_dump())
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 60, in thread_dump
        frames = sys._current_frames()  # pylint: disable=protected-access
    (C) File "Modules/gcmodule.c", line 2290, in gc_alloc (inlined) (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1400, in gc_collect_with_callback (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1287, in gc_collect_main (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1013, in delete_garbage (inlined) (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Objects/typeobject.c", line 1279, in subtype_clear (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Objects/typeobject.c", line 1463, in subtype_dealloc (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 904, in local_dealloc (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 872, in local_clear (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Python/thread_pthread.h", line 497, in PyThread_acquire_lock_timed (/usr/local/lib/libpython3.11.so.1.0)

This failure mode matches the description of python/cpython#106883, which is known to affect CPython 3.11, has been fixed in CPython 3.12 and has not been reproduced in CPython 3.10.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant