Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute the garbage collector only on the eval breaker #97922

Closed
pablogsal opened this issue Oct 5, 2022 · 1 comment
Closed

Execute the garbage collector only on the eval breaker #97922

pablogsal opened this issue Oct 5, 2022 · 1 comment

Comments

@pablogsal
Copy link
Member

Currently, the GC can be executed on every object allocation. This has been historically the source of many problems because it can trigger a GC run in points where the VM is in an inconsistent state. This includes critical points of the eval loop but also during complex object creation since the GC can run while creating sub-elements of the final result meanwhile the object is not fully initialized.

To improve the situation, we can schedule a GC run on object allocation but the GC will only run then on the eval breaker in a similar fashion we currently use to run signal handlers, do GIL switch and run pending callbacks.

@pablogsal
Copy link
Member Author

One small consideration:

We can opportunistically check if the GC is scheduled to run and run it if we have a request in PyErr_CheckSignals. This can benefit native code that needs to call PyErr_CheckSignals if is going to run for some time without executing Python code to ensure signals are handled. Checking for the GC here allows long-running native code to clean cycles created using the C-API even if it doesn't run the evaluation loop

carljm added a commit to carljm/cpython that referenced this issue Oct 8, 2022
* main:
  pythongh-68686: Retire eptag ptag scripts (python#98064)
  pythongh-97922: Run the GC only on eval breaker (python#97920)
  GitHub Workflows security hardening (python#96492)
  Add `@ezio-melotti` as codeowner for `.github/`. (python#98079)
  pythongh-97913 Docs: Add walrus operator to the index (python#97921)
  [doc] Fix broken links to C extensions accelerating stdlib modules (python#96914)
  pythongh-97822: Fix http.server documentation reference to test() function (python#98027)
  pythongh-91052: Add PyDict_Unwatch for unwatching a dictionary (python#98055)
  pythonGH-98023: Change default child watcher to PidfdChildWatcher on supported systems (python#98024)
  pythonGH-94182: Run the PidfdChildWatcher on the running loop (python#94184)
carljm added a commit to carljm/cpython that referenced this issue Oct 9, 2022
* main: (5519 commits)
  Minor edits to the Descriptor HowTo Guide (pythonGH-24901)
  Fix link to Lifecycle of a Pull Request in CONTRIBUTING (python#98102)
  pythonGH-94597: deprecate `SafeChildWatcher`, `FastChildWatcher` and `MultiLoopChildWatcher` child watchers  (python#98089)
  Auto-cancel old builds when new commit pushed to branch (python#98009)
  pythongh-95011: Migrate syslog module to Argument Clinic (pythonGH-95012)
  pythongh-68686: Retire eptag ptag scripts (python#98064)
  pythongh-97922: Run the GC only on eval breaker (python#97920)
  GitHub Workflows security hardening (python#96492)
  Add `@ezio-melotti` as codeowner for `.github/`. (python#98079)
  pythongh-97913 Docs: Add walrus operator to the index (python#97921)
  [doc] Fix broken links to C extensions accelerating stdlib modules (python#96914)
  pythongh-97822: Fix http.server documentation reference to test() function (python#98027)
  pythongh-91052: Add PyDict_Unwatch for unwatching a dictionary (python#98055)
  pythonGH-98023: Change default child watcher to PidfdChildWatcher on supported systems (python#98024)
  pythonGH-94182: Run the PidfdChildWatcher on the running loop (python#94184)
  pythongh-92886: make test_ast pass with -O (assertions off) (pythonGH-98058)
  pythongh-92886: make test_coroutines pass with -O (assertions off) (pythonGH-98060)
  pythongh-57179: Add note on symlinks for os.walk (python#94799)
  pythongh-94808: Fix regex on exotic platforms (python#98036)
  pythongh-90085: Remove vestigial -t and -c timeit options (python#94941)
  ...
mpage pushed a commit to mpage/cpython that referenced this issue Oct 11, 2022
vstinner added a commit to vstinner/cpython that referenced this issue Aug 22, 2023
* Rename Lib/test/crashers/ to Lib/test/test_crashers/.
* Move  Lib/test/test_crashers.py to
  Lib/test/test_crashers/__init__.py.
* test_crashers is no longer skipped and makes sure that scripts do
  crash, and no simply fail with a non-zero exit code.
* Update bogus_code_obj.py to use CodeType.replace().
* Remove Lib/test/crashers/ scripts which no longer crash:

  * recursive_call.py: fixed by pythongh-89419
  * mutation_inside_cyclegc.py: fixed by pythongh-97922
  * trace_at_recursion_limit.py: fixed by Python 3.7
vstinner added a commit to vstinner/cpython that referenced this issue Aug 22, 2023
* Rename Lib/test/crashers/ to Lib/test/test_crashers/.
* Move  Lib/test/test_crashers.py to
  Lib/test/test_crashers/__init__.py.
* test_crashers is no longer skipped and makes sure that scripts do
  crash, and no simply fail with a non-zero exit code.
* Update bogus_code_obj.py to use CodeType.replace().
* Scripts crashing Python now uses SuppressCrashReport of
  test.support to not create coredump files.
* Remove Lib/test/crashers/ scripts which no longer crash:

  * recursive_call.py: fixed by pythongh-89419
  * mutation_inside_cyclegc.py: fixed by pythongh-97922
  * trace_at_recursion_limit.py: fixed by Python 3.7
vstinner added a commit to vstinner/cpython that referenced this issue Aug 22, 2023
* Rename Lib/test/crashers/ to Lib/test/test_crashers/.
* Move  Lib/test/test_crashers.py to
  Lib/test/test_crashers/__init__.py.
* test_crashers is no longer skipped and makes sure that scripts do
  crash, and no simply fail with a non-zero exit code.
* Update bogus_code_obj.py to use CodeType.replace().
* Scripts crashing Python now uses SuppressCrashReport of
  test.support to not create coredump files.
* Remove Lib/test/crashers/ scripts which no longer crash:

  * recursive_call.py: fixed by pythongh-89419
  * mutation_inside_cyclegc.py: fixed by pythongh-97922
  * trace_at_recursion_limit.py: fixed by Python 3.7
nsrip-dd added a commit to DataDog/dd-trace-py that referenced this issue Apr 2, 2025
Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:

- It doesn't trigger the crash in a deterministic amount of time, so
  it's not really reasonable for CI/local dev loop as-is
- It probably benefits more from having the thread sanitizer enabled,
  which we don't currently do for the memalloc extension

I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.

The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.

Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.
nsrip-dd added a commit to DataDog/dd-trace-py that referenced this issue Apr 2, 2025
Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:

- It doesn't trigger the crash in a deterministic amount of time, so
  it's not really reasonable for CI/local dev loop as-is
- It probably benefits more from having the thread sanitizer enabled,
  which we don't currently do for the memalloc extension

I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.

The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.

Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.
chojomok pushed a commit to DataDog/dd-trace-py that referenced this issue Apr 7, 2025
Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:

- It doesn't trigger the crash in a deterministic amount of time, so
  it's not really reasonable for CI/local dev loop as-is
- It probably benefits more from having the thread sanitizer enabled,
  which we don't currently do for the memalloc extension

I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.

The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.

Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant