Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pybind11::handle inc_ref() & dec_ref() PyGILState_Check() **excluding** nullptr #4246

Merged
merged 3 commits into from
Dec 9, 2022

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Oct 17, 2022

Description

Background: https://docs.python.org/3/glossary.html#term-global-interpreter-lock

The GIL must be held when calling any Python C API functions. In multithreaded applications that use callbacks this requirement can easily be violated by accident. A general tool to ensure GIL health is not available, but the this PR integrates a very simple basic health check into pybind11::handle.

A comparison of test runtimes with and without this PR showed that the runtime overhead for the added checks is practically undetectable for non-optimized builds, but Google global testing with this PR uncovered a number (10+) of often very subtle GIL-not-held issues, including bugs in a few 3rd party projects (e.g. pikepdf).

The added GIL checks are guarded by PYBIND11_ASSERT_GIL_HELD_INCREF_DECREF, which is the default only if NDEBUG is not defined:

+#if !defined(NDEBUG) && !defined(PY_ASSERT_GIL_HELD_INCREF_DECREF)                                \
+    && !defined(PYBIND11_ASSERT_GIL_HELD_INCREF_DECREF)
+#    define PYBIND11_ASSERT_GIL_HELD_INCREF_DECREF
+#endif

The test for the PY_ASSERT_GIL_HELD_INCREF_DECREF macro is for cooperation with a more general patch in the Python sources, inserting the GIL checks directly in Py_INCREF & Py_DECREF, which makes the checks in pybind11 redundant.

Note that the GIL checks are skipped if py::handle ptr() == nullptr. This could mask bugs, in particular for classes with py::object members for which ptr() == nullptr in all unit tests, but not in production. Unfortunately Google global testing revealed that such members are much more common than one might expect. It is currently unclear if the cleanup cost for enforcing that the GIL is held even for nullptr is commensurate with the benefits of preventing a potentially only very small number of production bugs.

Suggested changelog entry:

``PyGILState_Check()``s were integrated to ``pybind11::handle`` ``inc_ref()`` & ``dec_ref()``. The added GIL checks are guarded by ``PYBIND11_ASSERT_GIL_HELD_INCREF_DECREF``, which is the default only if ``NDEBUG`` is not defined.

@rwgk
Copy link
Collaborator Author

rwgk commented Oct 17, 2022

Hi @mattip, in this PR I had to disable the new feature for Windows PyPy (b770824). This isn't a problem for my purposes, but I'm curious why Windows PyPy is hanging with the new code? Note that macOS and Linux PyPy are fine.

@mattip
Copy link

mattip commented Oct 17, 2022

why Windows PyPy is hanging with the new code

Is there a way to boil the test down to a minimal reproducer and open a PyPy issue?

@rwgk
Copy link
Collaborator Author

rwgk commented Oct 17, 2022

why Windows PyPy is hanging with the new code

Is there a way to boil the test down to a minimal reproducer and open a PyPy issue?

Ooo, that's hard, I'd have to reduce via the GitHub CI. I don't have access to a suitable Windows machine. And not really any clue where the problem could be, beyond observing that pytest didn't even start up.

@rwgk
Copy link
Collaborator Author

rwgk commented Oct 17, 2022

FYI — This PR was deployed to Google's production systems today (after a relatively small internal cleanup). I'll keep this in draft mode for a week or so, JIC there is some pushback.

@mattip
Copy link

mattip commented Oct 18, 2022

It seems the CI failed on this test

assert _run_in_process(_python_to_cpp_to_python_from_threads, 1) == 0

The code for that test is here.

#endif
#if defined(PYBIND11_ASSERT_GIL_HELD_INCREF_DECREF)
if (m_ptr != nullptr && !PyGILState_Check()) {
throw std::runtime_error("pybind11::handle::inc_ref() PyGILState_Check() failure.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason no pybind11_fail? Because of the corresponding !PyErrOccurred() assertion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
But even if that was taken out in pybind11_fail(), I think a "raw" C++ exception is best here, we want the process to terminate as directly as possible. I was thinking of using std::terminate() directly, but then it is troublesome to also produce the message, so I decided simpler is better. In our environment at least, std::runtime_error() works out very nicely (useful traceback with message).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, functions that can raise exceptions do come with some subtle costs in C++. The compiler needs to keep track of the many ways in which might have to destroy objects if an exception arises at various different points of the program and generate code for each one. I don't think it's a performance issue, but this will likely make the debug binaries bigger (compared to pybind11_fail) given that such a central operation is affected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Narrow focus:

Currently pybind11_fail() includes this:

assert(!PyErr_Occurred());

That needs the GIL, but here (handle) we just determined that we don't have it.

Bigger picture:

There is a general tradeoff: a faster or leaner binary today vs a future with more features.

Boundary condition: human time is ~constant.

So ultimately we always trade one for the other.

If you hire person A who likes to optimize, you may get something that runs in 1.0 time units with 1.0 resources, with 10 features.

If you hire person B, who's more like me, you get something that runs in 1.1 time units with 1.1 resources, with 20 features.

What's "better"?

It depends.

@rwgk
Copy link
Collaborator Author

rwgk commented Oct 18, 2022

It seems the CI failed on this test

assert _run_in_process(_python_to_cpp_to_python_from_threads, 1) == 0

The code for that test is here.

Oh, thanks for looking and pointing this out! I was misled by what I saw in the web view of the logs yesterday before I cancelled the jobs. There was just a spinning wheel for a long time without any output.

I'll try to get back here as soon as I get a chance.

@rwgk rwgk force-pushed the assert_gil_held_incref_decref branch from b770824 to 37c9cb7 Compare October 21, 2022 19:33
rwgk pushed a commit to Chekov2k/pybind11 that referenced this pull request Oct 26, 2022
rwgk pushed a commit to Chekov2k/pybind11 that referenced this pull request Oct 28, 2022
rwgk pushed a commit to Chekov2k/pybind11 that referenced this pull request Oct 30, 2022
rwgk pushed a commit to Chekov2k/pybind11 that referenced this pull request Oct 30, 2022
rwgk pushed a commit that referenced this pull request Oct 30, 2022
* Add option to force the use of the PYPY GIL scoped acquire/release logic to support nested gil access, see #1276 and pytorch/pytorch#83101

* Apply suggestions from code review

* Update CMakeLists.txt

* docs: update upgrade guide

* Update docs/upgrade.rst

* All bells & whistles.

* Add Reminder to common.h, so that we will not forget to purge `!WITH_THREAD` branches when dropping Python 3.6

* New sentence instead of semicolon.

* Temporarily pull in snapshot of PR #4246

* Add `test_release_acquire`

* Add more unit tests for nested gil locking

* Add test_report_builtins_internals_keys

* Very minor enhancement: sort list only after filtering.

* Revert change in docs/upgrade.rst

* Add test_multi_acquire_release_cross_module, while also forcing unique PYBIND11_INTERNALS_VERSION for cross_module_gil_utils.cpp

* Hopefully fix apparently new ICC error.

```
2022-10-28T07:57:54.5187728Z -- The CXX compiler identification is Intel 2021.7.0.20220726
...
2022-10-28T07:58:53.6758994Z icpc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
2022-10-28T07:58:54.5801597Z In file included from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../detail/type_caster_base.h(15),
2022-10-28T07:58:54.5803794Z                  from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../cast.h(15),
2022-10-28T07:58:54.5805740Z                  from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../attr.h(14),
2022-10-28T07:58:54.5809556Z                  from /home/runner/work/pybind11/pybind11/include/pybind11/detail/class.h(12),
2022-10-28T07:58:54.5812154Z                  from /home/runner/work/pybind11/pybind11/include/pybind11/pybind11.h(13),
2022-10-28T07:58:54.5948523Z                  from /home/runner/work/pybind11/pybind11/tests/cross_module_gil_utils.cpp(13):
2022-10-28T07:58:54.5949009Z /home/runner/work/pybind11/pybind11/include/pybind11/detail/../detail/internals.h(177): error #2282: unrecognized GCC pragma
2022-10-28T07:58:54.5949374Z       PYBIND11_TLS_KEY_INIT(tstate)
2022-10-28T07:58:54.5949579Z       ^
2022-10-28T07:58:54.5949695Z
```

* clang-tidy fixes

* Workaround for PYPY WIN exitcode None

* Revert "Temporarily pull in snapshot of PR #4246"

This reverts commit 23ac16e.

* Another workaround for PYPY WIN exitcode None

* Clean up how the tests are run "run in process" Part 1: uniformity

* Clean up how the tests are run "run in process" Part 2: use `@pytest.mark.parametrize` and clean up the naming.

* Skip some tests `#if defined(THREAD_SANITIZER)` (tested with TSAN using the Google-internal toolchain).

* Run all tests again but ignore ThreadSanitizer exitcode 66 (this is less likely to mask unrelated ThreadSanitizer issues in the future).

* bug fix: missing common.h include before using `PYBIND11_SIMPLE_GIL_MANAGEMENT`

For the tests in the github CI this does not matter, because
`PYBIND11_SIMPLE_GIL_MANAGEMENT` is always defined from the command line,
but when monkey-patching common.h locally, it matters.

* if process.exitcode is None: assert t_delta > 9.9

* More sophisiticated `_run_in_process()` implementation, clearly reporting `DEADLOCK`, additionally exercised via added `intentional_deadlock()`

* Wrap m.intentional_deadlock in a Python function, for `ForkingPickler` compatibility.

```
>       ForkingPickler(file, protocol).dump(obj)
E       TypeError: cannot pickle 'PyCapsule' object
```

Observed with all Windows builds including mingw but not PyPy, and macos-latest with Python 3.9, 3.10, 3.11 but not 3.6.

* Add link to potential solution for WOULD-BE-NICE-TO-HAVE feature.

* Add `SKIP_IF_DEADLOCK = True` option, to not pollute the CI results with expected `DEADLOCK` failures while we figure out what to do about them.

* Add COPY-PASTE-THIS: gdb ... command (to be used for debugging the detected deadlock)

* style: pre-commit fixes

* Do better than automatic pre-commit fixes.

* Add `PYBIND11_SIMPLE_GIL_MANAGEMENT` to `pytest_report_header()` (so that we can easily know when harvesting deadlock information from the CI logs).

Co-authored-by: Arnim Balzer <arnim@seechange.ai>
Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>
Co-authored-by: Ralf W. Grosse-Kunstleve <rwgk@google.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
henryiii added a commit that referenced this pull request Oct 31, 2022
* Add option to force the use of the PYPY GIL scoped acquire/release logic to support nested gil access, see #1276 and pytorch/pytorch#83101

* Apply suggestions from code review

* Update CMakeLists.txt

* docs: update upgrade guide

* Update docs/upgrade.rst

* All bells & whistles.

* Add Reminder to common.h, so that we will not forget to purge `!WITH_THREAD` branches when dropping Python 3.6

* New sentence instead of semicolon.

* Temporarily pull in snapshot of PR #4246

* Add `test_release_acquire`

* Add more unit tests for nested gil locking

* Add test_report_builtins_internals_keys

* Very minor enhancement: sort list only after filtering.

* Revert change in docs/upgrade.rst

* Add test_multi_acquire_release_cross_module, while also forcing unique PYBIND11_INTERNALS_VERSION for cross_module_gil_utils.cpp

* Hopefully fix apparently new ICC error.

```
2022-10-28T07:57:54.5187728Z -- The CXX compiler identification is Intel 2021.7.0.20220726
...
2022-10-28T07:58:53.6758994Z icpc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
2022-10-28T07:58:54.5801597Z In file included from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../detail/type_caster_base.h(15),
2022-10-28T07:58:54.5803794Z                  from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../cast.h(15),
2022-10-28T07:58:54.5805740Z                  from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../attr.h(14),
2022-10-28T07:58:54.5809556Z                  from /home/runner/work/pybind11/pybind11/include/pybind11/detail/class.h(12),
2022-10-28T07:58:54.5812154Z                  from /home/runner/work/pybind11/pybind11/include/pybind11/pybind11.h(13),
2022-10-28T07:58:54.5948523Z                  from /home/runner/work/pybind11/pybind11/tests/cross_module_gil_utils.cpp(13):
2022-10-28T07:58:54.5949009Z /home/runner/work/pybind11/pybind11/include/pybind11/detail/../detail/internals.h(177): error #2282: unrecognized GCC pragma
2022-10-28T07:58:54.5949374Z       PYBIND11_TLS_KEY_INIT(tstate)
2022-10-28T07:58:54.5949579Z       ^
2022-10-28T07:58:54.5949695Z
```

* clang-tidy fixes

* Workaround for PYPY WIN exitcode None

* Revert "Temporarily pull in snapshot of PR #4246"

This reverts commit 23ac16e.

* Another workaround for PYPY WIN exitcode None

* Clean up how the tests are run "run in process" Part 1: uniformity

* Clean up how the tests are run "run in process" Part 2: use `@pytest.mark.parametrize` and clean up the naming.

* Skip some tests `#if defined(THREAD_SANITIZER)` (tested with TSAN using the Google-internal toolchain).

* Run all tests again but ignore ThreadSanitizer exitcode 66 (this is less likely to mask unrelated ThreadSanitizer issues in the future).

* bug fix: missing common.h include before using `PYBIND11_SIMPLE_GIL_MANAGEMENT`

For the tests in the github CI this does not matter, because
`PYBIND11_SIMPLE_GIL_MANAGEMENT` is always defined from the command line,
but when monkey-patching common.h locally, it matters.

* if process.exitcode is None: assert t_delta > 9.9

* More sophisiticated `_run_in_process()` implementation, clearly reporting `DEADLOCK`, additionally exercised via added `intentional_deadlock()`

* Wrap m.intentional_deadlock in a Python function, for `ForkingPickler` compatibility.

```
>       ForkingPickler(file, protocol).dump(obj)
E       TypeError: cannot pickle 'PyCapsule' object
```

Observed with all Windows builds including mingw but not PyPy, and macos-latest with Python 3.9, 3.10, 3.11 but not 3.6.

* Add link to potential solution for WOULD-BE-NICE-TO-HAVE feature.

* Add `SKIP_IF_DEADLOCK = True` option, to not pollute the CI results with expected `DEADLOCK` failures while we figure out what to do about them.

* Add COPY-PASTE-THIS: gdb ... command (to be used for debugging the detected deadlock)

* style: pre-commit fixes

* Do better than automatic pre-commit fixes.

* Add `PYBIND11_SIMPLE_GIL_MANAGEMENT` to `pytest_report_header()` (so that we can easily know when harvesting deadlock information from the CI logs).

Co-authored-by: Arnim Balzer <arnim@seechange.ai>
Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>
Co-authored-by: Ralf W. Grosse-Kunstleve <rwgk@google.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@rwgk rwgk force-pushed the assert_gil_held_incref_decref branch from 37c9cb7 to 3115440 Compare November 1, 2022 18:11
@rwgk rwgk marked this pull request as ready for review November 1, 2022 18:18
@rwgk rwgk requested a review from henryiii November 1, 2022 18:19
@rwgk
Copy link
Collaborator Author

rwgk commented Nov 1, 2022

FYI — This PR was deployed to Google's production systems today (after a relatively small internal cleanup). I'll keep this in draft mode for a week or so, JIC there is some pushback.

15 days later: I only got two questions re problems that turned out to be completely unrelated, and one report of a bug uncovered by this PR and then fixed.

Also note that this PR would have immediately flagged the root cause of #4137, quite likely that issue would not even have been filed if we had made this change earlier.

Copy link
Collaborator

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you've checked runtime, and this helps with safety, I'm for it.

@rwgk rwgk merged commit 65374c8 into pybind:master Dec 9, 2022
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Dec 9, 2022
@rwgk rwgk deleted the assert_gil_held_incref_decref branch December 9, 2022 06:07
@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Dec 20, 2022
facebook-github-bot pushed a commit to pytorch/multipy that referenced this pull request Jul 17, 2023
Summary:
pybind11 recently added a feature to assert that the GIL is held during reference counting operations pybind/pybind11#4246

This feature has already uncovered several instances of undefined behavior in several Python extensions.

However, this code does not seem to pass these checks, which is preventing us from upgrading PyTorch to the latest pybind11. See pytorch/pytorch#105245

This PR disables those checks, which will in turn allow PyTorch to upgrade.

Note that this code already has known potential GIL issues:

https://github.com/pytorch/multipy/blob/bd1c76f294335695db8bfa66250781c1627f4eb7/multipy/runtime/deploy.cpp#L59

Pull Request resolved: #325

Reviewed By: PaliC

Differential Revision: D47518187

Pulled By: albanD

fbshipit-source-id: 5bf14d3633afe67fe0aa8fd74610403e0c494ea5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants