-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use multiprocessing
start_method
"forkserver"
#4306
Conversation
Both failures here are unrelated: test_iostream flake, eigen download problem. Trying again. |
…similar to pybind#4306, the only difference between the two PRs is the `"spawn"` vs `"forkserver"` argument).
Windows PyPy is still giving us DEADLOCK grief: https://github.com/pybind/pybind11/actions/runs/3379884767/jobs/5612023977 @mattip Do you have an interest and the tools for debugging these deadlocks? I think I've only ever seen this under Windows, but with with different PyPy versions (3.7, 3.8, 3.9). test_gil_scoped.py on master (or in the 2.10.1 release) is already set up for easy deadlock debugging, all you'd need to do is change these lines: -ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS + (_intentional_deadlock,)
+ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS # (_intentional_deadlock,)
-SKIP_IF_DEADLOCK = True # See PR #4216
+SKIP_IF_DEADLOCK = False # See PR #4216
...
- timeout = 0.1 if test_fn is _intentional_deadlock else 10
+ timeout = 0.1 if test_fn is _intentional_deadlock else 1000 Then run test_gil_scoped.py and wait until it hits a deadlock, then attach a debugger to inspect the stack traces for the two threads waiting on the GIL. I don't have a suitable Windows machine and don't know anything about debugging under Windows. |
@Skylion007 Sorry I still need to put back the |
For completeness: There was another PyPy Windows deadlock under #4305, for PyPy 3.9 (the flake here was for 3.8): https://github.com/pybind/pybind11/actions/runs/3379830331/jobs/5611902870 For Windows there is no difference between that PR and this one. I'm not updating #4305 anymore. |
The CI passed now without any issues. I'm not expecting any deadlocks, except for PyPy Windows, which are now reported as such via After this PR is merged, deadlocks for other platforms will show up as CI failures, rather than getting skipped. I hope not seeing such CI failures will confirm over time that we're good. I don't know of another way to get that confirmation quicker. See #4105 (comment). |
@henryiii could you please help reviewing this tiny PR? I think it will take less than 5 minutes. I'm really keen to restore more rigorous testing after the 2.10.1 release went out. I wasn't completely comfortable effectively silencing the deadlock detection, it was mainly to support the 2.10.1 release. 2 days later I learned that multiprocessing "fork" is generally not safe in combination with threads, IOW we were using it wrong the whole time. The fix is super easy. Getting this merged soon means that we will easily see again when deadlocks are detected. I have high hopes that there will be none. Being sure about it will be very useful. |
@henryiii I'd appreciate if you could please comment on this PR. |
@eacousineau It would be great if you could help out here as well. It's a really tiny PR. As I wrote above, I'm still very keen to restore more rigorous testing after the 2.10.1 release went out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! First pass, PTAL!
@@ -15,6 +17,10 @@ | |||
# Early diagnostic for failed imports | |||
import pybind11_tests | |||
|
|||
if os.name != "nt": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit Can you add a comment as to why we're doing this here?
git blame
is decent for more context, but I think two small sentences can help explain, e.g.:
# We want to shy away from Python's default choice for Linux of "fork", as it can incur
# post-fork resource contention issues (such as thread ids, etc.).
I'm vaguely familiar with this, and with our codebase we hit it with either sledgehammers of:
- "yes fork, but pin down threading for fast-but-reckless multiprocessing spawns"
- "no fork, please use spawn, and don't think about it" (I still need to read about
forkserver
and how it may interact with imports likenumpy
orcv2
that may configure / spawn threads at import)
EricCousineau-TRI/repro@2cff386
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment as to why we're doing this here?
Done. I tried to keep it concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... While this and the issue describes the symptoms, is there any way you can link to something that provides readers more context about the root cause of why threads + fork == bad / flaky?
(feel free to dismiss if you wanna add to issue discussion itself, or punt)
Alternative to PR pybind#4305
… will resolve the deadlocks for good.
d1b1edf
to
74bbc04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Remaining comment, but feel free to dismiss / punt / what not.
@@ -15,6 +17,10 @@ | |||
# Early diagnostic for failed imports | |||
import pybind11_tests | |||
|
|||
if os.name != "nt": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... While this and the issue describes the symptoms, is there any way you can link to something that provides readers more context about the root cause of why threads + fork == bad / flaky?
(feel free to dismiss if you wanna add to issue discussion itself, or punt)
TBH I was going only by the Python documentation (link in PR description: "Note that safely forking a multithreaded process is problematic.") and what @jbms told me. But I just found this: https://stackoverflow.com/questions/6078712/is-it-safe-to-fork-from-within-a-thread I'll add that to the PR description here. Thanks for the review! |
The one CI failure is a well-known PyPy flake: expected refcount mismatch. |
Description
Please see #4105 (comment) for full background.
Alternative to PR #4305
Quoting from
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
fork
... Note that safely forking a multithreaded process is problematic.
Available on Unix only. The default on Unix.
The idea to try
"spawn"
or"forkserver"
is the result of deadlock debugging. After reviewing tracebacks @jbms wrote:This stackoverflow posting provides some details:
Excerpts:
Note that this PR only changes how test_gil_scoped.py is run, therefore a changelog entry is not needed.
Suggested changelog entry: