-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in subinterpreters during subinterpreter creation on Windows debug build #100711
Comments
I'll take a look tomorrow. Thanks for finding this! |
What's the best way to reproduce the problem? |
Sorry, I have not found a reliable/minimal reproducer. The only thing I found was that running |
Is the race in the main branch? Any thoughts on why you noticed it while working on gh-100492? |
Yep it's on main. I'm not sure why but it requires less iterations to trigger on that PR. It requires many more iterations on main. |
I took a quick look at ceval_gil.c and may have a sense of what's going wrong. FWIW, I expect that the race has been around at least since the "new GIL" was added in 2009, if not longer, and that The main thing I noticed (in ceval_gil.c) is that both If the given thread state (thread A) were used at any such point within
Some of these might be safe due to the logic but we need to verify that and fix the race on the rest. A race on any of these is unlikely, but possible if one thread is cleaning up interpreters/threads while another is creating them (as happens in As to the relationship with Windows, it looks like the sleep granularity (while waiting for a lock) there is much more coarse than it is with pthreads (e.g. on linux). Anyone is welcome to investigate further or propose solutions. I'll be looking at this more later on today. Some observations (from ceval_gil.c):
|
Thanks for your insightful comment! My own suspicion was that the GILState struct is a stack-allocated variable (it's a variable in a function rather than malloc-ed). So it's possible for one subinterpreter to outlive the GILState struct and segfault. |
Can you verify if this is fixed now? This may have been a case of gh-104341. |
friendly reminder @Fidget-Spinner |
Windows reproducer on main branch: run
.\PCbuild\amd64\python_d.exe -m test test__xxsubinterpreters -F
and leave it running. If by run 100 nothing crashes, cancel and re-run again until you get a segfault.Traceback:
Somehow after calling
__xxsubinterpreters.create
: indrop_gil
the&ceval2->gil_drop_request
is non-NULL but invalid, so when we try to_Py_atomic_load
load from that address it segfaults.I have my suspicions it's due to swapping of GIL states but this is incredibly hard to debug because the race condition sometimes triggers when I run the test 4 times in a row, and sometimes doesn't trigger after 2 hours of continuous running.
The text was updated successfully, but these errors were encountered: