Skip to content

bpo-46210: Fix deadlock in print. #30310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

notarealdeveloper
Copy link
Contributor

@notarealdeveloper notarealdeveloper commented Dec 30, 2021

Add _at_fork_reinit handler to buffer->lock of stdout and stderr, to fix a deadlock in print.

https://bugs.python.org/issue46210

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@notarealdeveloper

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

Copy link
Member

@sweeneyde sweeneyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some superficial things, I'm not that knowledgable about the whole threading/forking situation.

Python/pystate.c Outdated
/* error */
ret = -1;
end:
return ret;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few more leaks here. Perhaps it would be simpler to initialize the variables to NULL, and then use Py_XDECREF in the end:/error: label.

Python/pystate.c Outdated
Comment on lines 197 to 205
if (_PyObject_GetAttrId(stdio, &PyId_isatty) == NULL) {
PyErr_Clear();
goto end;
}

isatty = _PyObject_CallMethodIdNoArgs(stdio, &PyId_isatty);
if (Py_IsFalse(isatty)) {
goto end;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be just one call to _PyObject_CallMethodIdNoArgs with a check for a NULL return value.

@sweeneyde sweeneyde requested a review from vstinner December 31, 2021 00:39
@sweeneyde
Copy link
Member

It would be nice to see a test case added for these methods. If a threading/forking case can be made to pass consistently, that would be nice, but even if not, just a simple exercise of the _at_fork_reinit method would be nice. Then we can make sure there are no reference leaks (with ./python -m test test_whatever -R3:3).

@notarealdeveloper
Copy link
Contributor Author

Thanks for the tips! I know exactly what to learn about next. Will push some updates after I learn a bit more. I think the general "unlock the thing" idea probably makes sense, but I'm sure I tripped over myself in 17 different ways in the initial patch. Curious to hear from Victor about whether the overall goal is a sane one. Either way, more coming soon. Happy New Year. :)

@notarealdeveloper
Copy link
Contributor Author

notarealdeveloper commented Dec 31, 2021

Ok, re: ./python -m test some_test -R3:3.

Just updated the pull request with what (I think?) is correct reference counting.

All the tests pass, but I wanted to check for reference leaks like you suggested.

So I ran several of the relevant tests with -R3:3 to check, but then I realized something.

Supposing there's a reference leak, would it be noticed by -R? From the docs, it seems -R uses sys.gettotalrefcount() to check for leaks, but this code is only being executed in the child process created by a fork, so if the parent is checking for updates to a variable in its address space I don't think it'll see them.

Is there a standard way to check for leaks in situations like that, or is it better to just use a brute force strategy like I did in the comment above (i.e., calling every function twice and checking ob_refcnt manually with printf like a caveman).

@notarealdeveloper
Copy link
Contributor Author

Hi there. :)

After spending some time thinking about how to exercise the lock re-initialization function in a test, I realized that the best way to overcome my noobish-ness here is to follow the example of _PyImport_ReInitLock, seeing as @vstinner has done a lot of the work on that, and his past fixes of various deadlocks are basically identical to what I'm trying to do here.

The "meat" of the implementation we've already discussed doesn't change (thanks again for the tips @sweeneyde). I've just moved things to different files, in order to maximize the similarity with what _PyImport_ReInitLock does.

Here's what a grep of the codebase for the two functions looks like:

$ grep -Ir _PyImport_ReInitLock
Modules/posixmodule.c:#include "pycore_import.h"        // _PyImport_ReInitLock()
Modules/posixmodule.c:    status = _PyImport_ReInitLock();
Include/internal/pycore_import.h:extern PyStatus _PyImport_ReInitLock(void);
Python/import.c:_PyImport_ReInitLock(void)

$ grep -Ir _PySys_ReInitStdio
Modules/posixmodule.c:#include "pycore_sysmodule.h"     // _PySys_ReInitStdio()
Modules/posixmodule.c:    status = _PySys_ReInitStdio();
Include/internal/pycore_sysmodule.h:extern PyStatus _PySys_ReInitStdio(void);
Python/sysmodule.c:_PySys_ReInitStdio(void)

And here's the git blame output showing the only location where the new function is called.

26881c8fae3 (Victor Stinner  2020-06-02 15:51:37  592)     status = _PyImport_ReInitLock();
26881c8fae3 (Victor Stinner  2020-06-02 15:51:37  593)     if (_PyStatus_EXCEPTION(status)) {
26881c8fae3 (Victor Stinner  2020-06-02 15:51:37  594)         goto fatal_error;
26881c8fae3 (Victor Stinner  2020-06-02 15:51:37  595)     }
26881c8fae3 (Victor Stinner  2020-06-02 15:51:37  596) 
8c8d9c99971 (Jason Wilkes    2022-01-05 11:45:01  597)     status = _PySys_ReInitStdio();
8c8d9c99971 (Jason Wilkes    2022-01-05 11:45:01  598)     if (_PyStatus_EXCEPTION(status)) {
8c8d9c99971 (Jason Wilkes    2022-01-05 11:45:01  599)         goto fatal_error;
8c8d9c99971 (Jason Wilkes    2022-01-05 11:45:01  600)     }

The goal was to get line-by-line correspondence with what @vstinner does with the import lock, and we seem to have that now.

Now, the only difference arises inside the implementation, because the import lock is a global variable inside Python/import.c, while the stdout lock lives on the stdout.buffer object (similarly with stderr). The addition of an _at_fork_reinit method to the buffered object in Modules/_io/bufferedio.c isn't a necessary part of the implementation, it's just the simplest way I knew of (as a noob) to get a pointer to those locks from outside that module. I'd be more than happy to change that detail if anyone thinks there's a better approach.

As for everything else: (1) the PEP 7 formatting should be good now, (2) @the-knights-who-say-ni seem to have given me the "CLA signed" badge just now, (3) in playing with all this, I stumbled on another small unrelated bug in the test suite and fixed it, so I'll submit that as a separate bpo & pull request.

Let me know if there's anything else you'd like me to do. Thanks again for all the help. :)

Copy link
Member

@JelleZijlstra JelleZijlstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this needs tests. It also adds a new Python-accessible method. What happens if Python code calls it?

@notarealdeveloper
Copy link
Contributor Author

notarealdeveloper commented Feb 1, 2022

Re: Helpful comments from @JelleZijlstra

Agree that this needs tests. It also adds a new Python-accessible method. What happens if Python code calls it?

100% agreed, the Python-accessible method wasn't necessary. I needed to learn a bit more before I could do this correctly.

Just pushed some changes to address this. The updated approach:

  1. Removes the Python-accessible method.
  2. Tests that the lock reinitialization is working as expected (See TestStdioAtForkReInit in Lib/test/test_thread.py).

The test's behavior is to cause the deadlock described in bpo-46210 if and only if the lock is not being reinitialized correctly.

Just like the other C-level locks in the codebase, this is something that should be reinitialized on every fork, so you (@JelleZijlstra) were absolutely right to point out that a Python-accessible interface was an awkward fit.

Further, I want to make it clear to other devs precisely what situation the bugfix is targeting, so the test contains a description of what the bug looks like without the fix, and which specific C function was added to fix it.

Summary: the test can be run with ./python -m test test_thread, and removing the single line status = _PySys_ReInitStdio(); from Modules/posixmodule.c should be sufficient to cause ./python -m test test_thread to fail with a timeout error due to the deadlock, which illustrates how this PR addresses bpo-46210. Let me know if there's anything else you'd like me to do. Thanks again for all the help. :)

@JelleZijlstra JelleZijlstra self-requested a review February 1, 2022 18:33
@notarealdeveloper
Copy link
Contributor Author

In addition to the above:

In the course of adding the test, I found a small bug in the wait_process function in Lib/test/support/__init__.py (the function ignores its timeout argument, and uses a module level global SHORT_TIMEOUT even when timeout is specified).

That won't be relevant to anyone who runs the test as is, since the PR fixes the deadlock.

However, if anyone feels like running the above test after removing the bugfix to verify that the test fails, as it should (if we remove status = _PySys_ReInitStdio(); from Modules/posixmodule.c and then re-run ./python -m test test_thread), then you may notice an odd behavior, in which a line in the new test that says support.wait_process(main_pid, exitcode=0, timeout=5) takes 30 seconds to time out, not 5.

That's due to the above mentioned bug in Lib/test/support/__init__.py:wait_process.

Should I submit another bpo for this small bug, or just send a PR directly?

Should be a one-line fix. Just need to change if dt > SHORT_TIMEOUT in that function to if dt > timeout.

@JelleZijlstra
Copy link
Member

Thanks, can you open another issue for the test.support fix? (I don't have time right now to look at this PR in detail, but I'll review it again.)

@notarealdeveloper
Copy link
Contributor Author

Thanks, can you open another issue for the test.support fix? (I don't have time right now to look at this PR in detail, but I'll review it again.)

Will do! :)

Thanks for the help.

int
_PyIO_buffered_at_fork_reinit(PyObject *self)
{
Py_INCREF(self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do the INCREF after the if statement?

Copy link
Contributor Author

@notarealdeveloper notarealdeveloper Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course. Still getting a hang of reference counting. Thanks for the tips! (This one should have been obvious, now that I look at it again).

PyObject *buffer = NULL;
PyObject *result = NULL;

PyObject *stdio = _PySys_GetObjectId(key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this returns a borrowed reference, so there's no need to DECREF it: the code is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I suggest the name stream, but stdio works too ;-)


PyObject *isatty = NULL;
PyObject *buffer = NULL;
PyObject *result = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks again.

@notarealdeveloper
Copy link
Contributor Author

Ok, all the comments above from @JelleZijlstra should be addressed (thanks again for the help!).

Also submitted PRs for two minor bugs I found in the test suite while working on this PR (those are #31204 and #31205).

Neither of those is a requirement for this PR to work. They're just trivial fixes to a couple edge cases in the test suite.

However, in a hypothetical scenario where:

  1. We merge the test that this PR added, but
  2. We remove or break the C code in this PR that fixes the deadlock.

then #31205 ensures that the timeout in our new test is respected when waiting for a deadlocked child process.

That's just a minor bit of polish, but I'm trying to think about possible future scenarios, not just the code as it exists now.

Let me know if there's anything else I can improve. :)

PyObject *stdio = _PySys_GetObjectId(key);

_Py_IDENTIFIER(isatty);
isatty = _PyObject_CallMethodIdNoArgs(stdio, &PyId_isatty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the purpose of testing isatty().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. :)

PyObject *buffer = NULL;

PyObject *stdio = _PySys_GetObjectId(key);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdio can be NULL or Py_None. You should do nothing in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

if (_PyObject_LookupAttrId(stdio, &PyId_buffer, &buffer) < 0) {
/* stdout.buffer and stderr.buffer are not part of the
* TextIOBase API and may not exist in some implementations.
* If not present, no need to reinitialize their locks. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function can raises an AttributeError, no? If yes, you must clear it: PyErr_Clear().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! (The function is MUCH simpler now. Thanks for the tips.)

int ret = 0;

PyObject *isatty = NULL;
PyObject *buffer = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to move variable declaration where the variable is set for the first value, rather than declarating most of them at the top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to this style throughout.

PyObject *buffer = NULL;
PyObject *result = NULL;

PyObject *stdio = _PySys_GetObjectId(key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I suggest the name stream, but stdio works too ;-)

if (!Py_IS_TYPE(self, &PyBufferedWriter_Type)) {
return 0;
}
Py_INCREF(self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you hold a string pointer? _PyThread_at_fork_reinit() doesn't release the GIL, it only allocates memory on the heap. You can remove the INCREF/DECREF dance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Re: stream over stdio. Changed.
  2. Re: INCREF/DECREF dance. Done. (Still getting a feel for refcounting, thanks for the tips.)

@@ -265,5 +267,107 @@ def tearDown(self):
pass


class TestStdioAtForkReInit(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike such function test, it's fragile and may break tomorrow. I would prefer to remove it.

If you want to keep it, you should:

  • create a new script and run the script from the test. So you better control the Python state when the test is run: which modules are imported, signals, etc.
  • move it to test_io: the tested code lives in the _io module

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why you think it's fragile? It seems useful as a regression test. I do agree that it would be more stable in a separate script.

Copy link
Contributor Author

@notarealdeveloper notarealdeveloper Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part I didn't change yet, b/c I wanted to check with you guys on the details.

@vstinner All the at_fork_reinit handlers in the codebase have your name on them, so I'll trust your judgment here. (Though I'm also curious for the same reasons @JelleZijlstra mentioned).

On the topic of the current test, you said "If you want to keep it, you should..."

What's the "else" condition for that sentence? I would prefer to do whatever you think makes the most sense. I don't have any strong opinions here, aside from a desire to have that one line in there that reinitializes the lock.

Test-wise, would you prefer (1) just moving the current test to a separate script? (2) removing the current test entirely? (3) a different test with certain properties that this one doesn't have?

I'm equally happy to do any of the above. :)

created child processes do not share locks with the parent. */
PyStatus
_PySys_ReInitStdio(void)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.stdin has also a buffer which has a lock which must be reinitialized at fork.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, now PyBufferedWriter_Type and PyBufferedReader_Type both get their locks reinitialized.

int reinit_stderr = stdio_at_fork_reinit(&PyId_stderr);

if (reinit_stdout < 0 || reinit_stderr < 0) {
return _PyStatus_ERR("Failed to reinitialize stdout and stderr");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return _PyStatus_ERR("Failed to reinitialize stdout and stderr");
return _PyStatus_ERR("Failed to reinitialize standard streams");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard streams it is! :)

@JelleZijlstra JelleZijlstra removed their request for review February 9, 2022 03:47
This commit:
* Properly identifies whether we need to reinit a stream's lock.
* Changes 'stdio' to 'stream' in several functions for clarity.
* Also reinitialize stdin's buffer->lock at fork.
@notarealdeveloper
Copy link
Contributor Author

notarealdeveloper commented Feb 17, 2022

Mind if I rebase this on main and force push a small update? I fixed the PR to be consistent with #30928 (bpo-46541).

i.e., stream_at_fork_reinit now uses _Py_ID instead of _Py_IDENTIFIER.

(P.S. Am I imagining things, or do recent changes in main feel very Sam Gross-y? Trying not to get too excited, but I hope that's what's going on. 🎉)

@JelleZijlstra
Copy link
Member

Generally better to merge main, so that the review history gets preserved.

@notarealdeveloper
Copy link
Contributor Author

Cool, so this pattern?

  1. Merge main (Now "ahead of 'origin/print' by 523 commits.")
  2. Cherry pick (Now "ahead of 'origin/print' by 524 commits.")
  3. Ok to push, despite the other commits? (i.e., those won't pollute the review history here?)

Apologies for the dumb question. I've used git socially and github personally, but I'm still fairly new to using github socially. :)

@JelleZijlstra
Copy link
Member

Something like this:

  • git fetch upstream (assuming upstream is python/cpython)
  • git merge upstream/main
  • Fix conflicts
  • git commit
  • git push

GitHub also lets you do this in the web interface if there's a conflict (the "Resolve conflicts" button at the bottom of the thread).

@notarealdeveloper
Copy link
Contributor Author

Thanks!

@kumaraditya303 kumaraditya303 requested a review from gpshead April 16, 2022 13:01
@kumaraditya303 kumaraditya303 added type-bug An unexpected behavior, bug, or error needs backport to 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) needs backport to 3.11 only security fixes labels Sep 4, 2022
@kumaraditya303 kumaraditya303 reopened this Sep 4, 2022
@hugovk hugovk removed the needs backport to 3.10 only security fixes label Apr 7, 2023
@serhiy-storchaka serhiy-storchaka added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes and removed needs backport to 3.11 only security fixes labels May 9, 2024
@Yhg1s Yhg1s removed the needs backport to 3.12 only security fixes label Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review interpreter-core (Objects, Python, Grammar, and Parser dirs) needs backport to 3.13 bugs and security fixes type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.