-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-101766: Fix refleak for _BlockingOnManager resources #101942
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I never saw this code before.
The current failure does not relate to this PR..
|
@ericsnowcurrently Can you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can safely remove it, because it is reset here: self.blocked_on = _blocking_on.setdefault(self.thread_id, [])
Thanks for fixing this!
@gvanrossum @sobolevn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I left one comment but feel free to ignore it if you don't think it matters.
Hi Eric |
I'll take a look at the refleak. |
I'm working on it via gh-101969. |
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
Merge this PR since refleaks test are passed, and macOS CI is another issue: |
if len(self.blocked_on) == 0: | ||
# gh-101766: glboal cache should be cleaned-up | ||
# if there is no more _blocking_on for this thread. | ||
del _blocking_on[self.thread_id] | ||
del self.blocked_on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corona10 @gvanrossum @sobolevn
This change re-introduces the re-entrancy bug that _BlockingOnManager
was introduced to fix.
Consider: A thread has executed __enter__
and begins executing __exit__
. It executes line 87 and the length of self.blocked_on
becomes 0. The check on the new line 88 passes and the thread enters the block starting with the new line 89. Call this "point x". Now the garbage collector is triggered and an object with a __del__
is collected. The __del__
runs and imports a module (as is very common now that there are ResourceWarning
s). As part of satisfying the import, the thread runs __enter__
. It executes line 82 and receives the list that is already in _blocking_on
(it uses the same key because it is the same thread - this is a single-threaded case). It executes line 83 to add its lock to the list. The import eventually finishes and it executes __exit__
. It removes its lock from the list (line 87) and removes the list from _blocking_on
(new line 91). Garbage collection proceeds and we get back to the original import being resolved. Execution returns to "point x". At point x we have already decided there are no locks in self.blocked_on
and we're going to clean up the global dictionary. New line 91 runs and finds no item at key self.thread_id
. This raises a KeyError which goes unhandled all the way up to some application code that was doing import somemodule
.
I recommend reverting this change until a re-entrant safe version of the fix can be determined. The resource leak is not great but if I understand correctly, it is also more or less academic right now (it causes a very small amount of memory to be wasted while the test suite is running). The re-entrancy bug is real and breaks many real applications in extremely confusing ways (consider that the original bug reports were open for years before anyone tracked down the problem, at great expense, and it took more than half a year to get the fix reviewed and merged).
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@exarkun cc @gvanrossum @sobolevn
Okay, I understand your intention.
I will try to submit the patch if the resource can be cleaned up at the test suite level including reverting the current change..
And also would you like to submit the test code to prevent regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to submit the patch if the resource can be cleaned up at the test suite level including reverting the current change..
Thanks you very much.
And also would you like to submit the test code to prevent regression?
That would be ideal, certainly. There is a reproducer on #94504 but it is CPU intensive and non-deterministic without additional instrumentation inside _bootstrap.py
itself. It's not clear to me exactly how to turn this into a regression test that can be included in the automated suite. I'd be happy to discuss it further though.
…pythongh-101942)" This reverts commit 775f881.
Would it be possible to delete the cached empty list in a |
test_importlib
onaarch64 RHEL8
#101766