Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plug leaking function_records in cpp_function initialization in case of exceptions (found by Valgrind in #2746) #2756

Merged
merged 9 commits into from
Jan 14, 2021

Conversation

YannickJadoul
Copy link
Collaborator

@YannickJadoul YannickJadoul commented Dec 30, 2020

Description

Another leak fix taken out of #2746.

I am not entirely happy with this solution yet, mostly because we are not reusing destruct(function_record *). I tried adding this as the unique_ptr's deleter, but the issue is that the char *s only get copied along the way, and are not owned before those copies. (kind of fixed that)
Also, rec->data can still leak as rec->free_data is not called after an exception. (fixed that)

Suggested changelog entry:

Resolved memory leak in cpp_function initialization when exceptions occured.

@YannickJadoul YannickJadoul added this to the v2.6.2 milestone Jan 1, 2021
@YannickJadoul
Copy link
Collaborator Author

So, this should be ready to review:

The one question I now have, is whether this fix is worth it. As far as I know, all of these leaks are actually errors by the pybind11 user, and can't/shouldn't happen during actual use, if the library with pybind11 is correctly designed and programmed. So do we want to incur this overhead?
Then again, this overhead is only during cpp_function initialization, so it shouldn't be on any critical path, either. So I lean towards merging this PR. But open to alternative suggestions!

@rwgk
Copy link
Collaborator

rwgk commented Jan 4, 2021

Hi @YannickJadoul,

Is this sentence in the PR top comment still true?

Also, rec->data can still leak as rec->free_data is not called after an exception.

What do you think about the diff below, to make it easier for the compiler/optimizer to see that a runtime bool works here?

diff --git a/include/pybind11/pybind11.h b/include/pybind11/pybind11.h
index 114f72b..e57dccc 100644
--- a/include/pybind11/pybind11.h
+++ b/include/pybind11/pybind11.h
@@ -120,7 +120,7 @@ protected:
     PYBIND11_NOINLINE unique_function_record make_function_record() {
         // `destruct<true>(function_record)`: `initialize_generic` copies strings and
         // takes care of cleaning up in case of exceptions.
-        return unique_function_record(new detail::function_record(), destruct<false>);
+        return unique_function_record(new detail::function_record(), destruct_no_delete_strings);
     }
 
     /// Special internal constructor for functors, lambda functions, etc.
@@ -392,7 +392,7 @@ protected:
             rec->def->ml_flags = METH_VARARGS | METH_KEYWORDS;
 
             capsule rec_capsule(unique_rec.release(), [](void *ptr) {
-                destruct((detail::function_record *) ptr);
+                destruct_with_delete_strings((detail::function_record *) ptr);
             });
             guarded_strdup.release();
 
@@ -489,9 +489,16 @@ protected:
         }
     }
 
+    static void destruct_with_delete_strings(detail::function_record *rec) {
+        destruct_impl(rec, true);
+    }
+
+    static void destruct_no_delete_strings(detail::function_record *rec) {
+        destruct_impl(rec, false);
+    }
+
     /// When a cpp_function is GCed, release any memory allocated by pybind11
-    template <bool DeleteStrings = true>
-    static void destruct(detail::function_record *rec) {
+    static void destruct_impl(detail::function_record *rec, bool delete_strings) {
         // If on Python 3.9, check the interpreter "MICRO" (patch) version.
         // If this is running on 3.9.0, we have to work around a bug.
         #if !defined(PYPY_VERSION) && PY_MAJOR_VERSION == 3 && PY_MINOR_VERSION == 9
@@ -504,7 +511,7 @@ protected:
                 rec->free_data(rec);
             // During initialization, these strings might not have been copied yet,
             // so they cannot be freed. Once the function has been created, they can.
-            if (DeleteStrings) {
+            if (delete_strings) {
                 std::free((char *) rec->name);
                 std::free((char *) rec->doc);
                 std::free((char *) rec->signature);

@YannickJadoul
Copy link
Collaborator Author

Hi @YannickJadoul,

Is this sentence in the PR top comment still true?

Also, rec->data can still leak as rec->free_data is not called after an exception.

Nope that's what I fixed, just today :-) Thanks, I fixed the original message!

What do you think about the diff below, to make it easier for the compiler/optimizer to see that a runtime bool works here?

Hmm, not sure what you mean. In my version it is a compile-time bool, which should be easier to optimize, no?

That being said, your version should result in a smaller size overhead (at the cost of that extra runtime check), if we make sure that destruct_impl is not inlined (there's some pybind11 macro for that). So this is a valid trade-off to consider, that I hadn't yet thought of. Any thoughts?

@rwgk
Copy link
Collaborator

rwgk commented Jan 4, 2021

That being said, your version should result in a smaller size overhead (at the cost of that extra runtime check)

The overhead for the runtime check is almost certainly not measurable.
But the compiler or linker will have to work pretty hard to figure out how to avoid duplicating a sizable chunk of generated machine code.
Seems like a classical case of "template bloating" that's very easy to avoid, as suggested, and it makes the code clearer, too, at least for me.

@YannickJadoul
Copy link
Collaborator Author

Alright, yes. Given other decisions in pybind11, prioritizing space over a tiny bit of performance makes sense to me! I've pulled this to runtime. (I have kept the default argument = true, though, indicating that that should be the default/normal case.)

@YannickJadoul
Copy link
Collaborator Author

Alright, that commit passes Valgrind in #2746 (da8859d) :-)

include/pybind11/pybind11.h Outdated Show resolved Hide resolved
include/pybind11/pybind11.h Show resolved Hide resolved
include/pybind11/pybind11.h Outdated Show resolved Hide resolved
include/pybind11/pybind11.h Show resolved Hide resolved
include/pybind11/pybind11.h Show resolved Hide resolved
@YannickJadoul
Copy link
Collaborator Author

YannickJadoul commented Jan 13, 2021

@henryiii, @rwgk, @EricCousineau-TRI, can I gently ping? This is the last thing standing between us, and rebasing and merging #2746 and soon having Valgrind checks run on all PRs :-)

@rwgk
Copy link
Collaborator

rwgk commented Jan 13, 2021

Hi @YannickJadoul, a few days ago I offered to run this PR through Google's global testing system, but you discouraged it. I'd happily approve this PR after convincing myself that the tests come back clean.

Fully absorbing all the nuances of this very hairy change would cost me more time than I can give it. I trust that you and @bstaletic have done a great job. Given that we have comprehensive testing including sanitizers in place, getting more heads to look at the details doesn't seem like a productive division of labor.

Please let me know when it this is a good time to globally test this PR again.

@YannickJadoul
Copy link
Collaborator Author

Hi @YannickJadoul, a few days ago I offered to run this PR through Google's global testing system, but you discouraged it. I'd happily approve this PR after convincing myself that the tests come back clean.

Well, you asked that for #2746, and I informed you that this code was already tested (except for the last few additional commits that were not made because tests fail, but because reviewing the code found a logical corner case for leaks that wasn't covered by tests; not at Google either, since you said ASAN/LSAN came back clean?). Also, #2746 isn't supposed to merge this functionality; I'll rebase it on top of this, and #2746 will be a bare-bones change to CI and CMake.

It's the central part of our code, so the successful route is tested by all our tests. I don't believe more tests would tell us much more, since the nature of this PR is fixing leaks in exceptional cases (most often programming mistakes, actually). Though I'm of course not stopping you to run tests! (They're just not the right tool to fully assess this PR, I think.)

I was mainly asking for one or two more reviews and approval, so we can merge this and get full Valgrind runs in CI (like @bstaletic's, but he was involved in creating this, so I guess that only counts as half an approval?). I'm sure that before 2.6.2 is released, you'll still run Google's tests anyway?

@rwgk
Copy link
Collaborator

rwgk commented Jan 13, 2021

so I guess that only counts as half an approval?).

I look at it as a full approval: you still have two heads looking at the same thing.
Of course, more is always better, but also more expensive: diminishing returns.

I'm sure that before 2.6.2 is released, you'll still run Google's tests anyway?

Running tests for one thing at a time has a lot of value: easier to find root causes.
My company is spending a lot of resources to make retesting easy, for humans. Let the machines to the work! That frees up human energy.

@YannickJadoul
Copy link
Collaborator Author

Also: what the hell is going on with our CI/GitHub Actions?

/home/runner/work/_temp/9fcba6cd-dc0e-4f5e-a169-2da04a13086d.sh: line 1: /opt/hostedtoolcache/cmake/3.19.3/x64/cmake-3.19.3-Linux-aarch64/bin/cmake: cannot execute binary file: Exec format error

We're not running aarch64, you silly thing.

@YannickJadoul
Copy link
Collaborator Author

YannickJadoul commented Jan 13, 2021

Running tests for one thing at a time has a lot of value: easier to find root causes.
My company is spending a lot of resources to make retesting easy, for humans. Let the machines to the work! That frees up human energy.

That's a good argument, yes. But as argued, in this case, I'm not sure the machines can tell us anything more than we already know :-/ (unless there's specialized tests that try to stress test pybind11's exceptions?)

@YannickJadoul
Copy link
Collaborator Author

YannickJadoul commented Jan 13, 2021

Anyway, go ahead, @rwgk. But given that you already tested the logic (all that was added since is an extra cleanup), I'm not expecting anything there. But if it convinces you we can merge this, then why not.

@rwgk
Copy link
Collaborator

rwgk commented Jan 13, 2021

Also, #2746 isn't supposed to merge this functionality; I'll rebase it on top of this, and #2746 will be a bare-bones change to CI and CMake.

Thanks, good to know. I'll test this PR specifically. Our global testing doesn't use CMake at all.

@YannickJadoul
Copy link
Collaborator Author

Also: what the hell is going on with our CI/GitHub Actions?

/home/runner/work/_temp/9fcba6cd-dc0e-4f5e-a169-2da04a13086d.sh: line 1: /opt/hostedtoolcache/cmake/3.19.3/x64/cmake-3.19.3-Linux-aarch64/bin/cmake: cannot execute binary file: Exec format error

We're not running aarch64, you silly thing.

OK, fixed by rebasing onto #2790. The remaining failures are still #2774, so unrelated.

@rwgk
Copy link
Collaborator

rwgk commented Jan 13, 2021

Quick confirmation: this PR ~by itself*** on top of master makes the tests under pybind11/tests run ASAN & MSAN clean. The only sanitizer errors I'm still seeing are from TSAN, as reported under #2754.

*** I'm also using #2409 and two small local patches I've been carrying for a while.

@YannickJadoul
Copy link
Collaborator Author

Quick confirmation: this PR ~by itself*** on top of master makes the tests under pybind11/tests run ASAN & MSAN clean.

Huh, that's funny. So it's not flagging the one issue we couldn't really track down in #2746, and had to work around? (c183120) Maybe/hopefully it's a Valgrind fluke, then, but it was quite consistent, so not sure I'm believing that :-/

Copy link
Collaborator

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to get this in.

@rwgk
Copy link
Collaborator

rwgk commented Jan 13, 2021

Huh, that's funny. So it's not flagging the one issue we couldn't really track down in #2746, and had to work around? (c183120)

Nope. I ran ASAN with pretty strict options, but MSAN only with defaults. I could play more with the options ... later! Let's get this in as-is asap IMO, pending only on the results of the global testing run.

@YannickJadoul
Copy link
Collaborator Author

Thanks, @henryiii :-)

Nope. I ran ASAN with pretty strict options, but MSAN only with defaults. I could play more with the options ... later!

No worries. It's the one thing @bstaletic and I couldn't figure out. So the plan is to get Valgrind in, with the workaround, then undo the workaround in another PR that can serve as bug report. So I propose we keep discussion of this one thing for that one undoing-the-workaround PR?

@henryiii henryiii closed this Jan 14, 2021
@henryiii henryiii reopened this Jan 14, 2021
@rwgk
Copy link
Collaborator

rwgk commented Jan 14, 2021

The global testing came back clean. @YannickJadoul, please merge!

This PR was already extremely useful for my work on #2672, to sanitize my new code (I found a couple bugs already).

@YannickJadoul YannickJadoul merged commit 0855146 into pybind:master Jan 14, 2021
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Jan 14, 2021
@rwgk rwgk mentioned this pull request Jan 14, 2021
@YannickJadoul YannickJadoul deleted the leak-function_record branch January 14, 2021 18:44
@rwgk
Copy link
Collaborator

rwgk commented Jan 17, 2021

FYI: After rolling out this PR Google-internal, our testing system discovered this (very minor) leak:
https://github.com/tensorflow/tensorflow/blob/10c06df9492c44062f3a6b0c04bec325c7738426/tensorflow/python/eager/pywrap_tfe_src.cc#L4335
I don't know how this PR unmasked that leak or how the test even succeeded before this PR. I mailed a fix. Let's see if the review sheds some light on this.

@YannickJadoul
Copy link
Collaborator Author

Interesting; if you do find out anything relevant or need to link back to these changes, I'm very interested.

@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants