Plug leaking function_records in cpp_function initialization in case of exceptions (found by Valgrind in #2746) #2756

YannickJadoul · 2020-12-30T14:09:04Z

Description

Another leak fix taken out of #2746.

I am not entirely happy with this solution yet, mostly because we are not reusing destruct(function_record *). I tried adding this as the unique_ptr's deleter, but the issue is that the char *s only get copied along the way, and are not owned before those copies. (kind of fixed that)
~~Also, rec->data can still leak as rec->free_data is not called after an exception.~~ (fixed that)

Suggested changelog entry:

Resolved memory leak in cpp_function initialization when exceptions occured.

include/pybind11/pybind11.h

YannickJadoul · 2021-01-04T15:21:32Z

So, this should be ready to review:

1d56f73 in Add a Valgrind build on debug Python 3.9 #2746 demonstrates some extra leaks that weren't fixed yet
6c8e2f1 shows that the last commit does fix it, now

The one question I now have, is whether this fix is worth it. As far as I know, all of these leaks are actually errors by the pybind11 user, and can't/shouldn't happen during actual use, if the library with pybind11 is correctly designed and programmed. So do we want to incur this overhead?
Then again, this overhead is only during cpp_function initialization, so it shouldn't be on any critical path, either. So I lean towards merging this PR. But open to alternative suggestions!

rwgk · 2021-01-04T16:36:44Z

Hi @YannickJadoul,

Is this sentence in the PR top comment still true?

Also, rec->data can still leak as rec->free_data is not called after an exception.

What do you think about the diff below, to make it easier for the compiler/optimizer to see that a runtime bool works here?

diff --git a/include/pybind11/pybind11.h b/include/pybind11/pybind11.h
index 114f72b..e57dccc 100644
--- a/include/pybind11/pybind11.h
+++ b/include/pybind11/pybind11.h
@@ -120,7 +120,7 @@ protected:
     PYBIND11_NOINLINE unique_function_record make_function_record() {
         // `destruct<true>(function_record)`: `initialize_generic` copies strings and
         // takes care of cleaning up in case of exceptions.
-        return unique_function_record(new detail::function_record(), destruct<false>);
+        return unique_function_record(new detail::function_record(), destruct_no_delete_strings);
     }
 
     /// Special internal constructor for functors, lambda functions, etc.
@@ -392,7 +392,7 @@ protected:
             rec->def->ml_flags = METH_VARARGS | METH_KEYWORDS;
 
             capsule rec_capsule(unique_rec.release(), [](void *ptr) {
-                destruct((detail::function_record *) ptr);
+                destruct_with_delete_strings((detail::function_record *) ptr);
             });
             guarded_strdup.release();
 
@@ -489,9 +489,16 @@ protected:
         }
     }
 
+    static void destruct_with_delete_strings(detail::function_record *rec) {
+        destruct_impl(rec, true);
+    }
+
+    static void destruct_no_delete_strings(detail::function_record *rec) {
+        destruct_impl(rec, false);
+    }
+
     /// When a cpp_function is GCed, release any memory allocated by pybind11
-    template <bool DeleteStrings = true>
-    static void destruct(detail::function_record *rec) {
+    static void destruct_impl(detail::function_record *rec, bool delete_strings) {
         // If on Python 3.9, check the interpreter "MICRO" (patch) version.
         // If this is running on 3.9.0, we have to work around a bug.
         #if !defined(PYPY_VERSION) && PY_MAJOR_VERSION == 3 && PY_MINOR_VERSION == 9
@@ -504,7 +511,7 @@ protected:
                 rec->free_data(rec);
             // During initialization, these strings might not have been copied yet,
             // so they cannot be freed. Once the function has been created, they can.
-            if (DeleteStrings) {
+            if (delete_strings) {
                 std::free((char *) rec->name);
                 std::free((char *) rec->doc);
                 std::free((char *) rec->signature);

YannickJadoul · 2021-01-04T17:00:08Z

Hi @YannickJadoul,

Is this sentence in the PR top comment still true?

Also, rec->data can still leak as rec->free_data is not called after an exception.

Nope that's what I fixed, just today :-) Thanks, I fixed the original message!

What do you think about the diff below, to make it easier for the compiler/optimizer to see that a runtime bool works here?

Hmm, not sure what you mean. In my version it is a compile-time bool, which should be easier to optimize, no?

That being said, your version should result in a smaller size overhead (at the cost of that extra runtime check), if we make sure that destruct_impl is not inlined (there's some pybind11 macro for that). So this is a valid trade-off to consider, that I hadn't yet thought of. Any thoughts?

rwgk · 2021-01-04T17:06:34Z

That being said, your version should result in a smaller size overhead (at the cost of that extra runtime check)

The overhead for the runtime check is almost certainly not measurable.
But the compiler or linker will have to work pretty hard to figure out how to avoid duplicating a sizable chunk of generated machine code.
Seems like a classical case of "template bloating" that's very easy to avoid, as suggested, and it makes the code clearer, too, at least for me.

YannickJadoul · 2021-01-04T18:54:05Z

Alright, yes. Given other decisions in pybind11, prioritizing space over a tiny bit of performance makes sense to me! I've pulled this to runtime. (I have kept the default argument = true, though, indicating that that should be the default/normal case.)

YannickJadoul · 2021-01-04T19:17:04Z

Alright, that commit passes Valgrind in #2746 (da8859d) :-)

include/pybind11/pybind11.h

YannickJadoul · 2021-01-13T14:18:54Z

@henryiii, @rwgk, @EricCousineau-TRI, can I gently ping? This is the last thing standing between us, and rebasing and merging #2746 and soon having Valgrind checks run on all PRs :-)

rwgk · 2021-01-13T14:41:48Z

Hi @YannickJadoul, a few days ago I offered to run this PR through Google's global testing system, but you discouraged it. I'd happily approve this PR after convincing myself that the tests come back clean.

Fully absorbing all the nuances of this very hairy change would cost me more time than I can give it. I trust that you and @bstaletic have done a great job. Given that we have comprehensive testing including sanitizers in place, getting more heads to look at the details doesn't seem like a productive division of labor.

Please let me know when it this is a good time to globally test this PR again.

YannickJadoul · 2021-01-13T15:02:56Z

Hi @YannickJadoul, a few days ago I offered to run this PR through Google's global testing system, but you discouraged it. I'd happily approve this PR after convincing myself that the tests come back clean.

Well, you asked that for #2746, and I informed you that this code was already tested (except for the last few additional commits that were not made because tests fail, but because reviewing the code found a logical corner case for leaks that wasn't covered by tests; not at Google either, since you said ASAN/LSAN came back clean?). Also, #2746 isn't supposed to merge this functionality; I'll rebase it on top of this, and #2746 will be a bare-bones change to CI and CMake.

It's the central part of our code, so the successful route is tested by all our tests. I don't believe more tests would tell us much more, since the nature of this PR is fixing leaks in exceptional cases (most often programming mistakes, actually). Though I'm of course not stopping you to run tests! (They're just not the right tool to fully assess this PR, I think.)

I was mainly asking for one or two more reviews and approval, so we can merge this and get full Valgrind runs in CI (like @bstaletic's, but he was involved in creating this, so I guess that only counts as half an approval?). I'm sure that before 2.6.2 is released, you'll still run Google's tests anyway?

rwgk · 2021-01-13T15:08:43Z

so I guess that only counts as half an approval?).

I look at it as a full approval: you still have two heads looking at the same thing.
Of course, more is always better, but also more expensive: diminishing returns.

I'm sure that before 2.6.2 is released, you'll still run Google's tests anyway?

Running tests for one thing at a time has a lot of value: easier to find root causes.
My company is spending a lot of resources to make retesting easy, for humans. Let the machines to the work! That frees up human energy.

YannickJadoul · 2021-01-13T15:09:24Z

Also: what the hell is going on with our CI/GitHub Actions?

/home/runner/work/_temp/9fcba6cd-dc0e-4f5e-a169-2da04a13086d.sh: line 1: /opt/hostedtoolcache/cmake/3.19.3/x64/cmake-3.19.3-Linux-aarch64/bin/cmake: cannot execute binary file: Exec format error

We're not running aarch64, you silly thing.

YannickJadoul · 2021-01-13T15:12:36Z

Running tests for one thing at a time has a lot of value: easier to find root causes.
My company is spending a lot of resources to make retesting easy, for humans. Let the machines to the work! That frees up human energy.

That's a good argument, yes. But as argued, in this case, I'm not sure the machines can tell us anything more than we already know :-/ (unless there's specialized tests that try to stress test pybind11's exceptions?)

YannickJadoul · 2021-01-13T15:14:36Z

Anyway, go ahead, @rwgk. But given that you already tested the logic (all that was added since is an extra cleanup), I'm not expecting anything there. But if it convinces you we can merge this, then why not.

rwgk · 2021-01-13T15:23:05Z

Also, #2746 isn't supposed to merge this functionality; I'll rebase it on top of this, and #2746 will be a bare-bones change to CI and CMake.

Thanks, good to know. I'll test this PR specifically. Our global testing doesn't use CMake at all.

YannickJadoul · 2021-01-13T15:37:57Z

Also: what the hell is going on with our CI/GitHub Actions?
/home/runner/work/_temp/9fcba6cd-dc0e-4f5e-a169-2da04a13086d.sh: line 1: /opt/hostedtoolcache/cmake/3.19.3/x64/cmake-3.19.3-Linux-aarch64/bin/cmake: cannot execute binary file: Exec format error
We're not running aarch64, you silly thing.

OK, fixed by rebasing onto #2790. The remaining failures are still #2774, so unrelated.

rwgk · 2021-01-13T16:04:58Z

Quick confirmation: this PR ~by itself*** on top of master makes the tests under pybind11/tests run ASAN & MSAN clean. The only sanitizer errors I'm still seeing are from TSAN, as reported under #2754.

*** I'm also using #2409 and two small local patches I've been carrying for a while.

YannickJadoul · 2021-01-13T16:14:56Z

Quick confirmation: this PR ~by itself*** on top of master makes the tests under pybind11/tests run ASAN & MSAN clean.

Huh, that's funny. So it's not flagging the one issue we couldn't really track down in #2746, and had to work around? (c183120) Maybe/hopefully it's a Valgrind fluke, then, but it was quite consistent, so not sure I'm believing that :-/

henryiii

I'm happy to get this in.

rwgk · 2021-01-13T16:20:47Z

Huh, that's funny. So it's not flagging the one issue we couldn't really track down in #2746, and had to work around? (c183120)

Nope. I ran ASAN with pretty strict options, but MSAN only with defaults. I could play more with the options ... later! Let's get this in as-is asap IMO, pending only on the results of the global testing run.

YannickJadoul · 2021-01-13T16:24:38Z

Thanks, @henryiii :-)

Nope. I ran ASAN with pretty strict options, but MSAN only with defaults. I could play more with the options ... later!

No worries. It's the one thing @bstaletic and I couldn't figure out. So the plan is to get Valgrind in, with the workaround, then undo the workaround in another PR that can serve as bug report. So I propose we keep discussion of this one thing for that one undoing-the-workaround PR?

…nt in destruct(function_record *)

rwgk · 2021-01-14T15:21:54Z

The global testing came back clean. @YannickJadoul, please merge!

This PR was already extremely useful for my work on #2672, to sanitize my new code (I found a couple bugs already).

rwgk · 2021-01-17T00:46:35Z

FYI: After rolling out this PR Google-internal, our testing system discovered this (very minor) leak:
https://github.com/tensorflow/tensorflow/blob/10c06df9492c44062f3a6b0c04bec325c7738426/tensorflow/python/eager/pywrap_tfe_src.cc#L4335
I don't know how this PR unmasked that leak or how the test even succeeded before this PR. I mailed a fix. Let's see if the review sheds some light on this.

YannickJadoul · 2021-01-17T00:59:22Z

Interesting; if you do find out anything relevant or need to link back to these changes, I'm very interested.

YannickJadoul commented Dec 30, 2020

View reviewed changes

include/pybind11/pybind11.h Show resolved Hide resolved

YannickJadoul mentioned this pull request Dec 30, 2020

Add a Valgrind build on debug Python 3.9 #2746

Merged

10 tasks

YannickJadoul added this to the v2.6.2 milestone Jan 1, 2021

YannickJadoul force-pushed the leak-function_record branch from 76e7605 to f5ec1cc Compare January 4, 2021 13:14

YannickJadoul requested review from bstaletic, EricCousineau-TRI, rwgk and henryiii January 4, 2021 15:24

bstaletic reviewed Jan 4, 2021

View reviewed changes

YannickJadoul force-pushed the leak-function_record branch from 9942614 to 5febe3a Compare January 13, 2021 14:50

YannickJadoul force-pushed the leak-function_record branch from 5febe3a to a7ebdd5 Compare January 13, 2021 15:07

YannickJadoul force-pushed the leak-function_record branch from a7ebdd5 to be90910 Compare January 13, 2021 15:27

henryiii mentioned this pull request Jan 13, 2021

Arch directory issue for 3.19.3 jwlawson/actions-setup-cmake#18

Closed

henryiii approved these changes Jan 13, 2021

View reviewed changes

YannickJadoul added 9 commits January 13, 2021 20:36

Plug leaking function_record objects when exceptions are thrown

2e39aa4

Plug leak of strdup'ed strings in function_record

39cf3b8

Some extra comments about the function_record ownership dance

69b6cb5

Clean up the function_record better, in case of exceptions

780ee72

Demonstrate some extra function_record leaks

10aad45

Change DeleteStrings template argument to free_strings runtime argume…

dade4c9

…nt in destruct(function_record *)

Zero-state unique_function_record deleter object

5d5367a

Clarify rvalue reference to unique_ptr parameter in initialize_generic

6578453

Use push_back with const char * instead of emplace_back

1c20521

YannickJadoul force-pushed the leak-function_record branch from be90910 to 1c20521 Compare January 13, 2021 19:36

henryiii closed this Jan 14, 2021

henryiii reopened this Jan 14, 2021

rwgk approved these changes Jan 14, 2021

View reviewed changes

YannickJadoul merged commit 0855146 into pybind:master Jan 14, 2021

github-actions bot added the needs changelog Possibly needs a changelog entry label Jan 14, 2021

rwgk mentioned this pull request Jan 14, 2021

[FEAT] Rework casting #2646

Open

YannickJadoul deleted the leak-function_record branch January 14, 2021 18:44

henryiii removed the needs changelog Possibly needs a changelog entry label Jan 25, 2021

henryiii mentioned this pull request Jan 26, 2021

chore: prepare for the 2.6.2 release #2821

Merged

rwgk mentioned this pull request Feb 10, 2023

FWD pybind11 google/pybind11clif#2756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plug leaking function_records in cpp_function initialization in case of exceptions (found by Valgrind in #2746) #2756

Plug leaking function_records in cpp_function initialization in case of exceptions (found by Valgrind in #2746) #2756

YannickJadoul commented Dec 30, 2020 •

edited

Loading

YannickJadoul commented Jan 4, 2021

rwgk commented Jan 4, 2021

YannickJadoul commented Jan 4, 2021

rwgk commented Jan 4, 2021

YannickJadoul commented Jan 4, 2021

YannickJadoul commented Jan 4, 2021

YannickJadoul commented Jan 13, 2021 •

edited

Loading

rwgk commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021

rwgk commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021 •

edited

Loading

YannickJadoul commented Jan 13, 2021 •

edited

Loading

rwgk commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021

rwgk commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021

henryiii left a comment

rwgk commented Jan 13, 2021 •

edited

Loading

YannickJadoul commented Jan 13, 2021

rwgk commented Jan 14, 2021 •

edited

Loading

rwgk commented Jan 17, 2021

YannickJadoul commented Jan 17, 2021

Plug leaking function_records in cpp_function initialization in case of exceptions (found by Valgrind in #2746) #2756

Plug leaking function_records in cpp_function initialization in case of exceptions (found by Valgrind in #2746) #2756

Conversation

YannickJadoul commented Dec 30, 2020 • edited Loading

Description

Suggested changelog entry:

YannickJadoul commented Jan 4, 2021

rwgk commented Jan 4, 2021

YannickJadoul commented Jan 4, 2021

rwgk commented Jan 4, 2021

YannickJadoul commented Jan 4, 2021

YannickJadoul commented Jan 4, 2021

YannickJadoul commented Jan 13, 2021 • edited Loading

rwgk commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021

rwgk commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021 • edited Loading

YannickJadoul commented Jan 13, 2021 • edited Loading

rwgk commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021

rwgk commented Jan 13, 2021

YannickJadoul commented Jan 13, 2021

henryiii left a comment

Choose a reason for hiding this comment

rwgk commented Jan 13, 2021 • edited Loading

YannickJadoul commented Jan 13, 2021

rwgk commented Jan 14, 2021 • edited Loading

rwgk commented Jan 17, 2021

YannickJadoul commented Jan 17, 2021

YannickJadoul commented Dec 30, 2020 •

edited

Loading

YannickJadoul commented Jan 13, 2021 •

edited

Loading

YannickJadoul commented Jan 13, 2021 •

edited

Loading

YannickJadoul commented Jan 13, 2021 •

edited

Loading

rwgk commented Jan 13, 2021 •

edited

Loading

rwgk commented Jan 14, 2021 •

edited

Loading