Skip to content

[BUG]: Copying from wrong address + segmentation fault #3514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
davidhamb95 opened this issue Nov 29, 2021 · 19 comments · May be fixed by #3701
Open
3 tasks done

[BUG]: Copying from wrong address + segmentation fault #3514

davidhamb95 opened this issue Nov 29, 2021 · 19 comments · May be fixed by #3701
Labels
bug casters Related to casters, and to be taken into account when analyzing/reworking casters help wanted

Comments

@davidhamb95
Copy link

Required prerequisites

Problem description

Hi,

Recently I have encountered with the situation (segmentation fault issue) which is difficult for me to debug.
I have a diamond-like hierarchy of classes in C++, which I want to use in python. I have written pybind wrapper but
had some problems after compiling and trying to call one of my methods.

Here is the C++ source code (animal.hpp):

#include <string>
#include <iostream>

namespace example {

class Creature {
public:
    virtual std::string typeName() const = 0;
    virtual ~Creature() = default;
};

class Animal : public Creature {
public:
    Animal() {
        std::cout << "Animal created: " << this << std::endl;
    }

    Animal(const Animal& animal) : Creature(animal) {
        std::cout << "Animal copied: from " << &animal << " to " << this << std::endl;
    }

    ~Animal() {
        std::cout << "Animal deleted: " << this << std::endl;
    }

     std::string typeName() const override {
        return _typeName;
     }

private:
    std::string _typeName = "animal";
};

class TerrestrialAnimal : public virtual Animal {
public:
    TerrestrialAnimal() {
        std::cout << "TerrestrialAnimal created: " << this << std::endl;
    }

    TerrestrialAnimal(const TerrestrialAnimal& animal) : Animal(animal) {
        std::cout << "TerrestrialAnimal copied: from " << &animal << " to " << this << std::endl;
    }

    ~TerrestrialAnimal() {
        std::cout << "TerrestrialAnimal deleted: " << this << std::endl;
    }

     std::string typeName() const override {
        return _typeName;
     }

private:
    std::string _typeName = "terrestrial";
};

class AquaticAnimal : public virtual Animal {
public:
    AquaticAnimal() {
        std::cout << "AquaticAnimal created: " << this << std::endl;
    }

    AquaticAnimal(const AquaticAnimal& animal) : Animal(animal) {
        std::cout << "AquaticAnimal copied: from " << &animal << " to " << this << std::endl;
    }

    ~AquaticAnimal() {
        std::cout << "AquaticAnimal deleted: " << this << std::endl;
    }

    std::string typeName() const override {
       return _typeName;
    }

private:
    std::string _typeName = "aquatic";
};

class Frog : public TerrestrialAnimal, public AquaticAnimal {
public:
    Frog() {
        std::cout << "Frog created: " << this << std::endl;
    }

    Frog(const Frog& animal) : Animal(animal), TerrestrialAnimal(animal), AquaticAnimal(animal) {
        std::cout << "Frog copied: from " << &animal << " to " << this << std::endl;
    }

    ~Frog() {
        std::cout << "Frog deleted: " << this << std::endl;
    }

    std::string typeName() const override {
       return _typeName;
    }

private:
    std::string _typeName = "frog";
};

class AnimalUsage {
public:
    AnimalUsage() {
        std::cout << "AnimalUsage created: " << this << std::endl;
    }
    
    AnimalUsage(const AnimalUsage& u) {
        std::cout << "AnimalUsage copied: from " << &u << " to " << this << std::endl;
    }

    ~AnimalUsage() {
        std::cout << "AnimalUsage deleted: " << this << std::endl;
    }

    const Animal& getAnimal() {
        return frog;
    }

    const AquaticAnimal& getAquaticAnimal() {
        return frog;
    }

    const Frog& getFrog() {
        return frog;
    }

private:
    Frog frog;
};

}

Corresponding pybind wrapper:

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <pybind11/functional.h>

#include "animal.hpp"

namespace py = pybind11;

PYBIND11_MODULE(examples, m) {
    using namespace example;

    py::class_<example::Animal> animal(m, "Animal");
    
    animal.def(py::init<>());
    animal.def("type_name", &example::Animal::typeName);

    py::class_<example::TerrestrialAnimal, example::Animal> terrestrialAnimal(m, "TerrestrialAnimal");
    
    terrestrialAnimal.def(py::init<>());
    terrestrialAnimal.def("type_name", &example::TerrestrialAnimal::typeName);

    py::class_<example::AquaticAnimal, example::Animal> aquaticAnimal(m, "AquaticAnimal");
    
    aquaticAnimal.def(py::init<>());
    aquaticAnimal.def("type_name", &example::AquaticAnimal::typeName);

    py::class_<example::Frog, example::TerrestrialAnimal, example::AquaticAnimal> frog(m, "Frog");
    
    frog.def(py::init<>());
    frog.def("type_name", &example::Frog::typeName);

    py::class_<example::AnimalUsage> animalUsage(m, "AnimalUsage");
    
    animalUsage.def(py::init<>());
    animalUsage.def(py::init<const example::AnimalUsage &>(), py::arg("u"));
    animalUsage.def("get_animal", &example::AnimalUsage::getAnimal);
    animalUsage.def("get_aquatic_animal", &example::AnimalUsage::getAquaticAnimal);
    animalUsage.def("get_frog", &example::AnimalUsage::getFrog);
}

After compiling this file and generating the lib by the name examples, I imported AnimalUsage class in my test file and tried to call get_animal method:

from examples import AnimalUsage

usage = AnimalUsage()
usage.get_animal()

This call ends with segmentation fault:

Animal created: 0x600000c800d8
TerrestrialAnimal created: 0x600000c80080
AquaticAnimal created: 0x600000c800a0
Frog created: 0x600000c80080
AnimalUsage created: 0x600000c80080
Animal copied: from 0x600000c80080 to 0x6000022804a0
[1]    12401 segmentation fault  ./test.py

In my source C++ example I keep Frog instance in AnimalUsage class and return it from three different methods: getAnimal, getAquaticAnimal and getFrog. I know that I shouldn't expect polymorphism to work here since the default return value policy is copy policy for references (when I use reference_internal return value policy, everything is good), but I can't find out the cause of segmentation fault.

Except of segfault issue, from Animal copy constructor log I also noticed that object is being copied from address 0x600000c80080 which is not the address of Animal which has ben created. It corresponds to the Frog address of the initial object. It seems to me that the copy should have been done from 0x600000c800d8 address. It seems that pybind detects at the runtime that returned Animal& object type is Frog, goes to that object and uses it to call Animal copy constructor.

FYI: everything is ok when I return by value instead of returning by reference from getAnimal method.

Thanks,
Davit

Reproducible example code

No response

@davidhamb95 davidhamb95 added the triage New bug, unverified label Nov 29, 2021
@Skylion007
Copy link
Collaborator

Skylion007 commented Jan 17, 2022

Okay, so I did some digging into this and it's probably because the return_value_policy is usually move for returns by value so that's why the return by value is wroking.

static return_value_policy policy(return_value_policy p) { return p; }

@Skylion007
Copy link
Collaborator

Potentially related to: #3217

@NAThompson
Copy link

NAThompson commented Jan 27, 2022

I have built a python interpreter with AddressSanitizer, and rerun this example, compiling with -fsanitize=address -fsanitize=undefined -g:

(venv) ~/r/build ❯❯❯ python3
Python 3.9.9 (main, Jan  6 2022, 13:44:43)
[Clang 12.0.5 (clang-1205.0.22.11)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from examples import AnimalUsage
>>> usage = AnimalUsage()
Animal created: 0x106828ad8
TerrestrialAnimal created: 0x106828a80
AquaticAnimal created: 0x106828aa0
Frog created: 0x106828a80
AnimalUsage created: 0x106828a80
>>> usage.get_animal()
Animal copied: from 0x106828a80 to 0x106216390
/Users/nathompson7/reproduce_member_call/venv/include/pybind11/pybind11.h:1389:40: runtime error: cast to virtual base of address 0x000106216390 which does not point to an object of type 'example::TerrestrialAnimal'
0x000106216390: note: object is of type 'example::Animal'
 27 00 80 18  d8 43 87 08 01 00 00 00  61 6e 69 6d 61 6c 00 be  be be be be be be be be  be be be be
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'example::Animal'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/nathompson7/reproduce_member_call/venv/include/pybind11/pybind11.h:1389:40 in
/Users/nathompson7/reproduce_member_call/venv/include/pybind11/pybind11.h:1389:40: runtime error: cast to virtual base of address 0x0001062163b0 which does not point to an object of type 'example::AquaticAnimal'
0x0001062163b0: note: object has invalid vptr
 be be be 06  03 00 00 00 00 00 00 02  20 00 00 20 20 00 00 0e  26 00 00 6d dd dd dd dd  dd dd dd dd
              ^~~~~~~~~~~~~~~~~~~~~~~
              invalid vptr
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/nathompson7/reproduce_member_call/venv/include/pybind11/pybind11.h:1389:40 in
=================================================================
==6412==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x0001062163b0 at pc 0x00010881ed30 bp 0x00016f462930 sp 0x00016f462928
READ of size 8 at 0x0001062163b0 thread T0
    #0 0x10881ed2c in void pybind11::class_<example::AquaticAnimal, example::Animal>::add_base<example::Animal, 0>(pybind11::detail::type_record&)::'lambda'(void*)::__invoke(void*) pybind11.h:1388
    #1 0x1087ebd7c in pybind11::detail::traverse_offset_bases(void*, pybind11::detail::type_info const*, pybind11::detail::instance*, bool (*)(void*, pybind11::detail::instance*)) class.h:285
    #2 0x1087ebba0 in pybind11::detail::traverse_offset_bases(void*, pybind11::detail::type_info const*, pybind11::detail::instance*, bool (*)(void*, pybind11::detail::instance*)) class.h:288
    #3 0x108821fb4 in pybind11::class_<example::Frog, example::TerrestrialAnimal, example::AquaticAnimal>::init_instance(pybind11::detail::instance*, void const*) pybind11.h:1635
    #4 0x10882dfc4 in pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) type_caster_base.h:613
    #5 0x10882d104 in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<example::Animal const&, example::AnimalUsage, pybind11::name, pybind11::is_method, pybind11::sibling>(example::Animal const& (example::AnimalUsage::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(example::AnimalUsage*), example::Animal const&, example::AnimalUsage*, pybind11::name, pybind11::is_method, pybind11::sibling>(example::Animal const&, example::AnimalUsage (*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::operator()(pybind11::detail::function_call&) const pybind11.h:232
    #6 0x1087bbc44 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) pybind11.h:835
    #7 0x100cb2d00 in cfunction_call methodobject.c:543
    #8 0x100bad788 in _PyObject_MakeTpCall call.c:191
    #9 0x100bb8fb0 in _PyObject_VectorcallTstate abstract.h:116
    #10 0x100bb699c in method_vectorcall classobject.c:53
    #11 0x100f6543c in _PyObject_VectorcallTstate abstract.h:118
    #12 0x100f5de64 in PyObject_Vectorcall abstract.h:127
    #13 0x100f5e0c0 in call_function ceval.c:5075
    #14 0x100f50b9c in _PyEval_EvalFrameDefault ceval.c:3487
    #15 0x100f27f48 in _PyEval_EvalFrame pycore_ceval.h:40
    #16 0x100f610a0 in _PyEval_EvalCode ceval.c:4327
    #17 0x100f6247c in _PyEval_EvalCodeWithName ceval.c:4359
    #18 0x100f27e74 in PyEval_EvalCodeEx ceval.c:4375
    #19 0x100f27d78 in PyEval_EvalCode ceval.c:826
    #20 0x10104f62c in run_eval_code_obj pythonrun.c:1221
    #21 0x10104b7e8 in run_mod pythonrun.c:1242
    #22 0x101048a50 in PyRun_InteractiveOneObjectEx pythonrun.c:274
    #23 0x101048290 in PyRun_InteractiveLoopFlags pythonrun.c:127
    #24 0x101047fd4 in PyRun_AnyFileExFlags pythonrun.c:86
    #25 0x1010c3714 in pymain_run_stdin main.c:512
    #26 0x1010c1550 in pymain_run_python main.c:601
    #27 0x1010c0c34 in Py_RunMain main.c:677
    #28 0x1010c1c74 in pymain_main main.c:707
    #29 0x1010c1f40 in Py_BytesMain main.c:731
    #30 0x10099434c in main python.c:15
    #31 0x101f010f0 in start+0x204 (dyld:arm64e+0x50f0)

0x0001062163b0 is located 0 bytes to the right of 32-byte region [0x000106216390,0x0001062163b0)
allocated by thread T0 here:
    #0 0x102300afc in wrap__Znwm+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4cafc)
    #1 0x1088350dc in decltype((new example::Animal(std::declval<example::Animal const>())) , (void* (*)(void const*){})) pybind11::detail::type_caster_base<example::Animal>::make_copy_constructor<example::Animal, void>(example::Animal const*)::'lambda'(void const*)::__invoke(void const*) type_caster_base.h:968
    #2 0x10882df18 in pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) type_caster_base.h:568
    #3 0x10882d104 in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<example::Animal const&, example::AnimalUsage, pybind11::name, pybind11::is_method, pybind11::sibling>(example::Animal const& (example::AnimalUsage::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(example::AnimalUsage*), example::Animal const&, example::AnimalUsage*, pybind11::name, pybind11::is_method, pybind11::sibling>(example::Animal const&, example::AnimalUsage (*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::operator()(pybind11::detail::function_call&) const pybind11.h:232
    #4 0x1087bbc44 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) pybind11.h:835
    #5 0x100cb2d00 in cfunction_call methodobject.c:543
    #6 0x100bad788 in _PyObject_MakeTpCall call.c:191
    #7 0x100bb8fb0 in _PyObject_VectorcallTstate abstract.h:116
    #8 0x100bb699c in method_vectorcall classobject.c:53
    #9 0x100f6543c in _PyObject_VectorcallTstate abstract.h:118
    #10 0x100f5de64 in PyObject_Vectorcall abstract.h:127
    #11 0x100f5e0c0 in call_function ceval.c:5075
    #12 0x100f50b9c in _PyEval_EvalFrameDefault ceval.c:3487
    #13 0x100f27f48 in _PyEval_EvalFrame pycore_ceval.h:40
    #14 0x100f610a0 in _PyEval_EvalCode ceval.c:4327
    #15 0x100f6247c in _PyEval_EvalCodeWithName ceval.c:4359
    #16 0x100f27e74 in PyEval_EvalCodeEx ceval.c:4375
    #17 0x100f27d78 in PyEval_EvalCode ceval.c:826
    #18 0x10104f62c in run_eval_code_obj pythonrun.c:1221
    #19 0x10104b7e8 in run_mod pythonrun.c:1242
    #20 0x101048a50 in PyRun_InteractiveOneObjectEx pythonrun.c:274
    #21 0x101048290 in PyRun_InteractiveLoopFlags pythonrun.c:127
    #22 0x101047fd4 in PyRun_AnyFileExFlags pythonrun.c:86
    #23 0x1010c3714 in pymain_run_stdin main.c:512
    #24 0x1010c1550 in pymain_run_python main.c:601
    #25 0x1010c0c34 in Py_RunMain main.c:677
    #26 0x1010c1c74 in pymain_main main.c:707
    #27 0x1010c1f40 in Py_BytesMain main.c:731
    #28 0x10099434c in main python.c:15
    #29 0x101f010f0 in start+0x204 (dyld:arm64e+0x50f0)

SUMMARY: AddressSanitizer: heap-buffer-overflow pybind11.h:1388 in void pybind11::class_<example::AquaticAnimal, example::Animal>::add_base<example::Animal, 0>(pybind11::detail::type_record&)::'lambda'(void*)::__invoke(void*)
Shadow bytes around the buggy address:
  0x007020c62c20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007020c62c30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007020c62c40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007020c62c50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007020c62c60: 00 00 00 00 fa fa 00 00 00 00 fa fa 00 00 00 00
=>0x007020c62c70: fa fa 00 00 00 00[fa]fa fd fd fd fd fa fa fd fd
  0x007020c62c80: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa
  0x007020c62c90: fd fd fd fd fa fa fd fd fd fd fa fa fd fd fd fd
  0x007020c62ca0: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
  0x007020c62cb0: fd fd fa fa 00 00 03 fa fa fa fd fd fd fa fa fa
  0x007020c62cc0: fd fd fd fd fa fa fd fd fd fd fa fa 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==6412==ABORTING
fish: Job 1, 'python3' terminated by signal SIGABRT (Abort)

@davidhamb95 : Thanks for building this into a minimal reproducer; I was seeing this error over and over but couldn't figure out how to pare it down to something that would fit in a bug report. Hopefully the UBSan/Asan report helps crack this case.

@rwgk
Copy link
Collaborator

rwgk commented Jan 27, 2022

I don't have the free bandwidth to understand the very large reproducer. But if you put it in a PR, and it passes the valgrind GHA (all existing tests pass obviously), it'll only take a few minutes of my time to run it through the Google-internal sanitizers. Please tag me on the PR.

@Skylion007
Copy link
Collaborator

Okay, so this bug is actually pretty subtle. First of all, there is a bug here:

return static_cast<Base *>(reinterpret_cast<type *>(src));
. Static_Casting to a base class is fine UNLESS it's a virtual base class which is undefined behavior (as this example shows). However, changing it to a dynamic_cast doesn't seem to solve the problem.

None of our test suite seems to use virtual base classes except for the brand new multi-inheritance tests added two days ago.

pybind11_tests.cpython-38-darwin.so	0x00000001036d96f5 void pybind11::class_<test_submodule_class_(pybind11::module_&)::AquaticAnimal, test_submodule_class_(pybind11::module_&)::Animal>::add_base<test_submodule_class_(pybind11::module_&)::Animal, 0>(pybind11::detail::type_record&)::'lambda'(void*)::__invoke(void*) + 21 (pybind11.h:1414)
6   pybind11_tests.cpython-38-darwin.so	0x00000001034b1803 pybind11::detail::traverse_offset_bases(void*, pybind11::detail::type_info const*, pybind11::detail::instance*, bool (*)(void*, pybind11::detail::instance*)) + 451 (class.h:285)
7   pybind11_tests.cpython-38-darwin.so	0x00000001034b1861 pybind11::detail::traverse_offset_bases(void*, pybind11::detail::type_info const*, pybind11::detail::instance*, bool (*)(void*, pybind11::detail::instance*)) + 545 (class.h:288)
8   pybind11_tests.cpython-38-darwin.so	0x00000001034b1335 pybind11::detail::deregister_instance(pybind11::detail::instance*, void*, pybind11::detail::type_info const*) + 85 (class.h:321)
9   pybind11_tests.cpython-38-darwin.so	0x00000001034b11ac pybind11::detail::clear_instance(_object*) + 220 (class.h:395)
10  pybind11_tests.cpython-38-darwin.so	0x00000001034b0db5 pybind11_object_dealloc + 21 (class.h:419)

My current theory about what is going wrong:

  • We have two objects that inherit from the same virtual base class.
  • This is causing havok with our casters that reference base classes by their type id, the same two classes can have the same typeid and they might be overwriting each other's type_record, trying to free it twice, or some other badness.
  • When replacing static_cast with dynamic_cast, dynamic casting can and does FAIL and returns a nullptr. To avoid segfaults we may want to handle this non-gracefully.
  • This is an issue because it tries to cast Frog up to Animal and back down again and is messing up or losing the RTTI somewhere along the chain and then try to copy it with the wrong constructor.
  • This error does occurs when we do not copy the C++ type, not when we move it.
  • If I abort add_base with a try .. except, then get_animal works, but deallocating the object causes a segfault.

Stumped about how to proceed.

@Skylion007
Copy link
Collaborator

@virtuald Any thoughts on how to handle virtual classes properly?

@virtuald
Copy link
Contributor

I don't have time at the moment to fully grok this, can look tonight.

... but I have virtual base class thingies working correctly as far as I can tell. I had found this previously (#2071) that indicated that using py::multiple_inheritance was required when dealing with a virtual base class. I don't remember if my autowrapper forces py::multiple_inheritance if it finds a virtual base class or not... I think it does?

@rwgk
Copy link
Collaborator

rwgk commented Jan 29, 2022

If looked carefully at @Skylion007 's comment and I have some vague ideas that may or may not be relevant.
Could someone please create a reproducer draft PR that we can all play with? (@NAThompson or @Skylion007 I figure you must have something already.)
The most important of the vague ideas is:

  • Do we properly cast to void * further upstream? For polymorphic types that would be dynamic_cast<void *>, which is a very special beast.

In another context we handle this properly, with template specializations distinguishing between polymorphic and non-polymorphic types:

return dynamic_cast<const void*>(src);

@Skylion007
Copy link
Collaborator

Okay, looks like I am barking up the right tree: gnuradio/gnuradio#5292

@Skylion007 Skylion007 linked a pull request Feb 8, 2022 that will close this issue
@Skylion007 Skylion007 linked a pull request Feb 8, 2022 that will close this issue
@Skylion007
Copy link
Collaborator

Could be related to: #1257

@plfoley
Copy link

plfoley commented May 15, 2022

I encountered the same issue and I think your analysis is right about the cast issue.
I went digging a bit in pybind11 (v2.6.1 & v2.9.2) code to try to figure out what the issue could be.

One of the thing that surprised me was that the typeinfo->implicit_casts available to the try_implicit_casts method of copyable_holder_caster did not contain an entry for all the branches in my inheritance tree. Actually a whole part behind the virtually inherited interface I was trying to cast to was missing even tough that class is exported to python through pybind.

My class hierarchy looks something like this:

VirtualInterface1
AbstractBaseClass1 : VirtualInterface1

VirtualInterface2 : VirtualInterface1
AbstractBaseClass2 : AbstractBaseClass1, VirtualInterface2

VirtualInterface3 : VirtualInterface2
AbstractBaseClass3 : AbstractBaseClass2, VirtualInterface3

DerivedClass3A : AbstractBaseClass3
DerivedClass3B : DerivedClass3A
DerivedClass3C : DerivedClass3B

All virtual interfaces are pure virtual with trivial constructors and public virtual destructors.
All of these classes and interfaces are fully exported through pybind.

I have two other classes which have methods that use instances polymorphically:
Implementation1
Implementation2 : Implementation1

In Implementation1 have a method that takes an instance of VirtualInterface1 as parameter. When I call this from Python on an instance of Implementation2, I have the same issue as described here but this happens only when the C++ code is compiled with MSVC, not with GCC nor LLDB. If I overload that method in Implementation2 to take AbstractBaseClass2 or AbstractBaseClass3 as parameter there is no problems.

I saw that behavior before when I was implementing a polymorphic delegation system. The issue was that we needed to cast to void* for compatibility reasons and needed to restore the type when executing the delegate. The problem was that if the delegate was instantiated from a base class, there was no way of knowing the most derived type to execute the correct static_cast to the most derived type. Doing this caused exactly the same problem we see here: getting a corrupted pointer with MSVC and much less often (although I did observed this in some of my tests) with GCC and LLDB.

The problem with casting to void* is that you lose all RTTI. The way I managed to make this work was to implement some sort of reflexion where we would extract the most derived type information and map it to a cast routine that would always static_cast to the correct type when resolved. The only drawback of this is that you can't initialize this in your construction sequence. I am guessing a similar solution could be applied to pybind11 since it seems to store std::type_info along with the instance pointer.

I just started to look into pybind11 code and everything is still unfamiliar so don't expect too much quickly. Moreover, the workaround of using the linear hierarchy is workable for us at the moment so we will probably go this way for now and refactor this when a solution has been implemented.

Another test I did was to set the correct pointer address manually in the aliasing constructor of the smart pointer (I use my own type of smart pointer, nor shared_ptr for historical reasons) when the cast operation on the VirtualInterface took place. This made the operation succeed and the call completed as expected.

There is clearly an issue in the library when it comes to cast instances polymorphically. This is a rather complicated issue to understand and fix, but I felt like sharing my experience in the hope that this could bring ideas to someone with more intrinsic knowledge in the library. If I can be of any help I would gladly help in any capacity I can because I would really want to see this one fixed.

@Skylion007
Copy link
Collaborator

@rwgk @henryiii @wjakob ^ See above.

@Skylion007 Skylion007 added bug help wanted casters Related to casters, and to be taken into account when analyzing/reworking casters and removed triage New bug, unverified labels May 15, 2022
@rwgk
Copy link
Collaborator

rwgk commented May 15, 2022

There is clearly an issue in the library when it comes to cast instances polymorphically.

Could you please provide a reproducer?

Until we have a reproducer it is too early to derive any conclusions about problems in the library.

The problem was that if the delegate was instantiated from a base class, there was no way of knowing the most derived type to execute the correct static_cast to the most derived type.

That doesn't sound right: when virtual functions are involved, you need to use dynamic_cast.

@plfoley
Copy link

plfoley commented May 15, 2022

First of all, I would like to praise you guys for being so responsive. This shows PyBind11 is very alive and kicking and that confort my choice to commit to the library in my architecture.

dr;tl
1 - Virtual methods and virtual inheritance are two very different concepts.
1a - Always use dynamic_cast to cast to void*.
1b - Always static cast void* to the most derived type the void* represents then dynamic_cast to virtually inherited base class.
2 - I will craft you a code sample that reproduces the issue. C++ class hierarchy with python bindings and Python code that triggers the issue. I will be very busy in the coming weeks so this might take some time however.

I am sorry if I state stuff you already know but I am covering them in the odd chance this makes you see what could be causing this issue.

When inheriting virtually, the derived class memory layout will contain only one copy of the virtually inherited class. This is true for every virtually inherited classes in the hierarchy. The standard is clear on the guaranties that a compiler must offer but the implementation details are left to the compiler vendor. In this particular regard, MSVC implementation differs significantly from GCC implementation.

As I am sure you know, you need to use the RTTI when navigating the class hierarchy, that's why you mentioned using dynamic_cast and in this regard this is absolutely true. The thing with virtual inheritance is that the class memory layout is not guaranteed since you need to have one and only one copy of the virtually inherited class. As long as you still have the RTTI on the instance you are fine navigating the hierarchy through dynamic_cast.

If you want to get a void* to the top of the instance memory layout, the way to do this is to use dynamic_cast<void*> as mentioned in an earlier comment. The problem then is that all RTTI will be gone on that instance. The only way to restore the RTTI is to static_cast the void* to the most derived type the instance refers to. Once the RTTI is restored on the instance you are free to use dynamic_cast to navigate the hierarchy again. If you cast the void* to an intermediate class, the result will be undefined. That explains why it sometimes works and sometime doesn't. In my experience this will almost never work properly with MSVC and much more often in GCC.

I think that this cast from void* to an intermediate class might be what is happening. I have to admit I haven't dug enough in pybind code at this point to be completely sure that this is what is happening. I would like to understand more how the implicit cast list works in the library so I plan to dig more on this topic as I think this might be a point of interest to solve this issue. What makes me think this is cast related is that when I manually correct the pointer passed to the aliasing constructor of my smart pointer (2nd parameter) to make it point to the top of the instance, the python operation succeeds. This is the exact same behavior I observed when I was dealing with cast issues when designing my polymorphic delegation system which was receiving void* and casting them to the target class before executing the stored pointer-to-member function on it.

I wanted to communicate that information to you guys sooner than later on the chance one of you might see where the described problematic situation might occur. I agree that without a reproducible example it is hard to definitively state that that is an issue in the library and I will make sure to give one to you as soon as possible.

Of course I will remain available to answer any question you might have and I might ask some guidance on the library design and architecture if you are inclined when I start digging.

@rwgk
Copy link
Collaborator

rwgk commented May 15, 2022

I think that this cast from void* to an intermediate class might be what is happening.

I'm sorry to sound like a broken record, but the only thing I could act on is a reproducer.

We'd have to make guesses: there will be no certainty that we hit the right thing.

IIUC you have something that is broken: please reduce.

@plfoley
Copy link

plfoley commented May 15, 2022

Don't worry I totally agree with the fact you don't want to go on a wild goose chase on a limb.
I did not expected anyone to look into this before I came back with a sure-fire way of reproducing this.
This is why I will make sure the exact same behavior is exhibited by the sample code.
This is also partly why it might take some time before I am able to give it to you.

Until I am able to get you something concrete to work on.

@plfoley
Copy link

plfoley commented May 30, 2022

Hey guys!

I am working on a sample code to reproduce the issue. I should have this out tomorrow (I hope).

I have a deployment question for you all:
I usually always deploy my modules to site-packages and use them from there.
I'd like, if possible to avoid polluting yours.

Is there a way you are aware of that allows to load pybind11 generated modules in the Python environment without distributing them to site-packages?

@Skylion007
Copy link
Collaborator

Skylion007 commented May 30, 2022

@plfoley Modifying the PYTHONPATH environment variable usually works for local testing. Just running the pytest cmake target should run all the tests thoug without installing anything.

@bluescarni
Copy link
Contributor

@rwgk there is a small reproducer for an issue which may be the same as this, over at #4365 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug casters Related to casters, and to be taken into account when analyzing/reworking casters help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants