-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[BUG]: Copying from wrong address + segmentation fault #3514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Okay, so I did some digging into this and it's probably because the return_value_policy is usually move for returns by value so that's why the return by value is wroking. pybind11/include/pybind11/cast.h Line 853 in 5194855
|
Potentially related to: #3217 |
I have built a python interpreter with AddressSanitizer, and rerun this example, compiling with
@davidhamb95 : Thanks for building this into a minimal reproducer; I was seeing this error over and over but couldn't figure out how to pare it down to something that would fit in a bug report. Hopefully the UBSan/Asan report helps crack this case. |
I don't have the free bandwidth to understand the very large reproducer. But if you put it in a PR, and it passes the valgrind GHA (all existing tests pass obviously), it'll only take a few minutes of my time to run it through the Google-internal sanitizers. Please tag me on the PR. |
Okay, so this bug is actually pretty subtle. First of all, there is a bug here: pybind11/include/pybind11/pybind11.h Line 1399 in ec81e8e
None of our test suite seems to use virtual base classes except for the brand new multi-inheritance tests added two days ago.
My current theory about what is going wrong:
Stumped about how to proceed. |
@virtuald Any thoughts on how to handle virtual classes properly? |
I don't have time at the moment to fully grok this, can look tonight. ... but I have virtual base class thingies working correctly as far as I can tell. I had found this previously (#2071) that indicated that using py::multiple_inheritance was required when dealing with a virtual base class. I don't remember if my autowrapper forces py::multiple_inheritance if it finds a virtual base class or not... I think it does? |
If looked carefully at @Skylion007 's comment and I have some vague ideas that may or may not be relevant.
In another context we handle this properly, with template specializations distinguishing between polymorphic and non-polymorphic types:
|
Okay, looks like I am barking up the right tree: gnuradio/gnuradio#5292 |
Could be related to: #1257 |
I encountered the same issue and I think your analysis is right about the cast issue. One of the thing that surprised me was that the typeinfo->implicit_casts available to the try_implicit_casts method of copyable_holder_caster did not contain an entry for all the branches in my inheritance tree. Actually a whole part behind the virtually inherited interface I was trying to cast to was missing even tough that class is exported to python through pybind. My class hierarchy looks something like this: VirtualInterface1 VirtualInterface2 : VirtualInterface1 VirtualInterface3 : VirtualInterface2 DerivedClass3A : AbstractBaseClass3 All virtual interfaces are pure virtual with trivial constructors and public virtual destructors. I have two other classes which have methods that use instances polymorphically: In Implementation1 have a method that takes an instance of VirtualInterface1 as parameter. When I call this from Python on an instance of Implementation2, I have the same issue as described here but this happens only when the C++ code is compiled with MSVC, not with GCC nor LLDB. If I overload that method in Implementation2 to take AbstractBaseClass2 or AbstractBaseClass3 as parameter there is no problems. I saw that behavior before when I was implementing a polymorphic delegation system. The issue was that we needed to cast to void* for compatibility reasons and needed to restore the type when executing the delegate. The problem was that if the delegate was instantiated from a base class, there was no way of knowing the most derived type to execute the correct static_cast to the most derived type. Doing this caused exactly the same problem we see here: getting a corrupted pointer with MSVC and much less often (although I did observed this in some of my tests) with GCC and LLDB. The problem with casting to void* is that you lose all RTTI. The way I managed to make this work was to implement some sort of reflexion where we would extract the most derived type information and map it to a cast routine that would always static_cast to the correct type when resolved. The only drawback of this is that you can't initialize this in your construction sequence. I am guessing a similar solution could be applied to pybind11 since it seems to store std::type_info along with the instance pointer. I just started to look into pybind11 code and everything is still unfamiliar so don't expect too much quickly. Moreover, the workaround of using the linear hierarchy is workable for us at the moment so we will probably go this way for now and refactor this when a solution has been implemented. Another test I did was to set the correct pointer address manually in the aliasing constructor of the smart pointer (I use my own type of smart pointer, nor shared_ptr for historical reasons) when the cast operation on the VirtualInterface took place. This made the operation succeed and the call completed as expected. There is clearly an issue in the library when it comes to cast instances polymorphically. This is a rather complicated issue to understand and fix, but I felt like sharing my experience in the hope that this could bring ideas to someone with more intrinsic knowledge in the library. If I can be of any help I would gladly help in any capacity I can because I would really want to see this one fixed. |
Could you please provide a reproducer? Until we have a reproducer it is too early to derive any conclusions about problems in the library.
That doesn't sound right: when virtual functions are involved, you need to use dynamic_cast. |
First of all, I would like to praise you guys for being so responsive. This shows PyBind11 is very alive and kicking and that confort my choice to commit to the library in my architecture. dr;tl I am sorry if I state stuff you already know but I am covering them in the odd chance this makes you see what could be causing this issue. When inheriting virtually, the derived class memory layout will contain only one copy of the virtually inherited class. This is true for every virtually inherited classes in the hierarchy. The standard is clear on the guaranties that a compiler must offer but the implementation details are left to the compiler vendor. In this particular regard, MSVC implementation differs significantly from GCC implementation. As I am sure you know, you need to use the RTTI when navigating the class hierarchy, that's why you mentioned using dynamic_cast and in this regard this is absolutely true. The thing with virtual inheritance is that the class memory layout is not guaranteed since you need to have one and only one copy of the virtually inherited class. As long as you still have the RTTI on the instance you are fine navigating the hierarchy through dynamic_cast. If you want to get a void* to the top of the instance memory layout, the way to do this is to use dynamic_cast<void*> as mentioned in an earlier comment. The problem then is that all RTTI will be gone on that instance. The only way to restore the RTTI is to static_cast the void* to the most derived type the instance refers to. Once the RTTI is restored on the instance you are free to use dynamic_cast to navigate the hierarchy again. If you cast the void* to an intermediate class, the result will be undefined. That explains why it sometimes works and sometime doesn't. In my experience this will almost never work properly with MSVC and much more often in GCC. I think that this cast from void* to an intermediate class might be what is happening. I have to admit I haven't dug enough in pybind code at this point to be completely sure that this is what is happening. I would like to understand more how the implicit cast list works in the library so I plan to dig more on this topic as I think this might be a point of interest to solve this issue. What makes me think this is cast related is that when I manually correct the pointer passed to the aliasing constructor of my smart pointer (2nd parameter) to make it point to the top of the instance, the python operation succeeds. This is the exact same behavior I observed when I was dealing with cast issues when designing my polymorphic delegation system which was receiving void* and casting them to the target class before executing the stored pointer-to-member function on it. I wanted to communicate that information to you guys sooner than later on the chance one of you might see where the described problematic situation might occur. I agree that without a reproducible example it is hard to definitively state that that is an issue in the library and I will make sure to give one to you as soon as possible. Of course I will remain available to answer any question you might have and I might ask some guidance on the library design and architecture if you are inclined when I start digging. |
I'm sorry to sound like a broken record, but the only thing I could act on is a reproducer. We'd have to make guesses: there will be no certainty that we hit the right thing. IIUC you have something that is broken: please reduce. |
Don't worry I totally agree with the fact you don't want to go on a wild goose chase on a limb. Until I am able to get you something concrete to work on. |
Hey guys! I am working on a sample code to reproduce the issue. I should have this out tomorrow (I hope). I have a deployment question for you all: Is there a way you are aware of that allows to load pybind11 generated modules in the Python environment without distributing them to site-packages? |
@plfoley Modifying the PYTHONPATH environment variable usually works for local testing. Just running the |
Required prerequisites
Problem description
Hi,
Recently I have encountered with the situation (segmentation fault issue) which is difficult for me to debug.
I have a diamond-like hierarchy of classes in C++, which I want to use in python. I have written pybind wrapper but
had some problems after compiling and trying to call one of my methods.
Here is the C++ source code (animal.hpp):
Corresponding pybind wrapper:
After compiling this file and generating the lib by the name
examples
, I importedAnimalUsage
class in my test file and tried to callget_animal
method:This call ends with segmentation fault:
In my source C++ example I keep
Frog
instance inAnimalUsage
class and return it from three different methods:getAnimal
,getAquaticAnimal
andgetFrog
. I know that I shouldn't expect polymorphism to work here since the default return value policy is copy policy for references (when I usereference_internal
return value policy, everything is good), but I can't find out the cause of segmentation fault.Except of segfault issue, from
Animal
copy constructor log I also noticed that object is being copied from address0x600000c80080
which is not the address ofAnimal
which has ben created. It corresponds to theFrog
address of the initial object. It seems to me that the copy should have been done from0x600000c800d8
address. It seems that pybind detects at the runtime that returnedAnimal&
object type isFrog
, goes to that object and uses it to callAnimal
copy constructor.FYI: everything is ok when I return by value instead of returning by reference from
getAnimal
method.Thanks,
Davit
Reproducible example code
No response
The text was updated successfully, but these errors were encountered: