Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressions from PR #1044 and #1045 #1119

Closed
dsnopek opened this issue May 22, 2023 · 10 comments
Closed

Regressions from PR #1044 and #1045 #1119

dsnopek opened this issue May 22, 2023 · 10 comments
Labels
bug This has been identified as a bug regression
Milestone

Comments

@dsnopek
Copy link
Collaborator

dsnopek commented May 22, 2023

These two recently merged PRs have led to a serious regression:

I'm making this issue to try and consolidate information about the regressions, hopefully leading to a PR to fix them!

Overview

Both of those PRs aim to fix calling methods defined in GDExtension (godot-cpp) from GDScript. Before merging the PRs, calling any non-virtual method that took Object * or Ref<T> as a parameter would lead to a crash.

However, now calling any virtual method that takes an Object * or Ref<T> as a parameter will lead to a crash!

Both PRs change PtrToArg<>::convert() for their respective types, in the same way: where they previously assumed they were receiving an Object **, the PRs switched to the assumption that they are receiving an Object *.

So, it seems that calling a non-virtual method from GDScript passes an Object * and calling a virtual method passes Object **.

Where does the difference come from?

As @zhehangd points out, in the GDScript VM we have

const void **argptrs = call_args_ptr;
for (int i = 0; i < argc; i++) {
  GET_INSTRUCTION_ARG(v, i);
  argptrs[i] = VariantInternal::get_opaque_pointer((const Variant *)v);
}
GET_INSTRUCTION_ARG(ret, argc + 1);
VariantInternal::initialize(ret, Variant::m_type);
void *ret_opaque = VariantInternal::OP_GET_##m_type(ret);
method->ptrcall(base_obj, argptrs, ret_opaque);

For an object, VariantInternal::get_opaque_pointer() returns a plain Object *. Hence, we're encoding an Object * as Object *.

However, when we look at the macros in gdvirtual.gen.inc (which generate the code that calls virtual methods) we've got:

		PtrToArg<m_type1>::EncodeT argval1 = arg1;\
		GDExtensionConstTypePtr argptrs[1]={&argval1};\
\
		PtrToArg<m_ret>::EncodeT ret;\
		_gdvirtual_##m_name(_get_extension_instance(),reinterpret_cast<GDExtensionConstTypePtr*>(argptrs),&ret);\

This is a little trickier to decipher. PtrToArg<Object *>::EncodeT is Object *, so it's making a local variable with the Object *. And then it's putting a pointer to that local variable (so, an Object **) into the argptrs array. That appears to be how we're getting the Object **

Possible solutions

Well, first of all, since PtrToArg<> on the godot-cpp side is supposed to round-trip encode/convert the data from PtrToArg<> on the Godot side, they are now out-of-sync. So, no matter what solution we end up pursuing, we should make sure that PtrToArg<> is doing the same thing in both godot-cpp and Godot again.

Here's some possible solutions:

  1. We could change Godot to call virtual functions with the encoding that godot-cpp expects after the PR (ie. passing Object *), and changing Godot's PtrToArgs<> to match godot-cpp's, as well as altering the macros in gdvirtual.gen.inc. This would be a breaking change to GDExtensions ABI (in a way that we don't have a system to provide compatibility), and may affect how C# works? (I really have no idea how the C# binding works, so that's just a guess. If it doesn't call virtual methods through those same macros, then it wouldn't be affected, but I suspect it does?)
  2. We could change how the GDScript VM calls methods so that it encodes Object * as Object ** and then revert the two godot-cpp PRs. However, since GDScript was able to call all sorts of methods just fine (just not non-virtual ones defined in GDExtension), I'm assuming this would break things in other places?
  3. We could have godot-cpp somehow decode Object *'s differently depending on whether or not the method being called is a virtual method or not. This is the least disruptive change because it's all in godot-cpp, but I haven't tried to implement it yet, so I'm not sure how messy it would be.

What do you think?

We really need some solution to this before Godot 4.1 is released, because godot-cpp is pretty broken at the moment.

@dsnopek dsnopek added bug This has been identified as a bug regression labels May 22, 2023
@dsnopek dsnopek added this to the 4.1 milestone May 22, 2023
@Faless
Copy link
Contributor

Faless commented May 23, 2023

One thing that's missing from the above, and which was pointed out in godotengine/godot#69902 and #958:

Unlike "normal" functions, which use Variant as an intermediate which ensures refcounting is done, with virtual functions we send pointers to Ref<T> objects.

This means, when calling virtual functions with arguments or return value of Ref<T>, Godot does not encode them Object ** but as Ref<T> * (which "might seem" like Object ** but it's not).

https://github.com/godotengine/godot/blob/56fc631/core/object/ref_counted.h#L249-L254

This is (I think) to ensure that refcount does not reach 0 when crossing the boundaries causing the objects to be freed.
This is particularly important when returning a new refcounted object from an extension method.

This is also why godotengine/godot#69902 added two functions to the gdextension interface to allow getting/setting the object referenced by an opaque pointer to a godot Ref (the extension can't know the memory profile of godot Ref template objects so can't use them directly).

Well, first of all, since PtrToArg<> on the godot-cpp side is supposed to round-trip encode/convert the data from PtrToArg<> on the Godot side, they are now out-of-sync. So, no matter what solution we end up pursuing, we should make sure that PtrToArg<> is doing the same thing in both godot-cpp and Godot again.

Here's some possible solutions:

My feeling on this is:

  • option 1 might result in some necessary cases being unaddressed (i.e. returning refcounted objects)
  • option 2 has a huge impact surface, and while it's probably a valid solution, might not be doable in the short term since it has a huge potential of breaking things in native godot.
  • I would go for option 3, which would mean doing on the godot-cpp side what Godot is already doing (i.e. distinguishing the 2 cases). This sounds doable in theory, but not sure how complex it is in practice.

@dsnopek
Copy link
Collaborator Author

dsnopek commented May 23, 2023

Thanks, @Faless!

Probably one of the important next steps here is experimenting with ways to do solution nr 3, so we can see how complex it is to implement. I'll try and put some time into that!

@vnen
Copy link
Member

vnen commented May 23, 2023

There was actually one special case where GDScript failed, which was fixed by godotengine/godot#72654. This was also something that had back and forth changes fixing one case while breaking another.

To put it short, in some cases where Object * is returned but the actual value is a RefCounted *, the ptrcall wouldn't increment the reference while where it is a Ref<T> it would increment. GDScript could not tell which case it was at runtime before the PR so it would either leak by doing too much or crash by not doing enough when packing it into Variant.

IMO we should agree on a consistent behavior for the encoding of objects and use the same standard everywhere, otherwise we're always going to have this battle of "I think it works this way" without realizing it works in a different for other cases. It is the main source of these issues.


For an object, VariantInternal::get_opaque_pointer() returns a plain Object *. Hence, we're encoding an Object * as Object *.

This seems like a problem in this method itself, it should return an Object **. The thing is that this is meant to return a pointer to the value inside the Variant. If it's a basic type, like int it returns an int *. For objects, the base type is Object * which is what is stored in Variant, so the method should return a pointer to this value, which is an Object **.

I do take the blame for this since I implemented the VariantInternal::get_opaque_pointer() initially. It should just call VariantInternal::get_object(), which does return an Object ** (get_object() itself was added later, that's why it's not called there). But I probably did this way because it is what worked.

This type of stuff has caused many problems already in the GDScript VM. It does beg the question why this is working fine though. Perhaps it goes back to PtrToArg not being consistent.


Unlike "normal" functions, which use Variant as an intermediate which ensures refcounting is done, with virtual functions we send pointers to Ref<T> objects.

This means, when calling virtual functions with arguments or return value of Ref<T>, Godot does not encode them Object ** but as Ref<T> * (which "might seem" like Object ** but it's not).

This seems like a bug in Godot to me, it should never send a Ref<T> * across to the GDExtension since the Ref wrapper does not exist there (or is a different one provided by the bindings). Or it should always send a Ref<T> *, not only for virtual functions, and then the GDExtension side can handle it with those added functions.


Regarding the potential solutions.

  1. We could change Godot to call virtual functions with the encoding that godot-cpp expects after the PR (ie. passing Object *), and changing Godot's PtrToArgs<> to match godot-cpp's, as well as altering the macros in gdvirtual.gen.inc. This would be a breaking change to GDExtensions ABI (in a way that we don't have a system to provide compatibility), and may affect how C# works? (I really have no idea how the C# binding works, so that's just a guess. If it doesn't call virtual methods through those same macros, then it wouldn't be affected, but I suspect it does?)

If we change this, I would go with always using Object **, for the same reasoning as Variant::get_opaque_pointer() I mention above. Changing PtrToArg would indeed require changes in GDScript and C#, as this affect all calls, not only virtual ones.

This is definitely disruptive but probably the best solution if we can make PtrToArg consistent. Breaking ABI for GDExtension at this point should still be okay.

However, changing the virtual calls to use Object * might be enough to fix the issue and wouldn't require changes in PtrToArg AFAICT (maybe it does for the GDExtension wrapper, but this shouldn't affect other users of PtrToArg). Which makes this less disruptive than otherwise. There might be still issues with Ref<T> though, so this needs to be considered.

  1. We could change how the GDScript VM calls methods so that it encodes Object * as Object ** and then revert the two godot-cpp PRs. However, since GDScript was able to call all sorts of methods just fine (just not non-virtual ones defined in GDExtension), I'm assuming this would break things in other places?

This most likely requires changing PtrToArg as well, since it currently works for engine classes, so not much different than the first option.

  1. We could have godot-cpp somehow decode Object *'s differently depending on whether or not the method being called is a virtual method or not. This is the least disruptive change because it's all in godot-cpp, but I haven't tried to implement it yet, so I'm not sure how messy it would be.

For me this is the last resort if we can't agree on changes in Godot itself, or don't have the time to implement those.

@saki7
Copy link
Contributor

saki7 commented May 23, 2023

I would like to share my thoughts on this topic because I was the person who initially reported the regression on #1045.

Unlike "normal" functions, which use Variant as an intermediate which ensures refcounting is done, with virtual functions we send pointers to Ref<T> objects.

This means, when calling virtual functions with arguments or return value of Ref<T>, Godot does not encode them Object ** but as Ref<T> * (which "might seem" like Object ** but it's not).

This seems like a bug in Godot to me, it should never send a Ref<T> * across to the GDExtension since the Ref wrapper does not exist there (or is a different one provided by the bindings). Or it should always send a Ref<T> *, not only for virtual functions, and then the GDExtension side can handle it with those added functions.

First of all, I feel suspicious about the current implementation where Variant is doing the refcounting, because a Variant class in general should work merely like a wrapper storage for multiple alternative types. We really shouldn't expect Variant to do some extra work like refcounting.

If we really need to store refcounted objects to variant, the Variant type should accept Ref<RefCounted> as one of the alternative types. But Godot was not designed like that; instead it has some special workaround which does the refcounting. Even if this feature is being utilized for a good cause, I believe that we shouldn't overuse this functionality for making refcount cross the extension boundary.

Regarding the comment from @vnen, I think the latter choice makes more sense. We can't eliminate Ref<T> because refcounting must always be done. To deal with that we've also added some nice helper functions to get/set into that opaque object. Therefore, I think core should always send Ref<T> *, regardless of the method being virtual or non-virtual. This way Ref<T> will take the responsibility for refcounting, and the consistent protocol makes our life much easier.


We could have godot-cpp somehow decode Object *'s differently depending on whether or not the method being called is a virtual method or not. This is the least disruptive change because it's all in godot-cpp, but I haven't tried to implement it yet, so I'm not sure how messy it would be.

Option 3 essentially means that we're accepting the current inconsistency on ABI. I have a really bad feeling about that, because the inconsistency is the principal cause of this recurring headache.

FYI, this is not just the problem for Ref<T>, it's affecting Variant-to-Object conversion too: godotengine/godot#61967. It would be nice if @Bromeon can share their thoughts on this issue, since @vnen mentioned about the inconsistency above.

I completely agree vnen for this for this point:

IMO we should agree on a consistent behavior for the encoding of objects and use the same standard everywhere, otherwise we're always going to have this battle of "I think it works this way" without realizing it works in a different for other cases. It is the main source of these issues.

@dsnopek
Copy link
Collaborator Author

dsnopek commented May 23, 2023

@vnen Thanks!

If we change this, I would go with always using Object **, for the same reasoning as Variant::get_opaque_pointer() I mention above.

This is essentially option nr 2. I may have described them too verbosely, but basically:

Option nr 1 = standardize everything on Object * encoding
Option nr 2 = standardize everything on Object ** encoding

For an object, VariantInternal::get_opaque_pointer() returns a plain Object *. Hence, we're encoding an Object * as Object *.

This seems like a problem in this method itself, it should return an Object **. The thing is that this is meant to return a pointer to the value inside the Variant. If it's a basic type, like int it returns an int *. For objects, the base type is Object * which is what is stored in Variant, so the method should return a pointer to this value, which is an Object **.

I think it's also worth experimenting with changing VariantInternal::get_opaque_pointer() to see just how disruptive this would be. I'll check it out!

@lilizoey
Copy link

I recently stumbled upon this with the rust bindings, where i solved this using option 3: godot-rust/gdext#275

I wasn't sure if this was an actual issue godot-cpp also faced, but it's interesting to know that you have the same issue.

It should be pretty simple for us to adapt our code to fit whichever solution you go for.

@dsnopek
Copy link
Collaborator Author

dsnopek commented May 23, 2023

I attempted to implement option nr 2 in Godot PR godotengine/godot#77410

I'm sure I'm missing something (or, most probably a lot of somethings!) because the changes are really pretty minimal. I expect I'll have to update it a whole bunch once I have a better understanding of what this breaks. :-)

However, when paired with godot-cpp PR #1123 it seems to fix the regressions (and the original issues that #1044 and #1045 were attempting to fix), although, I didn't test super in-depth.

I'm really in disbelief at how easily these changes seemed to come together. I fully expect that I'm missing something big, or botched my tests somehow, because it can't be this easy. :-)

@dsnopek
Copy link
Collaborator Author

dsnopek commented Jun 6, 2023

I've just posted PR #1135 for godot-cpp that implements option nr 3: decoding the arguments differently depending on if it's a virtual or non-virtual method.

Given that we're about 1 week passed the feature freeze and Godot PR godotengine/godot#77410 hasn't gotten review from any of the necessary stakeholders, this may end up being the way we need to go for Godot 4.1.

@dsnopek
Copy link
Collaborator Author

dsnopek commented Jun 7, 2023

Godot PR godotengine/godot#77410 was merged! After #1123 is merged, we should be able to close this :-)

@akien-mga
Copy link
Member

Fixed by #1123.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This has been identified as a bug regression
Projects
None yet
Development

No branches or pull requests

6 participants