You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I assume the reader is familiar with issues of data races. The effect of races is that they may cause rare crashes. I was compelled to study this after playing a game I was enjoying, but found it crashed after 30m-1h of gameplay.
As in #32081 (Data races when running Godot), the thread sanitizer highlights a number of problematic constructs.
Consider the following sequence involving two threads on an object.
(thread suspends, e.g. after executing line 87, as can happen for all sorts of reasons at any point in the code at the whim of the CPU or operating system)
Due to step 8, all subsequent uses of this are undefined behaviour and can cause arbitrary memory corruption
-
13
For example, get_script_instance() accesses a member variable of the ref, but the ref has been deleted and replaced with other content in timestep step 7.
This sequence of events is very unlikely in any given case, but it can happen, and is not valid.
If my interpretation and understanding above is correct, it could manifest in rare and difficult to reproduce crashes.
I also want to tip my hat towards this comment in the code, which I think is indicative that it is known that this construct may be responsible for crashes, but the reasons haven't yet been pinpointed. What I have described above could be one such a cause for these crashes (but no doubt there maybe other issues hiding around this construct).
if (refcount.get() <= 1/* higher is not relevant */) {
... is here because scripts may hold one of the references, and it needs to be notified. The exact intent from the comments is opaque to me. Reading the code alone it is very difficult for me to convince myself that it works as intended, given that there may be multiple threads executing in parallel.
I think some it would be useful to discuss the relationship between these objects, and what purpose notifying the script has. What follows are some thoughts on this side.
It seems that once unref() has been called, one should not do any more member accesses. One way to avoid this would be to do the member accesses before calling unref(). The API at the moment involves feeding the whole object into e.g. CSharpLanguage::refcount_incremented_instance_binding, which it uses to query the refcount (something that should be determined once, atomically, along with the unref, throughout the whole process); it is also used to obtain the associated script binding get_script_instance_binding. I think these things could be stored up-front, then do the unref, then communicate to the managed side that the unmanaged side has gone, but without referring to unmanaged objects.
Something which makes this hard to analyse, I think, is that the refcount is effectively being used to store two pieces of overlapping information. 1) Is there a managed side to deal with, 2) the refcount.
It might make sense to disentangle these things. But care must be taken to only update all of the pieces of state consistently and atomically (e.g. using a mutex).
One final thought, an "obvious" solution is just to shove a mutex member variable around Reference::unreference so that only one unreference operation can take place per object at once. This too would make things a bit easier to reason about. But I think there are deeper issues surrounding the way this is written, and shoving a mutex in feels like a band-aid, compared to fixing issues such as multiple accesses to refcount.get(), which may be inconsistent and racy in a multi-threaded environment.
The text was updated successfully, but these errors were encountered:
I took a quick look and I think so. My "reproducer" is to read the code, though, rather than to run the code.
open reference.cpp on the master branch
notice that the unref() logic appears to be the same as when I did my original analysis.
a key problematic construct appears to be the unrefing followed by memdelete. Just imagine that two threads simultaneously execute this line, and one suspends after that line but the other suspends temporarily.
the thread which continues executing concludes there are no references, deletes the thing.
the suspended thread wakes up and calls get_script_instance(), which is now an invalid access because this has been deleted.
Really, all the work done in unref (decrement a counter, delete stuff) needs to be protected by a mutex.
The code executing memdelete also needs to be written so that it's guaranteed there can't be other threads alive holding references to the deletee when it comes to delete the object, for example here. The following algorithm needs to ensure that exactly one thing is accessing the reference count during that time, otherwise bad things will happen:
decrement reference count
test if refcount is zero
if zero, delete
It's necessary to ensure that between (1) and (3), no-one can re-increment the refcount, or subsequently operate on the object once it's deleted. A good way to do that is to protect the whole operation by a mutex, and the same for modifications or tests against the reference count itself.
Godot version:
Tag:
3.1.2-stable
OS/device including version:
Any multi-threaded environment.
Issue description:
I assume the reader is familiar with issues of data races. The effect of races is that they may cause rare crashes. I was compelled to study this after playing a game I was enjoying, but found it crashed after 30m-1h of gameplay.
As in #32081 (Data races when running Godot), the thread sanitizer highlights a number of problematic constructs.
Consider the following sequence involving two threads on an object.
Initial state: refcount is 2.
Ref<>::unref()
Reference::unreference()
die = false
andrefcount-- == 1
Ref<>::unref()
die = true
,refcount-- == 0
, entermemdelete()
Ref<>
.Ref<>
.refcount.get() <= 1
is truethis
are undefined behaviour and can cause arbitrary memory corruptionget_script_instance()
accesses amember variable
of the ref, but the ref has been deleted and replaced with other content in timestep step 7.This sequence of events is very unlikely in any given case, but it can happen, and is not valid.
If my interpretation and understanding above is correct, it could manifest in rare and difficult to reproduce crashes.
I also want to tip my hat towards this comment in the code, which I think is indicative that it is known that this construct may be responsible for crashes, but the reasons haven't yet been pinpointed. What I have described above could be one such a cause for these crashes (but no doubt there maybe other issues hiding around this construct).
godot/core/reference.h
Lines 262 to 267 in 0587df4
Steps to reproduce:
Study code, think, and be aware of issues relating to reference counts, threading and race conditions.
What might a fix look like?
Presumably this condition:
godot/core/reference.cpp
Line 89 in 0587df4
... is here because scripts may hold one of the references, and it needs to be notified. The exact intent from the comments is opaque to me. Reading the code alone it is very difficult for me to convince myself that it works as intended, given that there may be multiple threads executing in parallel.
I think some it would be useful to discuss the relationship between these objects, and what purpose notifying the script has. What follows are some thoughts on this side.
It seems that once
unref()
has been called, one should not do any more member accesses. One way to avoid this would be to do the member accesses before callingunref()
. The API at the moment involves feeding the whole object into e.g.CSharpLanguage::refcount_incremented_instance_binding
, which it uses to query the refcount (something that should be determined once, atomically, along with the unref, throughout the whole process); it is also used to obtain the associated script bindingget_script_instance_binding
. I think these things could be stored up-front, then do the unref, then communicate to the managed side that the unmanaged side has gone, but without referring to unmanaged objects.Something which makes this hard to analyse, I think, is that the refcount is effectively being used to store two pieces of overlapping information. 1) Is there a managed side to deal with, 2) the refcount.
It might make sense to disentangle these things. But care must be taken to only update all of the pieces of state consistently and atomically (e.g. using a mutex).
One final thought, an "obvious" solution is just to shove a mutex member variable around
Reference::unreference
so that only one unreference operation can take place per object at once. This too would make things a bit easier to reason about. But I think there are deeper issues surrounding the way this is written, and shoving a mutex in feels like a band-aid, compared to fixing issues such as multiple accesses torefcount.get()
, which may be inconsistent and racy in a multi-threaded environment.The text was updated successfully, but these errors were encountered: