Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Racy construct in Ref<>::unref could lead to invalid memory accesses #35046

Open
pwaller opened this issue Jan 12, 2020 · 2 comments
Open

Racy construct in Ref<>::unref could lead to invalid memory accesses #35046

pwaller opened this issue Jan 12, 2020 · 2 comments

Comments

@pwaller
Copy link

pwaller commented Jan 12, 2020

Godot version:

Tag: 3.1.2-stable

OS/device including version:

Any multi-threaded environment.

Issue description:

I assume the reader is familiar with issues of data races. The effect of races is that they may cause rare crashes. I was compelled to study this after playing a game I was enjoying, but found it crashed after 30m-1h of gameplay.

As in #32081 (Data races when running Godot), the thread sanitizer highlights a number of problematic constructs.

Consider the following sequence involving two threads on an object.

Initial state: refcount is 2.

Step Thread 1 Thread 2
1 Call Ref<>::unref() -
2 Enter Reference::unreference() -
3 Set die = false and refcount-- == 1 -
4 (thread suspends, e.g. after executing line 87, as can happen for all sorts of reasons at any point in the code at the whim of the CPU or operating system) (thread wakes up)
5 - Call Ref<>::unref()
6 - Progress all the way through the unreferencing
7 - die = true, refcount-- == 0, enter memdelete()
8 - Object is deleted, it's no longer valid to access members of the Ref<>.
9 - e.g. Thread goes on and allocates new stuff in the space of the old Ref<>.
10 (thread resumes) -
11 refcount.get() <= 1 is true -
12 Due to step 8, all subsequent uses of this are undefined behaviour and can cause arbitrary memory corruption -
13 For example, get_script_instance() accesses a member variable of the ref, but the ref has been deleted and replaced with other content in timestep step 7.

This sequence of events is very unlikely in any given case, but it can happen, and is not valid.

If my interpretation and understanding above is correct, it could manifest in rare and difficult to reproduce crashes.

I also want to tip my hat towards this comment in the code, which I think is indicative that it is known that this construct may be responsible for crashes, but the reasons haven't yet been pinpointed. What I have described above could be one such a cause for these crashes (but no doubt there maybe other issues hiding around this construct).

godot/core/reference.h

Lines 262 to 267 in 0587df4

void unref() {
//TODO this should be moved to mutexes, since this engine does not really
// do a lot of referencing on references and stuff
// mutexes will avoid more crashes?
if (reference && reference->unreference()) {

Steps to reproduce:

Study code, think, and be aware of issues relating to reference counts, threading and race conditions.

What might a fix look like?

Presumably this condition:

if (refcount.get() <= 1 /* higher is not relevant */) {

... is here because scripts may hold one of the references, and it needs to be notified. The exact intent from the comments is opaque to me. Reading the code alone it is very difficult for me to convince myself that it works as intended, given that there may be multiple threads executing in parallel.

I think some it would be useful to discuss the relationship between these objects, and what purpose notifying the script has. What follows are some thoughts on this side.

It seems that once unref() has been called, one should not do any more member accesses. One way to avoid this would be to do the member accesses before calling unref(). The API at the moment involves feeding the whole object into e.g. CSharpLanguage::refcount_incremented_instance_binding, which it uses to query the refcount (something that should be determined once, atomically, along with the unref, throughout the whole process); it is also used to obtain the associated script binding get_script_instance_binding. I think these things could be stored up-front, then do the unref, then communicate to the managed side that the unmanaged side has gone, but without referring to unmanaged objects.

Something which makes this hard to analyse, I think, is that the refcount is effectively being used to store two pieces of overlapping information. 1) Is there a managed side to deal with, 2) the refcount.

It might make sense to disentangle these things. But care must be taken to only update all of the pieces of state consistently and atomically (e.g. using a mutex).

One final thought, an "obvious" solution is just to shove a mutex member variable around Reference::unreference so that only one unreference operation can take place per object at once. This too would make things a bit easier to reason about. But I think there are deeper issues surrounding the way this is written, and shoving a mutex in feels like a band-aid, compared to fixing issues such as multiple accesses to refcount.get(), which may be inconsistent and racy in a multi-threaded environment.

@KoBeWi
Copy link
Member

KoBeWi commented Dec 18, 2020

Can anyone still reproduce this bug in Godot 3.2.3 or any later release?

@pwaller
Copy link
Author

pwaller commented Jan 22, 2021

I took a quick look and I think so. My "reproducer" is to read the code, though, rather than to run the code.

  1. open reference.cpp on the master branch
  2. notice that the unref() logic appears to be the same as when I did my original analysis.
  3. a key problematic construct appears to be the unrefing followed by memdelete. Just imagine that two threads simultaneously execute this line, and one suspends after that line but the other suspends temporarily.
  4. the thread which continues executing concludes there are no references, deletes the thing.
  5. the suspended thread wakes up and calls get_script_instance(), which is now an invalid access because this has been deleted.

Really, all the work done in unref (decrement a counter, delete stuff) needs to be protected by a mutex.

The code executing memdelete also needs to be written so that it's guaranteed there can't be other threads alive holding references to the deletee when it comes to delete the object, for example here. The following algorithm needs to ensure that exactly one thing is accessing the reference count during that time, otherwise bad things will happen:

  1. decrement reference count
  2. test if refcount is zero
  3. if zero, delete

It's necessary to ensure that between (1) and (3), no-one can re-increment the refcount, or subsequently operate on the object once it's deleted. A good way to do that is to protect the whole operation by a mutex, and the same for modifications or tests against the reference count itself.

Hope this helps.

@YuriSizov YuriSizov modified the milestones: 4.0, 4.x Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants