Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenXR: Change timing of xrWaitFrame and fix XR multithreading issues #89734

Merged

Conversation

BastiaanOlij
Copy link
Contributor

@BastiaanOlij BastiaanOlij commented Mar 21, 2024

This PR changes the timing at which xrWaitFrame is called.

This also makes a start at ensuring the OpenXR code is thread safe if the renderer is run in a separate thread.
It is important to note that much of the OpenXR data such as the instance and session are initialised before rendering commences and access should be safe. However there is also much data that can be overwritten during frame processing while the previous frame is rendering and I've only just started making that safe.

The change to xrWaitFrame should improve timing issues regardless of whether rendering is run in a separate thread or not.

For testing OpenXR with rendering happening in a separate thread, set this project settings:
image

Todos:

  • Fix xrWaitFrame timing
  • Fix thread issues with OpenXR rendering
  • Fix thread issues with OpenXRAPI cleanup (possibly a non-issue due to application life cycle)
  • Fix thread issues with XRServer.world_origin
  • Fix thread issues with XRServer.world_scale

Note: This has only been tested on the Vulkan renderer, running multi-threaded in OpenGL may have issues unrelated to the work we're doing here.

@BastiaanOlij
Copy link
Contributor Author

BastiaanOlij commented Mar 21, 2024

One of the issues I've run into is that call_on_render_thread requires a proper Godot Object to be used and OpenXRAPI is not such an object.

I may decide to turn OpenXRAPI into a proper Godot object. This may eventually make OpenXRAPIExtension obsolete.

Also there is a potential minor breakage in get_next_frame_time now actually returning the predicted display time of the next frame, not the current frame, now that we're properly handling xrWaitFrame.
It will print out a warning on a developer version if this is the case. A new vendors plugin will likely be needed once this is released. The fix is simply calling the new get_predicted_display_time function.

@BastiaanOlij BastiaanOlij force-pushed the openxr_reorder_wait_frame branch 3 times, most recently from ba57f21 to f6ff398 Compare March 26, 2024 04:28
@BastiaanOlij BastiaanOlij force-pushed the openxr_reorder_wait_frame branch 4 times, most recently from 74d1a8e to bb190d7 Compare March 30, 2024 14:02
@BastiaanOlij BastiaanOlij force-pushed the openxr_reorder_wait_frame branch from bb190d7 to f1b07a8 Compare March 31, 2024 13:01

void XRServer::_bind_methods() {
ClassDB::bind_method(D_METHOD("get_world_scale"), &XRServer::get_world_scale);
ClassDB::bind_method(D_METHOD("set_world_scale", "scale"), &XRServer::set_world_scale);
ClassDB::bind_method(D_METHOD("get_world_origin"), &XRServer::get_world_origin);
ClassDB::bind_method(D_METHOD("set_world_origin", "world_origin"), &XRServer::set_world_origin);
ClassDB::bind_method(D_METHOD("get_reference_frame"), &XRServer::get_reference_frame);
ClassDB::bind_method(D_METHOD("clear_reference_frame"), &XRServer::get_reference_frame);
ClassDB::bind_method(D_METHOD("clear_reference_frame"), &XRServer::clear_reference_frame);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is a bug that we missed for ages, it was pointing to the wrong method. This obviously breaks ABI, I'm assuming it is fine to add this as an exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think an exception for this would be fine, since it was broken previously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly the CI isn't complaining about it..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty weird that the CI isn't complaining indeed. The method hash should have changed.

@BastiaanOlij BastiaanOlij marked this pull request as ready for review March 31, 2024 13:04
@BastiaanOlij BastiaanOlij requested review from a team as code owners March 31, 2024 13:04
@BastiaanOlij
Copy link
Contributor Author

Set this to ready for review, I think I've got all issues resolved in core except for the cleanup part. Right now I don't think this is a problem as we'll stop the rendering thread before we hit this code, but potentially there needs to be some more done there.

I also tested this on my Quest 3 and it seems to be working fine.

Copy link
Contributor

@dsnopek dsnopek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendering is not my area, but looking at the code, this all makes sense to me at high-level. It should definitely be reviewed by someone more knowledgeable about rendering, though.

I tested on my Quest 3 using the demo project from godot_openxr_vendors on its master branch.

Everything seems to work when the using single threaded rendering, but I get a segfault on start up when using multi-threaded rendering. For some reason, the debug symbols aren't showing up in the stack trace (even though they should be there!) so I don't have much more in the way of details to share. I'll try to get some more info...

modules/openxr/openxr_api.h Outdated Show resolved Hide resolved
modules/openxr/openxr_api.h Outdated Show resolved Hide resolved
@BastiaanOlij BastiaanOlij changed the title OpenXR: Change timing of xrWaitFrame OpenXR: Change timing of xrWaitFrame and fix XR multithreading issues Apr 4, 2024
@BastiaanOlij
Copy link
Contributor Author

Small update here:

  • Further testing has shown this resolves a number of issues including crashing of XR applications when multithreading is enabled
  • The artifact issue we've discovered on Quest 3 is not resolved, even is more clear when multithreading is used. This issue does not happen on PCVR, Quest Pro nor HTC Elite XR headsets and might even be unique to Quest 3. My conclusion at this point is that the cause of this lies elsewhere and should not stop us from merging this work which is important.
  • We should probably merge Add support for OpenXR composition layers #89880 first and then spend some time refactoring some of the swapchain logic as this clashes with changes here and in OpenXR: Cleanup swapchain logic (was Fix render target multiplier) #87466

@xwovr
Copy link

xwovr commented Apr 22, 2024

@dsnopek I am glad that you located the root cause of the flickering issue 💯 And yes, I believe the xrWaitSwapchainImage fix worths its own PR since it would fix other issues which would have different triggers.

@BastiaanOlij
Copy link
Contributor Author

@dsnopek no, that's not the correct fix. This is the whole reason this PR introduces the render_state struct. The rendering thread doesn't read from frame_state.shouldRender, it reads from render_state.should_render which is a copy of that value made before that frame gets rendered.

We do this for a lot of state where the main thread may change the value for the current frame being prepared, while the rendering thread should still be using the previous value while rendering the frame it's rendering (or even the value from before, the way this works is that the changes are added to a queue that is executed right before we render a frame).

So it should be completely safe to set frame_state.shouldRender to false. The code running in the render thread should never see that as we're currently rendering the previous frame. We don't start rendering this frame with the new values until we're finished with the current frame.

So this suggest we're either still using frame_state.shouldRender in a method executed on the rendering thread, or there is something wrong in the way the rendering thread determines whether it should start rendering a new frame.

I was going to have a meetup with @reduz to discuss more about the internal workings of the rendering thread setup, so I get a better understanding of that last bit.

@BastiaanOlij
Copy link
Contributor Author

@dsnopek so just to clarify, set_render_display_info(frame_state.predictedDisplayTime, frame_state.shouldRender); is the moment we copy the resulting values from the main thread to the render thread, and that call gets queued and won't actually be executed until the render thread is ready to render this frame.

So the code running on the render thread should remain blissfully unaware we set frame_state.shouldRender momentarily to false.

@dsnopek
Copy link
Contributor

dsnopek commented Apr 22, 2024

@BastiaanOlij:

So this suggest we're either still using frame_state.shouldRender in a method executed on the rendering thread, or there is something wrong in the way the rendering thread determines whether it should start rendering a new frame.

It is checking frame_state.shouldRender on the render thread.

In OpenXRAPI::pre_draw_viewport() (which runs on the render thread) there is:

	if (!can_render()) {
		return false;
	}

... and the definition of OpenXRAPI::can_render() is:

	_FORCE_INLINE_ bool can_render() { return instance != XR_NULL_HANDLE && session != XR_NULL_HANDLE && running && frame_state.shouldRender; }

What should we be doing here instead?

@dsnopek
Copy link
Contributor

dsnopek commented Apr 22, 2024

@BastiaanOlij What about this?

diff --git a/modules/openxr/openxr_api.cpp b/modules/openxr/openxr_api.cpp
index 4ced4196d1..150d7eb726 100644
--- a/modules/openxr/openxr_api.cpp
+++ b/modules/openxr/openxr_api.cpp
@@ -2156,7 +2156,7 @@ bool OpenXRAPI::pre_draw_viewport(RID p_render_target) {
        // We found an XR viewport!
        render_state.has_xr_viewport = true;
 
-       if (!can_render()) {
+       if (instance == XR_NULL_HANDLE || session == XR_NULL_HANDLE || !running || !render_state.should_render) {
                return false;
        }
 
diff --git a/modules/openxr/openxr_api.h b/modules/openxr/openxr_api.h
index 5e811879be..dc2e7feaf7 100644
--- a/modules/openxr/openxr_api.h
+++ b/modules/openxr/openxr_api.h
@@ -445,7 +445,6 @@ public:
        _FORCE_INLINE_ XrSpace get_play_space() const { return play_space; }
        _FORCE_INLINE_ XrTime get_predicted_display_time() { return frame_state.predictedDisplayTime; }
        _FORCE_INLINE_ XrTime get_next_frame_time() { return frame_state.predictedDisplayTime + frame_state.predictedDisplayPeriod; }
-       _FORCE_INLINE_ bool can_render() { return instance != XR_NULL_HANDLE && session != XR_NULL_HANDLE && running && frame_state.shouldRender; }
 
        XrHandTrackerEXT get_hand_tracker(int p_hand_index);
 

I haven't had a chance to test it, but it uses render_state.should_render instead of frame_state.shouldRender.

@BastiaanOlij
Copy link
Contributor Author

@xwovr , @dsnopek as to the xrWaitSwapchainImage fix, I think we should do that regardless. Note the use of XR_UNQUALIFIED_SUCCESS and then only fully erroring out on XR_FAILED and if so we skipped rendering the frame and just went to the next one. It makes far more sense to wait longer.

This rarely happens, the situation we very often ran into this causing issues was when you take a screenshot or start a recording, that is often where the Quest hangs onto the current image it's processing in the swapchain a little longer and we get a reason to have to wait before we can render to that image again.

But the 17000000 came from my misunderstanding many years ago about how this should work. Even just setting it to 1000000000 and then falling back to our current skip rendering logic would be enough. If we're waiting for more than a second for the next swapchain image to become available, we already have a big issue on our hands. But waiting for just 17ms is indeed waaaay too short.

Note also that this code is called from the rendering thread (if (!render_state.main_swapchains[i].acquire(render_state.should_render)) {) and is thus using render_state directly.

@BastiaanOlij
Copy link
Contributor Author

BastiaanOlij commented Apr 22, 2024

I haven't had a chance to test it, but it uses render_state.should_render instead of frame_state.shouldRender

Good catch, I had a problem around there where I lost a bunch of work during a rebase, I had removed all calls to can_render and instead put the check in there so its absolutely clear what is being checked., I just left can_render for backwards compatibility for any code outside of the module that may still call it (mainly GDExtension).

Obviously that change got lost and this is indeed wrong. pre_draw_viewport is called from the rendering thread and should never access frame_state.shouldRender

(same for running btw, which equally has a companion in render_state)

@BastiaanOlij BastiaanOlij force-pushed the openxr_reorder_wait_frame branch from 2c887b9 to beb6248 Compare April 23, 2024 00:16
@BastiaanOlij
Copy link
Contributor Author

Ok, rebased and applied both the swapchain fix and fixed up pre_draw_viewport. Also put a render thread guard onto can_render.

Haven't tested yet (still compiling) but just so @dsnopek can check this again as well.

@BastiaanOlij BastiaanOlij force-pushed the openxr_reorder_wait_frame branch from beb6248 to 03547c6 Compare April 23, 2024 00:55
@dsnopek
Copy link
Contributor

dsnopek commented Apr 23, 2024

@BastiaanOlij:

Ok, rebased and applied both the swapchain fix and fixed up pre_draw_viewport. Also put a render thread guard onto can_render.

Thanks! The latest is working great in my testing :-)

It looks like you still haven't addressed Clay's review above, but otherwise, this is a definite improvement, and I think worth moving forward with.

@BastiaanOlij
Copy link
Contributor Author

It looks like you still haven't addressed Clay's review above, but otherwise, this is a definite improvement, and I think worth moving forward with.

Seeing it looked like there was still too much going on for this to be merged into 4.3, I was giving some other things priority.
For clay I think the only outstanding thing is the physics thing, I need to find a moment to have a closer look at that.

There are also still some questions needing answering on cleanup, we can accept how things work right now and I think it will be safe but still. Though we can also treat that as a future thing to improve if it actually turns out to be an issue.

Also I have had no chance at all to look into any issues with the compatibility renderer.

@dsnopek
Copy link
Contributor

dsnopek commented Apr 23, 2024

Also I have had no chance at all to look into any issues with the compatibility renderer.

Personally, I don't think we necessarily need to fix multi-threaded rendering with the compatibility renderer in this PR: it's broken before this PR, and stays broken after. So, I think it'd be OK to continue working on that in follow-up PRs.

However, single-threaded rendering with the compatibility renderer still works with this PR, and I think there's improvements here even for single-threaded rendering.

@BastiaanOlij
Copy link
Contributor Author

Personally, I don't think we necessarily need to fix multi-threaded rendering with the compatibility renderer in this PR: it's broken before this PR, and stays broken after. So, I think it'd be OK to continue working on that in follow-up PRs.

However, single-threaded rendering with the compatibility renderer still works with this PR, and I think there's improvements here even for single-threaded rendering.

Yup, hey I'm all for including it in 4.3, and as long as you don't enable multithreading the impact is very low.
Also we're not really changing any of the outward APIs.
I do think we got all the important ground work done.

But at the same time I don't want to rush this.

@BastiaanOlij BastiaanOlij force-pushed the openxr_reorder_wait_frame branch from 03547c6 to 4d216b8 Compare April 30, 2024 01:21
@BastiaanOlij
Copy link
Contributor Author

@akien-mga did a rebase of this and a few final tests. Discussing this with @dsnopek and @clayjohn we believe this is worth merging for 4.3.

Right now rendering until multi-threading is broken and this is a step in the right direction, there is still plenty to do in the rendering engine itself, so multi-threading remains an experimental mode. We're not expecting any regressions in single threaded mode so we feel its a pretty safe merge.

@akien-mga akien-mga modified the milestones: 4.x, 4.3 Apr 30, 2024
doc/classes/RenderingServer.xml Outdated Show resolved Hide resolved
modules/openxr/openxr_api.h Outdated Show resolved Hide resolved
modules/openxr/openxr_api.cpp Outdated Show resolved Hide resolved
@BastiaanOlij BastiaanOlij force-pushed the openxr_reorder_wait_frame branch from 4d216b8 to cbab7dc Compare May 1, 2024 04:24
@akien-mga akien-mga merged commit 70247ad into godotengine:master May 1, 2024
16 checks passed
@akien-mga
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Godot OpenXR, improved handling of state
7 participants