-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RenderingDevice
- texture
functions are no longer allowed in background threads due to thread guards.
#99750
Comments
cc @DarioSamo , you might have some insight to this? I haven't had time to fully dive into what is going on here but I imagine we plugged some holes here especially when rendering runs on a separate thread. |
Yes, the main change that has been made, for thread-safety and correctness reasons, is that a lot of RenderingDevice calls are no longer allowed. The main reason is that RenderingDevice used to do a ton of concurrent work by just doing a Mutex basically everywhere and hoping for the best. This also meant that even if you had what looked like it was asynchronous generation working, it was in fact not asynchronous at all and merely slotting in the recording of the current frame wherever it could and getting done in the current frame, making your frame actually take longer than it should. True asynchronous GPU work takes advantage of the fact that the hardware offers multiple queues to execute the work and has a much lesser impact on the work produced on the current frame. This is why 4.4 introduces the use of transfer queues that actually run in parallel to the game, which gave a huge improvement to loading speeds in V-Sync'd scenarios.
The problem here is this gives the false appearance that it was working reliably, while in fact it was doing it in a way that introduces a ton of possible validation errors that it is practically impossible for us to assess. GPU Validation is a bad area to risk having problems because the problems are not reliably reproduced and might show up differently on different hardware. A major change that 4.4 brings is the possibility to actually use different command queues that the driver offers, but RenderingDevice has no API to expose this at the moment. The closest we have is creating a local rendering device, but sadly that does not allow you to share resources with the main one as it does create an entirely new different driver in the background. I realize this might be perceived as a regression in the eyes of users, but rolling back to the old design would keep hiding away the problem instead of addressing it properly. It was an explicit request by @reduz to introduce thread guards, so I think he should weigh in suggesting the API we should use to provide some functionality like this again to do it properly. As for unblocking your approach in 4.4 in some way even if it's not the best solution, my suggestion would be to use the mechanism GDScript provides to defer calls to the main render thread. As I said, it was never actually processing the images in an asynchronous way anyway, as all it did was throw the processing that you requested into the main render queue for the current frame (which introduced several race conditions), so you can very much operate with a similar logic that just does deferred calls to the render thread and checks on the data on the next frame. This is what I mean by the fact that it wasn't truly asynchronous anyway, and it was pretty evident because assets loaded in background threads were never actually loaded and had to wait for V-Sync all the time as they'd just get thrown in whatever current frame was active and had to wait it out. The same problem was probably happening with the effect you mention. |
Thanks for looking into it so quickly 👏 Could you give me a quick example on how I would implement "defer calls to the main render thread" or a working approach using mechanisms you describe? I can also reach out in the godot chat if that is preferred, but maybe it would be better to have it here in case someone else comes across this problem? |
I think one big issue is that deferred calls might not actually help much if the use case is 'texture_get_data' because that function in particular has a bit of a design problem, where it has an extremely heavy amount of CPU work from copying the data back to RAM while the GPU itself has to do very little work. So if that function is now locked to being called in the render thread, it'll cause a big stall regardless of us moving to the more correct approach. One approach we discussed with @clayjohn is that functions such as get_data() would benefit from having asynchronous versions that you can call and query at a later point, allowing the data copy to happen independently from the rendering process once the GPU guarantees the texture or buffer is ready for it. But to give you an example of something that was even worse, by using this function on a background thread, calling any function like get_data() basically requires flushing all of the queued frames of the game, pretty much demolishing any performance whatsoever the render thread was doing in being able to queue frames and keep the GPU busy! These are the kind of problems that not being explicit about what should be kept on the render thread causes, and it's completely hidden away from the user in the older approach. Perhaps you can explain your problem space a bit better so we can avoid having to do this on the CPU and keep it entirely on the GPU instead. Is there a particular reason you need to download the image to the CPU and manipulate it there? Could it be perhaps achieved through compute shaders? And at what frequency do you need to fetch the viewport image. is it per frame or in an interval of time? |
Here is one example:
Godot_v4.4-dev5_win64_tLvo061xXo.mp4In another example, visibility splats are generated when the player walks around which uncover the map.
|
@RPicster I can see why you'd go the route of needing CPU manipulation if it's a gameplay mechanic. Is the entire reason you went with the asynchronous approach because the get_data() routines were causing stalls? In that case it seems like the discussion should be less centered around being able to do rendering commands in a separate thread but rather, being able to request the data asynchronously. As you're clearly able to hide the latency for gameplay reasons, it sounds like you'd be better served by a scheme where you can do this sort of work in the main thread.
The main benefit of this approach would be that you no longer need to stall the GPU and wait for frames to be flushed compared to 4.3, which would give you higher performance. |
Yes, you absolutely hit the nail on the head. In the end this was just a solution I found for the problem that is stalling.
|
Alright, I'm not super aware about the details on how to implement something like this, but the idea has floated around long enough in our heads that it'd be useful to extend these methods to support this type of workflow. It seems 4.4's changes warrant they're basically a necessity at this point. CC @clayjohn I don't think it'd take much work to get this done in time for 4.4 if it'd provide a solution to users who relied on this functionality to work around the lack of the function in the first place. |
That would be a amazing. Thanks for all the feedback, I guess until anything new is ready I will get back to use Should I close this issue then? |
No, I'd suggest however making the title a bit clearer and update it with the solution we've discussed for any other maintainers who might see it and be unaware. |
RenderingDevice
- texture_copy
not working anymoreRenderingDevice
- texture
functions are no longer allowed in background threads due to thread guards.
This would indeed need a new API. In OgreNext we have AsyncTextureTicket which is conceptually similar to a Javascript promise: the user asks the engine it wants to download GPU data, and the engine returns a ticket (i.e. a handle). The user now with that ticket can either chose to do a blocking download (which will stall), or get notified when the data is ready, receiving a callback, or poll the ticket until it returns ready. Edit: Probably the only issue is how to determine that the ticket is no longer needed and can be thrown away (reference counted? user calls a free() function? the ticket becomes invalid after data has been obtained?). Edit 2: In OgreNext tickets have 2 accuracy modes: inaccurate and accurate. Inaccurate means you have no problem always waiting a few frames (thus the engine can use the per frame fences to know if the download is ready). Accurate means we issue a fence to know when the download is ready. Useful when you need the data ASAP (but the user may still end up getting the data a few frames later if the GPU is too far behind the CPU, which increases the variance of waiting based on GPU load). |
@darksylinc I think we can make things quite simple for Godot because the current API just simply returns a PackedByteArray (which is problematic already because it requires two texture copies instead of one). Right now you call Instead we can add This implementation should be familiar to Godot users and very simple for us to implement. Later, we could even add an option to issue a fence or simple wait The downside of this approach is that it maintains our current approach of |
@DarioSamo @RPicster With respect to calling things on the main thread. We actually have a mechanism for this built in already For an example of how that looks in practice, you can see my cloud demo: https://github.com/clayjohn/godot-volumetric-cloud-demo-v2/blob/c65cb2b6e5aa05205852bcf30844a8e600d683ba/cloud_sky/cloud_sky.gd#L117 |
@clayjohn On the risk of going slightly off topic: Since Dario's change broke multiple projects*; is it worth naming next version 4.4? But rather 5.0? * The breaking change is not an API or ABI change but rather observed behavior in the details. But ultimately backwards compatibility is broken. As Dario mentioned, the way Godot allowed some users to do things in <= 4.3 would lead to crashes, device losts; and in a way that was very hard to track down. So it's not something that can easily be reverted. |
The amount of users who are direct users of RenderingDevice along with background threads is far too low to consider such a version change and we've done a lot worse breaking changes API-wise over the 4.x versions, so I'd say we're fine. Even on this particular case we've found the answer to the problem wasn't the fact that RenderingDevice has this new restriction but rather that the API was lacking a solution to download the image in the proper way. The OP's methodology has a ton of performance implications we wouldn't want to support in the long term anyway, as it effectively nullifies any amount of work that goes into optimizing our GPU throughput. |
@darksylinc We don't follow Semvar too strictly. We allow some amount of breakage between minor versions. Godot 5.0 will be when we want to majorly rework the architecture again and drop support for a lot of older and lower-end devices. Also, backwards compatibility is only broken for certain projects that were abusing workarounds that we had in place due to our unfinished architecture. The original design of the RD wasn't thread safe. As users tried to call it from different threads, we added a ton of mutexes so their game wouldn't crash. But that wasn't our original design plan, that was a workaround to avoid crashes. Then, users began to rely on the mutexes and do more stuff from separate threads. Projects that are written to use our intended design (like my cloud project listed above) don't require any changes. To be clear, this is totally our fault, but I think it is an acceptable change for a minor version |
There's now a work in progress PR to address this issue. |
I've tested the MRP on 2 devices now to figure out what the issues are. I have run into a number of problems. But before that, I should point out that the MRP doesn't use multiple threads. So it doesn't even engage much of the problems discussed in the OP. First, on all devices I get
This appears to be a regression in how TypedArrays are handled in GDScript. The PackedByte Array is getting converted into an array of integers instead of an Array of PackedByteArrays. The error comes from this line of code:
This may never have actually been valid GDScript. At any rate the correct way to call this is with an array like so:
Next, on MacOS I get an error that depth textures can't use the
Finally, on Linux in 4.4 I get the following:
This error goes away when removing the |
Are you sure that's the right PR? That PR was very harmless as it only added debug routines for diagnostics.
That sounds like #98390, which remains unfixed as of today. To check if it that's the problem, you can apply this patch (note: this is NOT a real fix. It's not a proper fix. But it will shut up validation): diff --git a/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp b/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
index 8367f6e29d..f073bf6676 100644
--- a/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
+++ b/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
@@ -409,7 +409,7 @@ RID RendererCanvasRenderRD::_create_base_uniform_set(RID p_to_render_target, boo
RD::Uniform u;
u.uniform_type = RD::UNIFORM_TYPE_TEXTURE;
u.binding = 4;
- u.append_id(state.shadow_texture);
+ u.append_id(state.shadow_depth_texture);
uniforms.push_back(u);
}
@@ -1898,10 +1898,10 @@ RendererCanvasRenderRD::RendererCanvasRenderRD() {
tf.texture_type = RD::TEXTURE_TYPE_2D;
tf.width = 4;
tf.height = 4;
- tf.usage_bits = RD::TEXTURE_USAGE_SAMPLING_BIT;
- tf.format = RD::DATA_FORMAT_R32_SFLOAT;
+ tf.usage_bits = RD::TEXTURE_USAGE_SAMPLING_BIT|RD::TEXTURE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT;
+ tf.format = RD::DATA_FORMAT_D32_SFLOAT;
- state.shadow_texture = RD::get_singleton()->texture_create(tf, RD::TextureView());
+ state.shadow_depth_texture = RD::get_singleton()->texture_create(tf, RD::TextureView());
}
{ |
I don't think so. That validation layer error has been around for a long time. It is present in 4.3. The reason I think it is #90993 is because we switched from using Also, we switched |
An udpate from my side: I tested #100110 and it absolutely solves the challenge I had in my project in a very graceful way. |
We could probably split those other regressions I noticed into new bug reports so they can be tracked separately since the main reason for this report is now fixed. |
Tested versions
System information
Godot v4.4.dev5 - Windows 10.0.19045 - Single-window, 2 monitors - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 4080 (NVIDIA; 32.0.15.6094) - AMD Ryzen 7 3700X 8-Core Processor (16 threads)
Issue description
A small intro: I tried talking to @clayjohn and @BastiaanOlij about this behaviour, but so far I couldn't find out anything that enables me to move forward with it.
The setup in the MRP is what I essentially use in Engraving in different parts of the game to create an Image from a viewport texture in a asynchronous way, without stalling the game like a
viewport.get_texture().get_image()
would. For a lot of things it's not critical that the image is instantly available but it's more important to have a smooth game experience.With the release of 4.4 this approach does not work at all anymore and keeps generating error messages, it works reliable in 4.3 stable.
My game is heavily depening on having a smooth way of creating an image from the players drawings to further process this image in real-time, so this change keeps me from updating the version and doing further testing on 4.4 (or even release in 4.4+).
If it is not possible to continue with this approach it would be greatly appreciated to find an alternative way of achieving async image generation from a viewport - which is also useful in a lot of other scenarios (vfx using viewports as textures/buffers).
Steps to reproduce
Checking out the MRP would be the easiest :)
Minimal reproduction project (MRP)
rd_copy_texture.zip
EDIT
After the discussions in this issue, the main challenge appeared to be getting texture / image data from the GPU to the CPU without a stall. It's not a regression per se - how I used to do what I needed to seemed to be a very flawed approach.
The best solution would be a new function, like suggested here by Dario or here by myself.
The text was updated successfully, but these errors were encountered: