-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement RD::buffer_get_data_async()
and RD::texture_get_data_async()
#100110
Conversation
Would it be possible to make this work with the await syntax in gdscript? |
Summoning others who might be interested and could provide feedback on whether this works for them: |
c1629a7
to
14bbb8c
Compare
Maybe, I'm not familiar enough to say. But our public API needs to work for all supported languages (GDScript, C#, GDExtension). So we need to bind it in a way that is language-agnostic. |
Will take a look later I was more looking at methods of linking buffers between different shaders to keep things from round tripping. Back burner now. This looks like a nice upgrade will peek though code tonight and get back to comment more 👍 |
At which point would the callback be called, and on which thread? For example, when using the main renderer's RenderingDevice, is it called somewhere on the main thread during the main loop? What about when using a custom RenderingDevice in a custom thread? In my case, I use a separate RenderingDevice on a custom thread, mainly for the following reasons:
Considering this use case, I wonder if I would still get benefits from using async methods? Another minor thing I'm wondering, is how to use this Callable in a GDExtension, in a context where none of my functions belong to a Godot object (i.e the code only has a reference to RenderingDevice and maybe a few resources, but everything else is "pure" C++ classes) |
Looks really good but I need to take some more time to dissect |
RD::buffer_get_data_async()
and RD::texture_get_data_async()
Render thread (whoever is responsible for the RenderingDevice), as it's the one that checks for the fences.
If you're using local RD, all it means as far as I know is it'll trigger the callbacks on sync(), which might still be pretty helpful! You can build a large batch of get_data calls and process them all at once instead of making the GPU actively wait on each one. |
As discussed on the Godot Dev Chat, this PR could help tremendously the implementation of WebGPU, due to the async nature of GPU accesses by WebGPU. |
So in the case of the main renderer, if it's in multithread mode the user has to still make stuff thread-safe if the downloaded data needs to be stored in say some node of the game?
I'm already creating batches, I don't schedule work one by one. But does that mean then that I don't need to build batches at all, just submit things individually and then get results using async? |
I'll run it in my terrain system to retrieve collision data (currently bake time only as there's a slight hitch due to retrieving data) and get back with the results. |
Results of my tests. First Visual examples, notice the 1% lows, as right now the issues caused by retrieving the texture the standard way do not impact averages very much, and instead they result in stutters, this is the best way to see the results in action, if you just move around in the world it definently feels a lot better with async, the stutters are greatly reduced, and this is still saving the binary data to disc, which if I am not doing reduces the stutters even further. Already a noticeable difference. Benchmarks are next, I ran it twice, once with the standard method, and once with a naïve approach to the async method, I say naïve because it was pretty much just slotted in right where the previous one was, so as to disturb the flow as little as possible, I believe you could probably get better performance with a cleaner implementation, but for the sake of testing it gets the job done. Take note of the 0.1% lows, the new method has an uplift of these by roughly 60% For validation I also tested the outputs using FC and seemingly the data is exactly the same, so no errors or anything as far as that goes: Now, of note, during all this my computer BSODed, I was unable to reproduce, and considering some of the other stuff in my project, I doubt it was caused by this commit. Besides all that, in general I would say this is a straight improvement, and besides the block size (which needs some proper documentation for sure), I don't see any issues with it, and on a more personal note, this might make runtime generation/modification of collisions and later foliage data viable for my terrain, so very nice indeed. Excellent work! |
@Bonkahe Thank you very much for the testing! Regarding block size, it's a bit of a double-edged sword if I go down the route of implementing it as the regular upload does, which is why the restriction exists. When the block size is too small, the GPU has to do the copies in a more inefficient manner in smaller chunks, and this is something that used to happen a lot before the transfer queues improvement recently. The solution I'm looking into is implementing this behavior so it solves it automatically. But you'll get an improvement from increasing the block size and the region size in the future if you fine tune it to your needs like you had to do right now. That's why I'm a bit hesitant to make it part of the documentation until I sort out the implementation first. |
cf63221
to
adc584b
Compare
I've lifted the restriction from the PR concerning the block size, now the function behaves exactly like update() does which it uses the copies by regions with the same logic. Please do test if it works correctly, at least my test for downloading a larger texture and double-checking a particular pixel has worked fine. func _init():
...
format = RDTextureFormat.new()
format.width = 180
format.height = 180
format.format = RenderingDevice.DATA_FORMAT_R8G8B8A8_UNORM
format.usage_bits = RenderingDevice.TEXTURE_USAGE_SAMPLING_BIT | RenderingDevice.TEXTURE_USAGE_CAN_COPY_FROM_BIT
pixel_offset = (175 * format.height * 4) + 137 * 4;
data = [ PackedByteArray() ]
data[0].resize(format.width * format.height * 4)
data[0].encode_u32(pixel_offset, 0x00FF00FF)
view = RDTextureView.new()
texture = rd.texture_create(format, view, data)
texture_value = 0
pass
func _texture_get_data_callback(array):
texture_value = array.decode_u32(pixel_offset)
pass Keep in mind the performance limitations I've described before will apply. For example, this script triggers 9 separate texture copies to buffers, which is generally pretty inefficient, but that's just what the defaults of the region size at 64x64 will do. I introduced a new separate setting to control the region size of downloads, but of course you may be required to increase the size of the blocks if the region is too big or the format is too big. |
d6b5144
to
b798097
Compare
Updated documentation as well. |
So I did another round of testing and (barring two standout benchmark, which I'll go over in a minute) there was negligible difference in min framerate, this is far from scientific for sure, but it seemingly is pointing to a situation where having your solution to block size works for the most part, with probably some slight performance to be gained by manually accounting for it if you know what your doing. Benchmarks: And upon verification of data output it was 1:1 again, so no regressions there. For the standout benchmarks, I removed one benchmark that was after I changed the block size, seemingly when you change the block size there's a slight hitch shortly after, I assume as the system handles the update, this resulted in 0.1% lows at like 50% of what a normal run was. All in all the update to block size is all around an excellent improvement, it will probably work for the most part, and the people who are wanting to optimize everything perfectly can go dig it up, I think the documentation should probably reflect this, with the tooltip noting that for the majority of use cases, leaving it be will work just fine. Hope this was some help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great to me and I tested it locally with the MRP from #99750 and confirmed that it works correctly!
It just needs a rebase before merging (which I will do since Dario is on vacation)
b798097
to
1f68972
Compare
1f68972
to
f7fa688
Compare
f7fa688
to
054891d
Compare
Thanks! |
I also tested this on my project and it works like a charm, even makes a lot of the old code unneccessary. |
Does this allow retrieving a viewport texture asynchronously? |
Yep! |
Background
This implements the feature proposed in the discussion originated by #99750.
As proposed by Clay here.
This PR does exactly that for
buffer_get_data
andtexture_get_data
. These now have asynchronous variants which will trigger the callback when the frame the data retrieval was requested is flushed. This opens a way for users to request data downloads back to the CPU without introducing heavy stalls on their game that affect performance.As seen on the issue, this also fixes a perceived regression caused by the introduction of thread guards for
RenderingDevice
in #90400, where users were allowed to previously access these functions from another thread and request the data without blocking the main thread, but being unaware they were causing the entire device to flush the frames out and reduce performance in the process.Here's an example script that showcases how to use these functions.
CC @RPicster who brought up the issue originally and may wish to test this out.
Notes
This implementation piggybacks off the fact we already have a structure for staging buffers for uploading data from the CPU to the GPU. This duplicates that concept for the exact opposite purpose, but there's no separate configuration setting for it.
There's also an implied restriction, which will currently error out, if the staging buffer's block size is too small to handle one image mipmap. This was mostly done to reduce the complexity of the implementation, but it's not impossible to fix it to allow the function to do it in image chunks instead that can fit inside of the staging buffer's block size. Handling it like this might be implemented at a later point if it's deemed necessary.This restriction has been lifted, but performance can degrade if the block size or region size is too small for the large downloads.TODO:
Contributed by W4 Games. 🍀