Implement `RD::buffer_get_data_async()` and `RD::texture_get_data_async()` #100110

DarioSamo · 2024-12-06T18:19:25Z

Background

This implements the feature proposed in the discussion originated by #99750.

As proposed by Clay here.

Instead we can add texture_get_data_async(RID, callback) which returns immediately, but calls the callback once the data is ready. The callback would always be a function with the structure func some_name(PackedByteArray) where the user can then do what they will with that array, the same as if they had called texture_get_data().

This PR does exactly that for buffer_get_data and texture_get_data. These now have asynchronous variants which will trigger the callback when the frame the data retrieval was requested is flushed. This opens a way for users to request data downloads back to the CPU without introducing heavy stalls on their game that affect performance.

As seen on the issue, this also fixes a perceived regression caused by the introduction of thread guards for RenderingDevice in #90400, where users were allowed to previously access these functions from another thread and request the data without blocking the main thread, but being unaware they were causing the entire device to flush the frames out and reduce performance in the process.

Here's an example script that showcases how to use these functions.

extends Node2D

var rd
var in_counter
var out_counter
var buffer
var in_array
var out_array
var data
var format
var view
var texture
var texture_values

func _init():
	rd = RenderingServer.get_rendering_device()
	buffer = rd.storage_buffer_create(4)
	in_counter = 0;
	in_array = PackedByteArray()
	in_array.resize(4)
	
	data = [ PackedByteArray() ]
	format = RDTextureFormat.new()
	format.width = 4
	format.height = 4
	format.format = RenderingDevice.DATA_FORMAT_R8_UINT
	format.usage_bits = RenderingDevice.TEXTURE_USAGE_SAMPLING_BIT | RenderingDevice.TEXTURE_USAGE_CAN_COPY_FROM_BIT
	
	view = RDTextureView.new()
	data[0].resize(16)
	data[0].encode_u32(0, 0xFF0000FF)
	data[0].encode_u32(4, 0xFF0000FF)
	data[0].encode_u32(8, 0x00FF00FF)
	data[0].encode_u32(12, 0x0000FFFF)
	texture = rd.texture_create(format, view, data)
	texture_values = [0, 0, 0, 0]
	pass

func _buffer_get_data_callback(array):
	out_counter = array.decode_u32(0)
	pass
	
func _texture_get_data_callback(array):
	texture_values[0] = array.decode_u32(0)
	texture_values[1] = array.decode_u32(4)
	texture_values[2] = array.decode_u32(8)
	texture_values[3] = array.decode_u32(12)
	pass

func _process(_delta):
	in_counter += 1
	in_array.encode_u32(0, in_counter)
	rd.buffer_update(buffer, 0, in_array.size(), in_array)
	rd.buffer_get_data_async(buffer, _buffer_get_data_callback, 0, in_array.size())
	rd.texture_get_data_async(texture, 0, _texture_get_data_callback)
	
	$Label.text = "Counter: %X Buffer Value: %X\n(these shouldn't match if it's async)\nTexture Values: %X %X %X %X" % [in_counter, out_counter, texture_values[0], texture_values[1], texture_values[2], texture_values[3]]
	
	pass

CC @RPicster who brought up the issue originally and may wish to test this out.

Notes

This implementation piggybacks off the fact we already have a structure for staging buffers for uploading data from the CPU to the GPU. This duplicates that concept for the exact opposite purpose, but there's no separate configuration setting for it.

There's also an implied restriction, which will currently error out, if the staging buffer's block size is too small to handle one image mipmap. This was mostly done to reduce the complexity of the implementation, but it's not impossible to fix it to allow the function to do it in image chunks instead that can fit inside of the staging buffer's block size. Handling it like this might be implemented at a later point if it's deemed necessary. This restriction has been lifted, but performance can degrade if the block size or region size is too small for the large downloads.

TODO:

Documentation.
Test for validation errors or edge cases.

Contributed by W4 Games. 🍀

JoNax97 · 2024-12-06T18:29:41Z

Would it be possible to make this work with the await syntax in gdscript?

clayjohn · 2024-12-06T18:31:18Z

Summoning others who might be interested and could provide feedback on whether this works for them:
@Kimau
@granitrocky (author of #87850)
@Zylann
@TokisanGames
@Bonkahe

clayjohn · 2024-12-06T18:33:42Z

Would it be possible to make this work with the await syntax in gdscript?

Maybe, I'm not familiar enough to say. But our public API needs to work for all supported languages (GDScript, C#, GDExtension). So we need to bind it in a way that is language-agnostic.

Kimau · 2024-12-06T19:12:07Z

Will take a look later I was more looking at methods of linking buffers between different shaders to keep things from round tripping. Back burner now.

This looks like a nice upgrade will peek though code tonight and get back to comment more 👍

Zylann · 2024-12-06T19:12:56Z

At which point would the callback be called, and on which thread? For example, when using the main renderer's RenderingDevice, is it called somewhere on the main thread during the main loop? What about when using a custom RenderingDevice in a custom thread?

In my case, I use a separate RenderingDevice on a custom thread, mainly for the following reasons:

Deferring CPU work and stalls to not affect the main thread (setups, compute dispatches, uploads, downloads, maybe also shader compilation). The work is split in batches so that thread still has kind of a notion of "frame" but isn't related to "rendering frames".
Avoid slowing down the renderer by (hopefully) making these compute shaders run in parallel to rendering (the total amount dispatched during a "frame" could exceed the duration of a "rendering frame")

Considering this use case, I wonder if I would still get benefits from using async methods?

Another minor thing I'm wondering, is how to use this Callable in a GDExtension, in a context where none of my functions belong to a Godot object (i.e the code only has a reference to RenderingDevice and maybe a few resources, but everything else is "pure" C++ classes)

Kimau · 2024-12-06T19:24:23Z

Looks really good but I need to take some more time to dissect
’void RenderingDevice::_stall_for_frame(uint32_t p_frame)’

DarioSamo · 2024-12-06T19:27:57Z

At which point would the callback be called, and on which thread?

Render thread (whoever is responsible for the RenderingDevice), as it's the one that checks for the fences.

In my case, I use a separate RenderingDevice on a custom thread, mainly for the following reasons:

If you're using local RD, all it means as far as I know is it'll trigger the callbacks on sync(), which might still be pretty helpful! You can build a large batch of get_data calls and process them all at once instead of making the GPU actively wait on each one.

adamscott · 2024-12-06T19:29:39Z

As discussed on the Godot Dev Chat, this PR could help tremendously the implementation of WebGPU, due to the async nature of GPU accesses by WebGPU.

Zylann · 2024-12-06T19:36:14Z

Render thread, as it's the one that checks for the fences.

So in the case of the main renderer, if it's in multithread mode the user has to still make stuff thread-safe if the downloaded data needs to be stored in say some node of the game?
And in the case of a custom RenderingDevice, I assume that means the same thread in which the RenderingDevice was created? (in which case no thread-safety handling is needed since the thread calling sync would be the creator thread)

You can build a large batch of get_data calls and process them all at once instead of making the GPU actively wait on each one.

I'm already creating batches, I don't schedule work one by one. But does that mean then that I don't need to build batches at all, just submit things individually and then get results using async?
There is actually another reason I made batches, is to have a chance to group the same shaders together (since the work I submit can be a few different types, but there may be many instances of each).

Bonkahe · 2024-12-06T19:37:18Z

I'll run it in my terrain system to retrieve collision data (currently bake time only as there's a slight hitch due to retrieving data) and get back with the results.

Bonkahe · 2024-12-06T22:09:05Z

Results of my tests.
The tests were all ran within a actual game space, with foliage collisions etc, this was done in the editor however, but I wanted it to reflect real world performance as much as possible. The tests were done within a for loop in C#, the for loop was executed on 20x20 chunk region, (400 textures, but each time it overwrites the previous), this was also done with the full load of my terrain compute shader, so it is not a small amount of work.
Each image executed (for now only tested TextureGetDataAsync, need to test buffers later), was 512x512, Rf format, I did have to expand the block size in the settings multiple times to make it accommodate this size, I'm unsure what size the buffer needed to be to handle it, but for now it worked with 4098.

First Visual examples, notice the 1% lows, as right now the issues caused by retrieving the texture the standard way do not impact averages very much, and instead they result in stutters, this is the best way to see the results in action, if you just move around in the world it definently feels a lot better with async, the stutters are greatly reduced, and this is still saving the binary data to disc, which if I am not doing reduces the stutters even further.
Base:

Async:

Already a noticeable difference.

Benchmarks are next, I ran it twice, once with the standard method, and once with a naïve approach to the async method, I say naïve because it was pretty much just slotted in right where the previous one was, so as to disturb the flow as little as possible, I believe you could probably get better performance with a cleaner implementation, but for the sake of testing it gets the job done.
First row is the Async method, second is the base method:
Cleaned_FrameView_Summary.csv

Take note of the 0.1% lows, the new method has an uplift of these by roughly 60%

For validation I also tested the outputs using FC and seemingly the data is exactly the same, so no errors or anything as far as that goes:

Basic implementation:

Async implementation:

With the callable function:

Now, of note, during all this my computer BSODed, I was unable to reproduce, and considering some of the other stuff in my project, I doubt it was caused by this commit.

Besides all that, in general I would say this is a straight improvement, and besides the block size (which needs some proper documentation for sure), I don't see any issues with it, and on a more personal note, this might make runtime generation/modification of collisions and later foliage data viable for my terrain, so very nice indeed.

Excellent work!

DarioSamo · 2024-12-06T22:24:29Z

@Bonkahe Thank you very much for the testing!

Regarding block size, it's a bit of a double-edged sword if I go down the route of implementing it as the regular upload does, which is why the restriction exists. When the block size is too small, the GPU has to do the copies in a more inefficient manner in smaller chunks, and this is something that used to happen a lot before the transfer queues improvement recently.

The solution I'm looking into is implementing this behavior so it solves it automatically. But you'll get an improvement from increasing the block size and the region size in the future if you fine tune it to your needs like you had to do right now. That's why I'm a bit hesitant to make it part of the documentation until I sort out the implementation first.

DarioSamo · 2024-12-10T15:29:47Z

I've lifted the restriction from the PR concerning the block size, now the function behaves exactly like update() does which it uses the copies by regions with the same logic. Please do test if it works correctly, at least my test for downloading a larger texture and double-checking a particular pixel has worked fine.

func _init():
	...
	format = RDTextureFormat.new()
	format.width = 180
	format.height = 180
	format.format = RenderingDevice.DATA_FORMAT_R8G8B8A8_UNORM
	format.usage_bits = RenderingDevice.TEXTURE_USAGE_SAMPLING_BIT | RenderingDevice.TEXTURE_USAGE_CAN_COPY_FROM_BIT
	pixel_offset = (175 * format.height * 4) + 137 * 4;
	
	data = [ PackedByteArray() ]
	data[0].resize(format.width * format.height * 4)
	data[0].encode_u32(pixel_offset, 0x00FF00FF)

	view = RDTextureView.new()
	texture = rd.texture_create(format, view, data)
	texture_value = 0
	pass

func _texture_get_data_callback(array):
	texture_value = array.decode_u32(pixel_offset)
	pass

Keep in mind the performance limitations I've described before will apply. For example, this script triggers 9 separate texture copies to buffers, which is generally pretty inefficient, but that's just what the defaults of the region size at 64x64 will do. I introduced a new separate setting to control the region size of downloads, but of course you may be required to increase the size of the blocks if the region is too big or the format is too big.

servers/rendering/rendering_device.cpp

DarioSamo · 2024-12-10T17:28:39Z

Updated documentation as well.

Bonkahe · 2024-12-10T18:13:39Z

I've lifted the restriction from the PR concerning the block size, now the function behaves exactly like update() does which it uses the copies by regions with the same logic. Please do test if it works correctly, at least my test for downloading a larger texture and double-checking a particular pixel has worked fine.
func _init():
	...
	format = RDTextureFormat.new()
	format.width = 180
	format.height = 180
	format.format = RenderingDevice.DATA_FORMAT_R8G8B8A8_UNORM
	format.usage_bits = RenderingDevice.TEXTURE_USAGE_SAMPLING_BIT | RenderingDevice.TEXTURE_USAGE_CAN_COPY_FROM_BIT
	pixel_offset = (175 * format.height * 4) + 137 * 4;
	
	data = [ PackedByteArray() ]
	data[0].resize(format.width * format.height * 4)
	data[0].encode_u32(pixel_offset, 0x00FF00FF)

	view = RDTextureView.new()
	texture = rd.texture_create(format, view, data)
	texture_value = 0
	pass

func _texture_get_data_callback(array):
	texture_value = array.decode_u32(pixel_offset)
	pass
Keep in mind the performance limitations I've described before will apply. For example, this script triggers 9 separate texture copies to buffers, which is generally pretty inefficient, but that's just what the defaults of the region size at 64x64 will do. I introduced a new separate setting to control the region size of downloads, but of course you may be required to increase the size of the blocks if the region is too big or the format is too big.

So I did another round of testing and (barring two standout benchmark, which I'll go over in a minute) there was negligible difference in min framerate, this is far from scientific for sure, but it seemingly is pointing to a situation where having your solution to block size works for the most part, with probably some slight performance to be gained by manually accounting for it if you know what your doing.

Benchmarks:
Cleaned2_FrameView_Summary.csv

And upon verification of data output it was 1:1 again, so no regressions there.

For the standout benchmarks, I removed one benchmark that was after I changed the block size, seemingly when you change the block size there's a slight hitch shortly after, I assume as the system handles the update, this resulted in 0.1% lows at like 50% of what a normal run was.
The second issue was on the last benchmark, where I alt tabbed to something else for a moment to deal with something, and as a result the 0.1% lows here got destroyed, but I left it as is because A: I'm no scientist lol, and B: the min-fps is the real measure of hitches, and it wasn't really impacted by this error.

All in all the update to block size is all around an excellent improvement, it will probably work for the most part, and the people who are wanting to optimize everything perfectly can go dig it up, I think the documentation should probably reflect this, with the tooltip noting that for the majority of use cases, leaving it be will work just fine.

Hope this was some help.

clayjohn

The code looks great to me and I tested it locally with the MRP from #99750 and confirmed that it works correctly!

It just needs a rebase before merging (which I will do since Dario is on vacation)

servers/rendering/rendering_device.h

Repiteo · 2024-12-11T23:41:27Z

Thanks!

RPicster · 2024-12-13T08:24:29Z

I also tested this on my project and it works like a charm, even makes a lot of the old code unneccessary.
Fantastic work and a super useful addition!

yosimba2000 · 2024-12-21T01:15:16Z

Does this allow retrieving a viewport texture asynchronously?
#75877

clayjohn · 2024-12-21T03:01:50Z

Does this allow retrieving a viewport texture asynchronously? #75877

Yep!

DarioSamo added feature proposal topic:rendering labels Dec 6, 2024

DarioSamo mentioned this pull request Dec 6, 2024

RenderingDevice - texture functions are no longer allowed in background threads due to thread guards. #99750

Closed

DarioSamo force-pushed the rd-get-data-async branch from c1629a7 to 14bbb8c Compare December 6, 2024 18:32

clayjohn added this to the 4.4 milestone Dec 6, 2024

clayjohn added the regression label Dec 6, 2024

adamscott changed the title ~~Implement buffer_get_data_async and texture_get_data_async.~~ Implement RD::buffer_get_data_async() and RD::texture_get_data_async() Dec 6, 2024

DarioSamo force-pushed the rd-get-data-async branch 2 times, most recently from cf63221 to adc584b Compare December 10, 2024 15:26

AThousandShips reviewed Dec 10, 2024

View reviewed changes

servers/rendering/rendering_device.cpp Outdated Show resolved Hide resolved

DarioSamo force-pushed the rd-get-data-async branch 2 times, most recently from d6b5144 to b798097 Compare December 10, 2024 17:07

DarioSamo marked this pull request as ready for review December 10, 2024 17:30

DarioSamo requested review from a team as code owners December 10, 2024 17:30

clayjohn approved these changes Dec 11, 2024

View reviewed changes

clayjohn force-pushed the rd-get-data-async branch from b798097 to 1f68972 Compare December 11, 2024 06:23

AThousandShips approved these changes Dec 11, 2024

View reviewed changes

servers/rendering/rendering_device.h Outdated Show resolved Hide resolved

clayjohn force-pushed the rd-get-data-async branch from 1f68972 to f7fa688 Compare December 11, 2024 19:41

Implement buffer_get_data_async and texture_get_data_async.

054891d

clayjohn force-pushed the rd-get-data-async branch from f7fa688 to 054891d Compare December 11, 2024 19:55

Repiteo merged commit c5e6f62 into godotengine:master Dec 11, 2024
20 checks passed

Calinou mentioned this pull request Dec 21, 2024

Add support for asynchronous RenderingDevice.buffer_get_data() (hardware readbacks) godotengine/godot-proposals#7886

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `RD::buffer_get_data_async()` and `RD::texture_get_data_async()` #100110

Implement `RD::buffer_get_data_async()` and `RD::texture_get_data_async()` #100110

DarioSamo commented Dec 6, 2024 •

edited

Loading

JoNax97 commented Dec 6, 2024

clayjohn commented Dec 6, 2024

clayjohn commented Dec 6, 2024

Kimau commented Dec 6, 2024

Zylann commented Dec 6, 2024 •

edited

Loading

Kimau commented Dec 6, 2024

DarioSamo commented Dec 6, 2024 •

edited

Loading

adamscott commented Dec 6, 2024

Zylann commented Dec 6, 2024 •

edited

Loading

Bonkahe commented Dec 6, 2024

Bonkahe commented Dec 6, 2024 •

edited

Loading

DarioSamo commented Dec 6, 2024 •

edited

Loading

DarioSamo commented Dec 10, 2024 •

edited

Loading

DarioSamo commented Dec 10, 2024

Bonkahe commented Dec 10, 2024

clayjohn left a comment

Repiteo commented Dec 11, 2024

RPicster commented Dec 13, 2024

yosimba2000 commented Dec 21, 2024

clayjohn commented Dec 21, 2024

Implement RD::buffer_get_data_async() and RD::texture_get_data_async() #100110

Implement RD::buffer_get_data_async() and RD::texture_get_data_async() #100110

Conversation

DarioSamo commented Dec 6, 2024 • edited Loading

Background

Notes

TODO:

JoNax97 commented Dec 6, 2024

clayjohn commented Dec 6, 2024

clayjohn commented Dec 6, 2024

Kimau commented Dec 6, 2024

Zylann commented Dec 6, 2024 • edited Loading

Kimau commented Dec 6, 2024

DarioSamo commented Dec 6, 2024 • edited Loading

adamscott commented Dec 6, 2024

Zylann commented Dec 6, 2024 • edited Loading

Bonkahe commented Dec 6, 2024

Bonkahe commented Dec 6, 2024 • edited Loading

DarioSamo commented Dec 6, 2024 • edited Loading

DarioSamo commented Dec 10, 2024 • edited Loading

DarioSamo commented Dec 10, 2024

Bonkahe commented Dec 10, 2024

clayjohn left a comment

Choose a reason for hiding this comment

Repiteo commented Dec 11, 2024

RPicster commented Dec 13, 2024

yosimba2000 commented Dec 21, 2024

clayjohn commented Dec 21, 2024

Implement `RD::buffer_get_data_async()` and `RD::texture_get_data_async()` #100110

Implement `RD::buffer_get_data_async()` and `RD::texture_get_data_async()` #100110

DarioSamo commented Dec 6, 2024 •

edited

Loading

Zylann commented Dec 6, 2024 •

edited

Loading

DarioSamo commented Dec 6, 2024 •

edited

Loading

Zylann commented Dec 6, 2024 •

edited

Loading

Bonkahe commented Dec 6, 2024 •

edited

Loading

DarioSamo commented Dec 6, 2024 •

edited

Loading

DarioSamo commented Dec 10, 2024 •

edited

Loading