Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RD::buffer_get_data_async() and RD::texture_get_data_async() #100110

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

DarioSamo
Copy link
Contributor

@DarioSamo DarioSamo commented Dec 6, 2024

Background

This implements the feature proposed in the discussion originated by #99750.

As proposed by Clay here.

Instead we can add texture_get_data_async(RID, callback) which returns immediately, but calls the callback once the data is ready. The callback would always be a function with the structure func some_name(PackedByteArray) where the user can then do what they will with that array, the same as if they had called texture_get_data().

This PR does exactly that for buffer_get_data and texture_get_data. These now have asynchronous variants which will trigger the callback when the frame the data retrieval was requested is flushed. This opens a way for users to request data downloads back to the CPU without introducing heavy stalls on their game that affect performance.

As seen on the issue, this also fixes a perceived regression caused by the introduction of thread guards for RenderingDevice in #90400, where users were allowed to previously access these functions from another thread and request the data without blocking the main thread, but being unaware they were causing the entire device to flush the frames out and reduce performance in the process.

Here's an example script that showcases how to use these functions.

extends Node2D

var rd
var in_counter
var out_counter
var buffer
var in_array
var out_array
var data
var format
var view
var texture
var texture_values

func _init():
	rd = RenderingServer.get_rendering_device()
	buffer = rd.storage_buffer_create(4)
	in_counter = 0;
	in_array = PackedByteArray()
	in_array.resize(4)
	
	data = [ PackedByteArray() ]
	format = RDTextureFormat.new()
	format.width = 4
	format.height = 4
	format.format = RenderingDevice.DATA_FORMAT_R8_UINT
	format.usage_bits = RenderingDevice.TEXTURE_USAGE_SAMPLING_BIT | RenderingDevice.TEXTURE_USAGE_CAN_COPY_FROM_BIT
	
	view = RDTextureView.new()
	data[0].resize(16)
	data[0].encode_u32(0, 0xFF0000FF)
	data[0].encode_u32(4, 0xFF0000FF)
	data[0].encode_u32(8, 0x00FF00FF)
	data[0].encode_u32(12, 0x0000FFFF)
	texture = rd.texture_create(format, view, data)
	texture_values = [0, 0, 0, 0]
	pass

func _buffer_get_data_callback(array):
	out_counter = array.decode_u32(0)
	pass
	
func _texture_get_data_callback(array):
	texture_values[0] = array.decode_u32(0)
	texture_values[1] = array.decode_u32(4)
	texture_values[2] = array.decode_u32(8)
	texture_values[3] = array.decode_u32(12)
	pass

func _process(_delta):
	in_counter += 1
	in_array.encode_u32(0, in_counter)
	rd.buffer_update(buffer, 0, in_array.size(), in_array)
	rd.buffer_get_data_async(buffer, _buffer_get_data_callback, 0, in_array.size())
	rd.texture_get_data_async(texture, 0, _texture_get_data_callback)
	
	$Label.text = "Counter: %X Buffer Value: %X\n(these shouldn't match if it's async)\nTexture Values: %X %X %X %X" % [in_counter, out_counter, texture_values[0], texture_values[1], texture_values[2], texture_values[3]]
	
	pass

CC @RPicster who brought up the issue originally and may wish to test this out.

Notes

This implementation piggybacks off the fact we already have a structure for staging buffers for uploading data from the CPU to the GPU. This duplicates that concept for the exact opposite purpose, but there's no separate configuration setting for it.

There's also an implied restriction, which will currently error out, if the staging buffer's block size is too small to handle one image mipmap. This was mostly done to reduce the complexity of the implementation, but it's not impossible to fix it to allow the function to do it in image chunks instead that can fit inside of the staging buffer's block size. Handling it like this might be implemented at a later point if it's deemed necessary. This restriction has been lifted, but performance can degrade if the block size or region size is too small for the large downloads.

TODO:

  • Documentation.
  • Test for validation errors or edge cases.

Contributed by W4 Games. 🍀

@JoNax97
Copy link
Contributor

JoNax97 commented Dec 6, 2024

Would it be possible to make this work with the await syntax in gdscript?

@clayjohn
Copy link
Member

clayjohn commented Dec 6, 2024

Summoning others who might be interested and could provide feedback on whether this works for them:
@Kimau
@granitrocky (author of #87850)
@Zylann
@TokisanGames
@Bonkahe

@clayjohn
Copy link
Member

clayjohn commented Dec 6, 2024

Would it be possible to make this work with the await syntax in gdscript?

Maybe, I'm not familiar enough to say. But our public API needs to work for all supported languages (GDScript, C#, GDExtension). So we need to bind it in a way that is language-agnostic.

@clayjohn clayjohn added this to the 4.4 milestone Dec 6, 2024
@Kimau
Copy link
Contributor

Kimau commented Dec 6, 2024

Will take a look later I was more looking at methods of linking buffers between different shaders to keep things from round tripping. Back burner now.

This looks like a nice upgrade will peek though code tonight and get back to comment more 👍

@Zylann
Copy link
Contributor

Zylann commented Dec 6, 2024

At which point would the callback be called, and on which thread? For example, when using the main renderer's RenderingDevice, is it called somewhere on the main thread during the main loop? What about when using a custom RenderingDevice in a custom thread?

In my case, I use a separate RenderingDevice on a custom thread, mainly for the following reasons:

  • Deferring CPU work and stalls to not affect the main thread (setups, compute dispatches, uploads, downloads, maybe also shader compilation). The work is split in batches so that thread still has kind of a notion of "frame" but isn't related to "rendering frames".
  • Avoid slowing down the renderer by (hopefully) making these compute shaders run in parallel to rendering (the total amount dispatched during a "frame" could exceed the duration of a "rendering frame")

Considering this use case, I wonder if I would still get benefits from using async methods?

Another minor thing I'm wondering, is how to use this Callable in a GDExtension, in a context where none of my functions belong to a Godot object (i.e the code only has a reference to RenderingDevice and maybe a few resources, but everything else is "pure" C++ classes)

@Kimau
Copy link
Contributor

Kimau commented Dec 6, 2024

Looks really good but I need to take some more time to dissect
’void RenderingDevice::_stall_for_frame(uint32_t p_frame)’

@adamscott adamscott changed the title Implement buffer_get_data_async and texture_get_data_async. Implement RD::buffer_get_data_async() and RD::texture_get_data_async() Dec 6, 2024
@DarioSamo
Copy link
Contributor Author

DarioSamo commented Dec 6, 2024

At which point would the callback be called, and on which thread?

Render thread (whoever is responsible for the RenderingDevice), as it's the one that checks for the fences.

In my case, I use a separate RenderingDevice on a custom thread, mainly for the following reasons:

If you're using local RD, all it means as far as I know is it'll trigger the callbacks on sync(), which might still be pretty helpful! You can build a large batch of get_data calls and process them all at once instead of making the GPU actively wait on each one.

@adamscott
Copy link
Member

As discussed on the Godot Dev Chat, this PR could help tremendously the implementation of WebGPU, due to the async nature of GPU accesses by WebGPU.

@Zylann
Copy link
Contributor

Zylann commented Dec 6, 2024

Render thread, as it's the one that checks for the fences.

So in the case of the main renderer, if it's in multithread mode the user has to still make stuff thread-safe if the downloaded data needs to be stored in say some node of the game?
And in the case of a custom RenderingDevice, I assume that means the same thread in which the RenderingDevice was created? (in which case no thread-safety handling is needed since the thread calling sync would be the creator thread)

You can build a large batch of get_data calls and process them all at once instead of making the GPU actively wait on each one.

I'm already creating batches, I don't schedule work one by one. But does that mean then that I don't need to build batches at all, just submit things individually and then get results using async?
There is actually another reason I made batches, is to have a chance to group the same shaders together (since the work I submit can be a few different types, but there may be many instances of each).

@Bonkahe
Copy link
Contributor

Bonkahe commented Dec 6, 2024

I'll run it in my terrain system to retrieve collision data (currently bake time only as there's a slight hitch due to retrieving data) and get back with the results.

@Bonkahe
Copy link
Contributor

Bonkahe commented Dec 6, 2024

Results of my tests.
The tests were all ran within a actual game space, with foliage collisions etc, this was done in the editor however, but I wanted it to reflect real world performance as much as possible. The tests were done within a for loop in C#, the for loop was executed on 20x20 chunk region, (400 textures, but each time it overwrites the previous), this was also done with the full load of my terrain compute shader, so it is not a small amount of work.
Each image executed (for now only tested TextureGetDataAsync, need to test buffers later), was 512x512, Rf format, I did have to expand the block size in the settings multiple times to make it accommodate this size, I'm unsure what size the buffer needed to be to handle it, but for now it worked with 4098.

First Visual examples, notice the 1% lows, as right now the issues caused by retrieving the texture the standard way do not impact averages very much, and instead they result in stutters, this is the best way to see the results in action, if you just move around in the world it definently feels a lot better with async, the stutters are greatly reduced, and this is still saving the binary data to disc, which if I am not doing reduces the stutters even further.
Base:
debuggingcollisionbuildbase

Async:
debuggingcollisionbuildasync

Already a noticeable difference.

Benchmarks are next, I ran it twice, once with the standard method, and once with a naïve approach to the async method, I say naïve because it was pretty much just slotted in right where the previous one was, so as to disturb the flow as little as possible, I believe you could probably get better performance with a cleaner implementation, but for the sake of testing it gets the job done.
First row is the Async method, second is the base method:
Cleaned_FrameView_Summary.csv

Take note of the 0.1% lows, the new method has an uplift of these by roughly 60%

For validation I also tested the outputs using FC and seemingly the data is exactly the same, so no errors or anything as far as that goes:
image

Basic implementation:
image

Async implementation:
image

With the callable function:
image

Now, of note, during all this my computer BSODed, I was unable to reproduce, and considering some of the other stuff in my project, I doubt it was caused by this commit.

Besides all that, in general I would say this is a straight improvement, and besides the block size (which needs some proper documentation for sure), I don't see any issues with it, and on a more personal note, this might make runtime generation/modification of collisions and later foliage data viable for my terrain, so very nice indeed.

Excellent work!

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Dec 6, 2024

@Bonkahe Thank you very much for the testing!

Regarding block size, it's a bit of a double-edged sword if I go down the route of implementing it as the regular upload does, which is why the restriction exists. When the block size is too small, the GPU has to do the copies in a more inefficient manner in smaller chunks, and this is something that used to happen a lot before the transfer queues improvement recently.

The solution I'm looking into is implementing this behavior so it solves it automatically. But you'll get an improvement from increasing the block size and the region size in the future if you fine tune it to your needs like you had to do right now. That's why I'm a bit hesitant to make it part of the documentation until I sort out the implementation first.

@DarioSamo DarioSamo force-pushed the rd-get-data-async branch 2 times, most recently from cf63221 to adc584b Compare December 10, 2024 15:26
@DarioSamo
Copy link
Contributor Author

DarioSamo commented Dec 10, 2024

I've lifted the restriction from the PR concerning the block size, now the function behaves exactly like update() does which it uses the copies by regions with the same logic. Please do test if it works correctly, at least my test for downloading a larger texture and double-checking a particular pixel has worked fine.

func _init():
	...
	format = RDTextureFormat.new()
	format.width = 180
	format.height = 180
	format.format = RenderingDevice.DATA_FORMAT_R8G8B8A8_UNORM
	format.usage_bits = RenderingDevice.TEXTURE_USAGE_SAMPLING_BIT | RenderingDevice.TEXTURE_USAGE_CAN_COPY_FROM_BIT
	pixel_offset = (175 * format.height * 4) + 137 * 4;
	
	data = [ PackedByteArray() ]
	data[0].resize(format.width * format.height * 4)
	data[0].encode_u32(pixel_offset, 0x00FF00FF)

	view = RDTextureView.new()
	texture = rd.texture_create(format, view, data)
	texture_value = 0
	pass

func _texture_get_data_callback(array):
	texture_value = array.decode_u32(pixel_offset)
	pass

Keep in mind the performance limitations I've described before will apply. For example, this script triggers 9 separate texture copies to buffers, which is generally pretty inefficient, but that's just what the defaults of the region size at 64x64 will do. I introduced a new separate setting to control the region size of downloads, but of course you may be required to increase the size of the blocks if the region is too big or the format is too big.

@DarioSamo DarioSamo force-pushed the rd-get-data-async branch 2 times, most recently from d6b5144 to b798097 Compare December 10, 2024 17:07
@DarioSamo
Copy link
Contributor Author

Updated documentation as well.

@DarioSamo DarioSamo marked this pull request as ready for review December 10, 2024 17:30
@DarioSamo DarioSamo requested review from a team as code owners December 10, 2024 17:30
@Bonkahe
Copy link
Contributor

Bonkahe commented Dec 10, 2024

I've lifted the restriction from the PR concerning the block size, now the function behaves exactly like update() does which it uses the copies by regions with the same logic. Please do test if it works correctly, at least my test for downloading a larger texture and double-checking a particular pixel has worked fine.

func _init():
	...
	format = RDTextureFormat.new()
	format.width = 180
	format.height = 180
	format.format = RenderingDevice.DATA_FORMAT_R8G8B8A8_UNORM
	format.usage_bits = RenderingDevice.TEXTURE_USAGE_SAMPLING_BIT | RenderingDevice.TEXTURE_USAGE_CAN_COPY_FROM_BIT
	pixel_offset = (175 * format.height * 4) + 137 * 4;
	
	data = [ PackedByteArray() ]
	data[0].resize(format.width * format.height * 4)
	data[0].encode_u32(pixel_offset, 0x00FF00FF)

	view = RDTextureView.new()
	texture = rd.texture_create(format, view, data)
	texture_value = 0
	pass

func _texture_get_data_callback(array):
	texture_value = array.decode_u32(pixel_offset)
	pass

Keep in mind the performance limitations I've described before will apply. For example, this script triggers 9 separate texture copies to buffers, which is generally pretty inefficient, but that's just what the defaults of the region size at 64x64 will do. I introduced a new separate setting to control the region size of downloads, but of course you may be required to increase the size of the blocks if the region is too big or the format is too big.

So I did another round of testing and (barring two standout benchmark, which I'll go over in a minute) there was negligible difference in min framerate, this is far from scientific for sure, but it seemingly is pointing to a situation where having your solution to block size works for the most part, with probably some slight performance to be gained by manually accounting for it if you know what your doing.

Benchmarks:
Cleaned2_FrameView_Summary.csv

And upon verification of data output it was 1:1 again, so no regressions there.
image

For the standout benchmarks, I removed one benchmark that was after I changed the block size, seemingly when you change the block size there's a slight hitch shortly after, I assume as the system handles the update, this resulted in 0.1% lows at like 50% of what a normal run was.
The second issue was on the last benchmark, where I alt tabbed to something else for a moment to deal with something, and as a result the 0.1% lows here got destroyed, but I left it as is because A: I'm no scientist lol, and B: the min-fps is the real measure of hitches, and it wasn't really impacted by this error.

All in all the update to block size is all around an excellent improvement, it will probably work for the most part, and the people who are wanting to optimize everything perfectly can go dig it up, I think the documentation should probably reflect this, with the tooltip noting that for the majority of use cases, leaving it be will work just fine.

Hope this was some help.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great to me and I tested it locally with the MRP from #99750 and confirmed that it works correctly!

It just needs a rebase before merging (which I will do since Dario is on vacation)

servers/rendering/rendering_device.h Outdated Show resolved Hide resolved
@Repiteo Repiteo merged commit c5e6f62 into godotengine:master Dec 11, 2024
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Dec 11, 2024

Thanks!

@RPicster
Copy link
Contributor

I also tested this on my project and it works like a charm, even makes a lot of the old code unneccessary.
Fantastic work and a super useful addition!

@yosimba2000
Copy link

Does this allow retrieving a viewport texture asynchronously?
#75877

@clayjohn
Copy link
Member

Does this allow retrieving a viewport texture asynchronously? #75877

Yep!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.