Bump allocation for Uniform Buffers on WebGPU #5438
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before this PR, each UniformBuffer would allocate its internal GPUBuffer storage, and per frame copy the CPU storage content to it using writeBuffer. This required many writeBuffer calls, which is expensive on both CPU and GPU time.
This PR implement more performant implementation. Under the hood, a one or more large (1MB) gpu buffers are allocated, and a pool of staging buffers of the same size. Individual uniform buffers allocate storage using bump allocator from the staging buffers. Then, just before the command buffers are submitted, a command buffer is added to execute first, which copies the used staging buffers to the gpu buffers.
Here's an example of used buffers for many example. Note that the number of staging buffers gets larger each time a command buffers are submitted, as they can no longer use already existing staging buffer.
This PR also cleans up some temporary solutions introduced in #5423 to limit the number of expensive submit commands per frame. Before, command buffer of each render pass would be submitted separately, while now those are batched to a very small number.
As an example, the shadow cascades example is using a single submit, first copying the staging buffers to gpu buffers, following by a single command buffer render all shadow cascade render passes, followed the the forward pass of the scene:
Multi view example similarly renders the whole scene using a single submit for all command buffers:
If there are texture uploads done in a frame (typically a very small number of places), for example in this case the bone texture used by the skinning, and clustered lights updated on CPU, we end up with two submits:
All rendering submitted from the update functions of the script are submitted separately for now (could be a single submit as well), for example reflection-cubemap example which renders the scene using a single submit, and does multiple texture reprojections using draQuadWithShader within the scripts:
Performance
CPU frame time for the hierarchy example with 5000 or so meshes:
GPU times (these are based on the GPU duration reported by Chrome Profiler only, not sure about their reliability / what else they capture). I do not think this is reliable at all.