Generate `MeshUniform`s on the GPU via compute shader where available. #12773

pcwalton · 2024-03-29T04:37:00Z

Currently, MeshUniforms are rather large: 160 bytes. They're also somewhat expensive to compute, because they involve taking the inverse of a 3x4 matrix. Finally, if a mesh is present in multiple views, that mesh will have a separate MeshUniform for each and every view, which is wasteful.

This commit fixes these issues by introducing the concept of a mesh input uniform and adding a mesh uniform building compute shader pass. The MeshInputUniform is simply the minimum amount of data needed for the GPU to compute the full MeshUniform. Most of this data is just the transform and is therefore only 64 bytes. MeshInputUniforms are computed during the extraction phase, much like skins are today, in order to avoid needlessly copying transforms around on CPU. (In fact, the render app has been changed to only store the translation of each mesh; it no longer cares about any other part of the transform, which is stored only on the GPU and the main world.) Before rendering, the build_mesh_uniforms pass runs to expand the MeshInputUniforms to the full MeshUniform.

The mesh uniform building pass does the following, all on GPU:

Copy the appropriate fields of the MeshInputUniform to the MeshUniform slot. If a single mesh is present in multiple views, this effectively duplicates it into each view.
Compute the inverse transpose of the model transform, used for transforming normals.
If applicable, copy the mesh's transform from the previous frame for TAA. To support this, we double-buffer the MeshInputUniforms over two frames and swap the buffers each frame. The MeshInputUniforms for the current frame contain the index of that mesh's MeshInputUniform for the previous frame.

This commit produces wins in virtually every CPU part of the pipeline: extract_meshes, queue_material_meshes,
batch_and_prepare_render_phase, and especially
write_batched_instance_buffer are all faster. Shrinking the amount of CPU data that has to be shuffled around speeds up the entire rendering process.

Benchmark	This branch	`main`	Speedup
`many_cubes -nfc`	17.259	24.529	42.12%
`many_cubes -nfc -vpi`	302.116	312.123	3.31%
`many_foxes`	3.227	3.515	8.92%

Because mesh uniform building requires compute shader, and WebGL 2 has no compute shader, the existing CPU mesh uniform building code has been left as-is. Many types now have both CPU mesh uniform building and GPU mesh uniform building modes. Developers can opt into the old CPU mesh uniform building by setting the use_gpu_uniform_builder option on PbrPlugin to false.

Below are graphs of the CPU portions of many-cubes --no-frustum-culling. Yellow is this branch, red is main.

extract_meshes:

It's notable that we get a small win even though we're now writing to a GPU buffer.

queue_material_meshes:

There's a bit of a regression here; not sure what's causing it. In any case it's very outweighed by the other gains.

batch_and_prepare_render_phase:

There's a huge win here, enough to make batching basically drop off the profile.

write_batched_instance_buffer:

There's a massive improvement here, as expected. Note that a lot of it simply comes from the fact that MeshInputUniform is Pod. (This isn't a maintainability problem in my view because MeshInputUniform is so simple: just 16 tightly-packed words.)

Changelog

Added

Per-mesh instance data is now generated on GPU with a compute shader instead of CPU, resulting in rendering performance improvements on platforms where compute shaders are supported.

Migration guide

Custom render phases now need multiple systems beyond just batch_and_prepare_render_phase. Code that was previously creating custom render phases should now add a BinnedRenderPhasePlugin or SortedRenderPhasePlugin as appropriate instead of directly adding batch_and_prepare_render_phase.

Currently, `MeshUniform`s are rather large: 160 bytes. They're also somewhat expensive to compute, because they involve taking the inverse of a 3x4 matrix. Finally, if a mesh is present in multiple views, that mesh will have a separate `MeshUniform` for each and every view, which is wasteful. This commit fixes these issues by introducing the concept of a *mesh input uniform* and adding a *mesh uniform building* compute shader pass. The `MeshInputUniform` is simply the minimum amount of data needed for the GPU to compute the full `MeshUniform`. Most of this data is simply the transform and is therefore only 64 bytes. `MeshInputUniform`s are computed during the *extraction* phase, much like skins are today, in order to avoid needlessly copying transforms around on CPU. (In fact, the render app has been changed to only store the translation of each mesh; it no longer cares about any other part of the transform, which is stored only on the GPU and the main world.) Before rendering, the `build_mesh_uniforms` pass runs to expand the `MeshInputUniform`s to the full `MeshUniform`. The mesh uniform building pass does the following: 1. Copy the appropriate fields of the `MeshInputUniform` to the `MeshUniform` slot. If a single mesh is present in multiple views, this effectively duplicates it into each view. 2. Compute the inverse transpose of the model transform, used for transforming normals. 3. If applicable, copy the mesh's transform from the previous frame for TAA. To support this, we double-buffer the `MeshInputUniform`s over two frames and swap the buffers each frame. The `MeshInputUniform`s for the current frame contain the index of that mesh's `MeshInputUniform` for the previous frame. This commit produces wins in virtually every CPU part of the pipeline: `extract_meshes`, `queue_material_meshes`, `batch_and_prepare_render_phase`, and especially `write_batched_instance_buffer` are all faster. Shrinking the amount of CPU data that has to be shuffled around speeds up the entire rendering process. | Benchmark | This branch | `main` | Speedup | |------------------------|-------------|---------|---------| | `many_cubes -nfc` | 21.878 | 30.117 | 37.65% | | `many_cubes -nfc -vpi` | 302.116 | 312.123 | 3.31% | | `many_foxes` | 3.227 | 3.515 | 8.92% | Because mesh uniform building requires compute shader, and WebGL 2 has no compute shader, the existing CPU mesh uniform building code has been left as-is. Many types now have both CPU mesh uniform building and GPU mesh uniform building modes. Developers can opt into the old CPU mesh uniform building by setting the `using_gpu_uniform_builder` option on `PbrPlugin` to `false`.

crates/bevy_pbr/src/lib.rs

github-actions · 2024-03-31T00:06:20Z

It looks like your PR is a breaking change, but you didn't provide a migration guide.

Could you add some context on what users should update when this change get released in a new version of Bevy?
It will be used to help writing the migration guide for the version. Putting it after a ## Migration Guide will help it get automatically picked up by our tooling.

james7132

Only a partial review right now. Will do a full scan through soon.

crates/bevy_pbr/src/lib.rs

crates/bevy_pbr/src/lightmap/mod.rs

pcwalton · 2024-04-02T03:45:40Z

Unfortunately, I'm marking this as a draft. I found a bug and fixing it requires rewriting most of the patch.

crates/bevy_pbr/src/render/build_mesh_uniforms.rs

Elabajaba · 2024-04-02T06:04:55Z

edit: Whoops, didn't notice that it had been changed to a draft, and didn't read the message that's 2 above this one.

I'm getting crashes with this on many_foxes (immediately on launch), and scene_viewer (when I toggle shadows off after toggling them on). scene_viewer also seems to have flickering issues after shadows have been toggled on (tested with bistro and the synty castle scene).

many_foxes backtrace

thread 'Compute Task Pool (10)' panicked at crates\bevy_ecs\src\entity\mod.rs:223:9:
assertion failed: generation.get() <= HIGH_MASK
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library\std\src\panicking.rs:647
   1: core::panicking::panic_fmt
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library\core\src\panicking.rs:72
   2: core::panicking::panic
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library\core\src\panicking.rs:144
   3: bevy_ecs::entity::Entity::from_raw_and_generation
             at .\crates\bevy_ecs\src\entity\mod.rs:223
   4: bevy_ecs::entity::Entities::resolve_from_id
             at .\crates\bevy_ecs\src\entity\mod.rs:794
   5: bevy_ecs::entity::Entities::contains
             at .\crates\bevy_ecs\src\entity\mod.rs:724
   6: bevy_ecs::system::commands::Commands::get_entity
             at .\crates\bevy_ecs\src\system\commands\mod.rs:334
   7: bevy_ecs::system::commands::Commands::entity
             at .\crates\bevy_ecs\src\system\commands\mod.rs:294
   8: bevy_pbr::render::gpu_preprocess::prepare_preprocess_bind_groups
             at .\crates\bevy_pbr\src\render\gpu_preprocess.rs:273
   9: core::ops::function::FnMut::call_mut<void (*)(bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::BatchedInstanceBuffers<bev
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97\library\core\src\ops\function.rs:166
  10: core::ops::function::impls::impl$3::call_mut<tuple$<bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::BatchedInstanceBuffe
             at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97\library\core\src\ops\function.rs:294
  11: bevy_ecs::system::function_system::impl$17::run::call_inner<tuple$<>,bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::Bat
             at .\crates\bevy_ecs\src\system\function_system.rs:661
  12: bevy_ecs::system::function_system::impl$17::run<tuple$<>,void (*)(bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::Batche
             at .\crates\bevy_ecs\src\system\function_system.rs:664
  13: bevy_ecs::system::function_system::impl$6::run_unsafe<void (*)(bevy_ecs::system::commands::Commands,bevy_ecs::change_detection::Res<bevy_render::renderer::render_device::RenderDevice>,bevy_ecs::change_detection::Res<enum2$<bevy_render::batching::BatchedIn
             at .\crates\bevy_ecs\src\system\function_system.rs:504
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Encountered a panic in system `bevy_pbr::render::gpu_preprocess::prepare_preprocess_bind_groups`!

crates/bevy_pbr/src/render/mesh.rs

crates/bevy_render/src/maths.wgsl

per Discord discussion

pcwalton · 2024-04-09T18:23:18Z

Reported flickering problems in the lighting example. Marking as draft until I figure it out.

IceSentry

Just a few nitpicks, nothing major.

I'd like to see some migration guide entry, but I'm not entirely sure what it should be yet, we can figure something out closer to release.

This generally LGTM but like mentioned on discord there's a flickering issue in the lighting example, so I won't approve it yet. I confirmed that it works in webgl and webgpu though.

crates/bevy_pbr/src/prepass/mod.rs

crates/bevy_pbr/src/render/gpu_preprocess.rs

crates/bevy_pbr/src/render/mesh.rs

IceSentry · 2024-04-09T17:05:58Z

crates/bevy_pbr/src/render/mesh.rs

+                        ),
+                    );
+            } else {
+                let render_device = render_app.world().resource::<RenderDevice>();


render_device should already be in scope.

It won't borrow check if I use the existing render_device variable.

IceSentry · 2024-04-09T17:07:18Z

crates/bevy_pbr/src/render/mesh.rs

+                    );
+            };
+
+            let render_device = render_app.world().resource::<RenderDevice>();


same here, render_device is already in scope.

It's also needed to satisfy the borrow check.

crates/bevy_render/src/batching/mod.rs

crates/bevy_render/src/render_phase/mod.rs

This commit implements opt-in GPU frustum culling, built on top of the infrastructure in bevyengine#12773. To enable it on a camera, add the `GpuCulling` component to it. To additionally disable CPU frustum culling, add the `NoCpuCulling` component. Note that adding `GpuCulling` without `NoCpuCulling` *currently* does nothing useful. The reason why `GpuCulling` doesn't automatically imply `NoCpuCulling` is that I intend to follow this patch up with GPU two-phase occlusion culling, and CPU frustum culling plus GPU occlusion culling seems like a very commonly-desired mode. Adding the `GpuCulling` component frustum to a view puts that view into *indirect mode*. This mode makes all drawcalls indirect, relying on the mesh preprocessing shader to allocate instances dynamically. In indirect mode, the `PreprocessWorkItem` `output_index` points not to a `MeshUniform` instance slot but instead to a set of `wgpu` `IndirectParameters`, from which it allocates an instance slot dynamically if frustum culling succeeds. Batch building has been updated to allocate and track indirect parameter slots, and the AABBs are now supplied to the GPU as `MeshCullingData`. A small amount of code relating to the frustum culling has been borrowed from meshlets and moved into `maths.wgsl`. Note that standard Bevy frustum culling uses AABBs, while meshlets use bounding spheres; this means that not as much code can be shared as one might think. This patch doesn't provide any way to perform GPU culling on shadow maps, to avoid making this patch bigger than it already is. That can be a followup.

Generating MeshUniforms on the GPU crashes on Android. Introduced by bevyengine#12773

This commit implements opt-in GPU frustum culling, built on top of the infrastructure in #12773. To enable it on a camera, add the `GpuCulling` component to it. To additionally disable CPU frustum culling, add the `NoCpuCulling` component. Note that adding `GpuCulling` without `NoCpuCulling` *currently* does nothing useful. The reason why `GpuCulling` doesn't automatically imply `NoCpuCulling` is that I intend to follow this patch up with GPU two-phase occlusion culling, and CPU frustum culling plus GPU occlusion culling seems like a very commonly-desired mode. Adding the `GpuCulling` component to a view puts that view into *indirect mode*. This mode makes all drawcalls indirect, relying on the mesh preprocessing shader to allocate instances dynamically. In indirect mode, the `PreprocessWorkItem` `output_index` points not to a `MeshUniform` instance slot but instead to a set of `wgpu` `IndirectParameters`, from which it allocates an instance slot dynamically if frustum culling succeeds. Batch building has been updated to allocate and track indirect parameter slots, and the AABBs are now supplied to the GPU as `MeshCullingData`. A small amount of code relating to the frustum culling has been borrowed from meshlets and moved into `maths.wgsl`. Note that standard Bevy frustum culling uses AABBs, while meshlets use bounding spheres; this means that not as much code can be shared as one might think. This patch doesn't provide any way to perform GPU culling on shadow maps, to avoid making this patch bigger than it already is. That can be a followup. ## Changelog ### Added * Frustum culling can now optionally be done on the GPU. To enable it, add the `GpuCulling` component to a camera. * To disable CPU frustum culling, add `NoCpuCulling` to a camera. Note that `GpuCulling` doesn't automatically imply `NoCpuCulling`.

Generating MeshUniforms on the GPU crashes on Android. Introduced by bevyengine#12773

pcwalton added 4 commits March 28, 2024 21:31

Merge remote-tracking branch 'origin/main' into mesh-input-uniform

13ce172

Add more documentation and disable on WebGL

3e5e095

Merge remote-tracking branch 'origin/main' into mesh-input-uniform

ea218fa

james7132 mentioned this pull request Mar 30, 2024

Renderer optimization tracking issue #12590

Open

19 tasks

mnmaita added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Mar 30, 2024

pcwalton added 6 commits March 30, 2024 12:37

Extract the lightmap UV rect.

9fe6ffb

Remove needless warn!

bc764fc

Prepare the bind group in a separate system

02d5a21

Add some more documentation

c83c47a

Merge remote-tracking branch 'origin/main' into mesh-input-uniform

db817b3

Merge remote-tracking branch 'origin/main' into mesh-input-uniform

cc11119

pcwalton marked this pull request as ready for review March 30, 2024 23:36

pcwalton requested review from james7132 and superdump March 30, 2024 23:36

JMS55 reviewed Mar 30, 2024

View reviewed changes

crates/bevy_pbr/src/lib.rs Outdated Show resolved Hide resolved

james7132 added the M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide label Mar 31, 2024

james7132 reviewed Mar 31, 2024

View reviewed changes

crates/bevy_pbr/src/lib.rs Outdated Show resolved Hide resolved

crates/bevy_pbr/src/lightmap/mod.rs Outdated Show resolved Hide resolved

pcwalton added 4 commits March 31, 2024 19:03

using_gpu_uniform_builder -> use_gpu_uniform_builder

4559bd4

Merge remote-tracking branch 'origin/main' into mesh-input-uniform

23c05a1

Remove error spam

1bd64b7

Use a system set instead of using after

a6e8a62

pcwalton marked this pull request as draft April 2, 2024 03:45

JMS55 reviewed Apr 2, 2024

View reviewed changes

crates/bevy_pbr/src/render/build_mesh_uniforms.rs Outdated Show resolved Hide resolved

Make GPU mesh preparation per-view

49cdeee

Split out deletion of buffers to try to fix crashes

510a93a

superdump reviewed Apr 9, 2024

View reviewed changes

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved

superdump reviewed Apr 9, 2024

View reviewed changes

crates/bevy_render/src/maths.wgsl Show resolved Hide resolved

pcwalton added 3 commits April 8, 2024 21:43

Merge remote-tracking branch 'origin/main' into mesh-input-uniform

3412687

Rename get_index_of_batch_data and get_index_of_binned_batch_input

eff1d4e

per Discord discussion

Rustfmt police

87e5cbf

superdump approved these changes Apr 9, 2024

View reviewed changes

IceSentry self-requested a review April 9, 2024 04:51

pcwalton marked this pull request as draft April 9, 2024 18:23

IceSentry reviewed Apr 9, 2024

View reviewed changes

pcwalton added 4 commits April 9, 2024 12:34

Merge remote-tracking branch 'origin/main' into mesh-input-uniform

4ff938d

Run preprocessing before the shadow pass; seems to fix flickering issues

9714d44

Address review comments

a85cde1

Fix formatting

aacaa17

pcwalton marked this pull request as ready for review April 9, 2024 20:06

pcwalton requested a review from IceSentry April 9, 2024 20:06

IceSentry approved these changes Apr 9, 2024

View reviewed changes

superdump approved these changes Apr 10, 2024

View reviewed changes

superdump added this pull request to the merge queue Apr 10, 2024

Merged via the queue into bevyengine:main with commit 11817f4 Apr 10, 2024
27 checks passed

Cyannide added a commit to Cyannide/bevy that referenced this pull request Apr 14, 2024

Disabled GPU instance buffer builder for the Android example.

c29e9d8

Generating MeshUniforms on the GPU crashes on Android. Introduced by bevyengine#12773

mockersf mentioned this pull request Apr 19, 2024

Android segfaults since MeshUniforms generation on the GPU #13038

Closed

atlv24 mentioned this pull request May 4, 2024

Add inverse model matrix to Mesh uniform buffer #6608

Closed

Cyannide added a commit to Cyannide/bevy that referenced this pull request Jun 23, 2024

Disabled GPU instance buffer builder for the Android example.

058198c

Generating MeshUniforms on the GPU crashes on Android. Introduced by bevyengine#12773

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate `MeshUniform`s on the GPU via compute shader where available. #12773

Generate `MeshUniform`s on the GPU via compute shader where available. #12773

pcwalton commented Mar 29, 2024 •

edited

Loading

github-actions bot commented Mar 31, 2024

james7132 left a comment

pcwalton commented Apr 2, 2024

Elabajaba commented Apr 2, 2024 •

edited

Loading

pcwalton commented Apr 9, 2024

IceSentry left a comment •

edited

Loading

IceSentry Apr 9, 2024

pcwalton Apr 9, 2024

IceSentry Apr 9, 2024

pcwalton Apr 9, 2024

Generate MeshUniforms on the GPU via compute shader where available. #12773

Generate MeshUniforms on the GPU via compute shader where available. #12773

Conversation

pcwalton commented Mar 29, 2024 • edited Loading

Changelog

Added

Migration guide

github-actions bot commented Mar 31, 2024

james7132 left a comment

Choose a reason for hiding this comment

pcwalton commented Apr 2, 2024

Elabajaba commented Apr 2, 2024 • edited Loading

pcwalton commented Apr 9, 2024

IceSentry left a comment • edited Loading

Choose a reason for hiding this comment

IceSentry Apr 9, 2024

Choose a reason for hiding this comment

pcwalton Apr 9, 2024

Choose a reason for hiding this comment

IceSentry Apr 9, 2024

Choose a reason for hiding this comment

pcwalton Apr 9, 2024

Choose a reason for hiding this comment

Generate `MeshUniform`s on the GPU via compute shader where available. #12773

Generate `MeshUniform`s on the GPU via compute shader where available. #12773

pcwalton commented Mar 29, 2024 •

edited

Loading

Elabajaba commented Apr 2, 2024 •

edited

Loading

IceSentry left a comment •

edited

Loading