Reduce the size of MeshUniform to improve performance #9416

superdump · 2023-08-11T00:43:56Z

Objective

Significantly reduce the size of MeshUniform by only including necessary data.

Solution

Local to world, model transforms are affine. This means they only need a 4x3 matrix to represent them.

MeshUniform stores the current, and previous model transforms, and the inverse transpose of the current model transform, all as 4x4 matrices. Instead we can store the current, and previous model transforms as 4x3 matrices, and we only need the upper-left 3x3 part of the inverse transpose of the current model transform. This change allows us to reduce the serialized MeshUniform size from 208 bytes to 144 bytes, which is over a 30% saving in data to serialize, and VRAM bandwidth and space.

Benchmarks

On an M1 Max, running many_cubes -- sphere, main is in yellow, this PR is in red:

A reduction in frame time of ~14%.

Changelog

Changed: Redefined MeshUniform to improve performance by using 4x3 affine transforms and reconstructing 4x4 matrices in the shader. Helper functions were added to bevy_pbr::mesh_functions to unpack the data. affine_to_square converts the packed 4x3 in 3x4 matrix data to a 4x4 matrix. mat2x4_f32_to_mat3x3 converts the 3x3 in mat2x4 + f32 matrix data back into a 3x3.

Migration Guide

Shader code before:

var model = mesh[instance_index].model;

Shader code after:

#import bevy_pbr::mesh_functions affine_to_square

var model = affine_to_square(mesh[instance_index].model);

Before: mat4x4 x3 u32 = 196 bytes, but in an array, practically 208 bytes for 16-byte alignment. After: mat3x4 x2 mat2x4 f32 u32 = 136 bytes, but in an array, practically 144 bytes for 16-byte alignment. That is a reduction of over 30% VRAM space and bandwidth usage, plus less data to serialize.

JMS55 · 2023-08-11T06:17:30Z

For further saving, when motion vectors are not enabled, we should remove the previous model transform altogether.

james7132

Real quick scan, not a full review. Great to see the memory reduction and frame time gains though.

james7132 · 2023-08-11T14:44:08Z

crates/bevy_pbr/src/render/mesh.rs

-#[derive(Component, ShaderType, Clone)]
+#[derive(Component)]
+pub struct MeshTransforms {
+    pub transform: Affine3A,


If we're now doing nontrivial transformation into MeshUniform already, can we try using Affine3 over Affine3A here? We could save more CPU side memory and reduce command time costs.

I can for sure try it to see if it performs better.

There is no glam::Affine3. It looks like we only use the translation from the transform for view z calculations in the render app. I could then change them out for our own implementation of Affine3 that just has conversion to Affine3A implemented in case someone needs to do calculations with it.

I did that and retested (M1 Max, on AC power and fully charged, using my regular DPI external monitor, many_cubes -- sphere for 1500 frames).

main (yellow) vs PR (red):
frame time:

extract_meshes system:

extract_meshes system commands:

main (yellow) vs PR with MeshTransforms using custom Affine3:
frame time:

extract_meshes system:

extract_meshes system commands:

Summary:

frame times vs main in median frame time for this test run:

PR gives 6% reduction

PR + Affine3 gives 14% reduction

So it does indeed seem like adding struct Affine3 { matrix3: Mat3, translation: Vec3 } is worth it. Done.

In rendering code there is a significant performance benefit from using only Mat3 + Vec3 instead of Mat3A + Vec3A for mesh transforms as their storage size dominates performance.

crates/bevy_math/src/affine3.rs

robtfm · 2023-08-12T22:46:21Z

assets/shaders/custom_vertex_attribute.wgsl

@@ -23,7 +23,7 @@ struct VertexOutput {
 fn vertex(vertex: Vertex) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = mesh_position_local_to_clip(
-        mesh[bevy_render::instance_index::get_instance_index(vertex.instance_index)].model,
+        affine_to_square(mesh[bevy_render::instance_index::get_instance_index(vertex.instance_index)].model),


i think perhaps we should embrace the affinity and make the 3x4 the arg to the mesh_xxx and skin_normal functions, and the return of the skin_model function. maybe not very important since its only used per vertex but it would be cleaner.

then this would be clearer/easier to understand if the mesh.model (and the mesh_xxx arg) was a mat4x3 instead of a transpose ... is there a good reason to use 3x4 instead? the packing logic in mesh.rs is already quite complex so it probably wouldn't be worse. i guess MeshUniform would contain a [f32;12] instead of a [Vec4;3].

the downside would be if a caller has a 4x4 and wants to use the 4x3 functions they would have to use mesh_xxx(mat4x3(my4x4[0].xyz, my4x4[1].xyz, my4x4[2].xyz, my4x4[3].xyz)) - i guess we could add a utility conversion function for that if there isn't a better builtin way (i don't think there is).

i think perhaps we should embrace the affinity and make the 3x4 the arg to the mesh_xxx and skin_normal functions, and the return of the skin_model function. maybe not very important since its only used per vertex but it would be cleaner.

Then each of those would have to do the conversions or be less-readable code, and if anyone wanted to use the transforms for something else then they would have to figure out how.

I am rather feeling like we may want to add a get_model_matrix(instance_index) -> mat4x4<f32> function.

then this would be clearer/easier to understand if the mesh.model (and the mesh_xxx arg) was a mat4x3 instead of a transpose ... is there a good reason to use 3x4 instead? the packing logic in mesh.rs is already quite complex so it probably wouldn't be worse. i guess MeshUniform would contain a [f32;12] instead of a [Vec4;3].

4x3 takes the same space in uniform/storage buffers as 4x4 because it's practically 3 vec3s, and vec3s have an alignment of 16 so each vec3 starts at a 16-byte alignment. The shader would then error or misinterpret the data in the binding.

the downside would be if a caller has a 4x4 and wants to use the 4x3 functions they would have to use mesh_xxx(mat4x3(my4x4[0].xyz, my4x4[1].xyz, my4x4[2].xyz, my4x4[3].xyz)) - i guess we could add a utility conversion function for that if there isn't a better builtin way (i don't think there is).

Projection matrices are necessarily 4x4. so there we have to have a 4x4 matrix. One thing to note is that I have been told that the shader compiler (the GPU vendor one I guess?) will optimise away the multiplications by 0.0 and 1.0, so adding them back to make it a 4x4 does not hurt performance, it only makes code more readable and the matrix more usable.

I have added a get_model_matrix(instance_index: u32) -> mat4x4<f32> helper function and similar for previous model matrix.

everything makes sense given the size of mat4x3 is 64 bytes, i didn't realise that. then there's no purpose to 4x3 as input to the functions, and we would lose the size benefit if it was the unfiorm type. 👍

crates/bevy_pbr/src/render/mesh.rs

crates/bevy_pbr/src/render/mesh_functions.wgsl

superdump · 2023-08-13T11:03:14Z

For further saving, when motion vectors are not enabled, we should remove the previous model transform altogether.

I tried looking into this but it adds a lot of code complexity to optionally have the field or not, so I'm choosing to just always have it.

…module This is a bug/limitation in naga_oil.

This hides the ugliness and details that are unnecessary to know.

robtfm · 2023-08-13T20:07:06Z

last thing to change on the exported function in the resolved thread then all good

# Objective - Wireframe currently don't display since #9416 - There is an error ``` 2023-08-20T10:06:54.190347Z ERROR bevy_render::render_resource::pipeline_cache: failed to process shader: error: no definition in scope for identifier: 'vertex_no_morph' ┌─ crates/bevy_pbr/src/render/wireframe.wgsl:26:94 │ 26 │ let model = bevy_pbr::mesh_functions::get_model_matrix(vertex_no_morph.instance_index); │ ^^^^^^^^^^^^^^^ unknown identifier │ = no definition in scope for identifier: 'vertex_no_morph' ``` ## Solution - Use the correct identifier

# Objective - Since #9416, example shader_material_glsl is broken <img width="1280" alt="Screenshot 2023-08-20 at 21 08 41" src="https://github.com/bevyengine/bevy/assets/8672791/16bc15ee-a58c-4ec6-87a8-c3799a6df74a"> ## Solution - Apply the same function as other examples, but in glsl

superdump requested a review from robtfm August 11, 2023 00:44

superdump force-pushed the affine-model-matrices branch from 1e6b2b3 to b7845f4 Compare August 11, 2023 01:04

alice-i-cecile added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Aug 11, 2023

james7132 reviewed Aug 11, 2023

View reviewed changes

superdump added 2 commits August 12, 2023 13:09

bevy_math: Add Affine3 and conversions to/from Affine3A

100ac76

In rendering code there is a significant performance benefit from using only Mat3 + Vec3 instead of Mat3A + Vec3A for mesh transforms as their storage size dominates performance.

Use Affine3 for MeshTransforms

71827d3

robtfm reviewed Aug 12, 2023

View reviewed changes

superdump added 3 commits August 13, 2023 13:17

Add underscore to mat2x4_f32_to_mat3x3 to make it usable outside the …

45b38d6

…module This is a bug/limitation in naga_oil.

Implement From<&Affine3> for Affine3A and use it

dcb785c

Add get_(previous_)model_matrix shader helper functions

34ef93c

This hides the ugliness and details that are unnecessary to know.

superdump requested review from james7132 and robtfm August 13, 2023 11:41

Rename mat2x4_f32_to_mat3x3_ to mat2x4_f32_to_mat3x3_unpack

e91458c

robtfm approved these changes Aug 14, 2023

View reviewed changes

superdump added this pull request to the merge queue Aug 14, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 14, 2023

superdump added this pull request to the merge queue Aug 14, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 14, 2023

superdump added this pull request to the merge queue Aug 15, 2023

Merged via the queue into bevyengine:main with commit 0a11af9 Aug 15, 2023

This was referenced Aug 20, 2023

fix wireframe after MeshUniform size reduction #9505

Merged

fix shader_material_glsl example #9513

Merged

cart mentioned this pull request Oct 13, 2023

News: Release 0.12 bevyengine/bevy-website#754

Merged

43 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the size of MeshUniform to improve performance #9416

Reduce the size of MeshUniform to improve performance #9416

superdump commented Aug 11, 2023

JMS55 commented Aug 11, 2023

james7132 left a comment

james7132 Aug 11, 2023

superdump Aug 11, 2023

superdump Aug 12, 2023

superdump Aug 12, 2023 •

edited

Loading

robtfm Aug 12, 2023

superdump Aug 13, 2023

superdump Aug 13, 2023

robtfm Aug 13, 2023

superdump commented Aug 13, 2023

robtfm commented Aug 13, 2023

Reduce the size of MeshUniform to improve performance #9416

Reduce the size of MeshUniform to improve performance #9416

Conversation

superdump commented Aug 11, 2023

Objective

Solution

Benchmarks

Changelog

Migration Guide

JMS55 commented Aug 11, 2023

james7132 left a comment

Choose a reason for hiding this comment

james7132 Aug 11, 2023

Choose a reason for hiding this comment

superdump Aug 11, 2023

Choose a reason for hiding this comment

superdump Aug 12, 2023

Choose a reason for hiding this comment

superdump Aug 12, 2023 • edited Loading

Choose a reason for hiding this comment

robtfm Aug 12, 2023

Choose a reason for hiding this comment

superdump Aug 13, 2023

Choose a reason for hiding this comment

superdump Aug 13, 2023

Choose a reason for hiding this comment

robtfm Aug 13, 2023

Choose a reason for hiding this comment

superdump commented Aug 13, 2023

robtfm commented Aug 13, 2023

superdump Aug 12, 2023 •

edited

Loading