-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the size of MeshUniform to improve performance #9416
Conversation
Before: mat4x4 x3 u32 = 196 bytes, but in an array, practically 208 bytes for 16-byte alignment. After: mat3x4 x2 mat2x4 f32 u32 = 136 bytes, but in an array, practically 144 bytes for 16-byte alignment. That is a reduction of over 30% VRAM space and bandwidth usage, plus less data to serialize.
1e6b2b3
to
b7845f4
Compare
For further saving, when motion vectors are not enabled, we should remove the previous model transform altogether. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real quick scan, not a full review. Great to see the memory reduction and frame time gains though.
crates/bevy_pbr/src/render/mesh.rs
Outdated
#[derive(Component, ShaderType, Clone)] | ||
#[derive(Component)] | ||
pub struct MeshTransforms { | ||
pub transform: Affine3A, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're now doing nontrivial transformation into MeshUniform already, can we try using Affine3 over Affine3A here? We could save more CPU side memory and reduce command time costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can for sure try it to see if it performs better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no glam::Affine3
. It looks like we only use the translation from the transform for view z calculations in the render app. I could then change them out for our own implementation of Affine3 that just has conversion to Affine3A implemented in case someone needs to do calculations with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that and retested (M1 Max, on AC power and fully charged, using my regular DPI external monitor, many_cubes -- sphere
for 1500 frames).
main (yellow) vs PR (red):
frame time:
extract_meshes system:
extract_meshes system commands:
main (yellow) vs PR with MeshTransforms
using custom Affine3
:
frame time:
extract_meshes system:
extract_meshes system commands:
Summary:
- frame times vs main in median frame time for this test run:
- PR gives 6% reduction
- PR + Affine3 gives 14% reduction
So it does indeed seem like adding struct Affine3 { matrix3: Mat3, translation: Vec3 }
is worth it. Done.
In rendering code there is a significant performance benefit from using only Mat3 + Vec3 instead of Mat3A + Vec3A for mesh transforms as their storage size dominates performance.
@@ -23,7 +23,7 @@ struct VertexOutput { | |||
fn vertex(vertex: Vertex) -> VertexOutput { | |||
var out: VertexOutput; | |||
out.clip_position = mesh_position_local_to_clip( | |||
mesh[bevy_render::instance_index::get_instance_index(vertex.instance_index)].model, | |||
affine_to_square(mesh[bevy_render::instance_index::get_instance_index(vertex.instance_index)].model), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think perhaps we should embrace the affinity and make the 3x4 the arg to the mesh_xxx
and skin_normal
functions, and the return of the skin_model
function. maybe not very important since its only used per vertex but it would be cleaner.
then this would be clearer/easier to understand if the mesh.model (and the mesh_xxx
arg) was a mat4x3 instead of a transpose ... is there a good reason to use 3x4 instead? the packing logic in mesh.rs is already quite complex so it probably wouldn't be worse. i guess MeshUniform
would contain a [f32;12] instead of a [Vec4;3].
the downside would be if a caller has a 4x4 and wants to use the 4x3 functions they would have to use mesh_xxx(mat4x3(my4x4[0].xyz, my4x4[1].xyz, my4x4[2].xyz, my4x4[3].xyz))
- i guess we could add a utility conversion function for that if there isn't a better builtin way (i don't think there is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think perhaps we should embrace the affinity and make the 3x4 the arg to the
mesh_xxx
andskin_normal
functions, and the return of theskin_model
function. maybe not very important since its only used per vertex but it would be cleaner.
Then each of those would have to do the conversions or be less-readable code, and if anyone wanted to use the transforms for something else then they would have to figure out how.
I am rather feeling like we may want to add a get_model_matrix(instance_index) -> mat4x4<f32>
function.
then this would be clearer/easier to understand if the mesh.model (and the
mesh_xxx
arg) was a mat4x3 instead of a transpose ... is there a good reason to use 3x4 instead? the packing logic in mesh.rs is already quite complex so it probably wouldn't be worse. i guessMeshUniform
would contain a [f32;12] instead of a [Vec4;3].
4x3 takes the same space in uniform/storage buffers as 4x4 because it's practically 3 vec3s, and vec3s have an alignment of 16 so each vec3 starts at a 16-byte alignment. The shader would then error or misinterpret the data in the binding.
the downside would be if a caller has a 4x4 and wants to use the 4x3 functions they would have to use
mesh_xxx(mat4x3(my4x4[0].xyz, my4x4[1].xyz, my4x4[2].xyz, my4x4[3].xyz))
- i guess we could add a utility conversion function for that if there isn't a better builtin way (i don't think there is).
Projection matrices are necessarily 4x4. so there we have to have a 4x4 matrix. One thing to note is that I have been told that the shader compiler (the GPU vendor one I guess?) will optimise away the multiplications by 0.0 and 1.0, so adding them back to make it a 4x4 does not hurt performance, it only makes code more readable and the matrix more usable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a get_model_matrix(instance_index: u32) -> mat4x4<f32>
helper function and similar for previous model matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything makes sense given the size of mat4x3 is 64 bytes, i didn't realise that. then there's no purpose to 4x3 as input to the functions, and we would lose the size benefit if it was the unfiorm type. 👍
I tried looking into this but it adds a lot of code complexity to optionally have the field or not, so I'm choosing to just always have it. |
…module This is a bug/limitation in naga_oil.
This hides the ugliness and details that are unnecessary to know.
last thing to change on the exported function in the resolved thread then all good |
# Objective - Wireframe currently don't display since #9416 - There is an error ``` 2023-08-20T10:06:54.190347Z ERROR bevy_render::render_resource::pipeline_cache: failed to process shader: error: no definition in scope for identifier: 'vertex_no_morph' ┌─ crates/bevy_pbr/src/render/wireframe.wgsl:26:94 │ 26 │ let model = bevy_pbr::mesh_functions::get_model_matrix(vertex_no_morph.instance_index); │ ^^^^^^^^^^^^^^^ unknown identifier │ = no definition in scope for identifier: 'vertex_no_morph' ``` ## Solution - Use the correct identifier
# Objective - Since #9416, example shader_material_glsl is broken <img width="1280" alt="Screenshot 2023-08-20 at 21 08 41" src="https://github.com/bevyengine/bevy/assets/8672791/16bc15ee-a58c-4ec6-87a8-c3799a6df74a"> ## Solution - Apply the same function as other examples, but in glsl
Objective
Solution
Local to world, model transforms are affine. This means they only need a 4x3 matrix to represent them.
MeshUniform
stores the current, and previous model transforms, and the inverse transpose of the current model transform, all as 4x4 matrices. Instead we can store the current, and previous model transforms as 4x3 matrices, and we only need the upper-left 3x3 part of the inverse transpose of the current model transform. This change allows us to reduce the serialized MeshUniform size from 208 bytes to 144 bytes, which is over a 30% saving in data to serialize, and VRAM bandwidth and space.Benchmarks
On an M1 Max, running
many_cubes -- sphere
, main is in yellow, this PR is in red:A reduction in frame time of ~14%.
Changelog
MeshUniform
to improve performance by using 4x3 affine transforms and reconstructing 4x4 matrices in the shader. Helper functions were added tobevy_pbr::mesh_functions
to unpack the data.affine_to_square
converts the packed 4x3 in 3x4 matrix data to a 4x4 matrix.mat2x4_f32_to_mat3x3
converts the 3x3 in mat2x4 + f32 matrix data back into a 3x3.Migration Guide
Shader code before:
Shader code after: