Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GpuArrayBuffer for MeshUniform #9254

Merged
merged 15 commits into from
Jul 30, 2023

Conversation

superdump
Copy link
Contributor

@superdump superdump commented Jul 23, 2023

Objective

  • Reduce the number of rebindings to enable batching of draw commands

Solution

  • Use the new GpuArrayBuffer for MeshUniform data to store all MeshUniform data in arrays within fewer bindings
  • Sort opaque/alpha mask prepass, opaque/alpha mask main, and shadow phases also by the batch per-object data binding dynamic offset to improve performance on WebGL2.

Changelog

  • Changed: Per-object MeshUniform data is now managed by GpuArrayBuffer as arrays in buffers that need to be indexed into.

Migration Guide

Accessing the model member of an individual mesh object's shader Mesh struct the old way where each MeshUniform was stored at its own dynamic offset:

struct Vertex {
    @location(0) position: vec3<f32>,
};

fn vertex(vertex: Vertex) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = mesh_position_local_to_clip(
        mesh.model,
        vec4<f32>(vertex.position, 1.0)
    );
    return out;
}

The new way where one needs to index into the array of Meshes for the batch:

struct Vertex {
    @builtin(instance_index) instance_index: u32,
    @location(0) position: vec3<f32>,
};

fn vertex(vertex: Vertex) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = mesh_position_local_to_clip(
        mesh[vertex.instance_index].model,
        vec4<f32>(vertex.position, 1.0)
    );
    return out;
}

Note that using the instance_index is the default way to pass the per-object index into the shader, but if you wish to do custom rendering approaches you can pass it in however you like.

@superdump superdump force-pushed the use-gpu-list-for-per-object-data branch from 574516f to 382737d Compare July 23, 2023 14:45
@github-actions
Copy link
Contributor

Example no_renderer failed to run, please try running it locally and check the result.

1 similar comment
@github-actions
Copy link
Contributor

Example no_renderer failed to run, please try running it locally and check the result.

@superdump
Copy link
Contributor Author

Benchmarks

All benchmarks are running many_cubes -- sphere on a M1 Max MacBook Pro 16". Images show main in yellow and the new option in red.

(2560x1440) Storage buffer with Opaque3d sorted by view z

Not having to re-bind the per-object data per draw command gives a 15% reduction in frame time:

Screenshot 2023-07-23 at 17 18 26

(2560x1440) Storage buffer with Opaque3d sorted by dynamic offset and view z

Sorting by dynamic offset is unnecessary as there is no dynamic offset for the storage buffer case but has no negative impact:

Screenshot 2023-07-23 at 17 18 55

(WebGL2 at 1280x720) Uniform buffer with Opaque3d sorted by view z

This is basically the same as main as rebinding due to different dynamic offsets cannot be avoided as we ignore the dynamic offsets and sort by view z which can then mean the objects appear in different batches and so rebinds for dynamic offsets are needed:

Screenshot 2023-07-23 at 17 19 39

(WebGL2 at 1280x720) Uniform buffer with Opaque3d sorted by dynamic offset and view z

Sorting by dynamic offset and then by view z avoids the need to re-bind the per-object uniform buffer at different dynamic offsets for each draw. This brings an 18.6% reduction in frame time:

Screenshot 2023-07-23 at 17 19 59

Comment on lines +40 to +45
pub(super) fn model(render_device: &RenderDevice, binding: u32) -> BindGroupLayoutEntry {
GpuArrayBuffer::<MeshUniform>::binding_layout(
binding,
ShaderStages::VERTEX_FRAGMENT,
render_device,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess skinning and weights can get the same treatment in a future PR right? Since they could use a storage buffer in lieu of a uniform buffer when available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me reformulate: we already use an index and an Uniform for skins and morphs, but is there any benefit to use a storage buffer when available?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pack all the skinning data into 1 storage buffer, instead of multiple uniform buffers. Less buffers = less rebinds = more performance :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it applies to all buffers of struct data, basically. One needs to plumb in the indices, so perhaps a per-object index has an index or offset into a skinning array or so. I don't know too much about those skins work to know how best to structure the data. Another aspect of having a single buffer is if the data itself repeats with varying counts of things (so things like vertices in vertex buffers, or maybe the number of transforms in a skin?) then some kind of allocator is usually used to manage loading in/out data without having to reallocate buffers all the time and rewrite everything, and handle things of varying sizes.

GpuArrayBuffer::<MeshUniform>::binding_layout(
binding,
ShaderStages::VERTEX_FRAGMENT,
render_device,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR, it might be worthwhile to newtype device.limits().max_storage_buffers_per_shader_stage, and accept that newtype instead of RenderDevice as argument to GpuArrayBuffer so that it's clear what data it is using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm. I thought about passing the values explicitly, but then I also thought that maybe other features or limits may impact logic at some point. If it had been more difficult to plumb it in then I would have changed it.

@nicopap nicopap added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Jul 23, 2023
@nicopap nicopap added this to the 0.12 milestone Jul 23, 2023
@nicopap
Copy link
Contributor

nicopap commented Jul 23, 2023

Looks good. (I'll do some testing on my end before dropping an approval) What is the "TODO" item supposed to mean? Do you intend to iterate in future PRs or within this one?

@superdump
Copy link
Contributor Author

Looks good. (I'll do some testing on my end before dropping an approval) What is the "TODO" item supposed to mean? Do you intend to iterate in future PRs or within this one?

I'll probably do it in this one.

@JMS55
Copy link
Contributor

JMS55 commented Jul 23, 2023

Reviewed the PR in it's current state, minus the existing feedback lgtm :)

Also good catch on GpuComponentArrayBufferPlugin not using finish().

…mic offset

When using a uniform buffer with batches of per-object data, this provides a
~18% frame time reduction on the many_cubes -- sphere example in the Opaque3d
phase and should benefit alpha mask, prepass, and shadow phases in a similar
way.
@github-actions
Copy link
Contributor

Example no_renderer failed to run, please try running it locally and check the result.

@superdump
Copy link
Contributor Author

Ideally we'd have a good way of benchmarking the impact on prepass and shadows. I may modify the many_cubes example to also support some modes to test the performance of these.

The RenderApp sub app does not exist when the renderer is disabled so assuming
it does is not a good idea.
@github-actions
Copy link
Contributor

Example no_renderer failed to run, please try running it locally and check the result.

@robtfm robtfm self-requested a review July 24, 2023 10:17
@@ -138,7 +138,7 @@ fn fragment(
pbr_input.V = V;
pbr_input.occlusion = occlusion;

pbr_input.flags = mesh.flags;
pbr_input.flags = mesh[in.instance_index].flags;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is using in.instance_index unconditionally, but it's existence depends on VERTEX_OUTPUT_INSTANCE_INDEX. i think that's probably fine since this frag shader is (currently) specific to standard materials, but it doesn't feel great.

as far as i can tell this is the only use of the instance index in any fragment stage. would it make more sense to push the mesh flags through the MeshVertexOutput struct directly (and unconditionally)? then we could get rid of the VERTEX_OUTPUT_INSTANCE_INDEX entirely.

this would not make sense if there are other potential uses for mesh data in the fragment stage but i think there shoudn't be generally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would add cost to the vertex stage / interpolators but I don't know if it's worth the tradeoff.

After this change, people will need to be aware of per-object data and how to access it. Long-term, various indices will be added into the per-object data for materials, animations, and probably more. At least material stuff I expect to be used in fragment stage. The instance index, or some other way of getting the per-object index into the shader, would be used to index into the per-object data array, and then within that type there would be a material index to index into a material data array.

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved
// NOTE: Values increase towards the camera. Front-to-back ordering for opaque means we need a descending sort.
type SortKey = Reverse<FloatOrd>;
type SortKey = (u32, Reverse<FloatOrd>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not for this pr but just wondering, is there any possibility for / value in some kind of spatial sorting before putting the mesh data in the gpu array? i think currently they are entity iterator sorted, but it seems like (at least for static meshes) there'd be some benefit to having a batch be as proximally local as possible and then to sort batches based on the nearest member, or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will happen when the render set reordering and batching is implemented. The reason for reordering the render sets to extract, prepare assets, queue, sort, prepare+batch, render is to allow the order of draws to be known when preparing the data so that the order can be taken into account.

crates/bevy_core_pipeline/src/core_3d/mod.rs Show resolved Hide resolved
crates/bevy_pbr/src/prepass/mod.rs Outdated Show resolved Hide resolved
@superdump superdump force-pushed the use-gpu-list-for-per-object-data branch from 392408d to 4c8fba0 Compare July 25, 2023 18:02
@superdump superdump force-pushed the use-gpu-list-for-per-object-data branch from 4c8fba0 to dae52b7 Compare July 25, 2023 18:11
@Elabajaba
Copy link
Contributor

Elabajaba commented Jul 26, 2023

This seems to be ignoring object transforms when rendering stuff, as everything seems to be rendering in the same place (tested 3d_scene: cube is lower than it should be, 3d_shapes: everything is mashed together instead of being separate, and scene_viewer (with a large custom scene): everything is mashed together, and the object clump jumps around weirdly when moving the camera around.)

edit: Works properly on vulkan, broken on dx12.

@nicopap
Copy link
Contributor

nicopap commented Jul 26, 2023

I'm getting a 45% speedup on vulkan/linux with your latest set of changes for many_cubes (no additional flags). A 1% speedup on many_foxes. Looking at the sub app=RenderApp and schedule=Render traces.

I'm unable to test the OpenGL backend as I get a panic with WGPU_BACKEND=gl (but that's also true of main).

@superdump
Copy link
Contributor Author

@nicopap nice!

@Elabajaba - oof, ok, I guess this isn't mergeable if it breaks DX12. Could anyone test DX12 on NVIDIA as I think @Elabajaba uses AMD. I will test on a mobile RTX 3080 as soon as I can.

@robtfm
Copy link
Contributor

robtfm commented Jul 26, 2023

Could anyone test DX12 on NVIDIA

looks like always index 0 on nvidia/dx12 as well

@Elabajaba
Copy link
Contributor

Elabajaba commented Jul 26, 2023

The vulkan gains seem big enough that it might be worth forcing dx12 to always draw instead of draw_indexed for now, and try and get it implemented in wgpu for their next release?

edit: Not sure how many people are actually using bevy dx12, or what any potential performance losses might be (or if forcing draw instead of draw_indexed would break stuff).

@superdump
Copy link
Contributor Author

superdump commented Jul 26, 2023

All of these benchmarks are running many_cubes on an M1 Max, main (yellow) using a uniform buffer with a dynamic offset per object, vs PR (red) using GpuArrayBuffer (WebGL2 uses batches of per object data as dynamic offsets in a uniform buffer, but if not WebGL2 then just one large storage buffer).

Storage buffers

Storage buffers with main pass only

Screenshot 2023-07-26 at 22 12 07

Storage buffer with depth prepass and main pass

Screenshot 2023-07-26 at 22 12 43

Storage buffer with cascaded shadow mapping, and main pass

Screenshot 2023-07-26 at 22 13 18

Storage buffer with depth prepass, cascaded shadow mapping, and main pass

Screenshot 2023-07-26 at 22 13 43

WebGL2

WebGL2 with main pass only

Screenshot 2023-07-26 at 22 14 24

WebGL2 with depth prepass, and main pass

Screenshot 2023-07-26 at 22 14 43

WebGL2 with cascaded shadow mapping, and main pass

Screenshot 2023-07-26 at 22 15 36

WebGL2 with depth prepass, cascaded shadow mapping, and main pass

Screenshot 2023-07-26 at 22 15 54

Summary

Looking only at median frame times:

  • Storage buffers
    • main pass: -13.3%
    • prepass + main pass: -18.6%
    • CSM + main pass: -11.1%
    • prepass + CSM + main pass: -17.8%
  • WebGL2
    • main pass: -15.7%
    • prepass + main pass: -23.5%
    • CSM + main pass: -23.3%
    • prepass + CSM + main pass: -18.0%

So, big gains all-round from looking at the CPU side. I would need to run the GPU timestamp query stuff on Windows to check what's happening on the GPU side.

@superdump
Copy link
Contributor Author

superdump commented Jul 27, 2023

On a 5900HS and mobile RTX 3080 (on the power adapter) using Vulkan - main (yellow) vs PR (red) at 1280x720:

Storage buffer

Storage buffer with main pass only

Screenshot 2023-07-27 160637

Storage buffer with prepass, and main pass

Screenshot 2023-07-27 160706

Storage buffer with cascaded shadow mapping, and main pass

Screenshot 2023-07-27 160732

Storage buffer with prepass, cascaded shadow mapping, and main pass

Screenshot 2023-07-27 160751

WebGL2

WebGL2 with main pass only

Screenshot 2023-07-27 163355

WebGL2 with prepass, and main pass

Screenshot 2023-07-27 163423

WebGL2 with cascaded shadow mapping, and main pass

Screenshot 2023-07-27 163449

WebGL2 with prepass, cascaded shadow mapping, and main pass

Screenshot 2023-07-27 163516

Summary

  • Storage buffer
    • main pass: -7.4%
    • prepass + main pass: -12.7%
    • cascaded shadow mapping + main pass: -9.0%
    • prepass + cascaded shadow mapping + main pass: -10.2%
  • WebGL2
    • main pass: -12.8%
    • prepass + main pass: -18.2%
    • cascaded shadow mapping + main pass: -13.1%
    • prepass + cascaded shadow mapping + main pass: -13.0%

@superdump
Copy link
Contributor Author

I'm satisfied with the performance benefit. The blocking issue is DX12 instance_index being broken.

@superdump superdump mentioned this pull request Jul 29, 2023
@superdump
Copy link
Contributor Author

superdump commented Jul 30, 2023

The DX12 issue was fixed by @Elabajaba - thanks for figuring that one out!

@superdump superdump added this pull request to the merge queue Jul 30, 2023
Merged via the queue into bevyengine:main with commit e6405bb Jul 30, 2023
superdump added a commit to superdump/bevy that referenced this pull request Jul 30, 2023
superdump added a commit to superdump/bevy that referenced this pull request Jul 31, 2023
github-merge-queue bot pushed a commit that referenced this pull request Jul 31, 2023
# Objective

- Fix shader_material_glsl example

## Solution

- Expose the `PER_OBJECT_BUFFER_BATCH_SIZE` shader def through the
default `MeshPipeline` specialization.
- Make use of it in the `custom_material.vert` shader to access the mesh
binding.

---

## Changelog

- Added: Exposed the `PER_OBJECT_BUFFER_BATCH_SIZE` shader def through
the default `MeshPipeline` specialization to use in custom shaders not
using bevy_pbr::mesh_bindings that still want to use the mesh binding in
some way.
@cart cart mentioned this pull request Oct 13, 2023
43 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants