Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient use of Vec::push in prepare_uniform_components function #8284

Closed
CrazyRoka opened this issue Apr 1, 2023 · 4 comments · Fixed by #8299
Closed

Inefficient use of Vec::push in prepare_uniform_components function #8284

CrazyRoka opened this issue Apr 1, 2023 · 4 comments · Fixed by #8299
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times

Comments

@CrazyRoka
Copy link
Contributor

I created a stress test scenario with the Bevy engine (link) where I created 100K entities. I noticed that the prepare_uniform_components function was using 4.23% of the CPU time, and upon investigating the code, I found that the Vec3::push method was being called for each of the entities, which is not efficient.

To improve performance, I suggest copying all the objects in batches instead of pushing them one by one. Apart from that, DynamicUniformBuffer::write can be optimized as well. I am attaching my Flamegraph to this issue.

fn prepare_uniform_components<C: Component>(
mut commands: Commands,
render_device: Res<RenderDevice>,
render_queue: Res<RenderQueue>,
mut component_uniforms: ResMut<ComponentUniforms<C>>,
components: Query<(Entity, &C)>,
) where
C: ShaderType + WriteInto + Clone,
{
component_uniforms.uniforms.clear();
let entities = components
.iter()
.map(|(entity, component)| {
(
entity,
DynamicUniformIndex::<C> {
index: component_uniforms.uniforms.push(component.clone()),
marker: PhantomData,
},
)
})
.collect::<Vec<_>>();
commands.insert_or_spawn_batch(entities);
component_uniforms
.uniforms
.write_buffer(&render_device, &render_queue);
}

Screenshot from 2023-04-01 17-06-22

@JMS55
Copy link
Contributor

JMS55 commented Apr 1, 2023

You may be interested in weighing in on #8204

@james7132 james7132 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Apr 4, 2023
@james7132
Copy link
Member

james7132 commented Apr 4, 2023

@CrazyRoka turns out the values Vec is basically unused and was causing unnecessary copies and allocations. #8299 removes it. As for the performance issues with encase, I strongly suggest making a PR upstream to address it. The flamegraph you provided definitely shows there's some non-trivial overhead when using it.

@CrazyRoka
Copy link
Contributor Author

@james7132 Thank you for your PR to improve Bevy performance. I have 2 more questions:

  1. How did you measure performance improvements in your PR?
  2. Can we replace component_uniforms.uniforms.push(component.clone()) with batch write? I was diving deeper into the implementation and it looks like each push call creates Writer Cursor, writes and closes this complex object. I think it will make a difference to open WriteCursor once and write all the components together.

@james7132
Copy link
Member

  1. How did you measure performance improvements in your PR?

I used Tracy and the trace_tracy feature. You can find the docs on how to use it under docs/profiling.md in the repo.

  1. Can we replace component_uniforms.uniforms.push(component.clone()) with batch write? I was diving deeper into the implementation and it looks like each push call creates Writer Cursor, writes and closes this complex object. I think it will make a difference to open WriteCursor once and write all the components together.

Batch insertion probably is for the best here. We need to construct both the buffer and the Vec for Commands, and preallocating then skipping the capacity checks likely will have a significant speedup. Though with that said, with dynamic uniform offsets being pessimistically aligned to 256 bytes, we might not see much gain there.

github-merge-queue bot pushed a commit that referenced this issue Sep 25, 2023
# Objective
This is a minimally disruptive version of #8340. I attempted to update
it, but failed due to the scope of the changes added in #8204.

Fixes #8307. Partially addresses #4642. As seen in
#8284, we're actually copying
data twice in Prepare stage systems. Once into a CPU-side intermediate
scratch buffer, and once again into a mapped buffer. This is inefficient
and effectively doubles the time spent and memory allocated to run these
systems.

## Solution
Skip the scratch buffer entirely and use
`wgpu::Queue::write_buffer_with` to directly write data into mapped
buffers.

Separately, this also directly uses
`wgpu::Limits::min_uniform_buffer_offset_alignment` to set up the
alignment when writing to the buffers. Partially addressing the issue
raised in #4642.

Storage buffers and the abstractions built on top of
`DynamicUniformBuffer` will need to come in followup PRs.

This may not have a noticeable performance difference in this PR, as the
only first-party systems affected by this are view related, and likely
are not going to be particularly heavy.

---

## Changelog
Added: `DynamicUniformBuffer::get_writer`.
Added: `DynamicUniformBufferWriter`.
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this issue Jan 9, 2024
# Objective
This is a minimally disruptive version of bevyengine#8340. I attempted to update
it, but failed due to the scope of the changes added in bevyengine#8204.

Fixes bevyengine#8307. Partially addresses bevyengine#4642. As seen in
bevyengine#8284, we're actually copying
data twice in Prepare stage systems. Once into a CPU-side intermediate
scratch buffer, and once again into a mapped buffer. This is inefficient
and effectively doubles the time spent and memory allocated to run these
systems.

## Solution
Skip the scratch buffer entirely and use
`wgpu::Queue::write_buffer_with` to directly write data into mapped
buffers.

Separately, this also directly uses
`wgpu::Limits::min_uniform_buffer_offset_alignment` to set up the
alignment when writing to the buffers. Partially addressing the issue
raised in bevyengine#4642.

Storage buffers and the abstractions built on top of
`DynamicUniformBuffer` will need to come in followup PRs.

This may not have a noticeable performance difference in this PR, as the
only first-party systems affected by this are view related, and likely
are not going to be particularly heavy.

---

## Changelog
Added: `DynamicUniformBuffer::get_writer`.
Added: `DynamicUniformBufferWriter`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants