Inefficient use of Vec::push in prepare_uniform_components function #8284

CrazyRoka · 2023-04-01T16:15:23Z

I created a stress test scenario with the Bevy engine (link) where I created 100K entities. I noticed that the prepare_uniform_components function was using 4.23% of the CPU time, and upon investigating the code, I found that the Vec3::push method was being called for each of the entities, which is not efficient.

To improve performance, I suggest copying all the objects in batches instead of pushing them one by one. Apart from that, DynamicUniformBuffer::write can be optimized as well. I am attaching my Flamegraph to this issue.

bevy/crates/bevy_render/src/extract_component.rs

Lines 126 to 153 in aefe1f0

    
           fn prepare_uniform_components<C: Component>( 
        
               mut commands: Commands, 
        
               render_device: Res<RenderDevice>, 
        
               render_queue: Res<RenderQueue>, 
        
               mut component_uniforms: ResMut<ComponentUniforms<C>>, 
        
               components: Query<(Entity, &C)>, 
        
           ) where 
        
               C: ShaderType + WriteInto + Clone, 
        
           { 
        
               component_uniforms.uniforms.clear(); 
        
               let entities = components 
        
                   .iter() 
        
                   .map(|(entity, component)| { 
        
                       ( 
        
                           entity, 
        
                           DynamicUniformIndex::<C> { 
        
                               index: component_uniforms.uniforms.push(component.clone()), 
        
                               marker: PhantomData, 
        
                           }, 
        
                       ) 
        
                   }) 
        
                   .collect::<Vec<_>>(); 
        
               commands.insert_or_spawn_batch(entities); 
        
               component_uniforms 
        
                   .uniforms 
        
                   .write_buffer(&render_device, &render_queue); 
        
           }

The text was updated successfully, but these errors were encountered:

JMS55 · 2023-04-01T18:22:53Z

You may be interested in weighing in on #8204

james7132 · 2023-04-04T06:03:53Z

@CrazyRoka turns out the values Vec is basically unused and was causing unnecessary copies and allocations. #8299 removes it. As for the performance issues with encase, I strongly suggest making a PR upstream to address it. The flamegraph you provided definitely shows there's some non-trivial overhead when using it.

CrazyRoka · 2023-04-04T10:40:34Z

@james7132 Thank you for your PR to improve Bevy performance. I have 2 more questions:

How did you measure performance improvements in your PR?
Can we replace component_uniforms.uniforms.push(component.clone()) with batch write? I was diving deeper into the implementation and it looks like each push call creates Writer Cursor, writes and closes this complex object. I think it will make a difference to open WriteCursor once and write all the components together.

james7132 · 2023-04-04T20:48:04Z

How did you measure performance improvements in your PR?

I used Tracy and the trace_tracy feature. You can find the docs on how to use it under docs/profiling.md in the repo.

Can we replace component_uniforms.uniforms.push(component.clone()) with batch write? I was diving deeper into the implementation and it looks like each push call creates Writer Cursor, writes and closes this complex object. I think it will make a difference to open WriteCursor once and write all the components together.

Batch insertion probably is for the best here. We need to construct both the buffer and the Vec for Commands, and preallocating then skipping the capacity checks likely will have a significant speedup. Though with that said, with dynamic uniform offsets being pessimistically aligned to 256 bytes, we might not see much gain there.

# Objective This is a minimally disruptive version of #8340. I attempted to update it, but failed due to the scope of the changes added in #8204. Fixes #8307. Partially addresses #4642. As seen in #8284, we're actually copying data twice in Prepare stage systems. Once into a CPU-side intermediate scratch buffer, and once again into a mapped buffer. This is inefficient and effectively doubles the time spent and memory allocated to run these systems. ## Solution Skip the scratch buffer entirely and use `wgpu::Queue::write_buffer_with` to directly write data into mapped buffers. Separately, this also directly uses `wgpu::Limits::min_uniform_buffer_offset_alignment` to set up the alignment when writing to the buffers. Partially addressing the issue raised in #4642. Storage buffers and the abstractions built on top of `DynamicUniformBuffer` will need to come in followup PRs. This may not have a noticeable performance difference in this PR, as the only first-party systems affected by this are view related, and likely are not going to be particularly heavy. --- ## Changelog Added: `DynamicUniformBuffer::get_writer`. Added: `DynamicUniformBufferWriter`.

# Objective This is a minimally disruptive version of bevyengine#8340. I attempted to update it, but failed due to the scope of the changes added in bevyengine#8204. Fixes bevyengine#8307. Partially addresses bevyengine#4642. As seen in bevyengine#8284, we're actually copying data twice in Prepare stage systems. Once into a CPU-side intermediate scratch buffer, and once again into a mapped buffer. This is inefficient and effectively doubles the time spent and memory allocated to run these systems. ## Solution Skip the scratch buffer entirely and use `wgpu::Queue::write_buffer_with` to directly write data into mapped buffers. Separately, this also directly uses `wgpu::Limits::min_uniform_buffer_offset_alignment` to set up the alignment when writing to the buffers. Partially addressing the issue raised in bevyengine#4642. Storage buffers and the abstractions built on top of `DynamicUniformBuffer` will need to come in followup PRs. This may not have a noticeable performance difference in this PR, as the only first-party systems affected by this are view related, and likely are not going to be particularly heavy. --- ## Changelog Added: `DynamicUniformBuffer::get_writer`. Added: `DynamicUniformBufferWriter`.

james7132 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Apr 4, 2023

james7132 mentioned this issue Apr 4, 2023

Remove unnecesssary values Vec from DynamicUniformBuffer and DynamicStorageBuffer #8299

Merged

superdump closed this as completed in 63d89d3 Apr 4, 2023

superdump closed this as completed in #8299 Apr 4, 2023

This was referenced Apr 5, 2023

Direct copy API for Buffer wrappers #8307

Closed

Directly copy data into uniform buffers #8340

Closed

james7132 mentioned this issue Sep 20, 2023

Directly copy data into uniform buffers #9865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient use of Vec::push in prepare_uniform_components function #8284

Inefficient use of Vec::push in prepare_uniform_components function #8284

CrazyRoka commented Apr 1, 2023

JMS55 commented Apr 1, 2023

james7132 commented Apr 4, 2023 •

edited

Loading

CrazyRoka commented Apr 4, 2023

james7132 commented Apr 4, 2023

Inefficient use of Vec::push in prepare_uniform_components function #8284

Inefficient use of Vec::push in prepare_uniform_components function #8284

Comments

CrazyRoka commented Apr 1, 2023

JMS55 commented Apr 1, 2023

james7132 commented Apr 4, 2023 • edited Loading

CrazyRoka commented Apr 4, 2023

james7132 commented Apr 4, 2023

james7132 commented Apr 4, 2023 •

edited

Loading