Parallelize extract_meshes #9966

james7132 · 2023-09-29T12:15:47Z

Objective

extract_meshes can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it!

Solution

Use the ThreadLocal<Cell<Vec<T>>> approach utilized by #7348 in conjunction with Query::par_iter to build a set of thread-local queues, and collect them after going wide.

Performance

Using cargo run --profile stress-test --features trace_tracy --example many_cubes. Yellow is this PR. Red is main.

extract_meshes:

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.

superdump

I guess the copying the data around the heap is cheaper than the parallelism. Well, nice. :)

james7132 · 2023-09-29T13:26:50Z

Ideally the HashMap would be closer to a Vec so we can just append to the end of larger Vec since batch memcpys are exceptionally fast due to being heavily vectorized. In fact, it may be better to send a Vec to the render world and construct the HashMap from it in the render world to avoid blocking on extraction.

Once we remove the need for entities as the FIXME states, it should be even faster since we only need to construct the HashMap.

james7132 · 2023-09-29T13:31:57Z

I guess the copying the data around the heap is cheaper than the parallelism.

Now that the tracing overhead is almost all gone due to our span caching, I've noted that the overhead from parallelism is very low so long as we don't need to repeatedly park and unpark the task pool threads. As we parallelize more of the engine, there will be significantly less downtime as we increase CPU utilization. It may very well be worth it to copy more if we can go wide more readily.

james7132 · 2023-09-30T09:40:28Z

Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.

#9950 has been merged, and the difference grows even larger. The difference is closer to a 51.6% decrease in time spent. Though again, the bulk of it is still in collection into the final output Vec and HashMap. If we can remove either of those, it should be even faster.

superdump · 2023-09-30T10:06:14Z

This should be done for 2D meshes as well. And maybe also UI and sprites if possible.

The only downside is maybe if the parallelisation adds overhead that is negative for power consumption. So of it takes more energy to extract or do something in parallel than in serial on one core.

crates/bevy_pbr/src/render/mesh.rs

# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.

Parallelize extract_meshes

9a319fc

james7132 requested a review from superdump September 29, 2023 12:15

james7132 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Sep 29, 2023

superdump approved these changes Sep 29, 2023

View reviewed changes

james7132 added 2 commits September 29, 2023 19:48

Move FIXME to correct location

3f7d576

Merge branch 'main' into parallelize-extract-meshes

43dce92

hymm approved these changes Sep 30, 2023

View reviewed changes

JoJoJet reviewed Oct 1, 2023

View reviewed changes

crates/bevy_pbr/src/render/mesh.rs Show resolved Hide resolved

JoJoJet approved these changes Oct 1, 2023

View reviewed changes

james7132 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Oct 1, 2023

superdump approved these changes Oct 1, 2023

View reviewed changes

superdump added this pull request to the merge queue Oct 1, 2023

Merged via the queue into bevyengine:main with commit a1a81e5 Oct 1, 2023
25 of 26 checks passed

cart mentioned this pull request Oct 13, 2023

News: Release 0.12 bevyengine/bevy-website#754

Merged

43 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize extract_meshes #9966

Parallelize extract_meshes #9966

james7132 commented Sep 29, 2023

superdump left a comment

james7132 commented Sep 29, 2023

james7132 commented Sep 29, 2023

james7132 commented Sep 30, 2023 •

edited

Loading

superdump commented Sep 30, 2023

Parallelize extract_meshes #9966

Parallelize extract_meshes #9966

Conversation

james7132 commented Sep 29, 2023

Objective

Solution

Performance

superdump left a comment

Choose a reason for hiding this comment

james7132 commented Sep 29, 2023

james7132 commented Sep 29, 2023

james7132 commented Sep 30, 2023 • edited Loading

superdump commented Sep 30, 2023

james7132 commented Sep 30, 2023 •

edited

Loading