-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize extract_meshes #9966
Parallelize extract_meshes #9966
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the copying the data around the heap is cheaper than the parallelism. Well, nice. :)
Ideally the HashMap would be closer to a Vec so we can just append to the end of larger Vec since batch memcpys are exceptionally fast due to being heavily vectorized. In fact, it may be better to send a Vec to the render world and construct the HashMap from it in the render world to avoid blocking on extraction. Once we remove the need for entities as the FIXME states, it should be even faster since we only need to construct the HashMap. |
Now that the tracing overhead is almost all gone due to our span caching, I've noted that the overhead from parallelism is very low so long as we don't need to repeatedly park and unpark the task pool threads. As we parallelize more of the engine, there will be significantly less downtime as we increase CPU utilization. It may very well be worth it to copy more if we can go wide more readily. |
#9950 has been merged, and the difference grows even larger. The difference is closer to a 51.6% decrease in time spent. Though again, the bulk of it is still in collection into the final output Vec and HashMap. If we can remove either of those, it should be even faster. |
This should be done for 2D meshes as well. And maybe also UI and sprites if possible. The only downside is maybe if the parallelisation adds overhead that is negative for power consumption. So of it takes more energy to extract or do something in parallel than in serial on one core. |
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.
Objective
extract_meshes
can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it!Solution
Use the
ThreadLocal<Cell<Vec<T>>>
approach utilized by #7348 in conjunction withQuery::par_iter
to build a set of thread-local queues, and collect them after going wide.Performance
Using
cargo run --profile stress-test --features trace_tracy --example many_cubes
. Yellow is this PR. Red is main.extract_meshes
:An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.
Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.