Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize extract_meshes #9966

Merged
merged 3 commits into from
Oct 1, 2023

Conversation

james7132
Copy link
Member

Objective

extract_meshes can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it!

Solution

Use the ThreadLocal<Cell<Vec<T>>> approach utilized by #7348 in conjunction with Query::par_iter to build a set of thread-local queues, and collect them after going wide.

Performance

Using cargo run --profile stress-test --features trace_tracy --example many_cubes. Yellow is this PR. Red is main.

extract_meshes:

image

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.

@james7132 james7132 requested a review from superdump September 29, 2023 12:15
@james7132 james7132 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Sep 29, 2023
Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the copying the data around the heap is cheaper than the parallelism. Well, nice. :)

@james7132
Copy link
Member Author

Ideally the HashMap would be closer to a Vec so we can just append to the end of larger Vec since batch memcpys are exceptionally fast due to being heavily vectorized. In fact, it may be better to send a Vec to the render world and construct the HashMap from it in the render world to avoid blocking on extraction.

Once we remove the need for entities as the FIXME states, it should be even faster since we only need to construct the HashMap.

@james7132
Copy link
Member Author

I guess the copying the data around the heap is cheaper than the parallelism.

Now that the tracing overhead is almost all gone due to our span caching, I've noted that the overhead from parallelism is very low so long as we don't need to repeatedly park and unpark the task pool threads. As we parallelize more of the engine, there will be significantly less downtime as we increase CPU utilization. It may very well be worth it to copy more if we can go wide more readily.

@james7132
Copy link
Member Author

james7132 commented Sep 30, 2023

Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.

#9950 has been merged, and the difference grows even larger. The difference is closer to a 51.6% decrease in time spent. Though again, the bulk of it is still in collection into the final output Vec and HashMap. If we can remove either of those, it should be even faster.

image

@superdump
Copy link
Contributor

This should be done for 2D meshes as well. And maybe also UI and sprites if possible.

The only downside is maybe if the parallelisation adds overhead that is negative for power consumption. So of it takes more energy to extract or do something in parallel than in serial on one core.

@james7132 james7132 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Oct 1, 2023
@superdump superdump added this pull request to the merge queue Oct 1, 2023
Merged via the queue into bevyengine:main with commit a1a81e5 Oct 1, 2023
25 of 26 checks passed
@cart cart mentioned this pull request Oct 13, 2023
43 tasks
regnarock pushed a commit to regnarock/bevy that referenced this pull request Oct 13, 2023
# Objective
`extract_meshes` can easily be one of the most expensive operations in
the blocking extract schedule for 3D apps. It also has no fundamentally
serialized parts and can easily be run across multiple threads. Let's
speed it up by parallelizing it!

## Solution
Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in
conjunction with `Query::par_iter` to build a set of thread-local
queues, and collect them after going wide.

## Performance
Using `cargo run --profile stress-test --features trace_tracy --example
many_cubes`. Yellow is this PR. Red is main.

`extract_meshes`:


![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b)

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including bevyengine#9950's changes, so this may actually
result in even faster speedups once that's merged in.
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
# Objective
`extract_meshes` can easily be one of the most expensive operations in
the blocking extract schedule for 3D apps. It also has no fundamentally
serialized parts and can easily be run across multiple threads. Let's
speed it up by parallelizing it!

## Solution
Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in
conjunction with `Query::par_iter` to build a set of thread-local
queues, and collect them after going wide.

## Performance
Using `cargo run --profile stress-test --features trace_tracy --example
many_cubes`. Yellow is this PR. Red is main.

`extract_meshes`:


![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b)

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including bevyengine#9950's changes, so this may actually
result in even faster speedups once that's merged in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants