Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded render command encoding #9172

Merged
merged 64 commits into from
Feb 9, 2024

Conversation

JMS55
Copy link
Contributor

@JMS55 JMS55 commented Jul 16, 2023

Objective

  • Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a wgpu::CommandEncoder is very expensive, and takes a long time.
  • To improve performance, we want to perform the command encoding for these heavy passes in parallel.

Solution

  • RenderContext can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called.
  • When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the ComputeTaskPool to produce their corresponding command buffers.
  • The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution.

Nodes Parallelized

  • MainOpaquePass3dNode
  • PrepassNode
  • DeferredGBufferPrepassNode
  • ShadowPassNode (One task per view)

Future Work

  • For large number of draws calls, might be worth further subdividing passes into 2+ tasks.
  • Extend this to UI, 2d, transparent, and transmissive nodes?
  • Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items.
  • All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run.
  • There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier.

Performance Improvement

Thanks to @Elabajaba for testing on Bistro.

image

TLDR: Without shadow mapping, this PR has no impact. With shadow mapping, this PR gives ~40 more fps than main.


Changelog

  • MainOpaquePass3dNode, PrepassNode, DeferredGBufferPrepassNode, and each shadow map within ShadowPassNode are now encoded in parallel, giving greatly increased CPU performance, mainly when shadow mapping is enabled.
    • Does not work on WASM or AMD+Windows+Vulkan.
  • Added RenderContext::add_command_buffer_generation_task().
  • RenderContext::new() now takes adapter info
  • Some render graph and Node related types and methods now have additional lifetime constraints.

Migration Guide

RenderContext::new() now takes adapter info

  • Some render graph and Node related types and methods now have additional lifetime constraints.

@JMS55 JMS55 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Jul 16, 2023
@james7132 james7132 requested review from james7132 and superdump July 16, 2023 07:46
@dmyyy
Copy link
Contributor

dmyyy commented Jul 16, 2023

Base implementation looks good to me - interested in seeing what benchmarks will show.

@JMS55
Copy link
Contributor Author

JMS55 commented Jul 16, 2023

Not sure if I should mark this as a breaking change or not. Technically there's some extra lifetime constraints now, but 99.999% of users won't need to change their code. I think the only thing you would need to change is if you're using bevy_render's RenderContext for your own stuff, independent from the RenderGraph.

@JMS55 JMS55 marked this pull request as ready for review July 16, 2023 20:01
@JMS55
Copy link
Contributor Author

JMS55 commented Jul 16, 2023

Note that this PR currently depends on gfx-rs/wgpu#3626

@JMS55 JMS55 added this to the 0.12 milestone Jul 16, 2023
@james7132 james7132 added the S-Blocked This cannot move forward until something else changes label Jul 17, 2023
@james7132
Copy link
Member

Marking this as blocked on a potential wgpu release, since this otherwise would add complexity without any gains.

@cart
Copy link
Member

cart commented Jul 20, 2023

Does the benchmark use arcanized wgpu for both single and multi threaded impls?
Seems like arcanized wgpu on its own might give us a perf boost.

tjamaan pushed a commit to tjamaan/bevy that referenced this pull request Feb 6, 2024
# Objective

Keep core dependencies up to date.

## Solution

Update the dependencies.

wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was
included in this.

The rwh 0.6 version bump is just the simplest way of doing it. There
might be a way we can take advantage of wgpu's new safe surface creation
api, but I'm not familiar enough with bevy's window management to
untangle it and my attempt ended up being a mess of lifetimes and rustc
complaining about missing trait impls (that were implemented). Thanks to
@MiniaczQ for the (much simpler) rwh 0.6 version bump code.

Unblocks bevyengine#9172 and
bevyengine#10812

~~This might be blocked on cpal and oboe updating their ndk versions to
0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested
on android, and everything seems to work correctly (audio properly stops
when minimized, and plays when re-focusing the app).

---

## Changelog

- `wgpu` has been updated to 0.19! The long awaited arcanization has
been merged (for more info, see
https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan
should now be working again on Intel GPUs.
- Targeting WebGPU now requires that you add the new `webgpu` feature
(setting the `RUSTFLAGS` environment variable to
`--cfg=web_sys_unstable_apis` is still required). This feature currently
overrides the `webgl2` feature if you have both enabled (the `webgl2`
feature is enabled by default), so it is not recommended to add it as a
default feature to libraries without putting it behind a flag that
allows library users to opt out of it! In the future we plan on
supporting wasm binaries that can target both webgl2 and webgpu now that
wgpu added support for doing so (see
bevyengine#11505).
- `raw-window-handle` has been updated to version 0.6.

## Migration Guide

- `bevy_render::instance_index::get_instance_index()` has been removed
as the webgl2 workaround is no longer required as it was fixed upstream
in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed.
- WebGPU now requires the new `webgpu` feature to be enabled. The
`webgpu` feature currently overrides the `webgl2` feature so you no
longer need to disable all default features and re-add them all when
targeting `webgpu`, but binaries built with both the `webgpu` and
`webgl2` features will only target the webgpu backend, and will only
work on browsers that support WebGPU.
- Places where you conditionally compiled things for webgl2 need to be
updated because of this change, eg:
- `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]`
becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"),
feature = "webgpu"))]`
- `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes
`#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature =
"webgpu")))]`
- `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if
cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature =
"webgpu")))`
- `create_texture_with_data` now also takes a `TextureDataOrder`. You
can probably just set this to `TextureDataOrder::default()`
- `TextureFormat`'s `block_size` has been renamed to `block_copy_size`
- See the `wgpu` changelog for anything I might've missed:
https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md

---------

Co-authored-by: François <mockersf@gmail.com>
// Opaque draws
if !opaque_phase.items.is_empty() {
#[cfg(feature = "trace")]
let _opaque_main_pass_3d_span = info_span!("opaque_main_pass_3d").entered();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smallest of all nits: I can see that this span has a much narrower scope than the one on line 56, but they have super similar names. Is there a good reason for them to be this similar? I assume it's to maintain compatibility with the previous version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just copy pasted what the previous version did, yeah

Copy link
Contributor

@NthTensor NthTensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Noted performance increase. No flickering shadows.

@JMS55 JMS55 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Feb 7, 2024
QueuedCommandBuffer::Ready(command_buffer) => {
command_buffers.push((i, command_buffer));
}
QueuedCommandBuffer::Task(command_buffer_generation_task) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we can't immediately launch the task on queuing it, and just collect the results in finish? As this stands, if we have a large blocking part of the graph, the actual parallel tasks won't get started until that part of the graph is complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tasks need to be scoped, which isn't easy to do unless we start them all at the end. For now, this should be fine. I don't foresee it really being a problem.

self.command_buffers

if self.command_buffer_queue.is_empty() {
self.command_buffer_queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being done, even though the queue is empty? Shouldn't we just return Vec::new()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the code was like that. I removed all the redundant stuff.

Copy link
Contributor

@NthTensor NthTensor Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually looked at the generated assembly of this when I did my review, and for whatever reason the original compiled into fewer operations. I should try it again, maybe I did something wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't have any actual impact, but I imagine it was because it early-outs if the queue is empty.

@james7132 james7132 added the M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide label Feb 7, 2024
Copy link
Contributor

github-actions bot commented Feb 7, 2024

It looks like your PR is a breaking change, but you didn't provide a migration guide.

Could you add some context on what users should update when this change get released in a new version of Bevy?
It will be used to help writing the migration guide for the version. Putting it after a ## Migration Guide will help it get automatically picked up by our tooling.

@james7132
Copy link
Member

The changes to the APIs are technically breaking changes. Could you document them in a Migration Guide section?

@JMS55
Copy link
Contributor Author

JMS55 commented Feb 7, 2024

It's not a breaking change I don't think, as the lifetimes should be inferred for most use cases.

@james7132 james7132 added M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide and removed M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide labels Feb 8, 2024
Copy link
Contributor

github-actions bot commented Feb 8, 2024

It looks like your PR is a breaking change, but you didn't provide a migration guide.

Could you add some context on what users should update when this change get released in a new version of Bevy?
It will be used to help writing the migration guide for the version. Putting it after a ## Migration Guide will help it get automatically picked up by our tooling.

@james7132
Copy link
Member

It's not a breaking change I don't think, as the lifetimes should be inferred for most use cases.

This adds an argument to RenderGraphRunner::run. I don't expect many people to be running their own render graph, but it's worth documenting.

Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this on a few of the stress tests and this does seem to deliver on the performance improvements. many_foxes goes from 6.33ms (157 FPS) to 4.68ms (213 FPS).

Overall this looks good to me. Just a few final points that need to be addressed before it can be merged.

crates/bevy_pbr/src/render/light.rs Show resolved Hide resolved
/// Append a function that will generate a [`CommandBuffer`] to the
/// command buffer queue, to be ran later.
///
/// If present, this will flush the currently unflushed [`CommandEncoder`]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still need addressing?

github-merge-queue bot pushed a commit that referenced this pull request Feb 9, 2024
# Objective
While profiling around to validate the results of #9172, I noticed that
`present_frames` can take a significant amount of time. Digging into the
cause, it seems like we're creating a new `QueryState` from scratch
every frame. This involves scanning the entire World's metadata instead
of just updating its view of the world.

## Solution
Use a `SystemState` argument to cache the `QueryState` to avoid this
construction cost.

## Performance
Against `many_foxes`, this seems to cut the time spent in
`present_frames` by nearly almost 2x. Yellow is this PR, red is main.


![image](https://github.com/bevyengine/bevy/assets/3137680/2b02bbe0-6219-4255-958d-b690e37e7fba)
@JMS55
Copy link
Contributor Author

JMS55 commented Feb 9, 2024

This adds an argument to RenderGraphRunner::run. I don't expect many people to be running their own render graph, but it's worth documenting.

No changes were made to the argument count, just adding some lifetimes that should be inferred for all existing use cases.

@JMS55
Copy link
Contributor Author

JMS55 commented Feb 9, 2024

Does this still need addressing?

Github won't let me respond to it, but idk what this comment is in reference to.

@james7132
Copy link
Member

Ship it!

@james7132 james7132 added this pull request to the merge queue Feb 9, 2024
Merged via the queue into bevyengine:main with commit f4dab8a Feb 9, 2024
23 checks passed
Subserial pushed a commit to Subserial/bevy_winit_hook that referenced this pull request Feb 20, 2024
# Objective

Keep core dependencies up to date.

## Solution

Update the dependencies.

wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was
included in this.

The rwh 0.6 version bump is just the simplest way of doing it. There
might be a way we can take advantage of wgpu's new safe surface creation
api, but I'm not familiar enough with bevy's window management to
untangle it and my attempt ended up being a mess of lifetimes and rustc
complaining about missing trait impls (that were implemented). Thanks to
@MiniaczQ for the (much simpler) rwh 0.6 version bump code.

Unblocks bevyengine/bevy#9172 and
bevyengine/bevy#10812

~~This might be blocked on cpal and oboe updating their ndk versions to
0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested
on android, and everything seems to work correctly (audio properly stops
when minimized, and plays when re-focusing the app).

---

## Changelog

- `wgpu` has been updated to 0.19! The long awaited arcanization has
been merged (for more info, see
https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan
should now be working again on Intel GPUs.
- Targeting WebGPU now requires that you add the new `webgpu` feature
(setting the `RUSTFLAGS` environment variable to
`--cfg=web_sys_unstable_apis` is still required). This feature currently
overrides the `webgl2` feature if you have both enabled (the `webgl2`
feature is enabled by default), so it is not recommended to add it as a
default feature to libraries without putting it behind a flag that
allows library users to opt out of it! In the future we plan on
supporting wasm binaries that can target both webgl2 and webgpu now that
wgpu added support for doing so (see
bevyengine/bevy#11505).
- `raw-window-handle` has been updated to version 0.6.

## Migration Guide

- `bevy_render::instance_index::get_instance_index()` has been removed
as the webgl2 workaround is no longer required as it was fixed upstream
in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed.
- WebGPU now requires the new `webgpu` feature to be enabled. The
`webgpu` feature currently overrides the `webgl2` feature so you no
longer need to disable all default features and re-add them all when
targeting `webgpu`, but binaries built with both the `webgpu` and
`webgl2` features will only target the webgpu backend, and will only
work on browsers that support WebGPU.
- Places where you conditionally compiled things for webgl2 need to be
updated because of this change, eg:
- `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]`
becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"),
feature = "webgpu"))]`
- `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes
`#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature =
"webgpu")))]`
- `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if
cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature =
"webgpu")))`
- `create_texture_with_data` now also takes a `TextureDataOrder`. You
can probably just set this to `TextureDataOrder::default()`
- `TextureFormat`'s `block_size` has been renamed to `block_copy_size`
- See the `wgpu` changelog for anything I might've missed:
https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md

---------

Co-authored-by: François <mockersf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.