Pool tracker vecs #5414

robtfm · 2024-03-18T12:25:01Z

Description

when a large number of buffers/textures have ever been allocated, the cost of allocating vecs in UsageScopes becomes significant.

with 30-40k peak buffers, i see a ~75% reduction (200us -> 50us) in the time for CommandEncoder::begin_render_pass from using pools to keep vecs allocated instead of reallocating each time.

this implementation is bad. in particular it is unsound if multiple HalApis are used in a single execution. There is probably a cleaner way to do this, maybe with a custom allocator?

it also might be worthwhile to have a threshold below which we don't bother pooling to preserve micro benchmarks/low usage apps (this change is likely to be negative for a small peak number of buffers/textures).

i'm happy to make changes to clean it up, or for someone else to take the observation and redo it from scratch.

some xtest failures currently, also failing on trunk (although that doesn't have 1 slow):

     Summary [  75.356s] 669 tests run: 665 passed (1 slow), 4 failed, 0 skipped
  TRY 3 FAIL [   4.419s] wgpu-examples [Executed Failure: BACKEND] [Vulkan/NVIDIA GeForce RTX 3070/0] water
  TRY 3 FAIL [   3.933s] wgpu-examples [Executed] [Gl/NVIDIA GeForce RTX 3070/PCIe/SSE2/3] wgpu_examples::hello_compute::tests::multithreaded_compute
  TRY 3 FAIL [   3.509s] wgpu-examples [Executed] [Vulkan/NVIDIA GeForce RTX 3070/0] wgpu_examples::hello_synchronization::tests::sync
  TRY 3 FAIL [   0.522s] wgpu-test::wgpu-test [Executed] [Vulkan/NVIDIA GeForce RTX 3070/0] wgpu_test::shader::zero_init_workgroup_mem::zero_init_workgroup_memory

Testing

i've been running on a large app. a small test case should generate a large number of buffers, discard them all and then measure the cost of CommandEncoder::begin_render_pass

Checklist

Run cargo fmt.
Run cargo clippy. If applicable, add:
- --target wasm32-unknown-unknown
- --target wasm32-unknown-emscripten
Run cargo xtask test to run tests.
Add change to CHANGELOG.md. See simple instructions inside file.

Wumpf

Having these as global variables seems pretty bad and also very unnecessary here. It's unsafe and the pools can never be deallocated. It would already be a lot better to have these on the device or instance.

The implementation of those two pools themselves is likely not sound: transmuting a vec like done here is both scary and unnecessary - the pool could have been just generic over a given type.

Have you tried how performance place out when you install a custom allocator in your application, e.g. https://docs.rs/mimalloc/latest/mimalloc/

Given that for this you modified only very few locations to use these pools, we should first explore other ways of doing local pooling or avoidance of recreating these vec alltogether. Also, a breakdown of the effect of each of those sites would be helpful to better understand which optimization is worth how much etc.

Wumpf · 2024-03-18T21:52:22Z

But thanks for trying to address this problem, wonder if in this state this shouldn't be more of an extensive issue

robtfm · 2024-03-19T13:33:47Z

It would already be a lot better to have these on the device or instance

that makes sense, i put them in the device. i had to thread some lifetimes around to make it work, but it is cleaner (just to note i had to do the transmuting before because the HalApi was part of the type, so i couldn't create a static pool which included the type).

if you're happy with the approach i can do the same for render bundles as well.

Have you tried how performance place out when you install a custom allocator in your application, e.g. https://docs.rs/mimalloc/latest/mimalloc/

i haven't. i'm on windows and i remember there being a general impression that non-standard allocators were always worse, but that could well be out of date now. i can run tests but not immediately (my project is now on a fork of 19.3 and backporting this is not completely trivial).

robtfm · 2024-03-19T16:00:47Z

mimalloc actually helps a lot. pooling is still better, but only 10-20% vs 150-300% previously.
up to you if you think it's worth the complexity for these minor savings.

below show a span with 16 render passes (bloom down+upscaling). still just UsageScopes as per the pr, backported to 19.3

with ~10k peak buffers, default allocator (red=pooled)

with ~10k peak buffers, mimalloc (yellow=pooled, sorry)

with ~35k peak buffers, mimalloc (red = pooled)

Wumpf · 2024-03-19T21:26:43Z

thanks for following up with detailed traces <3

mimalloc actually helps a lot. pooling is still better, but only 10-20% vs 150-300% previously.

wow 😮 ! I knew mimalloc is faster, but crazy that it makes that much difference. We should strongly consider putting that in the readme then.

Despite the (with mimalloc) small savings I'm inclined to still put it in now that this iteration already looks sooo much better.
However, that extra lifetime is indeed super annoying and I'm kinda surprised it doesn't break anything else. I'm wondering if we could pass on an Arc<UsageScopePool> (where UsageScopePool is a very dump type or even just a type alias), might be worth the overhead. But I haven't fully thought it through. What do you think?

The fact that putting elements in and out of the usage scope pool requires taking a lock is also irking me a lot, but it's held so briefly that it's unlikely to contend (🤞..). Not sure if we already have some lockless queue/stack in place here somewhere that we could use instead 🤔. Let's not regard that as blocking this PR.

robtfm · 2024-03-19T23:32:47Z

using an arc has a visible impact when very few buffers are allocated but is negligible with a significant number.

10k buffers on the right of the graph, a couple of hundred on the left, +mimalloc, yellow=arc<pool>, red=&device (==yellow from 2nd graph above)

i could put the pool on the Hub to at least avoid the 2 lifetimes in RenderPassInfo, but Hub seems more "using-facing" so i wasn't sure that would be well received.

Wumpf · 2024-03-20T19:32:30Z

ah, too bad. Compelling case for the lifetime variant I reckon! Agreed, putting it on Hub feels wrong as well. You convinced me, let's have those lifetimes, doesn't look like they're spreading that far anyways :). Given that this is a faaairly straight forward optimization now and that the savings are that big (when not using mimalloc) it's totally worthwhile I'd say.

One more thing I'd like to ask you to try is to not reference the entire device but just the UsageScopePool (or similar, as mentioned in my previous comment) in order to make the intention of this more clear (i.e. to avoid "why does this State object need the entire device"). Unless of course I'm missing some good reason against it.
But as said I think now that in principal we can (and should!) move forward with this 🙂. Make sure to leave some comments in the final-to-be version that explain to future-us why we're doing this caching dance.

if you're happy with the approach i can do the same for render bundles as well.

yes, go ahead (:

robtfm · 2024-03-21T10:32:14Z

not reference the entire device

done

do the same for render bundles

it looks like this would require an Arc (Resource requires 'static) and probably isn't useful - RenderBundles seem to be long-lived repeatable executions, so allocating on creation is probably fine.

otoh CommandBuffer uses Tracker which also allocates buffer/texture-sized vecs on instantiation, and would also need an Arc to pool. i think the impact would be smaller - i guess typically RenderPasses are much higher volume ... but maybe this is an argument for arcs, and if we want arcs here might as well do them straight away to avoid churning too much.

* suspect all the future suspects * changelog

Wumpf

Looking really good now. Did a more in-depth review now, couple of small things to clean up but then we can land

RenderBundles seem to be long-lived repeatable executions, so allocating on creation is probably fine.

ah yes, if a renderbundle isn't reused for several frames we can well regard that as a user error, so I think we can leave it as is imho.

otoh CommandBuffer uses Tracker which also allocates buffer/texture-sized vecs on instantiation, and would also need an Arc to pool

In any case I'd say we take that in a separate iteration. Not having even more arc'ing would be nice 🤷, but yes there shouldn't be all that many CommandBuffer per frame. Again, let's keep this to a separate iteration :)

wgpu-core/src/track/mod.rs

wgpu-core/src/device/resource.rs

Wumpf

thanks! 🚢

robtfm added 3 commits March 18, 2024 12:03

pool tracker vecs

1b005bf

pool

6e5e9a5

ci

ab0ffea

robtfm requested a review from a team as a code owner March 18, 2024 12:25

Wumpf requested changes Mar 18, 2024

View reviewed changes

move pool to device

06cff9d

use pool ref, cleanup and comment

0a241cc

robtfm added 3 commits March 21, 2024 10:34

suspect all the future suspects (gfx-rs#5413)

86d1343

* suspect all the future suspects * changelog

changelog

0aab3c7

Merge branch 'trunk' into pool-tracker-vecs

4279116

Wumpf self-requested a review March 21, 2024 12:37

Wumpf requested changes Mar 23, 2024

View reviewed changes

wgpu-core/src/track/mod.rs Outdated Show resolved Hide resolved

wgpu-core/src/device/resource.rs Outdated Show resolved Hide resolved

wgpu-core/src/device/resource.rs Outdated Show resolved Hide resolved

review feedback

753737e

Wumpf approved these changes Mar 23, 2024

View reviewed changes

Merge branch 'trunk' into pool-tracker-vecs

6709efc

Wumpf enabled auto-merge (squash) March 23, 2024 11:28

Wumpf merged commit ed95dfe into gfx-rs:trunk Mar 23, 2024
27 checks passed

RedMindZ mentioned this pull request Apr 11, 2024

CommandEncoder::run_render_pass takes much longer to execute when there are many vertex and index buffers allocated, regardless of the number submitted to the RenderPass #5514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool tracker vecs #5414

Pool tracker vecs #5414

robtfm commented Mar 18, 2024 •

edited

Loading

Wumpf left a comment •

edited

Loading

Wumpf commented Mar 18, 2024

robtfm commented Mar 19, 2024

robtfm commented Mar 19, 2024

Wumpf commented Mar 19, 2024 •

edited

Loading

robtfm commented Mar 19, 2024

Wumpf commented Mar 20, 2024

robtfm commented Mar 21, 2024

Wumpf left a comment

Wumpf left a comment

Pool tracker vecs #5414

Pool tracker vecs #5414

Conversation

robtfm commented Mar 18, 2024 • edited Loading

Wumpf left a comment • edited Loading

Choose a reason for hiding this comment

Wumpf commented Mar 18, 2024

robtfm commented Mar 19, 2024

robtfm commented Mar 19, 2024

Wumpf commented Mar 19, 2024 • edited Loading

robtfm commented Mar 19, 2024

Wumpf commented Mar 20, 2024

robtfm commented Mar 21, 2024

Wumpf left a comment

Choose a reason for hiding this comment

Wumpf left a comment

Choose a reason for hiding this comment

robtfm commented Mar 18, 2024 •

edited

Loading

Wumpf left a comment •

edited

Loading

Wumpf commented Mar 19, 2024 •

edited

Loading