Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool tracker vecs #5414

Merged
merged 10 commits into from
Mar 23, 2024
Merged

Pool tracker vecs #5414

merged 10 commits into from
Mar 23, 2024

Conversation

robtfm
Copy link
Contributor

@robtfm robtfm commented Mar 18, 2024

Description

when a large number of buffers/textures have ever been allocated, the cost of allocating vecs in UsageScopes becomes significant.

with 30-40k peak buffers, i see a ~75% reduction (200us -> 50us) in the time for CommandEncoder::begin_render_pass from using pools to keep vecs allocated instead of reallocating each time.

this implementation is bad. in particular it is unsound if multiple HalApis are used in a single execution. There is probably a cleaner way to do this, maybe with a custom allocator?

it also might be worthwhile to have a threshold below which we don't bother pooling to preserve micro benchmarks/low usage apps (this change is likely to be negative for a small peak number of buffers/textures).

i'm happy to make changes to clean it up, or for someone else to take the observation and redo it from scratch.

some xtest failures currently, also failing on trunk (although that doesn't have 1 slow):

     Summary [  75.356s] 669 tests run: 665 passed (1 slow), 4 failed, 0 skipped
  TRY 3 FAIL [   4.419s] wgpu-examples [Executed Failure: BACKEND] [Vulkan/NVIDIA GeForce RTX 3070/0] water
  TRY 3 FAIL [   3.933s] wgpu-examples [Executed] [Gl/NVIDIA GeForce RTX 3070/PCIe/SSE2/3] wgpu_examples::hello_compute::tests::multithreaded_compute
  TRY 3 FAIL [   3.509s] wgpu-examples [Executed] [Vulkan/NVIDIA GeForce RTX 3070/0] wgpu_examples::hello_synchronization::tests::sync
  TRY 3 FAIL [   0.522s] wgpu-test::wgpu-test [Executed] [Vulkan/NVIDIA GeForce RTX 3070/0] wgpu_test::shader::zero_init_workgroup_mem::zero_init_workgroup_memory

Testing

i've been running on a large app. a small test case should generate a large number of buffers, discard them all and then measure the cost of CommandEncoder::begin_render_pass

Checklist

  • Run cargo fmt.
  • Run cargo clippy. If applicable, add:
    • --target wasm32-unknown-unknown
    • --target wasm32-unknown-emscripten
  • Run cargo xtask test to run tests.
  • Add change to CHANGELOG.md. See simple instructions inside file.

@robtfm robtfm requested a review from a team as a code owner March 18, 2024 12:25
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having these as global variables seems pretty bad and also very unnecessary here. It's unsafe and the pools can never be deallocated. It would already be a lot better to have these on the device or instance.

The implementation of those two pools themselves is likely not sound: transmuting a vec like done here is both scary and unnecessary - the pool could have been just generic over a given type.

Have you tried how performance place out when you install a custom allocator in your application, e.g. https://docs.rs/mimalloc/latest/mimalloc/

Given that for this you modified only very few locations to use these pools, we should first explore other ways of doing local pooling or avoidance of recreating these vec alltogether. Also, a breakdown of the effect of each of those sites would be helpful to better understand which optimization is worth how much etc.

@Wumpf
Copy link
Member

Wumpf commented Mar 18, 2024

But thanks for trying to address this problem, wonder if in this state this shouldn't be more of an extensive issue

@robtfm
Copy link
Contributor Author

robtfm commented Mar 19, 2024

It would already be a lot better to have these on the device or instance

that makes sense, i put them in the device. i had to thread some lifetimes around to make it work, but it is cleaner (just to note i had to do the transmuting before because the HalApi was part of the type, so i couldn't create a static pool which included the type).

if you're happy with the approach i can do the same for render bundles as well.

Have you tried how performance place out when you install a custom allocator in your application, e.g. https://docs.rs/mimalloc/latest/mimalloc/

i haven't. i'm on windows and i remember there being a general impression that non-standard allocators were always worse, but that could well be out of date now. i can run tests but not immediately (my project is now on a fork of 19.3 and backporting this is not completely trivial).

@robtfm
Copy link
Contributor Author

robtfm commented Mar 19, 2024

mimalloc actually helps a lot. pooling is still better, but only 10-20% vs 150-300% previously.
up to you if you think it's worth the complexity for these minor savings.

below show a span with 16 render passes (bloom down+upscaling). still just UsageScopes as per the pr, backported to 19.3

with ~10k peak buffers, default allocator (red=pooled)
10k buffers, red=pool,yellow=no_pool

with ~10k peak buffers, mimalloc (yellow=pooled, sorry)
mimalloc, 10k buffers, yellow=pool,red=no_pool

with ~35k peak buffers, mimalloc (red = pooled)
mimalloc, 30k buffers, red=pool,yellow=pool

@Wumpf
Copy link
Member

Wumpf commented Mar 19, 2024

thanks for following up with detailed traces <3

mimalloc actually helps a lot. pooling is still better, but only 10-20% vs 150-300% previously.

wow 😮 ! I knew mimalloc is faster, but crazy that it makes that much difference. We should strongly consider putting that in the readme then.

Despite the (with mimalloc) small savings I'm inclined to still put it in now that this iteration already looks sooo much better.
However, that extra lifetime is indeed super annoying and I'm kinda surprised it doesn't break anything else. I'm wondering if we could pass on an Arc<UsageScopePool> (where UsageScopePool is a very dump type or even just a type alias), might be worth the overhead. But I haven't fully thought it through. What do you think?

The fact that putting elements in and out of the usage scope pool requires taking a lock is also irking me a lot, but it's held so briefly that it's unlikely to contend (🤞..). Not sure if we already have some lockless queue/stack in place here somewhere that we could use instead 🤔. Let's not regard that as blocking this PR.

@robtfm
Copy link
Contributor Author

robtfm commented Mar 19, 2024

using an arc has a visible impact when very few buffers are allocated but is negligible with a significant number.

10k buffers on the right of the graph, a couple of hundred on the left, +mimalloc, yellow=arc<pool>, red=&device (==yellow from 2nd graph above)
image

i could put the pool on the Hub to at least avoid the 2 lifetimes in RenderPassInfo, but Hub seems more "using-facing" so i wasn't sure that would be well received.

@Wumpf
Copy link
Member

Wumpf commented Mar 20, 2024

ah, too bad. Compelling case for the lifetime variant I reckon! Agreed, putting it on Hub feels wrong as well. You convinced me, let's have those lifetimes, doesn't look like they're spreading that far anyways :). Given that this is a faaairly straight forward optimization now and that the savings are that big (when not using mimalloc) it's totally worthwhile I'd say.

One more thing I'd like to ask you to try is to not reference the entire device but just the UsageScopePool (or similar, as mentioned in my previous comment) in order to make the intention of this more clear (i.e. to avoid "why does this State object need the entire device"). Unless of course I'm missing some good reason against it.
But as said I think now that in principal we can (and should!) move forward with this 🙂. Make sure to leave some comments in the final-to-be version that explain to future-us why we're doing this caching dance.

if you're happy with the approach i can do the same for render bundles as well.

yes, go ahead (:

@robtfm
Copy link
Contributor Author

robtfm commented Mar 21, 2024

not reference the entire device

done

do the same for render bundles

it looks like this would require an Arc (Resource requires 'static) and probably isn't useful - RenderBundles seem to be long-lived repeatable executions, so allocating on creation is probably fine.

otoh CommandBuffer uses Tracker which also allocates buffer/texture-sized vecs on instantiation, and would also need an Arc to pool. i think the impact would be smaller - i guess typically RenderPasses are much higher volume ... but maybe this is an argument for arcs, and if we want arcs here might as well do them straight away to avoid churning too much.

@Wumpf Wumpf self-requested a review March 21, 2024 12:37
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good now. Did a more in-depth review now, couple of small things to clean up but then we can land


RenderBundles seem to be long-lived repeatable executions, so allocating on creation is probably fine.

ah yes, if a renderbundle isn't reused for several frames we can well regard that as a user error, so I think we can leave it as is imho.

otoh CommandBuffer uses Tracker which also allocates buffer/texture-sized vecs on instantiation, and would also need an Arc to pool

In any case I'd say we take that in a separate iteration. Not having even more arc'ing would be nice 🤷, but yes there shouldn't be all that many CommandBuffer per frame. Again, let's keep this to a separate iteration :)

wgpu-core/src/track/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/device/resource.rs Outdated Show resolved Hide resolved
wgpu-core/src/device/resource.rs Outdated Show resolved Hide resolved
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! 🚢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants