-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extensions: Inline all trivial functions #638
Conversation
Yes, my feeling is that this represents enough code in aggregate that the benefit is nonobvious.
I think this would be much more reliable data, yeah. |
I can try against
|
With WGPU at gfx-rs/wgpu@c20c86a: Note that I made sure that
EDIT: But we should not forget that this change brought us ±60ms in the first place, and we're only paying back ±3ms in end-user crates. Here's the same test again, with $ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
Time (mean ± σ): 6.717 s ± 0.072 s [User: 8.129 s, System: 0.475 s]
Range (min … max): 6.624 s … 6.876 s 10 runs And purely this PR (so with $ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
Time (mean ± σ): 6.673 s ± 0.040 s [User: 7.902 s, System: 0.462 s]
Range (min … max): 6.572 s … 6.710 s 10 runs That's still an improvement, if only ±40ms in total. Finally, for good measure, back on $ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
Time (mean ± σ): 6.739 s ± 0.045 s [User: 8.321 s, System: 0.494 s]
Range (min … max): 6.681 s … 6.799 s 10 runs This PR is the clear winner as it stands, with |
I need to do some clockfixing on my PC, after a suspend-resume this PR in wgpu is much faster again: $ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
Time (mean ± σ): 6.608 s ± 0.053 s [User: 7.869 s, System: 0.434 s]
Range (min … max): 6.552 s … 6.697 s 10 runs The new I suspect this is caused by the extremely low single-core load incurred by the build, it probably jumps from core to core and has constantly changing frequencies. That doesn't go down well on a ThreadRipper either with its distinct dies/caches either... |
14cd94a
to
45f0107
Compare
45f0107
to
7647327
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice analysis!
* extensions: Inline simple getter functions * extensions: Inline all `unsafe fn` helper functions * instance: Inline "skipped" `read_into_uninitialized_vector()` functions * enums: Inline `from_raw`/`as_raw` functions
As with earlier PRs inlining public functions with the
#[inline]
attribute is desired to provide function bodies to the linker and allow end-users to get ever so slightly higher performance when their trivial implementations are inlined directly into target code, instead of ending up as an extra indirection through a call and arguments that have to be set up according to a calling convention.At the same time this seems to not do / postpone certain codegen/optimization passes, resulting in ever so slightly reduced compilation times for the pure
ash
crate (which was part of the reason these have already been applied to other parts of theash
codebase already). This hit is likely only postponed to the linker though, but should still result in a total compiletime reduction (even if it's so small that it is considered negligible) as a user can't possibly use each and every function - which the compiler/linker then won't waste any time on I hope/suppose :)For actual performance numbers, on the
master
branch at 71d45e4:With this PR directly on top we get an ever so slightly increase of about ±60ms:
@Ralith it seems you have intentionally omitted (
Instance
) functions that call intoread_into_uninitialized_vector()
in #606, presumably because their bodies are too complex? Note that thoseread_into_*_vector()
functions have generics and would be publicized/inlined when the#[inline]
attribute is applied here, as far as I know/understand.Undoing that
#[inline]
change for those and every other extension function that callsread_into_uninitialized_vector()
/read_into_defaulted_vector()
brings the timings right back to where we started, undoing our ±60ms improvement:Should we profile with an end-user application that actively uses some of these functions, and see how the timings fare?