extensions: Inline all trivial functions #638

MarijnS95 · 2022-07-01T22:16:58Z

As with earlier PRs inlining public functions with the #[inline] attribute is desired to provide function bodies to the linker and allow end-users to get ever so slightly higher performance when their trivial implementations are inlined directly into target code, instead of ending up as an extra indirection through a call and arguments that have to be set up according to a calling convention.

At the same time this seems to not do / postpone certain codegen/optimization passes, resulting in ever so slightly reduced compilation times for the pure ash crate (which was part of the reason these have already been applied to other parts of the ash codebase already). This hit is likely only postponed to the linker though, but should still result in a total compiletime reduction (even if it's so small that it is considered negligible) as a user can't possibly use each and every function - which the compiler/linker then won't waste any time on I hope/suppose :)

For actual performance numbers, on the master branch at 71d45e4:

$ hyperfine -p 'cargo clean -p ash' 'cargo b -p ash'
Benchmark 1: cargo b -p ash
  Time (mean ± σ):      6.749 s ±  0.076 s    [User: 7.723 s, System: 0.456 s]
  Range (min … max):    6.670 s …  6.872 s    10 runs

With this PR directly on top we get an ever so slightly increase of about ±60ms:

$ hyperfine -p 'cargo clean -p ash' 'cargo b -p ash'
Benchmark 1: cargo b -p ash
  Time (mean ± σ):      6.690 s ±  0.051 s    [User: 7.349 s, System: 0.398 s]
  Range (min … max):    6.618 s …  6.768 s    10 runs

@Ralith it seems you have intentionally omitted (Instance) functions that call into read_into_uninitialized_vector() in #606, presumably because their bodies are too complex? Note that those read_into_*_vector() functions have generics and would be publicized/inlined when the #[inline] attribute is applied here, as far as I know/understand.

Undoing that #[inline] change for those and every other extension function that calls read_into_uninitialized_vector()/read_into_defaulted_vector() brings the timings right back to where we started, undoing our ±60ms improvement:

$ hyperfine -p 'cargo clean -p ash' 'cargo b -p ash'
Benchmark 1: cargo b -p ash
  Time (mean ± σ):      6.770 s ±  0.056 s    [User: 7.563 s, System: 0.464 s]
  Range (min … max):    6.710 s …  6.865 s    10 runs

Should we profile with an end-user application that actively uses some of these functions, and see how the timings fare?

Ralith · 2022-07-02T04:16:01Z

Note that those read_into_*_vector() functions have generic and would be publicized/inlined when the #[inline] attribute is applied here, as far as I know/understand.

Yes, my feeling is that this represents enough code in aggregate that the benefit is nonobvious.

Should we profile with an end-user application that actively uses some of these functions, and see how the timings fare?

I think this would be much more reliable data, yeah. wgpu would be a good candidate, but a nontrivial direct consumer of ash might also be distinctly interesting. In particular I don't think a 60ms difference in compile time is actionable either way on its own. I'm also happy to see this merged either way, but better data never hurts.

MarijnS95 · 2022-07-02T07:00:08Z

I can try against wgpu, not sure how soon though. I'd like to try:

Without any #[inline] at all;
The current state of ash on master;
With this PR;
With this PR, with #[inline] removed on those non-trivial read_into_*_vector() again.

MarijnS95 · 2022-07-02T07:44:45Z

With WGPU at gfx-rs/wgpu@c20c86a:

Note that I made sure that ash (and all other deps) are up-to-date/compiled first with a -w1 warmup run, and didn't clean ash through an extra -p ash. We'd need both to understand

Without any #[inline] at all, at 71d45e4:

$ hyperfine -w1 -p 'cargo clean -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):     576.8 ms ±   7.0 ms    [User: 648.0 ms, System: 87.0 ms]
  Range (min … max):   571.5 ms … 595.8 ms    10 runs

Reminder: 71d45e4 compiles in ±6.737s. Removing all #[inline] brings us back to ±7.059s.

The current state of ash on master at 71d45e4:

$ hyperfine -w1 -p 'cargo clean -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):     567.2 ms ±   3.1 ms    [User: 650.0 ms, System: 73.7 ms]
  Range (min … max):   562.9 ms … 573.8 ms    10 runs

That's clearly a win on wgpu-hal alone, even if it isn't much. In other words, we definitely aren't paying for the winds of #[inline] on ash later during linking.

With this PR at e0f5cad:

$ hyperfine -w1 -p 'cargo clean -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):     569.4 ms ±   3.3 ms    [User: 649.2 ms, System: 76.9 ms]
  Range (min … max):   564.6 ms … 576.9 ms    10 runs

This fluctuates a bit between 569-571ms on re-runs, but it's definitely ±2ms slower than `master.

With this PR at e0f5cad, with #[inline] removed on those non-trivial read_into_*_vector() again:

$ hyperfine -w1 -p 'cargo clean -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):     567.8 ms ±   2.7 ms    [User: 647.5 ms, System: 78.2 ms]
  Range (min … max):   563.9 ms … 571.8 ms    10 runs

And we're back at master, showing that it is indeed beneficial to omit #[inline] on functions that call read_into_*_vector(); if only by a very, very tiny margin.

EDIT: But we should not forget that this change brought us ±60ms in the first place, and we're only paying back ±3ms in end-user crates.

Here's the same test again, with ash cleaned (hence rebuilt) every run, and #[inline] removed on read_into_*_vector() calls:

$ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):      6.717 s ±  0.072 s    [User: 8.129 s, System: 0.475 s]
  Range (min … max):    6.624 s …  6.876 s    10 runs

And purely this PR (so with #[inline] on all functions):

$ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):      6.673 s ±  0.040 s    [User: 7.902 s, System: 0.462 s]
  Range (min … max):    6.572 s …  6.710 s    10 runs

That's still an improvement, if only ±40ms in total.

Finally, for good measure, back on master at 71d45e4:

$ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):      6.739 s ±  0.045 s    [User: 8.321 s, System: 0.494 s]
  Range (min … max):    6.681 s …  6.799 s    10 runs

This PR is the clear winner as it stands, with #[inline] added on all functions including read_into_*_vector() calls.

MarijnS95 · 2022-07-02T08:46:28Z

I need to do some clockfixing on my PC, after a suspend-resume this PR in wgpu is much faster again:

$ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):      6.608 s ±  0.053 s    [User: 7.869 s, System: 0.434 s]
  Range (min … max):    6.552 s …  6.697 s    10 runs

The new enum push seems to fluctuate around that.

I suspect this is caused by the extremely low single-core load incurred by the build, it probably jumps from core to core and has constantly changing frequencies. That doesn't go down well on a ThreadRipper either with its distinct dies/caches either...

Ralith

Nice analysis!

* extensions: Inline simple getter functions * extensions: Inline all `unsafe fn` helper functions * instance: Inline "skipped" `read_into_uninitialized_vector()` functions * enums: Inline `from_raw`/`as_raw` functions

MarijnS95 requested a review from Ralith July 6, 2022 10:44

MarijnS95 added 3 commits July 6, 2022 13:19

extensions: Inline simple getter functions

81afb71

extensions: Inline all unsafe fn helper functions

e13fca9

instance: Inline "skipped" read_into_uninitialized_vector() functions

d9e60ef

MarijnS95 force-pushed the extensions-inline branch from 14cd94a to 45f0107 Compare July 6, 2022 11:22

enums: Inline from_raw/as_raw functions

7647327

MarijnS95 force-pushed the extensions-inline branch from 45f0107 to 7647327 Compare July 6, 2022 11:27

Ralith approved these changes Jul 6, 2022

View reviewed changes

MarijnS95 merged commit 79f63e1 into master Jul 7, 2022

MarijnS95 deleted the extensions-inline branch July 7, 2022 15:25

MarijnS95 mentioned this pull request Aug 14, 2024

Remove unnecessary #[inline] hints rust-windowing/winit#3867

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extensions: Inline all trivial functions #638

extensions: Inline all trivial functions #638

MarijnS95 commented Jul 1, 2022 •

edited

Loading

Ralith commented Jul 2, 2022

MarijnS95 commented Jul 2, 2022

MarijnS95 commented Jul 2, 2022 •

edited

Loading

MarijnS95 commented Jul 2, 2022 •

edited

Loading

Ralith left a comment

extensions: Inline all trivial functions #638

extensions: Inline all trivial functions #638

Conversation

MarijnS95 commented Jul 1, 2022 • edited Loading

Ralith commented Jul 2, 2022

MarijnS95 commented Jul 2, 2022

MarijnS95 commented Jul 2, 2022 • edited Loading

MarijnS95 commented Jul 2, 2022 • edited Loading

Ralith left a comment

Choose a reason for hiding this comment

MarijnS95 commented Jul 1, 2022 •

edited

Loading

MarijnS95 commented Jul 2, 2022 •

edited

Loading

MarijnS95 commented Jul 2, 2022 •

edited

Loading