Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extensions: Inline all trivial functions #638

Merged
merged 4 commits into from
Jul 7, 2022
Merged

Conversation

MarijnS95
Copy link
Collaborator

@MarijnS95 MarijnS95 commented Jul 1, 2022

As with earlier PRs inlining public functions with the #[inline] attribute is desired to provide function bodies to the linker and allow end-users to get ever so slightly higher performance when their trivial implementations are inlined directly into target code, instead of ending up as an extra indirection through a call and arguments that have to be set up according to a calling convention.

At the same time this seems to not do / postpone certain codegen/optimization passes, resulting in ever so slightly reduced compilation times for the pure ash crate (which was part of the reason these have already been applied to other parts of the ash codebase already). This hit is likely only postponed to the linker though, but should still result in a total compiletime reduction (even if it's so small that it is considered negligible) as a user can't possibly use each and every function - which the compiler/linker then won't waste any time on I hope/suppose :)

For actual performance numbers, on the master branch at 71d45e4:

$ hyperfine -p 'cargo clean -p ash' 'cargo b -p ash'
Benchmark 1: cargo b -p ash
  Time (mean ± σ):      6.749 s ±  0.076 s    [User: 7.723 s, System: 0.456 s]
  Range (min … max):    6.670 s …  6.872 s    10 runs

With this PR directly on top we get an ever so slightly increase of about ±60ms:

$ hyperfine -p 'cargo clean -p ash' 'cargo b -p ash'
Benchmark 1: cargo b -p ash
  Time (mean ± σ):      6.690 s ±  0.051 s    [User: 7.349 s, System: 0.398 s]
  Range (min … max):    6.618 s …  6.768 s    10 runs

@Ralith it seems you have intentionally omitted (Instance) functions that call into read_into_uninitialized_vector() in #606, presumably because their bodies are too complex? Note that those read_into_*_vector() functions have generics and would be publicized/inlined when the #[inline] attribute is applied here, as far as I know/understand.

Undoing that #[inline] change for those and every other extension function that calls read_into_uninitialized_vector()/read_into_defaulted_vector() brings the timings right back to where we started, undoing our ±60ms improvement:

$ hyperfine -p 'cargo clean -p ash' 'cargo b -p ash'
Benchmark 1: cargo b -p ash
  Time (mean ± σ):      6.770 s ±  0.056 s    [User: 7.563 s, System: 0.464 s]
  Range (min … max):    6.710 s …  6.865 s    10 runs

Should we profile with an end-user application that actively uses some of these functions, and see how the timings fare?

@Ralith
Copy link
Collaborator

Ralith commented Jul 2, 2022

Note that those read_into_*_vector() functions have generic and would be publicized/inlined when the #[inline] attribute is applied here, as far as I know/understand.

Yes, my feeling is that this represents enough code in aggregate that the benefit is nonobvious.

Should we profile with an end-user application that actively uses some of these functions, and see how the timings fare?

I think this would be much more reliable data, yeah. wgpu would be a good candidate, but a nontrivial direct consumer of ash might also be distinctly interesting. In particular I don't think a 60ms difference in compile time is actionable either way on its own. I'm also happy to see this merged either way, but better data never hurts.

@MarijnS95
Copy link
Collaborator Author

I can try against wgpu, not sure how soon though. I'd like to try:

  1. Without any #[inline] at all;
  2. The current state of ash on master;
  3. With this PR;
  4. With this PR, with #[inline] removed on those non-trivial read_into_*_vector() again.

@MarijnS95
Copy link
Collaborator Author

MarijnS95 commented Jul 2, 2022

With WGPU at gfx-rs/wgpu@c20c86a:

Note that I made sure that ash (and all other deps) are up-to-date/compiled first with a -w1 warmup run, and didn't clean ash through an extra -p ash. We'd need both to understand

  1. Without any #[inline] at all, at 71d45e4:
    $ hyperfine -w1 -p 'cargo clean -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
    Benchmark 1: cargo b -p wgpu-hal --features ash
      Time (mean ± σ):     576.8 ms ±   7.0 ms    [User: 648.0 ms, System: 87.0 ms]
      Range (min … max):   571.5 ms … 595.8 ms    10 runs
    Reminder: 71d45e4 compiles in ±6.737s. Removing all #[inline] brings us back to ±7.059s.
  2. The current state of ash on master at 71d45e4:
    $ hyperfine -w1 -p 'cargo clean -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
    Benchmark 1: cargo b -p wgpu-hal --features ash
      Time (mean ± σ):     567.2 ms ±   3.1 ms    [User: 650.0 ms, System: 73.7 ms]
      Range (min … max):   562.9 ms … 573.8 ms    10 runs
    That's clearly a win on wgpu-hal alone, even if it isn't much. In other words, we definitely aren't paying for the winds of #[inline] on ash later during linking.
  3. With this PR at e0f5cad:
    $ hyperfine -w1 -p 'cargo clean -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
    Benchmark 1: cargo b -p wgpu-hal --features ash
      Time (mean ± σ):     569.4 ms ±   3.3 ms    [User: 649.2 ms, System: 76.9 ms]
      Range (min … max):   564.6 ms … 576.9 ms    10 runs
    This fluctuates a bit between 569-571ms on re-runs, but it's definitely ±2ms slower than `master.
  4. With this PR at e0f5cad, with #[inline] removed on those non-trivial read_into_*_vector() again:
    $ hyperfine -w1 -p 'cargo clean -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
    Benchmark 1: cargo b -p wgpu-hal --features ash
      Time (mean ± σ):     567.8 ms ±   2.7 ms    [User: 647.5 ms, System: 78.2 ms]
      Range (min … max):   563.9 ms … 571.8 ms    10 runs
    And we're back at master, showing that it is indeed beneficial to omit #[inline] on functions that call read_into_*_vector(); if only by a very, very tiny margin.

EDIT: But we should not forget that this change brought us ±60ms in the first place, and we're only paying back ±3ms in end-user crates.

Here's the same test again, with ash cleaned (hence rebuilt) every run, and #[inline] removed on read_into_*_vector() calls:

$ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):      6.717 s ±  0.072 s    [User: 8.129 s, System: 0.475 s]
  Range (min … max):    6.624 s …  6.876 s    10 runs

And purely this PR (so with #[inline] on all functions):

$ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):      6.673 s ±  0.040 s    [User: 7.902 s, System: 0.462 s]
  Range (min … max):    6.572 s …  6.710 s    10 runs

That's still an improvement, if only ±40ms in total.

Finally, for good measure, back on master at 71d45e4:

$ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):      6.739 s ±  0.045 s    [User: 8.321 s, System: 0.494 s]
  Range (min … max):    6.681 s …  6.799 s    10 runs

This PR is the clear winner as it stands, with #[inline] added on all functions including read_into_*_vector() calls.

@MarijnS95
Copy link
Collaborator Author

MarijnS95 commented Jul 2, 2022

I need to do some clockfixing on my PC, after a suspend-resume this PR in wgpu is much faster again:

$ hyperfine -w1 -p 'cargo clean -p ash -p wgpu-hal' 'cargo b -p wgpu-hal --features ash'
Benchmark 1: cargo b -p wgpu-hal --features ash
  Time (mean ± σ):      6.608 s ±  0.053 s    [User: 7.869 s, System: 0.434 s]
  Range (min … max):    6.552 s …  6.697 s    10 runs

The new enum push seems to fluctuate around that.

I suspect this is caused by the extremely low single-core load incurred by the build, it probably jumps from core to core and has constantly changing frequencies. That doesn't go down well on a ThreadRipper either with its distinct dies/caches either...

@MarijnS95 MarijnS95 requested a review from Ralith July 6, 2022 10:44
@MarijnS95 MarijnS95 force-pushed the extensions-inline branch from 14cd94a to 45f0107 Compare July 6, 2022 11:22
@MarijnS95 MarijnS95 force-pushed the extensions-inline branch from 45f0107 to 7647327 Compare July 6, 2022 11:27
Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice analysis!

@MarijnS95 MarijnS95 merged commit 79f63e1 into master Jul 7, 2022
@MarijnS95 MarijnS95 deleted the extensions-inline branch July 7, 2022 15:25
MarijnS95 added a commit that referenced this pull request Jul 7, 2022
* extensions: Inline simple getter functions

* extensions: Inline all `unsafe fn` helper functions

* instance: Inline "skipped" `read_into_uninitialized_vector()` functions

* enums: Inline `from_raw`/`as_raw` functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants