Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf - use ArrayVec instead of Vec for internal AudioParam buffer #363

Merged
merged 4 commits into from
Oct 3, 2023

Conversation

b-ma
Copy link
Collaborator

@b-ma b-ma commented Sep 22, 2023

cf. #359

@b-ma
Copy link
Collaborator Author

b-ma commented Sep 22, 2023

/bench

@github-actions
Copy link

Benchmark result:


bench_ctor
  Instructions:             4533261 (-0.570421%)
  L1 Accesses:              6769687 (-0.571688%)
  L2 Accesses:                54312 (-0.038651%)
  RAM Accesses:               61621 (+0.108848%)
  Estimated Cycles:         9197982 (-0.397242%)

bench_sine
  Instructions:            70946851 (-0.159726%)
  L1 Accesses:            103533921 (-0.125667%)
  L2 Accesses:               263290 (-1.509030%)
  RAM Accesses:               62493 (+0.107327%)
  Estimated Cycles:       107037626 (-0.138169%)

bench_sine_gain
  Instructions:            75976565 (-0.244167%)
  L1 Accesses:            111134791 (-0.197463%)
  L2 Accesses:               268087 (-1.294900%)
  RAM Accesses:               62589 (-0.094177%)
  Estimated Cycles:       114665841 (-0.208462%)

bench_sine_gain_delay
  Instructions:           150978642 (-0.202723%)
  L1 Accesses:            213650631 (-0.173822%)
  L2 Accesses:               566736 (-2.148217%)
  RAM Accesses:               64193 (+0.077951%)
  Estimated Cycles:       218731066 (-0.197331%)

bench_buffer_src
  Instructions:            17508909 (-0.742848%)
  L1 Accesses:             25451477 (-0.627479%)
  L2 Accesses:                87862 (+0.702587%)
  RAM Accesses:              100775 (+0.057587%)
  Estimated Cycles:        29417912 (-0.526200%)

bench_buffer_src_delay
  Instructions:            91164983 (-0.304087%)
  L1 Accesses:            126148642 (-0.282404%)
  L2 Accesses:               163146 (-2.265062%)
  RAM Accesses:              100922 (+0.033701%)
  Estimated Cycles:       130496642 (-0.286520%)

bench_buffer_src_iir
  Instructions:            41930928 (+0.298410%)
  L1 Accesses:             60575017 (-0.403218%)
  L2 Accesses:                87029 (-0.810349%)
  RAM Accesses:              100756 (-0.045634%)
  Estimated Cycles:        64536622 (-0.386502%)

bench_buffer_src_biquad
  Instructions:            37529097 (-0.846618%)
  L1 Accesses:             52768075 (-0.667674%)
  L2 Accesses:               117794 (-1.586559%)
  RAM Accesses:              100972 (+0.022784%)
  Estimated Cycles:        56891065 (-0.634670%)

bench_stereo_positional
  Instructions:            44850469 (-1.726766%)
  L1 Accesses:             67166221 (-1.231903%)
  L2 Accesses:               290895 (+2.244209%)
  RAM Accesses:              100958 (-0.077200%)
  Estimated Cycles:        72154226 (-1.108165%)

bench_stereo_panning_automation
  Instructions:            32358310 (+0.399506%)
  L1 Accesses:             48579620 (+0.950843%)
  L2 Accesses:               134788 (-3.874598%)
  RAM Accesses:              100891 (+0.061490%)
  Estimated Cycles:        52784745 (+0.826269%)

bench_analyser_node
  Instructions:            39636423 (-0.368031%)
  L1 Accesses:             55489272 (-0.331125%)
  L2 Accesses:               185066 (+1.474418%)
  RAM Accesses:              101311 (+0.070130%)
  Estimated Cycles:        59960487 (-0.280097%)


src/param.rs Outdated
@@ -1088,7 +1088,11 @@ impl AudioParamProcessor {
match some_event {
None => {
if is_a_rate {
self.buffer.resize(count, self.intrinsic_value);
let buffer = [self.intrinsic_value; RENDER_QUANTUM_SIZE];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more succinct to write as:

for _ in self.buffer.len() .. count {
    self.buffer.try_insert(self.intrinsic_value).unwrap();
}

Or did you benchmark this to be the fastest way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I just replaced it with a simple push.

I actually didn't do any particular benchmark, but I think this doesn't worth the hassle for now. I don't think this is a really hot path and more important issues should be considered before focusing on such details in my opinion. Let's just prefer simplicity and readability (I left a comment to keep the idea around though)

note: there is this weird L2 Accesses: 307843 (+12.67221%) in the bench_stereo_positional (I have the impression this particular bench is often very unstable, but maybe this is just in my head...). That's a bit confusing, but all other numbers are very similar between the two versions

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the L2 and RAM accesses are quite unstable. But since these numbers are so low I don't think it is very important (and that's probably also the reason for the large deviations). I think instruction count is the main metric to track.

I have read https://kobzol.github.io/rust/rustc/2023/09/23/rustc-runtime-benchmarks.html recently with some tips to look further into:

After the initial refactoring was completed, I needed to decide how will we actually define the benchmarks and what tool we should use to gather the execution metrics. Both cargo bench and criterion are not a bad choice for running benchmarks, but they only measure wall-time, while I also wanted to measure hardware counters. I was considering to use iai for a while. However, it uses Cachegrind for the measurements, while I wanted the benchmarks to be executed natively, without simulation. Also, using Cachegrind wouldn’t produce realistic wall-time results.

In the end, I decided to write a small library called benchlib, so that we would have ultimate control of defining, executing and measuring the benchmarks, instead of relying on external crates. benchlib uses Linux perf events to gather hardware metrics, using the perf-event crate. I also took bits and pieces from other mentioned tools, like the black_box function from iai.

@orottier
Copy link
Owner

orottier commented Oct 2, 2023

Thanks, looks good to me except the mentioned nitpick!

@b-ma
Copy link
Collaborator Author

b-ma commented Oct 3, 2023

/bench

@github-actions
Copy link

github-actions bot commented Oct 3, 2023

Benchmark result:


bench_ctor
  Instructions:             4533283 (-0.569938%)
  L1 Accesses:              6769740 (-0.570924%)
  L2 Accesses:                54303 (-0.038657%)
  RAM Accesses:               61603 (+0.066600%)
  Estimated Cycles:         9197360 (-0.406523%)

bench_sine
  Instructions:            70948434 (-0.157499%)
  L1 Accesses:            103535456 (-0.123113%)
  L2 Accesses:               263720 (-1.755013%)
  RAM Accesses:               62479 (+0.073679%)
  Estimated Cycles:       107040821 (-0.139535%)

bench_sine_gain
  Instructions:            75978926 (-0.235109%)
  L1 Accesses:            111130496 (-0.197582%)
  L2 Accesses:               275308 (+2.122135%)
  RAM Accesses:               62575 (+0.065564%)
  Estimated Cycles:       114697161 (-0.165352%)

bench_sine_gain_delay
  Instructions:           150981779 (-0.200649%)
  L1 Accesses:            213619925 (-0.188222%)
  L2 Accesses:               601328 (+3.846439%)
  RAM Accesses:               64178 (+0.040529%)
  Estimated Cycles:       218872795 (-0.132578%)

bench_buffer_src
  Instructions:            17508933 (-0.742734%)
  L1 Accesses:             25451728 (-0.626669%)
  L2 Accesses:                87646 (+0.505705%)
  RAM Accesses:              100769 (+0.045670%)
  Estimated Cycles:        29416873 (-0.529828%)

bench_buffer_src_delay
  Instructions:            91167238 (-0.300114%)
  L1 Accesses:            126152775 (-0.280860%)
  L2 Accesses:               162184 (-0.309795%)
  RAM Accesses:              100916 (+0.031720%)
  Estimated Cycles:       130495755 (-0.272605%)

bench_buffer_src_iir
  Instructions:            41934521 (+0.307074%)
  L1 Accesses:             60578354 (-0.398699%)
  L2 Accesses:                88469 (+1.563594%)
  RAM Accesses:              100868 (+0.055549%)
  Estimated Cycles:        64551079 (-0.360767%)

bench_buffer_src_biquad
  Instructions:            37537474 (-0.824486%)
  L1 Accesses:             52785985 (-0.640675%)
  L2 Accesses:               115575 (-0.449624%)
  RAM Accesses:              100959 (+0.003962%)
  Estimated Cycles:        56897425 (-0.598944%)

bench_stereo_positional
  Instructions:            44858702 (-1.698429%)
  L1 Accesses:             67159790 (-1.247776%)
  L2 Accesses:               307843 (+12.67221%)
  RAM Accesses:              101064 (+0.037614%)
  Estimated Cycles:        72236245 (-0.924624%)

bench_stereo_panning_automation
  Instructions:            32358700 (+0.400629%)
  L1 Accesses:             48578465 (+0.943108%)
  L2 Accesses:               136422 (-0.929544%)
  RAM Accesses:              100877 (+0.039668%)
  Estimated Cycles:        52791270 (+0.857559%)

bench_analyser_node
  Instructions:            39640027 (-0.368147%)
  L1 Accesses:             55496034 (-0.328541%)
  L2 Accesses:               183103 (+0.664673%)
  RAM Accesses:              101414 (+0.048340%)
  Estimated Cycles:        59961039 (-0.291285%)


@orottier orottier merged commit 910e238 into orottier:main Oct 3, 2023
3 checks passed
@orottier orottier mentioned this pull request Oct 3, 2023
11 tasks
@b-ma b-ma deleted the perf/param-array-vec branch November 4, 2023 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants