Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-optimize the heck out of LEB128 reading and writing. #69050

Merged
merged 1 commit into from
Feb 13, 2020

Conversation

nnethercote
Copy link
Contributor

This commit makes the following writing improvements:

  • Removes the unnecessary write_to_vec function.
  • Reduces the number of conditions per loop from 2 to 1.
  • Avoids a mask and a shift on the final byte.

And the following reading improvements:

  • Removes an unnecessary type annotation.
  • Fixes a dangerous unchecked slice access. Imagine a slice [0x80] --
    the current code will read past the end of the slice some number of
    bytes. The bounds check at the end will subsequently trigger, unless
    something bad (like a crash) happens first. The cost of doing bounds
    check in the loop body is negligible.
  • Avoids a mask on the final byte.

And the following improvements for both reading and writing:

  • Changes for to loop for the loops, avoiding an unnecessary
    condition on each iteration. This also removes the need for
    leb128_size.

All of these changes give significant perf wins, up to 5%.

r? @michaelwoerister

This commit makes the following writing improvements:
- Removes the unnecessary `write_to_vec` function.
- Reduces the number of conditions per loop from 2 to 1.
- Avoids a mask and a shift on the final byte.

And the following reading improvements:
- Removes an unnecessary type annotation.
- Fixes a dangerous unchecked slice access. Imagine a slice `[0x80]` --
  the current code will read past the end of the slice some number of
  bytes. The bounds check at the end will subsequently trigger, unless
  something bad (like a crash) happens first. The cost of doing bounds
  check in the loop body is negligible.
- Avoids a mask on the final byte.

And the following improvements for both reading and writing:
- Changes `for` to `loop` for the loops, avoiding an unnecessary
  condition on each iteration. This also removes the need for
  `leb128_size`.

All of these changes give significant perf wins, up to 5%.
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 11, 2020
@nnethercote
Copy link
Contributor Author

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Feb 11, 2020

⌛ Trying commit ad7802f with merge d902ca046d0a8cc72dd69a16627fa5da540030f1...

@nnethercote
Copy link
Contributor Author

Local check results:

clap-rs-check
        avg: -2.7%      min: -5.6%      max: -0.0%
ucd-check
        avg: -1.3%      min: -2.8%      max: -0.4%
coercions-check
        avg: -1.0%?     min: -2.2%?     max: -0.0%?
tuple-stress-check
        avg: -0.7%      min: -1.6%      max: -0.0%
wg-grammar-check
        avg: -0.6%      min: -1.6%      max: -0.0%
html5ever-check
        avg: -0.9%      min: -1.4%      max: -0.2%
script-servo-check
        avg: -0.8%      min: -1.1%      max: -0.1%
cranelift-codegen-check
        avg: -0.5%      min: -1.0%      max: -0.1%
unused-warnings-check
        avg: -0.4%      min: -1.0%      max: -0.0%
webrender-check
        avg: -0.6%      min: -1.0%      max: -0.1%
regression-31157-check
        avg: -0.6%      min: -1.0%      max: -0.2%
regex-check
        avg: -0.7%      min: -1.0%      max: -0.1%
piston-image-check
        avg: -0.6%      min: -0.9%      max: -0.1%
cargo-check
        avg: -0.5%      min: -0.9%      max: -0.0%
webrender-wrench-check
        avg: -0.6%      min: -0.8%      max: -0.1%
hyper-2-check
        avg: -0.4%      min: -0.8%      max: -0.1%
keccak-check
        avg: -0.3%      min: -0.8%      max: -0.0%
futures-check
        avg: -0.5%      min: -0.8%      max: -0.1%
syn-check
        avg: -0.5%      min: -0.8%      max: -0.1%
packed-simd-check
        avg: -0.4%      min: -0.8%      max: -0.0%
ripgrep-check
        avg: -0.5%      min: -0.8%      max: -0.1%
serde-check
        avg: -0.3%      min: -0.8%      max: -0.0%
encoding-check
        avg: -0.5%      min: -0.8%      max: -0.1%
serde-serde_derive-check
        avg: -0.4%      min: -0.7%      max: -0.0%
style-servo-check
        avg: -0.4%      min: -0.7%      max: -0.0%
tokio-webpush-simple-check
        avg: -0.5%      min: -0.7%      max: -0.2%
inflate-check
        avg: -0.2%      min: -0.7%      max: -0.0%
await-call-tree-check
        avg: -0.6%      min: -0.7%      max: -0.4%
issue-46449-check
        avg: -0.5%      min: -0.7%      max: -0.4%
wf-projection-stress-65510-che...
        avg: -0.2%      min: -0.6%      max: 0.0%
unicode_normalization-check
        avg: -0.2%      min: -0.6%      max: -0.0%
helloworld-check
        avg: -0.3%      min: -0.5%      max: -0.1%
ctfe-stress-4-check
        avg: -0.2%?     min: -0.5%?     max: 0.2%?
unify-linearly-check
        avg: -0.3%      min: -0.4%      max: -0.2%
deeply-nested-check
        avg: -0.3%      min: -0.4%      max: -0.2%
deep-vector-check
        avg: -0.1%      min: -0.3%      max: -0.0%
token-stream-stress-check
        avg: -0.1%      min: -0.1%      max: -0.0%

The biggest improvements are on "clean incremental" runs, followed by "patched incremental".

src/libserialize/leb128.rs Show resolved Hide resolved
src/libserialize/leb128.rs Show resolved Hide resolved
@bors
Copy link
Contributor

bors commented Feb 11, 2020

☀️ Try build successful - checks-azure
Build commit: d902ca046d0a8cc72dd69a16627fa5da540030f1 (d902ca046d0a8cc72dd69a16627fa5da540030f1)

@rust-timer
Copy link
Collaborator

Queued d902ca046d0a8cc72dd69a16627fa5da540030f1 with parent dc4242d, future comparison URL.

@michaelwoerister
Copy link
Member

That's interesting. I remember that switching the code from loop to for sped up the code considerably a couple of years ago. My theory now is that that past speedup came from duplicating the machine code for each integer type, allowing the branch predictor to do a better job, and that that speedup was so big that it was faster even though the for loop introduced more overhead.

Anyway, I'm happy to get any kind of improvement here. And it's even more safe than before 🎉

(In case someone is interested in the past of this implementation: https://github.com/michaelwoerister/encoding-bench contains a number of different versions that I tried out. It's rather messy as it's essentially a private repo but an interesting aspect is the test data files that are generated from actual rustc invocations)

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit d902ca046d0a8cc72dd69a16627fa5da540030f1, comparison URL.

@michaelwoerister
Copy link
Member

@bors r+

Thanks, @nnethercote!

@bors
Copy link
Contributor

bors commented Feb 12, 2020

📌 Commit ad7802f has been approved by michaelwoerister

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 12, 2020
@nnethercote
Copy link
Contributor Author

@bors r- until I have tried out @ranma42's suggestion.

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Feb 12, 2020
@ranma42
Copy link
Contributor

ranma42 commented Feb 12, 2020

I was just finding it strange that the most significant bit was cleared out (_ & 0x7f) just before it was being set (_ | 0x80).
I do not think it should make any difference in the timing (or even the generated code, as I believe LLVM will optimize it out).
If this is an performance-sensitive part of the compiler, I will try to have a deeper look :)

@eddyb
Copy link
Member

eddyb commented Feb 12, 2020

@nnethercote If you're bored, I wonder how this implementation compares to the pre-#59820 one in libproc_macro (which I implemented from scratch in safe code).

It definitely feels like your new version here is close to mine, but without checking I can't tell which one LLVM will prefer (or if they compile all the same).

EDIT: also, has anyone considered using SIMD here, like @BurntSushi and others have employed for handling UTF-8/regexes etc.? I'm asking because UTF-8 is like a more complex LEB128.

@bjorn3
Copy link
Member

bjorn3 commented Feb 12, 2020

UTF-8 validation handles a lot of codepoints every call, while these read and write methods only handle a single LEB128 int per call, so SIMD is likely not useful.

@eddyb
Copy link
Member

eddyb commented Feb 12, 2020

while these read and write methods only handle a single LEB128 int per call

May not be relevant, but the serialized data is basically a sequence of LEB128s (perhaps intermixed with strings), they just semantically represent more hierarchical values than an UTF-8 stream.

@ranma42
Copy link
Contributor

ranma42 commented Feb 12, 2020

If you are willing to do processor-specific tuning, PDEP/PEXT (available on modern x86 processors) might be better suited than generic SIMD for this task.

@gereeter
Copy link
Contributor

also, has anyone considered using SIMD here

See also Masked VByte [arXiv].

@nnethercote
Copy link
Contributor Author

@nnethercote If you're bored, I wonder how this implementation compares to the pre-#59820 one in libproc_macro (which I implemented from scratch in safe code).

I tried the read and write implementations from libproc_macro individually, they both were slower than the code in this PR.

@nnethercote
Copy link
Contributor Author

also, has anyone considered using SIMD here

See also Masked VByte [arXiv].

Thanks for the link, I will take a look... but not in this PR :)

@nnethercote
Copy link
Contributor Author

@bors r=michaelwoerister

@bors
Copy link
Contributor

bors commented Feb 13, 2020

📌 Commit ad7802f has been approved by michaelwoerister

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 13, 2020
@nnethercote
Copy link
Contributor Author

BTW, in case anyone is curious, here's how I approached this bug. From profiling with Callgrind I saw that clap-rs-Check-CleanIncr was the benchmark+run+build combination most affected by LEB128 encoding. Its text output has entries like this:

265,344,872 ( 2.97%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:rustc::ty::query::on_disk_cache::__ty_decoder_impl::<impl serialize::serialize::Decoder for rustc::ty::query::on_disk_cache::CacheDecoder>::read_usize
236,097,015 ( 2.64%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc::ty::query::on_disk_cache::CacheEncoder<E> as serialize::serialize::Encoder>::emit_u32
213,551,888 ( 2.39%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:rustc::ty::codec::encode_with_shorthand
165,042,682 ( 1.85%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc_target::abi::VariantIdx as serialize::serialize::Decodable>::decode
 40,540,500 ( 0.45%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<u32 as serialize::serialize::Encodable>::encode
 24,026,292 ( 0.27%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:serialize::serialize::Encoder::emit_seq
 20,160,540 ( 0.23%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc::dep_graph::serialized::SerializedDepNodeIndex as serialize::serialize::Decodable>::decode
  9,661,323 ( 0.11%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:serialize::serialize::Decoder::read_tuple
  4,898,927 ( 0.05%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc::ty::query::on_disk_cache::CacheEncoder<E> as serialize::serialize::Encoder>::emit_usize
  3,384,018 ( 0.04%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc_metadata::rmeta::encoder::EncodeContext as serialize::serialize::Encoder>::emit_u32
  2,296,440 ( 0.03%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc::ty::UniverseIndex as serialize::serialize::Decodable>::decode

These are instruction counts, and the percentages sum to about 11%. Lots of different functions are involved because the LEB128 functions are inlined, but the file is leb128.rs in all of them, so I could tell where the relevant code lives. And the annotated code in that file looks like this:

          .           macro_rules! impl_write_unsigned_leb128 {
          .               ($fn_name:ident, $int_ty:ident) => {
          .                   #[inline]
          .                   pub fn $fn_name(out: &mut Vec<u8>, mut value: $int_ty) {
          .                       for _ in 0..leb128_size!($int_ty) {
143,877,210 ( 1.61%)                  let mut byte = (value & 0x7F) as u8;
 48,003,612 ( 0.54%)                  value >>= 7;
239,884,434 ( 2.69%)                  if value != 0 {
 47,959,070 ( 0.54%)                      byte |= 0x80;
          .                           }
          .
          .                           write_to_vec(out, byte);
          .
 47,959,070 ( 0.54%)                  if value == 0 {
          .                               break;
          .                           }
          .                       }
          .                   }
          .               };
          .           }
          .
          .           impl_write_unsigned_leb128!(write_u16_leb128, u16);
-- line 50 ----------------------------------------
-- line 57 ----------------------------------------
          .               ($fn_name:ident, $int_ty:ident) => {
          .                   #[inline]
          .                   pub fn $fn_name(slice: &[u8]) -> ($int_ty, usize) {
          .                       let mut result: $int_ty = 0;
          .                       let mut shift = 0;
          .                       let mut position = 0;
          .
          .                       for _ in 0..leb128_size!($int_ty) {
 59,507,824 ( 0.67%)                  let byte = unsafe { *slice.get_unchecked(position) };
          .                           position += 1;
204,126,888 ( 2.29%)                  result |= ((byte & 0x7F) as $int_ty) << shift;
119,023,350 ( 1.33%)                  if (byte & 0x80) == 0 {
          .                               break;
          .                           }
          .                           shift += 7;
          .                       }
          .
          .                       // Do a single bounds check at the end instead of for every byte.
 67,805,748 ( 0.76%)              assert!(position <= slice.len());
          .
          .                       (result, position)
          .                   }
          .               };
          .           }

Those percentages also add up to about 11%. Plus I poked around a bit at call sites and found this in a different file (libserialize/opaque.rs):

         .           macro_rules! read_uleb128 {
          .               ($dec:expr, $fun:ident) => {{
100,680,777 ( 1.13%)          let (value, bytes_read) = leb128::$fun(&$dec.data[$dec.position..]);
 67,858,196 ( 0.76%)          $dec.position += bytes_read;
 43,378,625 ( 0.49%)          Ok(value)
          .               }};
          .           }

which is another 2.38%. So it was clear that LEB128 reading/writing was hot.

I then tried gradually improving the code. I ended up measuring 18 different changes to the code. 10 of them were improvements (which I kept), and 8 were regressions (which I discarded). The following table shows the notes I took. The descriptions of the changes are a bit cryptic, but the basic technique should be clear.

IMPROVEMENTS
            clap-rs-Check-CleanIncr
feb10/Leb0  8,992M        $RUSTC0
feb10/Leb1  8,927M/99.3%  First attempt
feb11/Leb4  8,996M        $RUSTC0 but with bounds checking
feb11/Leb5  8,983M        `loop` for reading
feb11/Leb6  8,928M/99.3%  `loop` for writing, `write_to_vec` removed
feb11/Leb8  8,829M/98.1%  avoid mask on final byte in read loop
feb11/Leb9  8,529M/94.8%  in write loop, avoid a condition
feb11/Leb10 8,488M/94.4%  in write loop, mask/shift on final byte
feb13/Leb13 8,488M/94.4%  in write loop, push `(value | 0x80) as u8`
feb13/Leb15 8,488M/94.4%  in read loop, do `as` before `&`
feb13/Leb18 8,492M/94.4%  Landed (not sure about the extra 4M, oh well)

REGRESSIONS
feb11/Leb2  8,927M/99.3%  add slice0, slice1, slice2 vars
feb11/Leb3  9,127M        move the slow loop into a separate no-inline function
feb11/Leb7  8,930M        `< 128` in read loop
feb11/Leb11 8,492M        use `byte < 0x80` in read loop
feb12/Leb12 8,721M        unsafe pushing in write
feb13/Leb14 8,494M/94.4%  in write loop, push `(value as u8) | 0x80`
feb13/Leb16 8,831M        eddyb's write loop
feb13/Leb17 8,578M        eddyb's read loop

Every iteration took about 6.5 minutes to recompile, and about 2 minutes to measure with Cachegrind. I interleaved these steps with other work, so in practice each iteration took anywhere from 10-30 minutes, depending on context-switching delays.

The measurements in the notes are close to those from the CI run, which indicate the following for clap-rs-Check-CleanIncr:

  • instructions: -5.3%
  • cycles: -4.4%
  • wall-time: -3.9%

Instruction counts are almost deterministic and highly reliable. Cycle counts are more variable but still reasonable. Wall-time is highly variable and barely trustworthy. But they're all pointing in the same direction, which is encouraging.

Looking at the instruction counts, we saw that LEB128 operations were about 11-13% of instructions originally, and instruction counts went down by about 5%, which suggests that the LEB128 operations are a bit less than twice as fast as they were. Pretty good.

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Feb 13, 2020
…r=michaelwoerister

Micro-optimize the heck out of LEB128 reading and writing.

This commit makes the following writing improvements:
- Removes the unnecessary `write_to_vec` function.
- Reduces the number of conditions per loop from 2 to 1.
- Avoids a mask and a shift on the final byte.

And the following reading improvements:
- Removes an unnecessary type annotation.
- Fixes a dangerous unchecked slice access. Imagine a slice `[0x80]` --
  the current code will read past the end of the slice some number of
  bytes. The bounds check at the end will subsequently trigger, unless
  something bad (like a crash) happens first. The cost of doing bounds
  check in the loop body is negligible.
- Avoids a mask on the final byte.

And the following improvements for both reading and writing:
- Changes `for` to `loop` for the loops, avoiding an unnecessary
  condition on each iteration. This also removes the need for
  `leb128_size`.

All of these changes give significant perf wins, up to 5%.

r? @michaelwoerister
bors added a commit that referenced this pull request Feb 13, 2020
Rollup of 9 pull requests

Successful merges:

 - #67642 (Relax bounds on HashMap/HashSet)
 - #68848 (Hasten macro parsing)
 - #69008 (Properly use parent generics for opaque types)
 - #69048 (Suggestion when encountering assoc types from hrtb)
 - #69049 (Optimize image sizes)
 - #69050 (Micro-optimize the heck out of LEB128 reading and writing.)
 - #69068 (Make the SGX arg cleanup implementation a NOP)
 - #69082 (When expecting `BoxFuture` and using `async {}`, suggest `Box::pin`)
 - #69104 (bootstrap: Configure cmake when building sanitizer runtimes)

Failed merges:

r? @ghost
@bors bors merged commit ad7802f into rust-lang:master Feb 13, 2020
@bors
Copy link
Contributor

bors commented Feb 13, 2020

☔ The latest upstream changes (presumably #69118) made this pull request unmergeable. Please resolve the merge conflicts.

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Feb 13, 2020
@nnethercote nnethercote deleted the micro-optimize-leb128 branch February 13, 2020 08:29
@Veedrac
Copy link
Contributor

Veedrac commented Feb 13, 2020

In response to earlier comments, PEXT can be used to encode with something like (untested)

fn leb128enc(value: u32) -> [u8; 8] {
    let hi = 0x8080_8080_8080_8080;
    let split = unsafe { _pdep_u64(value as u64, !hi) };
    let tags = ((!0 >> (split | 1).leading_zeros()) & hi;
    return (split | tags).to_le_bytes();
}

You can do a similar thing with PDEP for decoding. Encoding larger integers is probably just best off using a branch to handle full chunks of 56 bits (with let tags = hi) before finishing with the above.

@nnethercote
Copy link
Contributor Author

@fitzgen tried using PEXT a while back in a different project. For the common case (small integers that fit in 1 byte) it was a slight slowdown:
https://twitter.com/fitzgen/status/1138784734417432576

@fitzgen
Copy link
Member

fitzgen commented Feb 13, 2020

Also, on intel chips, pext is implemented in hardware and super fast (one or two cycles iirc), but on amd it is implemented in microcode and is muuuuuch slower (150-300 cycles). Would have to be careful with it.

@fitzgen
Copy link
Member

fitzgen commented Feb 13, 2020

@Veedrac
Copy link
Contributor

Veedrac commented Feb 13, 2020

@nnethercote The thing I would worry about with PEXT is the copy; if you do that byte-at-a-time (or with memcpy) you probably eat a lot of the earnings. The key for a fast variable-length copy is to always add the maximum size and then bump the pointer by the length instead (or truncate the vector, in the Rust case). Being able to avoid the >10% mispredict rate probably pays for the few extra instructions in the common cases, but you need to specifically design for that.

nnethercote added a commit to nnethercote/rust that referenced this pull request Feb 16, 2020
PR rust-lang#69050 changed LEB128 reading and writing. After it landed I did some
double-checking and found that the writing changes were universally a
speed-up, but the reading changes were not. I'm not exactly sure why,
perhaps there was a quirk of inlining in the particular revision I was
originally working from.

This commit reverts some of the reading changes, while still avoiding
`unsafe` code. I have checked it on multiple revisions and the speed-ups
seem to be robust.
bors added a commit that referenced this pull request Feb 16, 2020
Tweak LEB128 reading some more.

PR #69050 changed LEB128 reading and writing. After it landed I did some
double-checking and found that the writing changes were universally a
speed-up, but the reading changes were not. I'm not exactly sure why,
perhaps there was a quirk of inlining in the particular revision I was
originally working from.

This commit reverts some of the reading changes, while still avoiding
`unsafe` code. I have checked it on multiple revisions and the speed-ups
seem to be robust.

r? @michaelwoerister
@nnethercote
Copy link
Contributor Author

#92604 is a successor to this PR, for those who like LEB128 micro-optimizations.

d-e-s-o added a commit to d-e-s-o/blazesym that referenced this pull request Jun 5, 2024
As it turns out, the Rust compiler uses variable length LEB128 encoded
integers internally. It so happens that they spent a fair amount of
effort micro-optimizing the decoding functionality [0] [1], as it's in
the hot path.
With this change we replace our decoding routines with these optimized
ones. To make that happen more easily (and to gain some base line speed
up), also remove the "shift" return from the respective methods. As a
result of these changes, we see a respective speed up:

Before:
  test util::tests::bench_u64_leb128_reading  ... bench:  128 ns/iter (+/- 10)

After:
  test util::tests::bench_u64_leb128_reading  ... bench:  103 ns/iter (+/- 5)

Gsym decoding, which uses these routines, improved as follows:
  main/symbolize_gsym_multi_no_setup
    time:   [146.26 µs 146.69 µs 147.18 µs]
    change: [−7.2075% −5.7106% −4.4870%] (p = 0.00 < 0.02)
    Performance has improved.

[0] rust-lang/rust#69050
[1] rust-lang/rust#69157

Signed-off-by: Daniel Müller <deso@posteo.net>
d-e-s-o added a commit to d-e-s-o/blazesym that referenced this pull request Jun 5, 2024
As it turns out, the Rust compiler uses variable length LEB128 encoded
integers internally. It so happens that they spent a fair amount of
effort micro-optimizing the decoding functionality [0] [1], as it's in
the hot path.
With this change we replace our decoding routines with these optimized
ones. To make that happen more easily (and to gain some base line speed
up), also remove the "shift" return from the respective methods. As a
result of these changes, we see a respectable speed up:

Before:
  test util::tests::bench_u64_leb128_reading  ... bench:  128 ns/iter (+/- 10)

After:
  test util::tests::bench_u64_leb128_reading  ... bench:  103 ns/iter (+/- 5)

Gsym decoding, which uses these routines, improved as follows:
  main/symbolize_gsym_multi_no_setup
    time:   [146.26 µs 146.69 µs 147.18 µs]
    change: [−7.2075% −5.7106% −4.4870%] (p = 0.00 < 0.02)
    Performance has improved.

[0] rust-lang/rust#69050
[1] rust-lang/rust#69157

Signed-off-by: Daniel Müller <deso@posteo.net>
d-e-s-o added a commit to libbpf/blazesym that referenced this pull request Jun 5, 2024
As it turns out, the Rust compiler uses variable length LEB128 encoded
integers internally. It so happens that they spent a fair amount of
effort micro-optimizing the decoding functionality [0] [1], as it's in
the hot path.
With this change we replace our decoding routines with these optimized
ones. To make that happen more easily (and to gain some base line speed
up), also remove the "shift" return from the respective methods. As a
result of these changes, we see a respectable speed up:

Before:
  test util::tests::bench_u64_leb128_reading  ... bench:  128 ns/iter (+/- 10)

After:
  test util::tests::bench_u64_leb128_reading  ... bench:  103 ns/iter (+/- 5)

Gsym decoding, which uses these routines, improved as follows:
  main/symbolize_gsym_multi_no_setup
    time:   [146.26 µs 146.69 µs 147.18 µs]
    change: [−7.2075% −5.7106% −4.4870%] (p = 0.00 < 0.02)
    Performance has improved.

[0] rust-lang/rust#69050
[1] rust-lang/rust#69157

Signed-off-by: Daniel Müller <deso@posteo.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.