Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster encoding for lengths in BufferBackend. #51

Closed
wants to merge 2 commits into from

Conversation

reinerp
Copy link

@reinerp reinerp commented Feb 5, 2023

Instead of varlen encoding, which requires up to 8 conditional branches in a loop, we use a simpler encoding which is just 4 instructions without branching.

New encoding:

  • length<=254: <1-byte length>
  • otherwise: <0xFF> <usize length> <0xFF>

Improves throughputs by 10-80%, with the biggest improvement in resolve().

get_or_intern/fill-empty/new/BufferBackend
                        time:   [6.4562 ms 6.5068 ms 6.5684 ms]
                        thrpt:  [15.224 Melem/s 15.369 Melem/s 15.489 Melem/s]
                 change:
                        time:   [-8.4289% -7.3761% -6.1844%] (p = 0.00 < 0.05)
                        thrpt:  [+6.5920% +7.9635% +9.2048%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

get_or_intern/fill-empty/with_capacity/BufferBackend
                        time:   [3.8182 ms 3.8467 ms 3.8796 ms]
                        thrpt:  [25.776 Melem/s 25.996 Melem/s 26.190 Melem/s]
                 change:
                        time:   [-18.750% -18.024% -17.246%] (p = 0.00 < 0.05)
                        thrpt:  [+20.840% +21.986% +23.077%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

get_or_intern/already-filled/BufferBackend
                        time:   [2.7638 ms 2.8026 ms 2.8477 ms]
                        thrpt:  [35.116 Melem/s 35.681 Melem/s 36.182 Melem/s]
                 change:
                        time:   [-7.9910% -6.4121% -4.8432%] (p = 0.00 < 0.05)
                        thrpt:  [+5.0897% +6.8514% +8.6851%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

get_or_intern_static/BufferBackend/get_or_intern
                        time:   [2.8467 µs 2.8632 µs 2.8807 µs]
                        thrpt:  [18.051 Melem/s 18.162 Melem/s 18.267 Melem/s]
                 change:
                        time:   [-1.7629% +0.8260% +4.7452%] (p = 0.71 > 0.05)
                        thrpt:  [-4.5302% -0.8193% +1.7945%]
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
get_or_intern_static/BufferBackend/get_or_intern_static
                        time:   [2.8259 µs 2.8456 µs 2.8634 µs]
                        thrpt:  [18.160 Melem/s 18.274 Melem/s 18.401 Melem/s]
                 change:
                        time:   [-6.7386% -3.1051% -0.1965%] (p = 0.06 > 0.05)
                        thrpt:  [+0.1969% +3.2046% +7.2255%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

resolve/already-filled/BufferBackend
                        time:   [239.71 µs 242.17 µs 245.19 µs]
                        thrpt:  [407.85 Melem/s 412.93 Melem/s 417.16 Melem/s]
                 change:
                        time:   [-46.443% -45.470% -44.561%] (p = 0.00 < 0.05)
                        thrpt:  [+80.377% +83.386% +86.719%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

get/already-filled/BufferBackend
                        time:   [2.3035 ms 2.3222 ms 2.3427 ms]
                        thrpt:  [42.686 Melem/s 43.062 Melem/s 43.411 Melem/s]
                 change:
                        time:   [-2.6436% -1.3829% -0.0703%] (p = 0.03 < 0.05)
                        thrpt:  [+0.0703% +1.4023% +2.7154%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

iter/already-filled/BufferBackend
                        time:   [278.13 µs 279.55 µs 281.19 µs]
                        thrpt:  [355.64 Melem/s 357.72 Melem/s 359.54 Melem/s]
                 change:
                        time:   [-46.368% -42.429% -39.796%] (p = 0.00 < 0.05)
                        thrpt:  [+66.102% +73.699% +86.454%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

Instead of varlen encoding, which requires up to 8 conditional branches
in a loop, we use a simpler encoding which is just 4 instructions
without branching.

New encoding:
* length<=254: <1-byte length>
* otherwise: <0xFF> <usize length> <0xFF>

Improves throughputs by 10-80%, with the biggest improvement in
resolve().

get_or_intern/fill-empty/new/BufferBackend
                        time:   [6.4562 ms 6.5068 ms 6.5684 ms]
                        thrpt:  [15.224 Melem/s 15.369 Melem/s 15.489 Melem/s]
                 change:
                        time:   [-8.4289% -7.3761% -6.1844%] (p = 0.00 < 0.05)
                        thrpt:  [+6.5920% +7.9635% +9.2048%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

get_or_intern/fill-empty/with_capacity/BufferBackend
                        time:   [3.8182 ms 3.8467 ms 3.8796 ms]
                        thrpt:  [25.776 Melem/s 25.996 Melem/s 26.190 Melem/s]
                 change:
                        time:   [-18.750% -18.024% -17.246%] (p = 0.00 < 0.05)
                        thrpt:  [+20.840% +21.986% +23.077%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

get_or_intern/already-filled/BufferBackend
                        time:   [2.7638 ms 2.8026 ms 2.8477 ms]
                        thrpt:  [35.116 Melem/s 35.681 Melem/s 36.182 Melem/s]
                 change:
                        time:   [-7.9910% -6.4121% -4.8432%] (p = 0.00 < 0.05)
                        thrpt:  [+5.0897% +6.8514% +8.6851%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

get_or_intern_static/BufferBackend/get_or_intern
                        time:   [2.8467 µs 2.8632 µs 2.8807 µs]
                        thrpt:  [18.051 Melem/s 18.162 Melem/s 18.267 Melem/s]
                 change:
                        time:   [-1.7629% +0.8260% +4.7452%] (p = 0.71 > 0.05)
                        thrpt:  [-4.5302% -0.8193% +1.7945%]
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
get_or_intern_static/BufferBackend/get_or_intern_static
                        time:   [2.8259 µs 2.8456 µs 2.8634 µs]
                        thrpt:  [18.160 Melem/s 18.274 Melem/s 18.401 Melem/s]
                 change:
                        time:   [-6.7386% -3.1051% -0.1965%] (p = 0.06 > 0.05)
                        thrpt:  [+0.1969% +3.2046% +7.2255%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

resolve/already-filled/BufferBackend
                        time:   [239.71 µs 242.17 µs 245.19 µs]
                        thrpt:  [407.85 Melem/s 412.93 Melem/s 417.16 Melem/s]
                 change:
                        time:   [-46.443% -45.470% -44.561%] (p = 0.00 < 0.05)
                        thrpt:  [+80.377% +83.386% +86.719%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

get/already-filled/BufferBackend
                        time:   [2.3035 ms 2.3222 ms 2.3427 ms]
                        thrpt:  [42.686 Melem/s 43.062 Melem/s 43.411 Melem/s]
                 change:
                        time:   [-2.6436% -1.3829% -0.0703%] (p = 0.03 < 0.05)
                        thrpt:  [+0.0703% +1.4023% +2.7154%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

iter/already-filled/BufferBackend
                        time:   [278.13 µs 279.55 µs 281.19 µs]
                        thrpt:  [355.64 Melem/s 357.72 Melem/s 359.54 Melem/s]
                 change:
                        time:   [-46.368% -42.429% -39.796%] (p = 0.00 < 0.05)
                        thrpt:  [+66.102% +73.699% +86.454%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Copy link
Owner

@Robbepop Robbepop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR. Benchmarks look good. Gonna give it a proper review after the CI has passed.

Comment on lines +158 to +162
let len_bytes =
unsafe { self.buffer.get_unchecked(index - LEN_LOOKBEHIND..index) };
let len_bytes = <&[u8; LEN_LOOKBEHIND]>::try_from(len_bytes).unwrap();
let len = decode_len_backwards(len_bytes);
let str_bytes = unsafe { self.buffer.get_unchecked(index..index + len) };
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this fix #49 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No; the usize value can take any set of 8 bytes, some of which may fall outside the utf8 encoding.

One possibility is to make a more complex encoding for the usize, with 7 bits per byte, leaving the 8th bit unset. This would keep it within the ASCII range. This requires some encoding/decoding. The problem is that this encoding/decoding requires a shift and an or, per byte, which would defeat most of the performance gains in this PR.

On x86 machines with fast pext/pdep instructions (Intel Haswell and later; AMD Zen 3 and later) we can use pext rax, 0x7f7f7f7f7f7f7f7f for this decoding and pdep rax, 0x7f7f7f7f7f7f7f7f for the encoding. Sadly, older machines either don't have these instruction (Intel) or run them very slowly (Zen 1 and Zen 2).

The "bit-table indicating string start" approach would allow a fast (single-instruction) check, but at the cost of 12.5% space overhead. If we required strings to be aligned at 2 bytes, we could have a bit per 2 bytes (6.25% overhead) or a bit per 4 bytes (3.125% overhead), etc.

I'm not sure which approach is best. I'd probably go with a bit-table with 2 byte alignment as the least-worst compromise.

@Robbepop
Copy link
Owner

Robbepop commented Feb 6, 2024

@reinerp Sorry for taking so long. (a whole year ...) I want to release a new version including your improvements soon. Could you resolve the conflicts and rebase on the current master? That would be of great help. Given the benchmark results I am even confident that we could make the BufferBackend the new default for the string-interner crate.

@Robbepop Robbepop mentioned this pull request Feb 7, 2024
@Robbepop
Copy link
Owner

Robbepop commented May 1, 2024

On my Macbook Pro M2 these are the speed-ups I see:

get_or_intern/fill-empty/new/BufferBackend
                        time:   [2.2334 ms 2.2388 ms 2.2447 ms]
                        thrpt:  [44.549 Melem/s 44.666 Melem/s 44.775 Melem/s]
                 change:
                        time:   [+0.7617% +1.2424% +1.7823%] (p = 0.00 < 0.05)
                        thrpt:  [-1.7510% -1.2271% -0.7559%]
                        Change within noise threshold.

get_or_intern/fill-empty/with_capacity/BufferBackend
                        time:   [1.2190 ms 1.2209 ms 1.2225 ms]
                        thrpt:  [81.801 Melem/s 81.905 Melem/s 82.031 Melem/s]
                 change:
                        time:   [+2.1111% +3.3640% +4.6335%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4283% -3.2545% -2.0675%]
                        Performance has regressed.

get_or_intern/already-filled/BufferBackend
                        time:   [731.74 µs 733.87 µs 735.69 µs]
                        thrpt:  [135.93 Melem/s 136.26 Melem/s 136.66 Melem/s]
                 change:
                        time:   [-1.4299% -0.5519% +0.4034%] (p = 0.25 > 0.05)
                        thrpt:  [-0.4018% +0.5550% +1.4506%]
                        No change in performance detected.

get_or_intern_static/BufferBackend/get_or_intern
                        time:   [1.1811 µs 1.1842 µs 1.1879 µs]
                        thrpt:  [43.775 Melem/s 43.913 Melem/s 44.027 Melem/s]
                 change:
                        time:   [+1.5763% +1.9736% +2.3511%] (p = 0.00 < 0.05)
                        thrpt:  [-2.2971% -1.9354% -1.5519%]
                        Performance has regressed.

get_or_intern_static/BufferBackend/get_or_intern_static
                        time:   [1.1873 µs 1.1909 µs 1.1948 µs]
                        thrpt:  [43.521 Melem/s 43.665 Melem/s 43.798 Melem/s]
                 change:
                        time:   [+1.8508% +2.2119% +2.5798%] (p = 0.00 < 0.05)
                        thrpt:  [-2.5149% -2.1640% -1.8172%]
                        Performance has regressed.

resolve/already-filled/BufferBackend
                        time:   [129.64 µs 129.76 µs 129.90 µs]
                        thrpt:  [769.85 Melem/s 770.66 Melem/s 771.39 Melem/s]
                 change:
                        time:   [-18.586% -18.438% -18.296%] (p = 0.00 < 0.05)
                        thrpt:  [+22.393% +22.606% +22.828%]
                        Performance has improved.

get/already-filled/BufferBackend
                        time:   [622.42 µs 623.86 µs 625.20 µs]
                        thrpt:  [159.95 Melem/s 160.29 Melem/s 160.66 Melem/s]
                 change:
                        time:   [-6.4554% -6.0793% -5.7250%] (p = 0.00 < 0.05)
                        thrpt:  [+6.0727% +6.4728% +6.9008%]
                        Performance has improved.

iter/already-filled/BufferBackend
                        time:   [97.685 µs 99.065 µs 100.63 µs]
                        thrpt:  [993.74 Melem/s 1.0094 Gelem/s 1.0237 Gelem/s]
                 change:
                        time:   [-86.868% -86.658% -86.434%] (p = 0.00 < 0.05)
                        thrpt:  [+637.13% +649.50% +661.49%]
                        Performance has improved.

So no speed-ups for me for get_or_intern but visible speed-ups for get, resolve and in particular iter which is nice.

@Robbepop
Copy link
Owner

Robbepop commented May 1, 2024

With #64 we already have some heavy gains while keeping the old var-len encoding and I am not entirely sure if this PR is still adding a lot of improvements on top. @reinerp if you have time please feel free to rebase and re-evaluate. Otherwise I'd like to close this.

@Robbepop
Copy link
Owner

Closed due to inactivity. @reinerp feel free to re-open if you want to work on this.

@Robbepop Robbepop closed this May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants