Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve encode_varint performance by bounding its loop #940

Merged
merged 1 commit into from
Mar 15, 2024

Conversation

mumbleskates
Copy link
Contributor

I believe this is probably allowing the compiler to unroll the loop where it could not before by giving it a more easily provable bound. This has significant performance improvement for me in benchmarks (on two different processors: a 7-series AMD and an old intel mobile processor; break-even on 3-series threadripper); if we have a process for validating these on more machines we should go through it and get more info.

@mumbleskates
Copy link
Contributor Author

here are the relevant benchmark outputs for my available machines, diffed with critcmp:

amd 7950x:

group                   bounded                                master
-----                   -------                                -----
varint/large/encode     1.00    411.1±1.95ns        ? ?/sec    1.18    485.8±0.56ns        ? ?/sec
varint/medium/encode    1.00    223.2±0.28ns        ? ?/sec    1.06    237.4±1.66ns        ? ?/sec
varint/mixed/encode     1.00    219.3±0.22ns        ? ?/sec    1.20   263.5±11.65ns        ? ?/sec
varint/small/encode     1.00     43.5±0.65ns        ? ?/sec    1.44     62.8±0.56ns        ? ?/sec

intel i7-6700:

group                   bounded                                master
-----                   -------                                ------
varint/large/encode     1.00    562.4±5.57ns        ? ?/sec    1.31    739.4±4.93ns        ? ?/sec
varint/medium/encode    1.00    318.3±2.96ns        ? ?/sec    1.17    371.3±3.83ns        ? ?/sec
varint/mixed/encode     1.00    324.2±1.95ns        ? ?/sec    1.26    407.1±5.61ns        ? ?/sec
varint/small/encode     1.00     63.3±0.53ns        ? ?/sec    1.25     79.2±0.94ns        ? ?/sec

intel i7-2760QM (this one's results are finicky):

group                   bounded                                master
-----                   -------                                ------
varint/large/encode     1.00   941.0±12.56ns        ? ?/sec    1.11  1042.5±13.34ns        ? ?/sec
varint/medium/encode    1.06    561.1±7.53ns        ? ?/sec    1.00   528.7±21.71ns        ? ?/sec
varint/mixed/encode     1.00    536.9±8.93ns        ? ?/sec    1.04    557.1±6.92ns        ? ?/sec
varint/small/encode     1.00    107.3±1.73ns        ? ?/sec    1.46    157.2±3.28ns        ? ?/sec

amd 3975WX:

group                   bounded                                master
-----                   -------                                -----
varint/large/encode     1.00   994.6±10.58ns        ? ?/sec    1.00   998.8±16.10ns        ? ?/sec
varint/medium/encode    1.00    495.1±4.99ns        ? ?/sec    1.01    501.7±6.08ns        ? ?/sec
varint/mixed/encode     1.00    547.4±7.17ns        ? ?/sec    1.00    545.9±5.56ns        ? ?/sec
varint/small/encode     1.00    101.9±1.06ns        ? ?/sec    1.00    102.2±2.21ns        ? ?/sec

virtualized on EPYC 7713 (Linode):

group                   bounded                                 master
-----                   -------                                 ------
varint/large/encode     1.00      3.1±0.23µs        ? ?/sec     1.10      3.4±0.43µs        ? ?/sec
varint/medium/encode    1.00  1592.3±155.09ns        ? ?/sec    1.02  1626.2±163.09ns        ? ?/sec
varint/mixed/encode     1.00  1707.0±38.18ns        ? ?/sec     1.05  1786.0±48.74ns        ? ?/sec
varint/small/encode     1.00   322.5±54.06ns        ? ?/sec     1.09   352.7±61.40ns        ? ?/sec

group                   bounded                                 master
-----                   -------                                 ------
varint/large/encode     1.00      3.2±0.10µs        ? ?/sec     1.22      3.9±1.17µs        ? ?/sec
varint/medium/encode    1.00  1624.0±153.63ns        ? ?/sec    1.03  1670.4±300.05ns        ? ?/sec
varint/mixed/encode     1.00  1761.4±55.99ns        ? ?/sec     1.16      2.1±0.70µs        ? ?/sec
varint/small/encode     1.00   343.1±78.44ns        ? ?/sec     1.11  381.8±140.00ns        ? ?/sec

...and contributed by others:
intel i7-13700HX:

group                        bounded                                master
-----                        -------                                ------
varint/large/encode          1.00    530.0±0.82ns        ? ?/sec    1.01    537.8±2.76ns        ? ?/sec
varint/medium/encode         1.00    265.4±0.38ns        ? ?/sec    1.05    278.4±2.84ns        ? ?/sec
varint/mixed/encode          1.00    292.0±1.44ns        ? ?/sec    1.31    381.9±1.59ns        ? ?/sec
varint/small/encode          1.00     54.0±0.90ns        ? ?/sec    1.05     56.7±2.38ns        ? ?/sec

@mumbleskates
Copy link
Contributor Author

i have a macbook sitting around, with an intel i7-9750H:

group                   bounded                                master
-----                   -------                                ------
varint/large/encode     1.15   1191.2±0.71ns        ? ?/sec    1.00   1038.3±0.31ns        ? ?/sec
varint/medium/encode    1.42    725.6±4.22ns        ? ?/sec    1.00    509.9±0.95ns        ? ?/sec
varint/mixed/encode     1.27    720.9±1.61ns        ? ?/sec    1.00    568.7±0.16ns        ? ?/sec
varint/small/encode     1.30    185.7±0.51ns        ? ?/sec    1.00    142.5±1.96ns        ? ?/sec

the thermals on this laptop are a bit of a mess but i'm seeing a consistent regression, very interesting

Copy link
Collaborator

@caspermeijn caspermeijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase to latest master to fix CI

@@ -27,7 +27,7 @@ pub fn encode_varint<B>(mut value: u64, buf: &mut B)
where
B: BufMut,
{
loop {
for _ in 0..10 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the constant 10? Maybe a static assert, formula or documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not sufficiently covered by the comment already on this function? i can add it, just wondering

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are probably right. I saw that comment and started looking into why 10 bytes is correct. I took me a moment to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protobuf's varints are fairly thoroughly documented to be 10 bytes maximum, yes. i mean they could be 9 but they decided not to do that. i believe every minimal length protobuf varint that is 10 bytes always has 0x01 as its last byte value, which is rather a waste considering how many situations cause them to become maximum length this way. the burden of legacy i guess

@mumbleskates mumbleskates force-pushed the encode-varint-bounded branch from 0790edf to 561fa9c Compare March 8, 2024 10:15
Copy link
Collaborator

@caspermeijn caspermeijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@caspermeijn caspermeijn added this pull request to the merge queue Mar 15, 2024
@caspermeijn
Copy link
Collaborator

Thank you for your contribution

Merged via the queue into tokio-rs:master with commit 145fd47 Mar 15, 2024
12 checks passed
caspermeijn added a commit to caspermeijn/prost that referenced this pull request Apr 2, 2024
_PROST!_ is a [Protocol Buffers](https://developers.google.com/protocol-buffers/) implementation for the [Rust Language](https://www.rust-lang.org/). `prost` generates simple, idiomatic Rust code from `proto2` and `proto3` files.

This patch updates brings a few new features and fixes:

- Bump MSRV to 1.70 (minimum supported Rust version)
- Rename cargo feature `prost-derive` to `derive` (tokio-rs#992)
- Add @generated comment on top of generated files (tokio-rs#935)
- Optimize implementation of prost::Name when generated by prost-build (tokio-rs#956)

## Dependencies
- build(deps): Allow itertools 0.12 (tokio-rs#948)
- build(deps): Allow heck 0.5 (tokio-rs#1012)
- build(deps): Allow multimap 0.10 (tokio-rs#1013)

## Documentation
- Improve protoc not found error message (tokio-rs#937)
- build: Add development container config (tokio-rs#949)
- docs: Fixed README typos (tokio-rs#952 / tokio-rs#967 / tokio-rs#970)

## Internal
- chore: Fix minimal versions (tokio-rs#920)
- fix: fq_message_name should begin with one dot (tokio-rs#981)
- improve encode_varint performance by bounding its loop (tokio-rs#940)
- style: Remove duplicate function call (tokio-rs#989)
- test: Improve test decode_varint_slow (tokio-rs#977)
- chore: Add dep: prefix to feature dependencies (tokio-rs#919)
- Minor clippy lint fixes. (tokio-rs#1006)
- chore: Use taiki-e/install-action to setup cargo-machete (tokio-rs#909)
- chore: Remove which dependency. (tokio-rs#962)
- chore: Update to actions/checkout@v4 (tokio-rs#910)

```
$ cargo semver-checks --workspace
     Parsing prost v0.12.4 (current)
      Parsed [   4.348s] (current)
     Parsing prost v0.12.3 (baseline, cached)
      Parsed [   0.099s] (baseline)
    Checking prost v0.12.3 -> v0.12.4 (minor change)
     Checked [   0.029s] 61 checks; 61 passed, 6 unnecessary
    Finished [   4.481s] prost
     Parsing prost-build v0.12.4 (current)
      Parsed [   5.488s] (current)
     Parsing prost-build v0.12.3 (baseline)
      Parsed [   3.819s] (baseline)
    Checking prost-build v0.12.3 -> v0.12.4 (minor change)
     Checked [   0.005s] 61 checks; 61 passed, 6 unnecessary
    Finished [   9.319s] prost-build
     Parsing prost-types v0.12.4 (current)
      Parsed [   3.150s] (current)
     Parsing prost-types v0.12.3 (baseline)
      Parsed [   3.192s] (baseline)
    Checking prost-types v0.12.3 -> v0.12.4 (minor change)
     Checked [   0.075s] 61 checks; 61 passed, 6 unnecessary
    Finished [   6.434s] prost-types
```
caspermeijn added a commit to caspermeijn/prost that referenced this pull request Apr 5, 2024
_PROST!_ is a [Protocol Buffers](https://developers.google.com/protocol-buffers/) implementation for the [Rust Language](https://www.rust-lang.org/). `prost` generates simple, idiomatic Rust code from `proto2` and `proto3` files.

This patch update brings new features and fixes:

- Bump MSRV to 1.70 (minimum supported Rust version)
- Rename cargo feature `prost-derive` to `derive` (tokio-rs#992)
- Add @generated comment on top of generated files (tokio-rs#935)
- Optimize implementation of prost::Name when generated by prost-build (tokio-rs#956)

## Dependencies
- build(deps): Allow itertools 0.12 (tokio-rs#948)
- build(deps): Allow heck 0.5 (tokio-rs#1012)
- build(deps): Allow multimap 0.10 (tokio-rs#1013)

## Documentation
- Improve protoc not found error message (tokio-rs#937)
- build: Add development container config (tokio-rs#949)
- docs: Fixed README typos (tokio-rs#952 / tokio-rs#967 / tokio-rs#970)

## Internal
- chore: Fix minimal versions (tokio-rs#920)
- fix: fq_message_name should begin with one dot (tokio-rs#981)
- improve encode_varint performance by bounding its loop (tokio-rs#940)
- style: Remove duplicate function call (tokio-rs#989)
- test: Improve test decode_varint_slow (tokio-rs#977)
- chore: Add dep: prefix to feature dependencies (tokio-rs#919)
- Minor clippy lint fixes. (tokio-rs#1006)
- chore: Use taiki-e/install-action to setup cargo-machete (tokio-rs#909)
- chore: Remove which dependency. (tokio-rs#962)
- chore: Update to actions/checkout@v4 (tokio-rs#910)
caspermeijn added a commit to caspermeijn/prost that referenced this pull request Apr 5, 2024
_PROST!_ is a [Protocol Buffers](https://developers.google.com/protocol-buffers/) implementation for the [Rust Language](https://www.rust-lang.org/). `prost` generates simple, idiomatic Rust code from `proto2` and `proto3` files.

This patch update brings new features and fixes:

- Bump MSRV to 1.70 (minimum supported Rust version)
- Rename cargo feature `prost-derive` to `derive` (tokio-rs#992)
- Add @generated comment on top of generated files (tokio-rs#935)
- Optimize implementation of prost::Name when generated by prost-build (tokio-rs#956)

## Dependencies
- build(deps): Allow itertools 0.12 (tokio-rs#948)
- build(deps): Allow heck 0.5 (tokio-rs#1012)
- build(deps): Allow multimap 0.10 (tokio-rs#1013)

## Documentation
- Improve protoc not found error message (tokio-rs#937)
- build: Add development container config (tokio-rs#949)
- docs: Fixed README typos (tokio-rs#952 / tokio-rs#967 / tokio-rs#970)

## Internal
- chore: Fix minimal versions (tokio-rs#920)
- fix: fq_message_name should begin with one dot (tokio-rs#981)
- improve encode_varint performance by bounding its loop (tokio-rs#940)
- style: Remove duplicate function call (tokio-rs#989)
- test: Improve test decode_varint_slow (tokio-rs#977)
- chore: Add dep: prefix to feature dependencies (tokio-rs#919)
- Minor clippy lint fixes. (tokio-rs#1006)
- chore: Use taiki-e/install-action to setup cargo-machete (tokio-rs#909)
- chore: Remove which dependency. (tokio-rs#962)
- chore: Update to actions/checkout@v4 (tokio-rs#910)
github-merge-queue bot pushed a commit that referenced this pull request Apr 5, 2024
_PROST!_ is a [Protocol Buffers](https://developers.google.com/protocol-buffers/) implementation for the [Rust Language](https://www.rust-lang.org/). `prost` generates simple, idiomatic Rust code from `proto2` and `proto3` files.

This patch update brings new features and fixes:

- Bump MSRV to 1.70 (minimum supported Rust version)
- Rename cargo feature `prost-derive` to `derive` (#992)
- Add @generated comment on top of generated files (#935)
- Optimize implementation of prost::Name when generated by prost-build (#956)

## Dependencies
- build(deps): Allow itertools 0.12 (#948)
- build(deps): Allow heck 0.5 (#1012)
- build(deps): Allow multimap 0.10 (#1013)

## Documentation
- Improve protoc not found error message (#937)
- build: Add development container config (#949)
- docs: Fixed README typos (#952 / #967 / #970)

## Internal
- chore: Fix minimal versions (#920)
- fix: fq_message_name should begin with one dot (#981)
- improve encode_varint performance by bounding its loop (#940)
- style: Remove duplicate function call (#989)
- test: Improve test decode_varint_slow (#977)
- chore: Add dep: prefix to feature dependencies (#919)
- Minor clippy lint fixes. (#1006)
- chore: Use taiki-e/install-action to setup cargo-machete (#909)
- chore: Remove which dependency. (#962)
- chore: Update to actions/checkout@v4 (#910)
gibbz00 pushed a commit to gibbz00/prost that referenced this pull request Apr 6, 2024
_PROST!_ is a [Protocol Buffers](https://developers.google.com/protocol-buffers/) implementation for the [Rust Language](https://www.rust-lang.org/). `prost` generates simple, idiomatic Rust code from `proto2` and `proto3` files.

This patch update brings new features and fixes:

- Bump MSRV to 1.70 (minimum supported Rust version)
- Rename cargo feature `prost-derive` to `derive` (tokio-rs#992)
- Add @generated comment on top of generated files (tokio-rs#935)
- Optimize implementation of prost::Name when generated by prost-build (tokio-rs#956)

## Dependencies
- build(deps): Allow itertools 0.12 (tokio-rs#948)
- build(deps): Allow heck 0.5 (tokio-rs#1012)
- build(deps): Allow multimap 0.10 (tokio-rs#1013)

## Documentation
- Improve protoc not found error message (tokio-rs#937)
- build: Add development container config (tokio-rs#949)
- docs: Fixed README typos (tokio-rs#952 / tokio-rs#967 / tokio-rs#970)

## Internal
- chore: Fix minimal versions (tokio-rs#920)
- fix: fq_message_name should begin with one dot (tokio-rs#981)
- improve encode_varint performance by bounding its loop (tokio-rs#940)
- style: Remove duplicate function call (tokio-rs#989)
- test: Improve test decode_varint_slow (tokio-rs#977)
- chore: Add dep: prefix to feature dependencies (tokio-rs#919)
- Minor clippy lint fixes. (tokio-rs#1006)
- chore: Use taiki-e/install-action to setup cargo-machete (tokio-rs#909)
- chore: Remove which dependency. (tokio-rs#962)
- chore: Update to actions/checkout@v4 (tokio-rs#910)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants