Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Zero-copy String Types #6

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

ParkMyCar
Copy link
Member

work in progress

@ParkMyCar ParkMyCar force-pushed the str/zero-copy-string-type branch from c6c0d9b to 7b1bb93 Compare February 2, 2024 21:42
@ParkMyCar ParkMyCar force-pushed the str/zero-copy-string-type branch from 7b1bb93 to 080397e Compare February 2, 2024 22:38
danhhz added a commit to danhhz/materialize that referenced this pull request Feb 21, 2024
In the (presumably) common case of Rows made entirely of stack-allocated
`Datum` (`Bool`, `Int64`, etc) this reduces the allocations per Row from
2 to 0. `Bytes` could be the same with careful use of prost's
`bytes::Bytes` feature. `String` would need something like Parker's
MaterializeInc/prost#6. For more complex types
(`List`, `AclItem`, etc) this PR is probably a rounding error.

This is accomplished by adding a new `decode_into` variant of the
"legacy" `Codec::decode` method which enables amortizing allocations
across calls in two ways. First, instead of returning `Self`, it takes
`&mut Self` as a parameter, allowing for reuse of the internal
`Row/SourceData` allocations. Second, it adds a `type Storage` to
`Codec` and also passes it in as `&mut Option<Self::Storage>`, allowing
for reuse for ProtoRow/ProtoSourceData allocations. `decode_into` gets a
default implementation for `Codec`s that can't take advantage of this.

This has no effect on the "structured" `Codec` path that we'll hopefully
soon be moving to. I'm generally hesitant to spend too much time
optimizing the "legacy" path, but this is simple enough and a big enough
win that it felt worth it to me.

We then take advantage of this in the `Codec` impls for `Row` (entirely
unused and could be remove) and `SourceData` (the main one used in
production).

To reuse the `Vec<ProtoDatum>` in the `ProtoRow` in the `Ok` variant of
`ProtoSourceData`, we take advantage of prost's `Message::merge`, which
decodes an encoded proto into an existing message using proto's [merge]
semantics. We do this because `Message::merge` seems to be careful and
smart about reusing existing allocations when it can. To make this work
(we don't actually want "merge" semantics), we first clear the message,
so we're merging into an empty one. Unfortunately, the generated code
for `ProtoSourceData::clear` is not able to keep the allocations around,
so we have to manually "clear" it to the equivalent of an `Ok` with an
empty `Row`.

[merge]: https://protobuf.dev/programming-guides/encoding/#last-one-wins

To reuse the internal allocs of the `Row` inside the `SourceData`, we
optimize for the (definitely) common case of `Ok`. This is done manually
with some match statements and by exposing a new `Row::decode_from`
method.

The benchmarks should be no-ops but are not. I have no explanation for
this, but it seems to reproduce.

SourceData

    roundtrip_int64_decode_legacy
                            time:   [1.7787 ms 1.7798 ms 1.7810 ms]
                            change: [-2.2133% -1.8816% -1.5910%] (p = 0.00 < 0.05)
                            Performance has improved.

    roundtrip_bytes_decode_legacy
                            time:   [2.7741 ms 2.7763 ms 2.7800 ms]
                            change: [-0.3923% -0.2916% -0.1545%] (p = 0.00 < 0.05)
                            Change within noise threshold.

    roundtrip_string_decode_legacy
                            time:   [2.4841 ms 2.4855 ms 2.4872 ms]
                            change: [+2.0101% +2.1176% +2.2234%] (p = 0.00 < 0.05)
                            Performance has regressed.

Row

    roundtrip_decode_legacy time:   [1.8818 ms 1.8834 ms 1.8851 ms]
                            change: [-6.4160% -5.9947% -5.6271%] (p = 0.00 < 0.05)
                            Performance has improved.
danhhz added a commit to danhhz/materialize that referenced this pull request Feb 21, 2024
In the (presumably) common case of Rows made entirely of stack-allocated
`Datum` (`Bool`, `Int64`, etc) this reduces the allocations per Row from
2 to 0. `Bytes` could be the same with careful use of prost's
`bytes::Bytes` feature. `String` would need something like Parker's
MaterializeInc/prost#6. For more complex types
(`List`, `AclItem`, etc) this PR is probably a rounding error.

This is accomplished by adding a new `decode_into` variant of the
"legacy" `Codec::decode` method which enables amortizing allocations
across calls in two ways. First, instead of returning `Self`, it takes
`&mut Self` as a parameter, allowing for reuse of the internal
`Row/SourceData` allocations. Second, it adds a `type Storage` to
`Codec` and also passes it in as `&mut Option<Self::Storage>`, allowing
for reuse for ProtoRow/ProtoSourceData allocations. `decode_into` gets a
default implementation for `Codec`s that can't take advantage of this.

This has no effect on the "structured" `Codec` path that we'll hopefully
soon be moving to. I'm generally hesitant to spend too much time
optimizing the "legacy" path, but this is simple enough and a big enough
win that it felt worth it to me.

We then take advantage of this in the `Codec` impls for `Row` (entirely
unused and could be remove) and `SourceData` (the main one used in
production).

To reuse the `Vec<ProtoDatum>` in the `ProtoRow` in the `Ok` variant of
`ProtoSourceData`, we take advantage of prost's `Message::merge`, which
decodes an encoded proto into an existing message using proto's [merge]
semantics. We do this because `Message::merge` seems to be careful and
smart about reusing existing allocations when it can. To make this work
(we don't actually want "merge" semantics), we first clear the message,
so we're merging into an empty one. Unfortunately, the generated code
for `ProtoSourceData::clear` is not able to keep the allocations around,
so we have to manually "clear" it to the equivalent of an `Ok` with an
empty `Row`.

[merge]: https://protobuf.dev/programming-guides/encoding/#last-one-wins

To reuse the internal allocs of the `Row` inside the `SourceData`, we
optimize for the (definitely) common case of `Ok`. This is done manually
with some match statements and by exposing a new `Row::decode_from`
method.

The benchmarks should be no-ops but are not. I have no explanation for
this, but it seems to reproduce.

SourceData

    roundtrip_int64_decode_legacy
                            time:   [1.7787 ms 1.7798 ms 1.7810 ms]
                            change: [-2.2133% -1.8816% -1.5910%] (p = 0.00 < 0.05)
                            Performance has improved.

    roundtrip_bytes_decode_legacy
                            time:   [2.7741 ms 2.7763 ms 2.7800 ms]
                            change: [-0.3923% -0.2916% -0.1545%] (p = 0.00 < 0.05)
                            Change within noise threshold.

    roundtrip_string_decode_legacy
                            time:   [2.4841 ms 2.4855 ms 2.4872 ms]
                            change: [+2.0101% +2.1176% +2.2234%] (p = 0.00 < 0.05)
                            Performance has regressed.

Row

    roundtrip_decode_legacy time:   [1.8818 ms 1.8834 ms 1.8851 ms]
                            change: [-6.4160% -5.9947% -5.6271%] (p = 0.00 < 0.05)
                            Performance has improved.
ParkMyCar pushed a commit to MaterializeInc/materialize that referenced this pull request Feb 22, 2024
In the (presumably) common case of Rows made entirely of stack-allocated
`Datum` (`Bool`, `Int64`, etc) this reduces the allocations per Row from
2 to 0. `Bytes` could be the same with careful use of prost's
`bytes::Bytes` feature. `String` would need something like Parker's
MaterializeInc/prost#6. For more complex types
(`List`, `AclItem`, etc) this PR is probably a rounding error.

This is accomplished by adding a new `decode_into` variant of the
"legacy" `Codec::decode` method which enables amortizing allocations
across calls in two ways. First, instead of returning `Self`, it takes
`&mut Self` as a parameter, allowing for reuse of the internal
`Row/SourceData` allocations. Second, it adds a `type Storage` to
`Codec` and also passes it in as `&mut Option<Self::Storage>`, allowing
for reuse of ProtoRow/ProtoSourceData allocations. `decode_into` gets a
default implementation for `Codec`s that can't take advantage of this.

This has no effect on the "structured" `Codec` path that we'll hopefully
soon be moving to. I'm generally hesitant to spend too much time
optimizing the "legacy" path, but this is simple enough and a big enough
win that it felt worth it to me.

We then take advantage of this in the `Codec` impls for `Row` (entirely
unused and could be remove) and `SourceData` (the main one used in
production).

To reuse the `Vec<ProtoDatum>` in the `ProtoRow` in the `Ok` variant of
`ProtoSourceData`, we take advantage of prost's `Message::merge`, which
decodes an encoded proto into an existing message using proto's [merge]
semantics. We do this because `Message::merge` seems to be careful and
smart about reusing existing allocations when it can. To make this work
(we don't actually want "merge" semantics), we first clear the message,
so we're merging into an empty one. Unfortunately, the generated code
for `ProtoSourceData::clear` is not able to keep the allocations around,
so we have to manually "clear" it to the equivalent of an `Ok` with an
empty `Row`.

[merge]: https://protobuf.dev/programming-guides/encoding/#last-one-wins

To reuse the internal allocs of the `Row` inside the `SourceData`, we
optimize for the (definitely) common case of `Ok`. This is done manually
with some match statements and by exposing a new `Row::decode_from`
method.

Allocs before (for Row of all Int64s):

<img width="855" alt="Screenshot 2024-02-21 at 9 40 30 AM"
src="https://github.com/MaterializeInc/materialize/assets/52528/406af726-3237-4a5e-8d18-b71b8bac3f47">

Allocs after:

<img width="858" alt="Screenshot 2024-02-21 at 9 40 45 AM"
src="https://github.com/MaterializeInc/materialize/assets/52528/6cb16bf3-bede-45e2-954b-a06956182c29">

Benchmarks:

    roundtrip_int64_decode_legacy
                            time:   [1.4475 ms 1.4484 ms 1.4493 ms]
change: [-20.305% -20.018% -19.760%] (p = 0.00 < 0.05)
                            Performance has improved.

    roundtrip_bytes_decode_legacy
                            time:   [2.2823 ms 2.2851 ms 2.2888 ms]
change: [-18.031% -17.935% -17.793%] (p = 0.00 < 0.05)
                            Performance has improved.

    roundtrip_string_decode_legacy
                            time:   [1.9071 ms 1.9086 ms 1.9104 ms]
change: [-21.660% -21.548% -21.397%] (p = 0.00 < 0.05)
                            Performance has improved.

### Motivation

  * This PR adds a feature that has not yet been specified.

### Tips for reviewer

I haven't yet actually hooked this up in the main prod codepath. Will do
that now and push it as a commit if it's easy, or a separate PR if it's
not.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
- <!-- Add release notes here or explicitly state that there are no
user-facing behavior changes. -->
chuck-alt-delete pushed a commit to MaterializeInc/materialize that referenced this pull request Mar 1, 2024
In the (presumably) common case of Rows made entirely of stack-allocated
`Datum` (`Bool`, `Int64`, etc) this reduces the allocations per Row from
2 to 0. `Bytes` could be the same with careful use of prost's
`bytes::Bytes` feature. `String` would need something like Parker's
MaterializeInc/prost#6. For more complex types
(`List`, `AclItem`, etc) this PR is probably a rounding error.

This is accomplished by adding a new `decode_into` variant of the
"legacy" `Codec::decode` method which enables amortizing allocations
across calls in two ways. First, instead of returning `Self`, it takes
`&mut Self` as a parameter, allowing for reuse of the internal
`Row/SourceData` allocations. Second, it adds a `type Storage` to
`Codec` and also passes it in as `&mut Option<Self::Storage>`, allowing
for reuse of ProtoRow/ProtoSourceData allocations. `decode_into` gets a
default implementation for `Codec`s that can't take advantage of this.

This has no effect on the "structured" `Codec` path that we'll hopefully
soon be moving to. I'm generally hesitant to spend too much time
optimizing the "legacy" path, but this is simple enough and a big enough
win that it felt worth it to me.

We then take advantage of this in the `Codec` impls for `Row` (entirely
unused and could be remove) and `SourceData` (the main one used in
production).

To reuse the `Vec<ProtoDatum>` in the `ProtoRow` in the `Ok` variant of
`ProtoSourceData`, we take advantage of prost's `Message::merge`, which
decodes an encoded proto into an existing message using proto's [merge]
semantics. We do this because `Message::merge` seems to be careful and
smart about reusing existing allocations when it can. To make this work
(we don't actually want "merge" semantics), we first clear the message,
so we're merging into an empty one. Unfortunately, the generated code
for `ProtoSourceData::clear` is not able to keep the allocations around,
so we have to manually "clear" it to the equivalent of an `Ok` with an
empty `Row`.

[merge]: https://protobuf.dev/programming-guides/encoding/#last-one-wins

To reuse the internal allocs of the `Row` inside the `SourceData`, we
optimize for the (definitely) common case of `Ok`. This is done manually
with some match statements and by exposing a new `Row::decode_from`
method.

Allocs before (for Row of all Int64s):

<img width="855" alt="Screenshot 2024-02-21 at 9 40 30 AM"
src="https://github.com/MaterializeInc/materialize/assets/52528/406af726-3237-4a5e-8d18-b71b8bac3f47">

Allocs after:

<img width="858" alt="Screenshot 2024-02-21 at 9 40 45 AM"
src="https://github.com/MaterializeInc/materialize/assets/52528/6cb16bf3-bede-45e2-954b-a06956182c29">

Benchmarks:

    roundtrip_int64_decode_legacy
                            time:   [1.4475 ms 1.4484 ms 1.4493 ms]
change: [-20.305% -20.018% -19.760%] (p = 0.00 < 0.05)
                            Performance has improved.

    roundtrip_bytes_decode_legacy
                            time:   [2.2823 ms 2.2851 ms 2.2888 ms]
change: [-18.031% -17.935% -17.793%] (p = 0.00 < 0.05)
                            Performance has improved.

    roundtrip_string_decode_legacy
                            time:   [1.9071 ms 1.9086 ms 1.9104 ms]
change: [-21.660% -21.548% -21.397%] (p = 0.00 < 0.05)
                            Performance has improved.

### Motivation

  * This PR adds a feature that has not yet been specified.

### Tips for reviewer

I haven't yet actually hooked this up in the main prod codepath. Will do
that now and push it as a commit if it's easy, or a separate PR if it's
not.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](https://github.com/MaterializeInc/cloud/pull/5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
- <!-- Add release notes here or explicitly state that there are no
user-facing behavior changes. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant