-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11045: [Rust] Fix performance issues of allocator #9044
Conversation
@nevi-me , there is a failing test in parquet that I am unable to fix. I think that we may be creating bitmap buffers that are too large. My current hypothesis is that the buffer |
Codecov Report
@@ Coverage Diff @@
## master #9044 +/- ##
==========================================
- Coverage 82.57% 82.56% -0.01%
==========================================
Files 203 203
Lines 50036 49875 -161
==========================================
- Hits 41316 41180 -136
+ Misses 8720 8695 -25
Continue to review full report at Codecov.
|
if capacity > self.capacity { | ||
let new_capacity = bit_util::round_upto_multiple_of_64(capacity); | ||
let new_capacity = cmp::max(new_capacity, self.capacity * 2); | ||
pub fn reserve(&mut self, additional: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to change the API in this PR? don't see how it could be related to the performance issue.
Also better to split it to make PR easier to review. We also need to update the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it makes it more difficult. The signature is related to performance, though:
It was difficult to reason with reserve(new_len)
when Vec::reserve
and RawVec::reserve
uses additional
(and I was basing this PR on that). This PR fixes a bug in one Builder
that was calling reserve(1)
in append
, when it should be calling reserve(self.len + 1)
. I also spent 4hs tracking a bug because of this difference (I was using reserve(additional)
when its signature was reserve(new_len)
.
I can try to split that to another PR, though. IMO this PR will still need to be based on that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think originally we referred C++ impl for the signature design to make it consistent. Since buffer
module is public, we'll need to be careful when making any breaking change on API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yap, and I think it was a good decision back then, when the goal was to port C++ code to Rust. I understand your concern.
However, IMO we should not keep using C++-motivated APIs because it was historically like that. additional
is the standard in virtually every container in Rust (HashMap
, Vec
, tokio::bytes::Bytes
). IMO this change will need to be made at some point, and delaying it will only cause more pain in the future.
In this particular instance, it was motivated by an already existing confusion on our own code base, which IMO evidences that rust developers expect additional
. Most people that use this API are probably passing additional
to it already.
Regardless, we are bumping the major version, so there is already no expectation of backward compatibility as far as cargo and semver is concerned. My goal here is to make MutableBuffer
as close as possible to Vec
so that people can use them without even thinking that they are using a special allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, if we're going to bump the major version then I'm fine with the compatibility issue. I don't have strong opinion on this so either is fine to me. IMO a good documentation should be suffice for developers to understand what the method does.
@@ -774,16 +756,16 @@ impl MutableBuffer { | |||
/// `new_len` will be zeroed out. | |||
/// | |||
/// If `new_len` is less than `len`, the buffer will be truncated. | |||
pub fn resize(&mut self, new_len: usize) { | |||
pub fn resize(&mut self, new_len: usize, value: u8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could have another method with value
= 0 since this seems to be the dominate use case (it might enable future optimizations for the special case as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I based this signature on Vec::resize
. I.e. I am assuming here that we use MutableBuffer
as a special Vec<u8>
that uses our custom allocator, and offer users an API that they are used to (Vec
).
The main use-case for value: 255u8
is when we want to allocate and grow a set bitmap to later unset bits. We did not offer that before, which is the reason MutableArrayData
does not do that (even though is most likely more efficient for many use-cases).
Note that even though rust std::RawVec
offers a with_capacity_zeroed
, std::Vec
does not expose that as a public API: resize
and resize_with
are the public APIs to resize a vec, and none of them uses std::alloc::alloc_zeroed
, i.e. they also just call std::ptr::write_bytes
.
I agree with you that resize(len, 0)
can potentially be optimized when the mutable buffer has zero capacity by calling std::alloc::alloc_zeroed
instead of std::ptr::write_bytes
.
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and #9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes #9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
@jorgecarleitao I think this PR broke master:
|
likely. It was kind of expected, as it did some backward incompatible changes. I was trying to merge it first to avoid breaking, but I guess I was not fast for the speed on which PRs are merged into master after the green light on the mailing list :P |
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and #9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes #9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and apache#9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes apache#9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and apache#9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes apache#9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This PR addresses a performance issue in how we allocate and reallocate the
MutableBuffer
.Problem
See #9032
This PR
This PR changes
MutableBuffer::reserve
to callstd::alloc::alloc
instead ofstd::alloc::alloc_zeroed
, which improves performance when building buffers with unknown sizes (such as strings and nested types).This required changing some calls of
MutableBuffer
that assumed a zero initialized buffer even whenreserve
was used.It also changed
reserve
's signature toreserve(additional)
instead ofreserve(new_len)
, which is the notation used throughout Rust's std library.The take is slower because it is using
Buffer::from
, which is always worse than using aMutableBuffer
.