Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize behavior of vec.split_off(0) (take all) #76682

Merged
merged 1 commit into from
Sep 15, 2020

Conversation

richkadel
Copy link
Contributor

@richkadel richkadel commented Sep 13, 2020

Optimization improvement to split_off() so the performance meets the
intuitively expected behavior when at == 0, avoiding the current behavior
of copying the entire vector.

The change honors documented behavior that the original vector's
"previous capacity unchanged".

This improvement better supports the pattern for building and flushing a
buffer of elements, such as the following:

    let mut vec = Vec::new();
    loop {
        vec.push(something);
        if condition_is_met {
            process(vec.split_off(0));
        }
    }

Option wrapping is the first alternative I thought of, but is much
less obvious and more verbose:

    let mut capacity = 1;
    let mut vec: Option<Vec<Stuff>> = None;
    loop {
        vec.get_or_insert_with(|| Vec::with_capacity(capacity)).push(something);
        if condition_is_met {
            capacity = vec.capacity();
            process(vec.take().unwrap());
        }
    }

Directly using mem::replace() (instead of callingsplit_off()) could work,
but mem::replace() is a more advanced tool for Rust developers, and in
this case, I believe developers would assume the standard library should
be sufficient for the purpose described here.

The benefit of the approach to this change is it does not change the
existing API contract, but improves the peformance of split_off(0) for
Vec, String (which delegates split_off() to Vec), and any other
existing use cases.

This change adds tests to validate the behavior of split_off() with
regard to capacity, as originally documented, and confirm that behavior
still holds, when at == 0.

The change is an implementation detail, and does not require a
documentation change, but documenting the new behavior as part of its
API contract may benefit future users.

(Let me know if I should make that documentation update.)

Note, for future consideration:

I think it would be helpful to introduce an additional method to Vec
(if not also to String):

    pub fn take_all(&mut self) -> Self {
        self.split_off(0)
    }

This would make it more clear how Vec supports the pattern, and make
it easier to find, since the behavior is similar to other take()
methods in the Rust standard library.

r? @wesleywiser
FYI: @tmandry

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 13, 2020
@jyn514 jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 13, 2020
Optimization improvement to `split_off()` so the performance meets the
intuitively expected behavior when `at == 0`, avoiding the current
behavior of copying the entire vector.

The change honors documented behavior that the method leaves the
original vector's "previous capacity unchanged".

This improvement better supports the pattern for building and flushing a
buffer of elements, such as the following:

```rust
    let mut vec = Vec::new();
    loop {
        vec.push(something);
        if condition_is_met {
            process(vec.split_off(0));
        }
    }
```

`Option` wrapping is the first alternative I thought of, but is much
less obvious and more verbose:

```rust
    let mut capacity = 1;
    let mut vec: Option<Vec<Stuff>> = None;
    loop {
        vec.get_or_insert_with(|| Vec::with_capacity(capacity)).push(something);
        if condition_is_met {
            capacity = vec.capacity();
            process(vec.take().unwrap());
        }
    }
```

Directly applying `mem::replace()` could work, but `mem::` functions are
typically a last resort, when a developer is actively seeking better
performance than the standard library provides, for example.

The benefit of the approach to this change is it does not change the
existing API contract, but improves the peformance of `split_off(0)` for
`Vec`, `String` (which delegates `split_off()` to `Vec`), and any other
existing use cases.

This change adds tests to validate the behavior of `split_off()` with
regard to capacity, as originally documented, and confirm that behavior
still holds, when `at == 0`.

The change is an implementation detail, and does not require a
documentation change, but documenting the new behavior as part of its
API contract may benefit future users.

(Let me know if I should make that documentation update.)

Note, for future consideration:

I think it would be helpful to introduce an additional method to `Vec`
(if not also to `String`):

```
    pub fn take_all(&mut self) -> Self {
        self.split_off(0)
    }
```

This would make it more clear how `Vec` supports the pattern, and make
it easier to find, since the behavior is similar to other `take()`
methods in the Rust standard library.
@Mark-Simulacrum
Copy link
Member

Do we have some numbers on how much this helps split_off with a constant argument of zero? (That's the best case for this optimization). I'd also be interested in benchmarks showing how much of a regression this is for non-constant non-zero argument split off, i.e., how much the branch costs (I imagine not much, given that we're almost guaranteed to call into the allocator).

@richkadel
Copy link
Contributor Author

Do we have some numbers on how much this helps split_off with a constant argument of zero?

Here's an ad hoc benchmark test with results below, testing with rustc 1.47.0-nightly (e5e33ebd2 2020-08-11) (without my change). For comparison, I'm just calling mem::replace() as a direct alternative to split_off().

I could adjust this to capture the elapsed time only when calling the flush method, but I didn't do that here.

I also include, for comparison, the performance of ensuring the capacity is retained, vs. allocating a new Vec with default capacity. (If it made a big difference, a future method could be implemented that way... but I don't think it helped or hurt that much either way.)

Since this uses an unmodified rustc and library, I can't test the branch cost, but I'm not convinced there's a lot of value in testing that. (You kind of make that argument yourself. The cost of calling Vec::with_capacity() with anything greater than zero, and then calling copy_nonoverlapping() has to be way more than the cost of adding a branch if zero. Right?

time_split_off_0.rs
use std::time::Instant;

fn main() {
    benchmark_suite(/*bigger_elems=*/1.0, /*buffer_more_before_flush=*/1.0, /*more_elems=*/1.0);
    for growth in 2..=5 {
        let factor = growth as f64;
        benchmark_suite(/*bigger_elems=*/1.0, /*buffer_more_before_flush=*/factor, /*more_elems=*/1.0);
        benchmark_suite(/*bigger_elems=*/1.0, /*buffer_more_before_flush=*/1.0, /*more_elems=*/factor);
        benchmark_suite(/*bigger_elems=*/factor, /*buffer_more_before_flush=*/1.0, /*more_elems=*/1.0);
    }
}

fn benchmark_suite(bigger_elems: f64, buffer_more_before_flush: f64, more_elems: f64, ) {
    let mut elem_size = (10.0 * bigger_elems) as usize;
    let mut max_buffer = (1000.0 * buffer_more_before_flush) as usize;
    let mut loops = max_buffer * (100.0 * more_elems) as usize;

    for _ in 1..=5 {
        println!("========================== BENCHMARK TEST ==========================");
        println!(
            "pushing {} elements of size {} bytes into a Vec buffer flushed at max {} elements",
            loops,
            elem_size,
            max_buffer,
        );
        println!("--------------------------------------------------------------------");
        benchmark(elem_size, loops, max_buffer, false, false);
        benchmark(elem_size, loops, max_buffer, /*replace_with_capacity=*/true, false);
        benchmark(elem_size, loops, max_buffer, false, /*replace_with_new=*/true);
        elem_size *= 2;
        loops *= 2;
        max_buffer *= 2;
    }
}

fn benchmark(elem_size: usize, loops: usize, max_buffer: usize, replace_with_capacity: bool, replace_with_new: bool) {
    let now = Instant::now();
    let mut buffer = Vec::new();
    let mut flushed = 0;
    for _ in 0..loops {
        if buffer.len() > max_buffer {
            if replace_with_capacity {
                let capacity = buffer.capacity();
                flush(std::mem::replace(&mut buffer, Vec::with_capacity(capacity)));
            } else if replace_with_new {
                flush(std::mem::replace(&mut buffer, Vec::new()));
            } else {
                flush(buffer.split_off(0));
            }
            flushed += 1;
        }
        buffer.push(vec![b'\0'; elem_size]);
    }
    if buffer.len() > 0 {
        flush(buffer);
        flushed += 1;
    }

    let elapsed_usec = now.elapsed().as_micros();
    println!(
        "{} microseconds {}, called flush {} times",
        elapsed_usec,
        if replace_with_capacity {
            "replaced with `with_capacity()`"
        } else if replace_with_new {
            "replaced with `new()`"
        } else {
            "WITH EXISTING split_off()"
        },
        flushed,
    );
}

fn flush(buffer: Vec<Vec<u8>>) {
    let _ = buffer;
    // drops and frees this `Vec`
}

results.txt

@richkadel
Copy link
Contributor Author

a few more results, this time including values for "microseconds flushing and dropping". This is the total of elapsed time calling only flush() (with the various expression alternatives for replacing the buffer). Since flush() will drop the original buffer, the cost of the free is included in this total.

Looking at either result sets, you can see the optimization has some benefit in almost all cases. (Since I'm running this on Linux and the results are in clock time, there are some nondeterministic factors, but on average the results are better with the optimization.)

benchmark2.txt

@richkadel
Copy link
Contributor Author

Sorry, one more. I thought I should try timing ONLY the method call, and it turns out the method call by itself is a much smaller percentage of the total time spent calling the flush() method (and dropping the old buffer). I had to change the timer resolution to nanoseconds.

The performance difference is more pronounced in this version.

So is the difference between allocating the replacement buffer Vec with with_capacity() vs. just with new() (which I originally suspected).

At first I didn't know if it would be worth changing the behavior (say, in a new, alternative method, like take_all()), but actually there may be a reason to consider this.

One might assume the use cases that would benefit from higher performance here probably use a similar capacity most of the time, re-using the same capacity seems to make sense.

But what if there is an outlier?

If a buffer Vec is re-used like this 1000s or 100,000s of times and 99.999% of the time the number of elements averages, say, 100, but one time the number was 5000 (just as an example), the capacity will never go below 5000 again! We would always allocate the large buffer, 50 times larger than needed, most of the time.

Maybe it would be good to have a take_all() method that allocates the replacement with Vec::new().

The downside there is, we'd lose an opportunity to replace a buffer with a vector with a specified default capacity.

vec.take_all_with_capacity() (I'm sure there's a better name) is one possibility.

Or another possibility is, take a different approach that's almost as concise, but more flexible. Essentially a slightly more convenient wrapper around mem::replace():

vec.replace(Vec::with_capacity(100)) returning the original vec.

benchmark3.txt

@Mark-Simulacrum
Copy link
Member

Okay, I think it would be helpful to have a summary of the benchmark results -- there's a lot of data in those text files :)

I am inclined to say that we should land this though, since it avoids the copy, and when compared to the allocation and copy the additional branch is essentially free. I thought about whether there could be room to copy less in other cases, but I think the answer is no -- even split_off(1), for example, still needs to copy the whole buffer because we can't shift our existing allocation.

@richkadel
Copy link
Contributor Author

Sounds great.

Here is a summary of the savings* after running the benchmarks:

[*By "savings" I mean, if the old way took 100 microseconds and the new way takes 90 microseconds, the savings is 10%.]

Savings are between 17.43% and 99.42%, average 66.99%, when tests flushed bigger buffers (flushed less often)
Savings are between 07.59% and 80.37%, average 41.50%, when tests created more elements
Savings are between 08.78% and 98.04%, average 48.68%, when tests created bigger elements

@richkadel
Copy link
Contributor Author

(The summary is just comparing the version using "with_capacity()" to the original implementation.)

@Mark-Simulacrum
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Sep 14, 2020

📌 Commit 79aa9b1 has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 14, 2020
@bors
Copy link
Contributor

bors commented Sep 15, 2020

⌛ Testing commit 79aa9b1 with merge 6cae281...

@bors
Copy link
Contributor

bors commented Sep 15, 2020

☀️ Test successful - checks-actions, checks-azure
Approved by: Mark-Simulacrum
Pushing 6cae281 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Sep 15, 2020
@bors bors merged commit 6cae281 into rust-lang:master Sep 15, 2020
@rustbot rustbot added this to the 1.48.0 milestone Sep 15, 2020
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jan 26, 2024
Remove special-case handling of `vec.split_off(0)`

rust-lang#76682 added special handling to `Vec::split_off` for the case where `at == 0`. Instead of copying the vector's contents into a freshly-allocated vector and returning it, the special-case code steals the old vector's allocation, and replaces it with a new (empty) buffer with the same capacity.

That eliminates the need to copy the existing elements, but comes at a surprising cost, as seen in rust-lang#119913. The returned vector's capacity is no longer determined by the size of its contents (as would be expected for a freshly-allocated vector), and instead uses the full capacity of the old vector.

In cases where the capacity is large but the size is small, that results in a much larger capacity than would be expected from reading the documentation of `split_off`. This is especially bad when `split_off` is called in a loop (to recycle a buffer), and the returned vectors have a wide variety of lengths.

I believe it's better to remove the special-case code, and treat `at == 0` just like any other value:
- The current documentation states that `split_off` returns a “newly allocated vector”, which is not actually true in the current implementation when `at == 0`.
- If the value of `at` could be non-zero at runtime, then the caller has already agreed to the cost of a full memcpy of the taken elements in the general case. Avoiding that copy would be nice if it were close to free, but the different handling of capacity means that it is not.
- If the caller specifically wants to avoid copying in the case where `at == 0`, they can easily implement that behaviour themselves using `mem::replace`.

Fixes rust-lang#119913.
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2024
Rollup merge of rust-lang#119917 - Zalathar:split-off, r=cuviper

Remove special-case handling of `vec.split_off(0)`

rust-lang#76682 added special handling to `Vec::split_off` for the case where `at == 0`. Instead of copying the vector's contents into a freshly-allocated vector and returning it, the special-case code steals the old vector's allocation, and replaces it with a new (empty) buffer with the same capacity.

That eliminates the need to copy the existing elements, but comes at a surprising cost, as seen in rust-lang#119913. The returned vector's capacity is no longer determined by the size of its contents (as would be expected for a freshly-allocated vector), and instead uses the full capacity of the old vector.

In cases where the capacity is large but the size is small, that results in a much larger capacity than would be expected from reading the documentation of `split_off`. This is especially bad when `split_off` is called in a loop (to recycle a buffer), and the returned vectors have a wide variety of lengths.

I believe it's better to remove the special-case code, and treat `at == 0` just like any other value:
- The current documentation states that `split_off` returns a “newly allocated vector”, which is not actually true in the current implementation when `at == 0`.
- If the value of `at` could be non-zero at runtime, then the caller has already agreed to the cost of a full memcpy of the taken elements in the general case. Avoiding that copy would be nice if it were close to free, but the different handling of capacity means that it is not.
- If the caller specifically wants to avoid copying in the case where `at == 0`, they can easily implement that behaviour themselves using `mem::replace`.

Fixes rust-lang#119913.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants