Optimize drain_filter #44355

Xaeroxe · 2017-09-05T21:38:59Z

This PR cuts out two copies from each iteration of drain_filter by exchanging the swap operation for a copy_nonoverlapping function call instead. Since the data being swapped is not needed anymore we can just overwrite it instead.

rust-highfive · 2017-09-05T21:39:06Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

sfackler · 2017-09-05T21:48:41Z

Do you have any benchmark numbers to indicate what kind of performance impact this has?

Xaeroxe · 2017-09-05T21:49:12Z

No, but I'm happy to make some.

Xaeroxe · 2017-09-05T22:09:50Z

Note: A similar optimization could have been made to Vec::retain but at this point that would technically be a breaking change, as it would change the execution order from

Execute predicate on all elements -> Drop elements it was false for

into

Execute predicate on element -> Drop element immediately if false -> Execute predicate on next element -> ...

sfackler · 2017-09-05T22:14:21Z

Drop order isn't something I'd be particularly worried about changing IMO.

vitalyd · 2017-09-05T22:18:56Z

@sfackler, wouldn't that be visible to users though if they catch panics? Suppose the predicate panics on the last element. Or is that not a typical concern for things like this? It's probably unspecified how this behaves, but I think changing behavior nevertheless means the perf increase has to be more significant.

Xaeroxe · 2017-09-05T22:20:32Z

It could technically be a breaking change if somehow the predicate had different behavior based on whether or not elements in the vector had been dropped yet or not. I'm having a hard time thinking of a use case for such a construct though, not to mention it's kind of bad design in the first place as it relies on side effects that aren't guaranteed by any written documentation.

sfackler · 2017-09-05T23:11:51Z

@vitalyd "breaking change" is explicitly smaller in scope than "any change that is physically possible to observe". Otherwise we couldn't do reasonable seeming things like changing the size of a type or adding more information to a Debug implementation. With behavior changes like this, it generally depends on if anyone would "reasonably" be depending on it, which I don't think is the case here.

sfackler · 2017-09-05T23:12:12Z

(Obligatory https://xkcd.com/1172/)

vitalyd · 2017-09-05T23:25:43Z

@sfackler, FWIW I agree - it's a healthy approach that will make Rust better over the long haul (for an example of a language/ecosystem that's somewhat petrified of making changes, we can look at Java :)).

At the same time, it's important to not alienate/frustrate users with "willy nilly" changes that break something but don't add any real value otherwise. Not saying that's proposed here, just a general comment. So a change that may break code at runtime, such as different drop order in a rare (and possibly untested) scenario, just means the benefit bar has to go up for that change.

Anyway, sorry for the distraction - carry on :).

sfackler · 2017-09-05T23:32:35Z

Yeah for sure; it's always a bit of a judgement call. It's one of the reasons we have things like Crater and nightlies that let us look at the actual downstream impact to see if it's worse than we'd assumed.

vitalyd · 2017-09-06T01:17:42Z

Yeah, I think crater/cargobomb are awesome tools for this space in general. But those are mostly (entirely?) about detecting compile-time breakage?

Xaeroxe · 2017-09-06T01:46:33Z

Well at least we have the nightly channel, so that way users can report breakage they find themselves.

Xaeroxe · 2017-09-06T05:15:56Z

Initial benchmarks weren't very promising, so I updated the pull request to grab the initial pointers more like the swap function. I've got a build and benchmark operation running overnight on my PC and I'll report back in a little less than 24 hours.

Xaeroxe · 2017-09-06T05:50:30Z

Ok, so after the most recent commit these benchmarks look slightly better:

# Current nightly as of 2017-09-05
Average time for removing one at beginning: Duration { secs: 0, nanos: 62956957 }
Average time for removing one at ending:    Duration { secs: 0, nanos: 42066468 }
Average time for removing all elements:     Duration { secs: 0, nanos: 66166225 }
Average time for removing no elements:      Duration { secs: 0, nanos: 41859578 }
info: default toolchain set to 'optimization'
# This PR
Average time for removing one at beginning: Duration { secs: 0, nanos: 52273292 }
Average time for removing one at ending:    Duration { secs: 0, nanos: 41589495 }
Average time for removing all elements:     Duration { secs: 0, nanos: 65309229 }
Average time for removing no elements:      Duration { secs: 0, nanos: 42410923 }

The time I would expect this to impact the most is Average time for removing one at beginning, which was reflected here. This is the program used to get these benchmarks:

#![feature(drain_filter)]

use std::time::{Duration, Instant};

fn main() {
    let mut total = Duration::new(0, 0);
    for _ in 0..100 {
        let mut v = Vec::new();
        for i in 0..9999999 {
            v.push(i);
        }
        let start = Instant::now();
        assert_eq!(v.drain_filter(|e| *e == 0).count(), 1);
        total += start.elapsed() / 1000;
    }
    println!("Average time for removing one at beginning: {:?}", total);
    let mut total = Duration::new(0, 0);
    for _ in 0..100 {
        let mut v = Vec::new();
        for i in 0..9999999 {
            v.push(i);
        }
        let start = Instant::now();
        assert_eq!(v.drain_filter(|e| *e == 9999998).count(), 1);
        total += start.elapsed() / 1000;
    }
    println!("Average time for removing one at ending:    {:?}", total);
    let mut total = Duration::new(0, 0);
    for _ in 0..100 {
        let mut v = Vec::new();
        for _ in 0..9999999 {
            v.push(0);
        }
        let start = Instant::now();
        assert_eq!(v.drain_filter(|e| *e == 0).count(), 9999999);
        total += start.elapsed() / 1000;
    }
    println!("Average time for removing all elements:     {:?}", total);
    let mut total = Duration::new(0, 0);
    for _ in 0..100 {
        let mut v = Vec::new();
        for _ in 0..9999999 {
            v.push(0);
        }
        let start = Instant::now();
        assert_eq!(v.drain_filter(|e| *e != 0).count(), 0);
        total += start.elapsed() / 1000;
    }
    println!("Average time for removing no elements:      {:?}", total);
}

Overall assessment: It's a very slight improvement, not really worth investigating for Vec::retain but eh, it's already written for this so there's no harm in merging it.

Xaeroxe · 2017-09-06T06:17:23Z

I suppose that for the scenario of removing a single element at the front of a very long vector this did boost performance by about 20%. So maybe it would be worth it for Vec::retain?

carols10cents · 2017-09-11T15:16:08Z

review ping @aturon !

bluss · 2017-09-11T19:04:59Z

src/liballoc/vec.rs

I feel like this requires a comment to explain why this is safe (for example in case of unwinding). The crutch is quite literally not local, which is that the Vec had its length set to 0.

Do the bounds checks optimize out here? Is there anything to gain from writing this entirely in raw pointers instead of using indices? We've seen before that the compiler doesn't optimize out indices and lengths into slices (or into slice iterators, not relevant here) entirely fluently.

I'll add that comment. The first implementation did use raw pointers but they did not perform favorably in benchmarks. I believe this was because pointer::offset required me to convert the usize to an isize. In order to prevent this I could just store isize values and instead make all usages of this use raw pointers. I'll try that and see how it goes in benchmarking.

bluss · 2017-09-11T19:13:00Z

This is a bit anti-optimization, but I really want us to make the best effort on unwinding and not just drop the ball and leak things. This issue is pre-existing and not due to this PR!

Since this is sort of a successor to retain, I see the unwinding issue to be of higher importance here than with the regular drain; here we have a closure that the user might cause a panic in.

bluss · 2017-09-11T19:29:34Z

Discussing with @frankmcsherry he pointed out that it does more or less handle panic ok -- if futher calls to the closure don't panic. The concerning thing is that if the filter closure panics, it will still be attempted to be called again in DrainFilter's drop.

frankmcsherry · 2017-09-11T19:35:52Z

Some thoughts (from convo with @bluss):

It looks like DrainFilter's drop method drains the iterator. This seems like it might be prone to finding more panics, leading to an abort. Might it be better to check if panicking (std::thread::panicking) and just drop the remaining region in that case. This would deviate from the intended behavior of drain_filter, but I'm not sure what the intent of drop in a panic is (e.g. get out as cleanly as possible; don't worry about maintaining application level invariants?).
I'm not sure if you have one at hand, but the performance could plausibly be more significant on a multicore set-up. It can be hard to saturate your memory bandwidth with just one core, and one of the big changes is removing the need to write back the swapped memory, which eases memory bandwidth.
I can't tell if the "remove first" case got faster because it is better, or because LLVM can see that this happens only once and optimizes lots of small memory moves into larger ones. If you really wanted to dork around with the performance, you could wait until the predicate succeeds to move, allowing you to shift blocks of ignored memory rather than single elements (but preventing copy_nonoverlapping).

Xaeroxe · 2017-09-12T04:27:59Z

So I added code that just used ptr::offset with isize values and despite this removing the bounds check the bench mark actually performed significantly worst than even the implementation that uses swap, so I removed that commit from this PR. I'll play around with @frankmcsherry's suggestion number 3 tomorrow. These benchmarks are being executed on the following CPU:

CPU:       Quad core Intel Core i7-6700HQ (-HT-MCP-) cache: 6144 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 20735
           clock speeds: min/max: 800/3500 MHz 1: 844 MHz 2: 825 MHz 3: 799 MHz 4: 862 MHz 5: 800 MHz
6: 800 MHz 7: 800 MHz 8: 822 MHz

Xaeroxe · 2017-09-12T04:29:57Z

Side note: I'm not really sure what correct behavior during a panic is. If someone can just tell me what should be dropped when then I'm happy to write it.

alexcrichton · 2017-09-14T22:51:34Z

ping @aturon for a review!

Also @Xaeroxe want to add the comment that @bluss mentioned?

carols10cents · 2017-09-18T16:02:16Z

ping @aturon @rust-lang/libs! this needs a review please!

Xaeroxe · 2017-09-18T16:12:11Z

Due to being unable to use nonoverlapping suggestion 3 has significant performance loss. I'll add the comment @bluss mentioned now.

alexcrichton · 2017-09-19T16:05:42Z

Ok this PR itself looks good to me, there's a benchmark showing a nice improvement in some cases, and a concern about panics/drops isn't exacerbated as it's not changing the behavior from what I can tell. Sounds like dealing with panics/drops is best done in a separate issue/PR, so I'm going to...

@bors: r+

Thanks for being patient @Xaeroxe!

bors · 2017-09-19T16:05:43Z

📌 Commit 10384ab has been approved by alexcrichton

bors · 2017-09-19T16:07:35Z

⌛ Testing commit 10384ab with merge 8bc544a...

Optimize drain_filter This PR cuts out two copies from each iteration of `drain_filter` by exchanging the swap operation for a copy_nonoverlapping function call instead. Since the data being swapped is not needed anymore we can just overwrite it instead.

bors · 2017-09-19T22:43:55Z

💔 Test failed - status-travis

Xaeroxe · 2017-09-19T22:46:23Z

Looks like an unrelated Travis failure? Retry?

alexcrichton · 2017-09-20T01:26:39Z

@bors: retry

yeah weird travis scheduling...

bors · 2017-09-20T06:29:48Z

⌛ Testing commit 10384ab with merge 94a82ad...

Optimize drain_filter This PR cuts out two copies from each iteration of `drain_filter` by exchanging the swap operation for a copy_nonoverlapping function call instead. Since the data being swapped is not needed anymore we can just overwrite it instead.

kennytm · 2017-09-20T10:00:14Z

dist-mips-linux (3 restarts) and x86_64-gnu-incremental (2 restarts) hit travis-ci/travis-ci#8429, expect ~~one more hour~~ before merging 🤦‍.

Edit: 🤦‍

bors · 2017-09-20T10:04:27Z

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing 94a82ad to master...

arthurprs · 2017-09-27T12:04:42Z

Can we also land this for retain?

Xaeroxe · 2018-02-13T16:02:34Z

@arthurprs this has just been r+ed for retain in #48065

Apply optimization from rust-lang#44355 to retain As discussed in rust-lang#44355 this PR applies a similar optimization to `Vec::retain`. For `drain_filter`, a very similar function, this improved performance by up to 20%.

Optimize Vec::retain Use `copy_non_overlapping` instead of `swap` to reduce memory writes, like what we've done in rust-lang#44355 and `String::retain`. rust-lang#48065 already tried to do this optimization but it is reverted in rust-lang#67300 due to bad codegen of `DrainFilter::drop`. This PR re-implement the drop-then-move approach. I did a [benchmark](https://gist.github.com/oxalica/3360eec9376f22533fcecff02798b698) on small-no-drop, small-need-drop, large-no-drop elements with different predicate functions. It turns out that the new implementation is >20% faster in average for almost all cases. Only 2/24 cases are slower by 3% and 5%. See the link above for more detail. I think regression in may-panic cases is due to drop-guard preventing some optimization. If it's permitted to leak elements when predicate function of element's `drop` panic, the new implementation should be almost always faster than current one. I'm not sure if we should leak on panic, since there is indeed an issue (rust-lang#52267) complains about it before.

Xaeroxe added 2 commits September 5, 2017 15:32

Remove unneeded copy

0e2427c

Fix ownership issues

4817754

rust-highfive assigned aturon Sep 5, 2017

aidanhs added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 7, 2017

bluss reviewed Sep 11, 2017

View reviewed changes

Try using ref to raw conversion

4de0cf1

Xaeroxe force-pushed the optimize_drain_filter branch from f62d618 to 4de0cf1 Compare September 12, 2017 04:24

Add requested comment

10384ab

bors mentioned this pull request Sep 19, 2017

rework the README.md for rustc and add other readmes #44505

Merged

bors merged commit 10384ab into rust-lang:master Sep 20, 2017

Xaeroxe deleted the optimize_drain_filter branch February 7, 2018 19:44

Xaeroxe mentioned this pull request Feb 8, 2018

Apply optimization from #44355 to retain #48065

Merged

Xaeroxe added a commit to Xaeroxe/rust that referenced this pull request Feb 8, 2018

Apply optimization from rust-lang#44355 to retain

2a4c018

hanna-kruppe mentioned this pull request Dec 14, 2019

Restore original implementation of Vec::retain #67300

Merged

oxalica mentioned this pull request Jan 17, 2021

Optimize Vec::retain #81126

Merged

Optimize drain_filter #44355

Optimize drain_filter #44355

Uh oh!

Conversation

Xaeroxe commented Sep 5, 2017

Uh oh!

rust-highfive commented Sep 5, 2017

Uh oh!

sfackler commented Sep 5, 2017

Uh oh!

Xaeroxe commented Sep 5, 2017

Uh oh!

Xaeroxe commented Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfackler commented Sep 5, 2017

Uh oh!

vitalyd commented Sep 5, 2017

Uh oh!

Xaeroxe commented Sep 5, 2017

Uh oh!

sfackler commented Sep 5, 2017

Uh oh!

sfackler commented Sep 5, 2017

Uh oh!

vitalyd commented Sep 5, 2017

Uh oh!

sfackler commented Sep 5, 2017

Uh oh!

vitalyd commented Sep 6, 2017

Uh oh!

Xaeroxe commented Sep 6, 2017

Uh oh!

Xaeroxe commented Sep 6, 2017

Uh oh!

Xaeroxe commented Sep 6, 2017

Uh oh!

Xaeroxe commented Sep 6, 2017

Uh oh!

carols10cents commented Sep 11, 2017

Uh oh!

bluss Sep 11, 2017

Choose a reason for hiding this comment

Uh oh!

Xaeroxe Sep 11, 2017

Choose a reason for hiding this comment

Uh oh!

bluss commented Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluss commented Sep 11, 2017

Uh oh!

frankmcsherry commented Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xaeroxe commented Sep 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xaeroxe commented Sep 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Sep 14, 2017

Uh oh!

carols10cents commented Sep 18, 2017

Uh oh!

Xaeroxe commented Sep 18, 2017

Uh oh!

alexcrichton commented Sep 19, 2017

Uh oh!

bors commented Sep 19, 2017

Uh oh!

bors commented Sep 19, 2017

Uh oh!

bors commented Sep 19, 2017

Uh oh!

Xaeroxe commented Sep 19, 2017

Uh oh!

alexcrichton commented Sep 20, 2017

Uh oh!

bors commented Sep 20, 2017

Xaeroxe commented Sep 5, 2017 •

edited

Loading

bluss commented Sep 11, 2017 •

edited

Loading

frankmcsherry commented Sep 11, 2017 •

edited

Loading

Xaeroxe commented Sep 12, 2017 •

edited

Loading

Xaeroxe commented Sep 12, 2017 •

edited

Loading

kennytm commented Sep 20, 2017 •

edited

Loading