-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement custom fold for ZipLongest #764
Implement custom fold for ZipLongest #764
Conversation
I'm not very familiar with benchmarks but I don't think we can expect it to be two times faster. PS: As you can see here, you needed to run EDIT (after response): Ah 1024 and 768 are just mirrored, ok. I did not expect such difference (22% vs 5%) for such change. |
Sorry - by "proposed benchmark", I mean the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this! I've left a comment on how to further optimize fold
. Could you also add a test for the specialization in tests/specializations.rs
? Something like this should do the trick:
quickcheck! {
fn intersperse(v: Vec<u8>) -> () {
test_specializations(&v.into_iter().intersperse(0));
}
+
+ fn zip_longest(a: Vec<u8>, b: Vec<u8>) -> () {
+ test_specializations(&a.into_iter().zip_longest(b))
+ }
}
src/zip_longest.rs
Outdated
#[inline] | ||
fn fold<B, F>(self, mut acc: B, mut f: F) -> B | ||
where | ||
Self: Sized, | ||
F: FnMut(B, Self::Item) -> B, | ||
{ | ||
let ZipLongest { mut a, mut b } = self; | ||
|
||
loop { | ||
match (a.next(), b.next()) { | ||
(Some(x), Some(y)) => acc = f(acc, EitherOrBoth::Both(x, y)), | ||
(Some(x), None) => { | ||
acc = f(acc, EitherOrBoth::Left(x)); | ||
// b is exhausted, so we can drain a. | ||
return a.fold(acc, |acc, x| f(acc, EitherOrBoth::Left(x))); | ||
} | ||
(None, Some(y)) => { | ||
acc = f(acc, EitherOrBoth::Right(y)); | ||
// a is exhausted, so we can drain b. | ||
return b.fold(acc, |acc, y| f(acc, EitherOrBoth::Right(y))); | ||
} | ||
(None, None) => return acc, // Both iterators are exhausted. | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When specializing fold
, try to avoid for
loops and calls to next()
on underlying iterators. If those underlying iterators also specialize fold
, then for
looping and calling next()
bypasses their optimizations.
It's possible to implement this method in terms of fold:
#[inline] | |
fn fold<B, F>(self, mut acc: B, mut f: F) -> B | |
where | |
Self: Sized, | |
F: FnMut(B, Self::Item) -> B, | |
{ | |
let ZipLongest { mut a, mut b } = self; | |
loop { | |
match (a.next(), b.next()) { | |
(Some(x), Some(y)) => acc = f(acc, EitherOrBoth::Both(x, y)), | |
(Some(x), None) => { | |
acc = f(acc, EitherOrBoth::Left(x)); | |
// b is exhausted, so we can drain a. | |
return a.fold(acc, |acc, x| f(acc, EitherOrBoth::Left(x))); | |
} | |
(None, Some(y)) => { | |
acc = f(acc, EitherOrBoth::Right(y)); | |
// a is exhausted, so we can drain b. | |
return b.fold(acc, |acc, y| f(acc, EitherOrBoth::Right(y))); | |
} | |
(None, None) => return acc, // Both iterators are exhausted. | |
} | |
} | |
} | |
#[inline] | |
fn fold<B, F>(self, mut init: B, mut f: F) -> B | |
where | |
Self: Sized, | |
F: FnMut(B, Self::Item) -> B, | |
{ | |
let ZipLongest { a, mut b } = self; | |
init = a.fold(init, |init, a| match b.next() { | |
Some(b) => f(init, EitherOrBoth::Both(a, b)), | |
None => f(init, EitherOrBoth::Left(a)), | |
}); | |
b.fold(init, |init, b| f(init, EitherOrBoth::Right(b))) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jswrenn for your feedback. I tried to run the benchmark against your recommendation but the result is regressing
time: [594.12 ns 594.85 ns 595.65 ns]
change: [+20.625% +21.035% +21.434%] (p = 0.00 < 0.05)
Performance has regressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was gonna suggest to try to rely a bit on zip
from the std with init = a.by_ref().zip(b.ref()).fold(init, &mut f); ...
but Zip
do not have a fold
specialization yet. It would allow to discard the loop (and the Some, Some
case).
What about f(init, match b.next() {...})
(in the first fold)? (I guess it's the same?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes unfortunately. I tried :
{
let ZipLongest { a, mut b } = self;
init = a.fold(init, |init, a| f(init, match b.next() {
Some(b) => EitherOrBoth::Both(a, b),
None => EitherOrBoth::Left(a),
}));
b.fold(init, |init, b| f(init, EitherOrBoth::Right(b)))
}
and the result is same: time: [593.41 ns 594.82 ns 596.39 ns]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #765 I considered folding a range k+1..=n
but did not because it was slower, which surprised me. It seems weird to even suggest but maybe it's because of Range*::fold
?!
Since fold
s in the benchmark rely on the same thing here, maybe you can try with different iterators.
Like let v1 = black_box((0..***).collect_vec()); v2...
(before bench_function
) and then v1.iter().zip_longest(v2.iter())
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswrenn Thanks a lot! Yes after looking at the 2 files you shared, I can understand why for loop should always be avoided (and use the underlying fold api which has been optimized). That's really inspiring to me. I've incorporated your suggestions just now. Thanks a lot!
@Philippe-Cholet Yes the issue is on Range*::fold
. I adopted your suggestions and rerun the benchmark. Using the default fold
(which calls next
) the time is time: [2.0522 µs 2.0558 µs 2.0599 µs]
, and using the new fold
logic, I can see a huge improvement time: [414.34 ns 415.44 ns 416.75 ns]
Thanks both!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diving... Apparently, Range (a..b
) does not specialize fold (so maybe that's why there was bad perf), but RangeInclusive (a..=b
) does relying on try_fold
(implementation seems tricky to me). std::vec::IntoIter
does not either, but std::slice::Iter
which I redirected you to (without much insight) does.
So I guess we should continue with some_vec.iter().fold(...)
for benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. In future PR that optimizes fold
, we should stick with some_vec.iter().fold(...)
for benchmarking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswrenn Do you have any further feedback for my PR ? If no I think we can merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usual link for how things should prefer forwarding internal iteration to internal iteration where possible: https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb.
chain
is the usual example where internal iteration tends to be measurably faster than external. For things like Range<usize>
or slice::Iter
, there's typically no difference.
(So this is my +1 to forwarding fold
to fold
where possible, and when we finally make it possible to implement try_fold
, having that forward to try_fold
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks.
3e5395d
to
575819a
Compare
575819a
to
c4358ab
Compare
Sigh. While attempting to fix a spurious merge conflict error, I reset the head back slightly too far. Github autoclosed the PR, and now I don't have edit access. @Owen-CH-Leung Can you just run |
As per discussion in #755 , this PR implements custom fold logic for ZipLongest<T, U> and create benchmark test to measure the performance gain compared with the default fold.
Running the proposed benchmark using the default
fold
from iterator, the time wastime: [592.01 ns 593.24 ns 594.70 ns]
Running the proposed benchmark using the proposed
fold
logic, the time wastime: [486.75 ns 487.88 ns 489.29 ns]
But the performance gain is not so significant if the setup changes to
(0..1024).zip_longest(0..768)
using the default
fold
logic:time: [509.25 ns 510.21 ns 511.35 ns]
using the proposed
fold
logic:time: [485.58 ns 486.24 ns 486.97 ns]