Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement custom fold for ZipLongest #764

Conversation

Owen-CH-Leung
Copy link
Contributor

@Owen-CH-Leung Owen-CH-Leung commented Sep 28, 2023

As per discussion in #755 , this PR implements custom fold logic for ZipLongest<T, U> and create benchmark test to measure the performance gain compared with the default fold.

Running the proposed benchmark using the default fold from iterator, the time was time: [592.01 ns 593.24 ns 594.70 ns]
Running the proposed benchmark using the proposed fold logic, the time was time: [486.75 ns 487.88 ns 489.29 ns]

But the performance gain is not so significant if the setup changes to (0..1024).zip_longest(0..768)

using the default fold logic: time: [509.25 ns 510.21 ns 511.35 ns]
using the proposed fold logic: time: [485.58 ns 486.24 ns 486.97 ns]

@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Sep 28, 2023

I'm not very familiar with benchmarks but I don't think we can expect it to be two times faster.
It seems to be respectively 22% and 5% faster. It seems reasonably good to me.
What was the "proposed benchmark" you replaced?

PS: As you can see here, you needed to run cargo fmt in order to pass CI tests.

EDIT (after response): Ah 1024 and 768 are just mirrored, ok. I did not expect such difference (22% vs 5%) for such change.
Side note: After a CI failure yesterday, I'm trying a custom .git/hooks/pre-commit (gist) to prevent most CI failures.

@Owen-CH-Leung
Copy link
Contributor Author

I'm not very familiar with benchmarks but I don't think we can expect it to be two times faster. It seems to be respectively 22% and 5% faster. It seems reasonably good to me. What was the "proposed benchmark" you replaced?

PS: As you can see here, you needed to run cargo fmt in order to pass CI tests.

Sorry - by "proposed benchmark", I mean the ziplongest benchmark function that I added in this PR. Yes the performance is around 2x% amd 5% faster

Copy link
Member

@jswrenn jswrenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this! I've left a comment on how to further optimize fold. Could you also add a test for the specialization in tests/specializations.rs? Something like this should do the trick:

quickcheck! {
     fn intersperse(v: Vec<u8>) -> () {
         test_specializations(&v.into_iter().intersperse(0));
     }
+
+    fn zip_longest(a: Vec<u8>, b: Vec<u8>) -> () {
+        test_specializations(&a.into_iter().zip_longest(b))
+    }
 }

Comment on lines 56 to 80
#[inline]
fn fold<B, F>(self, mut acc: B, mut f: F) -> B
where
Self: Sized,
F: FnMut(B, Self::Item) -> B,
{
let ZipLongest { mut a, mut b } = self;

loop {
match (a.next(), b.next()) {
(Some(x), Some(y)) => acc = f(acc, EitherOrBoth::Both(x, y)),
(Some(x), None) => {
acc = f(acc, EitherOrBoth::Left(x));
// b is exhausted, so we can drain a.
return a.fold(acc, |acc, x| f(acc, EitherOrBoth::Left(x)));
}
(None, Some(y)) => {
acc = f(acc, EitherOrBoth::Right(y));
// a is exhausted, so we can drain b.
return b.fold(acc, |acc, y| f(acc, EitherOrBoth::Right(y)));
}
(None, None) => return acc, // Both iterators are exhausted.
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When specializing fold, try to avoid for loops and calls to next() on underlying iterators. If those underlying iterators also specialize fold, then for looping and calling next() bypasses their optimizations.

It's possible to implement this method in terms of fold:

Suggested change
#[inline]
fn fold<B, F>(self, mut acc: B, mut f: F) -> B
where
Self: Sized,
F: FnMut(B, Self::Item) -> B,
{
let ZipLongest { mut a, mut b } = self;
loop {
match (a.next(), b.next()) {
(Some(x), Some(y)) => acc = f(acc, EitherOrBoth::Both(x, y)),
(Some(x), None) => {
acc = f(acc, EitherOrBoth::Left(x));
// b is exhausted, so we can drain a.
return a.fold(acc, |acc, x| f(acc, EitherOrBoth::Left(x)));
}
(None, Some(y)) => {
acc = f(acc, EitherOrBoth::Right(y));
// a is exhausted, so we can drain b.
return b.fold(acc, |acc, y| f(acc, EitherOrBoth::Right(y)));
}
(None, None) => return acc, // Both iterators are exhausted.
}
}
}
#[inline]
fn fold<B, F>(self, mut init: B, mut f: F) -> B
where
Self: Sized,
F: FnMut(B, Self::Item) -> B,
{
let ZipLongest { a, mut b } = self;
init = a.fold(init, |init, a| match b.next() {
Some(b) => f(init, EitherOrBoth::Both(a, b)),
None => f(init, EitherOrBoth::Left(a)),
});
b.fold(init, |init, b| f(init, EitherOrBoth::Right(b)))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jswrenn for your feedback. I tried to run the benchmark against your recommendation but the result is regressing

time:   [594.12 ns 594.85 ns 595.65 ns]
change: [+20.625% +21.035% +21.434%] (p = 0.00 < 0.05)
Performance has regressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was gonna suggest to try to rely a bit on zip from the std with init = a.by_ref().zip(b.ref()).fold(init, &mut f); ... but Zip do not have a fold specialization yet. It would allow to discard the loop (and the Some, Some case).

What about f(init, match b.next() {...}) (in the first fold)? (I guess it's the same?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes unfortunately. I tried :

{
        let ZipLongest { a, mut b } = self;
        init = a.fold(init, |init, a| f(init, match b.next() {
            Some(b) => EitherOrBoth::Both(a, b),
            None => EitherOrBoth::Left(a),
        }));
        b.fold(init, |init, b| f(init, EitherOrBoth::Right(b)))
    }

and the result is same: time: [593.41 ns 594.82 ns 596.39 ns]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #765 I considered folding a range k+1..=n but did not because it was slower, which surprised me. It seems weird to even suggest but maybe it's because of Range*::fold?!
Since folds in the benchmark rely on the same thing here, maybe you can try with different iterators.
Like let v1 = black_box((0..***).collect_vec()); v2... (before bench_function) and then v1.iter().zip_longest(v2.iter()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswrenn Thanks a lot! Yes after looking at the 2 files you shared, I can understand why for loop should always be avoided (and use the underlying fold api which has been optimized). That's really inspiring to me. I've incorporated your suggestions just now. Thanks a lot!

@Philippe-Cholet Yes the issue is on Range*::fold. I adopted your suggestions and rerun the benchmark. Using the default fold (which calls next) the time is time: [2.0522 µs 2.0558 µs 2.0599 µs], and using the new fold logic, I can see a huge improvement time: [414.34 ns 415.44 ns 416.75 ns]

Thanks both!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diving... Apparently, Range (a..b) does not specialize fold (so maybe that's why there was bad perf), but RangeInclusive (a..=b) does relying on try_fold (implementation seems tricky to me). std::vec::IntoIter does not either, but std::slice::Iter which I redirected you to (without much insight) does.
So I guess we should continue with some_vec.iter().fold(...) for benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. In future PR that optimizes fold, we should stick with some_vec.iter().fold(...) for benchmarking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswrenn Do you have any further feedback for my PR ? If no I think we can merge this.

Copy link
Contributor

@scottmcm scottmcm Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usual link for how things should prefer forwarding internal iteration to internal iteration where possible: https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb.

chain is the usual example where internal iteration tends to be measurably faster than external. For things like Range<usize> or slice::Iter, there's typically no difference.

(So this is my +1 to forwarding fold to fold where possible, and when we finally make it possible to implement try_fold, having that forward to try_fold.)

@jswrenn jswrenn added this to the next milestone Sep 28, 2023
Copy link
Member

@jswrenn jswrenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks.

@jswrenn jswrenn added this pull request to the merge queue Oct 3, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Oct 3, 2023
@jswrenn jswrenn added this pull request to the merge queue Oct 3, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Oct 3, 2023
@jswrenn jswrenn force-pushed the Implement-custom-fold-for-zip-longest branch from 3e5395d to 575819a Compare October 3, 2023 16:54
@jswrenn jswrenn closed this Oct 3, 2023
@jswrenn jswrenn force-pushed the Implement-custom-fold-for-zip-longest branch from 575819a to c4358ab Compare October 3, 2023 16:55
@jswrenn
Copy link
Member

jswrenn commented Oct 3, 2023

Sigh. While attempting to fix a spurious merge conflict error, I reset the head back slightly too far. Github autoclosed the PR, and now I don't have edit access.

@Owen-CH-Leung Can you just run git push -f on your end (without first pulling)? That ought to do the trick.

@Owen-CH-Leung
Copy link
Contributor Author

Hi @jswrenn , I run git push -f without pulling and creates this new PR.

#774

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants