-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize .zip() for efficient slice and slice iteration #33090
Conversation
r? @aturon (rust_highfive has picked a reviewer for you, use r? to override) |
Why |
I haven't tried that. The counted loop motivation explains why I used this approach though. |
I see. (Now to practice reading things from start to finish before asking questions...) |
Nice wins! The design here seems pretty clean, although I agree with @Stebalien that if we could get away with something like: unsafe fn next_unchecked(&mut self) -> Self::Item; Then that'd be pretty slick as well. (not sure if it optimizes the same way though). It's a little unfortunate that the struct Zip<A, B> {
inner: <(A, B) as ZipImpl<A, B>>::Repr,
}
trait ZipImpl<A, B> {
type Repr;
}
impl<A, B> ZipImpl<A, B> for (A, B) {
type Repr = (A, B);
}
// ... I'm not sure if that'll actually compile just yet, but perhaps worth a shot? Also, could this add some benchmarks in-tree so eventually if we start tracking them we can track performance over time? |
@Stebalien @alexcrichton I have tried similar things to .next_unchecked, but not with the rigor or the scale as in this PR. But in general it does not optimize the same way the indexed version does and does not reduce to Specialization with associated types seems controversial as it is. I'm not worried about the extra struct fields. In the absolute majority of use cases, they are locals in a function and can be optimized out if they are unused. I will attempt to salvage Map by adding back the side effect case for uneven length iterators, and see if the performance of the test cases are still preserved. I'll happily add the examples as benchmark cases, but for me they are more like codegen test cases. Do we have a facility for testing optimizations? |
Ah yeah if LLVM can't connect the dots to realize everything uses the same iteration counter then seems fine to leave as is.
Could you elaborate a bit? Is it not implemented currently, or do you think it's likely to be removed? It's true that this would only be a minor win, I'm almost mostly just interested to see if it works :)
We kinda do in |
Here's a playground snippet where you can compare the counted strategy with the next_unchecked strategy. https://play.rust-lang.org/?gist=fbb8dece16c32195eb60219b9d769871&version=nightly&backtrace=0 The implementation is of course not so beautiful, since we need to access private fields of the iterators from outside their module. next_unchecked vectorizes, but it does not recognize as memcpy, and you can compare the addition loops' code against each other too. The kind of code generated for the addition loop by next_unchecked is generally a bit worse in benchmarks than the indexed approach. |
The AssumeNonNull code is gone, because it regressed some of the simplest testcases (even if it improved many others). This optimization is pretty brittle itself! I've added benchmarks, and they optimize as they should, but only after I moved the code out into separate functions. The codegen test also passes. I think Map will not participate in this specialization. I planned to reproduce the needed side effects, but then I realized the |
It's fragile and it depends on something that's outside the zip iterator's control. Example: https://gist.github.com/bluss/5a088d3f420d12406689439e2940d731 Same input iterators but different sources (slice vs Vec). |
90ff0a4
to
1346fa0
Compare
Regarding the nonnull/noalias/optimization failures, after looking through this yesterday I think it's a very interesting issue but it's orthogonal to and an independent issue of this improvement of zip, even though the missing nonnull/noalias metadata readily destroys the optimization in many cases. A reliable workaround is to use an intermediate function with slice arguments (works well for ndarray for example). In the long run we must ensure that slices / slice iterators created directly from Vec variables or Vec function arguments are treated the same way slice function arguments are. @alexcrichton I did one attempt for associated types and found some interesting effects. The implementation diff on top of the current state of the PR is here https://gist.github.com/bluss/ee1396d6b845f8eefa04ead1fa940533 and the compiler error that resulted is also in there. It simply looks like the compiler has to treat the default case as if it has a drop ck obligation. This is all pretty new to me, and I don't know what's correct and how it can be worked around. I think the approach using Default is the way to go without the lattice rule. There's no way we can have it know the associated data is always () in the default implementation case. |
To clarify the overall status: it seems pretty clear that we're going to want specializable associated types in some form (and today's implementation has them). We're already starting to collect some pretty significant use cases. The main question is precisely who gets to assume what about these types -- there are a few ways we can take the design, and we want to get feedback on the use cases they enable. |
I found a use of associated types that compiles and works as before (optimizes the same). It's in the latest commit. It uses the trait bounds 'static + Default + Clone + Debug on the associated type. The 'static was the ingredient to avoid the dropck issue. Default to allow creating the field in the default impl, and Clone, Debug are to support the derive on Zip. The trait owning the associated type is a separate trait implemented for every type, so that no new trait bounds are needed on the Zip struct. |
Nice! I'm definitely in favor of merging :) (but will defer to libs triage at least) |
It's in good shape for merging now. Avenues for expansion:
|
The dropck thing is probably a bug, though. |
This still feels like a scary type-system feature to me - and lo and behold, the current implementation has tons of bugs. If you ignore the orphan rules suddenly becoming Very Important, method specialization is basically equivalent to matching on types - which you can basically do (unreliably) even before specialization with sufficient unsafe code tricks. Specialized associated types, however, add that kind of matching to the type system itself. We don't even have associated types working in all cases even without specialization - I doubt that adding it will make the situation better. |
We still don't know how to do type equality with HRTB and associated types (at least, I don't, and rustc doesn't do it correctly). Selecting types based on that sounds... bogus. |
Is this related to specializing to take advantage of a trusted size_hint? (I didn't see any issues open for such a thing.) |
@bstrie It's not formally connected to it, but it could be. The specialization here has a much stricter demand (that you can access elements unchecked, by index). |
The libs team discussed this during triage today, and conclusion was positive all around. @aturon would like to scrutinize before landing, so I'll defer the final review to him. |
☔ The latest upstream changes (presumably #33694) made this pull request unmergeable. Please resolve the merge conflicts. |
The associated type must be 'static to avoid dropck related errors.
Thank you. Sorry that I was away for some weeks. Let's give this a go (I'm not 100% that the memcpy test will pass the buildbots). @bors r=aturon |
📌 Commit 5df05c6 has been approved by |
⌛ Testing commit 5df05c6 with merge 0a35504... |
@bors: retry force clean
|
Specialize .zip() for efficient slice and slice iteration The idea is to introduce a private trait TrustedRandomAccess and specialize .zip() for random access iterators into a counted loop. The implementation in the PR is internal and has no visible effect in the API Why a counted loop? To have each slice iterator compile to just a pointer, and both pointers are indexed with the same loop counter value in the generated code. When this succeeds, copying loops are readily recognized and replaced with memcpy and addition loops autovectorize well. The TrustedRandomAccess approach works very well on the surface. Microbenchmarks optimize well, following the ideas above, and that is a dramatic improvement of .zip()'s codegen. ```rust // old zip before this PR: bad, byte-for-byte loop // with specialized zip: memcpy pub fn copy_zip(xs: &[u8], ys: &mut [u8]) { for (a, b) in ys.iter_mut().zip(xs) { *a = *b; } } // old zip before this PR: single addition per iteration // with specialized zip: vectorized pub fn add_zip(xs: &[f32], ys: &mut [f32]) { for (a, b) in ys.iter_mut().zip(xs) { *a += *b; } } // old zip before this PR: single addition per iteration // with specialized zip: vectorized (!!) pub fn add_zip3(xs: &[f32], ys: &[f32], zs: &mut [f32]) { for ((a, b), c) in zs.iter_mut().zip(xs).zip(ys) { *a += *b * *c; } } ``` Yet in more complex situations, the .zip() loop can still fall back to its old behavior where phantom null checks throw in fake premature end of the loop conditionals. Remember that a NULL inside Option<(&T, &T)> makes it a `None` value and a premature (in this case) end of the loop. So even if we have 1) an explicit `Some` in the code and 2) the types of the pointers are `&T` or `&mut T` which are nonnull, we can still get a phantom null check at that point. One example that illustrates the difference is `copy_zip` with slice versus Vec arguments. The involved iterator types are exactly the same, but the Vec version doesn't compile down to memcpy. Investigating into this, the function argument metadata emitted to llvm plays the biggest role. As eddyb summarized, we need nonnull for the loop to autovectorize and noalias for it to replace with memcpy. There was an experiment to use `assume` to add a non-null assumption on each of the two elements in the specialized zip iterator, but this only helped in some of the test cases and regressed others. Instead I think the nonnull/noalias metadata issue is something we need to solve separately anyway. These have conditionally implemented TrustedRandomAccess - Enumerate - Zip These have not implemented it - Map is sideeffectful. The forward case would be workable, but the double ended case is complicated. - Chain, exact length semantics unclear - Filter, FilterMap, FlatMap and many others don't offer random access and/or exact length
Unless I'm mistaken, this should also have implementations for |
Could this have made |
It is. |
@bluss Could you elaborate on why exactly implementing this for |
@krdln why not here, I don't mind. I didn't know how to faithfully reproduce the side effects. As an example, if you have two mapped iterators of length 4 and 3 like this: The quirky part is that The specialized version computes a length 3 and only attempts to extract 3 elements from each iterator, as it is implemented now. I remember i tried to replicate this in the specialized case. Maybe we can try again. |
Oh, I totally didn't think about this edge case. So, by implementing this "naively" (by that I mean without replicating original behaviour), you would introduce an incosistency between the non-specialized and specialized version. So my question now is why exactly it's a problem? Is it because of specialized version having different semantics? If that's really needed for correctness, then in fact the naive implementation is no-go. Or is it because it may potentially break somebody's code, which relies on one more call to the closure? I would be really surprised if somebody was relying on this for correctness, especially as this behaviour is not documented anywhere. I would consider that change as breaking as changing the order of comparisons in sort implementation. Anyway, I've mostly ask the question because of
so I guess it would be fine? |
Cloned is much simpler than Map, I don't see much problem with saying that .clone() might not be called. Even simpler if the type is Copy. I think that when we specialize we need to pay very close attention to the details and always try to make it seamless. Otherwise the code that builds on top of for example zip gets very hard to understand all of a sudden. |
…ition, r=alexcrichton Remove data structure specialization for .zip() iterator Go back on half the specialization, the part that changed the Zip struct's fields themselves depending on the types of the iterators. Previous PR: rust-lang#33090 This means that the Zip iterator will always carry two usize fields, which are sometimes unused. If a whole for loop using a .zip() iterator is inlined, these are simply removed and have no effect. The same improvement for Zip of for example slice iterators remain, and they still optimize well. However, like when the specialization of zip was merged, the compiler is still very sensistive to the exact context. For example this code only autovectorizes if the function is used, not if the code in zip_sum_i32 is inserted inline where it was called: ```rust fn zip_sum_i32(xs: &[i32], ys: &[i32]) -> i32 { let mut s = 0; for (&x, &y) in xs.iter().zip(ys) { s += x * y; } s } fn zipdot_i32_default_zip(b: &mut test::Bencher) { let xs = vec![1; 1024]; let ys = vec![1; 1024]; b.iter(|| { zip_sum_i32(&xs, &ys) }) } ``` Include a test that checks that `Zip<T, U>` is covariant w.r.t. T and U. Fixes rust-lang#35727
…excrichton Remove data structure specialization for .zip() iterator Go back on half the specialization, the part that changed the Zip struct's fields themselves depending on the types of the iterators. Previous PR: #33090 This means that the Zip iterator will always carry two usize fields, which are sometimes unused. If a whole for loop using a .zip() iterator is inlined, these are simply removed and have no effect. The same improvement for Zip of for example slice iterators remain, and they still optimize well. However, like when the specialization of zip was merged, the compiler is still very sensistive to the exact context. For example this code only autovectorizes if the function is used, not if the code in zip_sum_i32 is inserted inline where it was called: ```rust fn zip_sum_i32(xs: &[i32], ys: &[i32]) -> i32 { let mut s = 0; for (&x, &y) in xs.iter().zip(ys) { s += x * y; } s } fn zipdot_i32_default_zip(b: &mut test::Bencher) { let xs = vec![1; 1024]; let ys = vec![1; 1024]; b.iter(|| { zip_sum_i32(&xs, &ys) }) } ``` Include a test that checks that `Zip<T, U>` is covariant w.r.t. T and U. Fixes #35727
The idea is to introduce a private trait TrustedRandomAccess and specialize .zip() for random access iterators into a counted loop.
The implementation in the PR is internal and has no visible effect in the API
Why a counted loop? To have each slice iterator compile to just a pointer, and both pointers are indexed with the same loop counter value in the generated code. When this succeeds, copying loops are readily recognized and replaced with memcpy and addition loops autovectorize well.
TrustedRandomAccess
The TrustedRandomAccess approach works very well on the surface. Microbenchmarks optimize well, following the ideas above, and that is a dramatic improvement of .zip()'s codegen.
Non-null issues
Yet in more complex situations, the .zip() loop can still fall back to its old behavior where phantom null checks throw in fake premature end of the loop conditionals. Remember that a NULL inside
Option<(&T, &T)> makes it a
None
value and a premature (in this case)end of the loop.
So even if we have 1) an explicit
Some
in the code and 2) the types of the pointers are&T
or&mut T
which are nonnull, we can still get a phantom null check at that point.One example that illustrates the difference is
copy_zip
with slice versus Vec arguments. The involved iterator types are exactly the same, but the Vec version doesn't compile down to memcpy. Investigating into this, the function argument metadata emitted to llvm plays the biggest role. As eddyb summarized, we need nonnull for the loop to autovectorize and noalias for it to replace with memcpy.There was an experiment to use
assume
to add a non-null assumption on each of the two elements in the specialized zip iterator, but this only helped in some of the test cases and regressed others. Instead I think the nonnull/noalias metadata issue is something we need to solve separately anyway.Adaptors
These have conditionally implemented TrustedRandomAccess
These have not implemented it