Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical issues before stabilization #364

Open
calebzulawski opened this issue Sep 10, 2023 · 43 comments
Open

Critical issues before stabilization #364

calebzulawski opened this issue Sep 10, 2023 · 43 comments

Comments

@calebzulawski
Copy link
Member

calebzulawski commented Sep 10, 2023

I wanted to put together a more technical list of issues to be solved before we can stabilize (see rust-lang/rust#86656).

There are many more important issues like performance on some targets, or important missing functions, but IMO those don't prevent stabilization. In this list I've trimmed down the issues to major API-related issues that can't be changed, or are difficult to change, after stabilization.

Issues to be solved

  • Restricting the number of lanes. In my opinion, the LaneCount<N>: SupportedLaneCount bound makes the API exceptionally cumbersome. I've also had some trouble writing generics around it, particularly when making functions that change the number of lanes (the trait solver sometimes simply fails). Adding the bound often feels like boilerplate, and I found myself looking for ways to launder the bound, like adding unnecessary const parameters. Making this a post-monomorphization error (I found Design meeting: Non-local errors (aka post-monomorphization errors) lang-team#195 helpful) might be the way to go, or perhaps there's a way to make a const fn panic. Cons: the trait bound is very explicit and hiding the error states could possibly do more harm than good when identifying the source of a build failure.
  • Mask element type. I'm not confident that the mask type for Simd<f32, N> should be Simd<i32, N> (see Change Mask element to match Simd element #322 for discussion). I think it would be much more straightforward if the types simply matched. Cons: this would require an extra cast when using i32 masks for f32 vectors or similar, and makes implementing From<Mask> for Mask impossible (since pointer masks must be generic Mask<*const T, N>, maybe all pointers could use the same mask element?).
  • Swizzle functions. They're difficult to use and not a very good API, but Rust doesn't currently allow for anything much better. We could hold off stabilizing arbitrary swizzles, but that's a big limitation.

Non-issues, but things that should be done

  • Are we happy with the traits the way they are? e.g. SimdPartialEq, SimdInt.
  • The API should be partitioned into multiple unstable features that can be stabilized independently.

Updates (after filing this issue)

Lane count

  • I tried condensing LaneCount an SimdElement into a single bound Simd<T, N>: Supported. This doesn't work well for a variety of reasons. One example: scatter/gather use Simd<T, N>, Simd<*const T, N>, and Simd<usize, N>. Each of these would need their own bound, rather than using one LaneCount bound since they all share N.
  • I recommend either keeping the bounds as they are now (LaneCount and SimdElement) or turning LaneCount into a non-local error.

Swizzles

I tried the following swizzle code. It requires the incomplete adt_const_params feature. Even with this feature enabled, it's impossible to implement functions like reverse, because const generics can't access the generic const parameters.

    pub fn swizzle<const M: usize, const INDEX: &'static [usize]>(self) -> Simd<T, M>
    where
        LaneCount<M>: SupportedLaneCount,
    {
        // SAFETY: `self` is a vector and the swizzle index is a const array of u32
        unsafe {
            intrinsics::simd_shuffle(
                self,
                self,
                const {
                    assert!(M == INDEX.len(), "`M` must equal the length of `INDEX`");
                    let mut r = [0; M];
                    let mut i = 0;
                    while i < M {
                        assert!(
                            INDEX[i] < N,
                            "indices must be less than the input vector length"
                        );
                        r[i] = INDEX[i] as u32;
                        i += 1;
                    }
                    r
                },
            )
        }
    }
@BurntSushi
Copy link
Member

w00t! Very excited to see this. :D :D :D

Are we happy with the traits the way they are? e.g. SimdPartialEq, SimdInt.

Is there high level docs anywhere describing how the API is laid out? Like, a guided tour of everything? I realize that's a tall order to ask for before the API is set in stone because it takes work to stabilize. But I'm asking because as of right now, I find the API to be very overwhelming. There are a lot of traits and upon seeing them, I immediately wonder how they all fit together. And in particular, why they exist. That is, if the API were less trait heavy, what would we lose?

Looking at the names of the traits, I can make some guesses as to what some of them are for. And it looks plausible that they could all mostly be explained (at a high level) without too much exposition.

I have opinions on the names of the traits too, but I think that fun can be saved for later. :P

I find the number of concrete types to be large but not overwhelming. At a quick glance, I can immediately understand what they mean and what connects them. While I do have some light SIMD background, I'm confident that the large number of concrete types could be explained by a ~paragraph in the crate docs.

The API should be partitioned into multiple unstable features that can be stabilized independently.

Do you have a 50,000-foot view of what the partitioning might look like? I don't mean to put you on the spot and ask for a detailed proposal, but just a general feeling of what you think might make sense.

Swizzle functions. They're difficult to use and not a very good API, but Rust doesn't currently allow for anything much better. We could hold off stabilizing arbitrary swizzles, but that's a big limitation.

I agree that missing swizzles is a rather large limitation. For me personally, unless there is a language feature that you're pretty confident is right at the horizon, or if there's a small language feature that you can get the lang team on board with quickly, I would go to war with the army you have when everything else is ready to stabilize. But this is just a general sense of things personally, and I'm sure there is a deeper analysis that could be done here. e.g., "If Rust had language feature Foo, then the swizzle API could be written like this which would make things simpler and/or unlock new use cases."

@programmerjake
Copy link
Member

e.g., "If Rust had language feature Foo, then the swizzle API could be written like this which would make things simpler and/or unlock new use cases."

that would be: if LLVM had dynamic swizzles (using the shuffle LLVM IR instruction), then we could just have the swizzle pattern be a function argument and rely on inlining and const-propagation, like we do with atomic orderings (technically we could do that to some extent in rustc, but it really should be in LLVM).

@calebzulawski
Copy link
Member Author

Is there high level docs anywhere describing how the API is laid out?

Nope, that's definitely something we should have. Personally, I think we could use more submodules to make the division a little clearer. Each of these could be documented independently, with a bit of guidance in the main std::simd module:

  • Simd, Mask, and the various aliases in std::simd
  • num containing SimdInt, SimdFloat, etc
  • ptr containing SimdConstPtr and SimdMutPtr
  • cmp containing SimdPartialEq etc
  • swizzle containing Swizzle, Which, etc

That is, if the API were less trait heavy, what would we lose?

The original implementation had no traits. This results in two related issues. To implement recip, you need impl Simd<f32, N> and impl Simd<f64, N> and end up with two identical (but separate in docs) recip functions. For integers this is even more verbose and distracting. The second issue is that you now have overlapping functions, e.g. recip (for floats), count_ones (for ints) and wrapping_add (for pointers) all implemented on the Simd type. With traits, we were able to put generic vector functions on Simd and leave anything element-type-specific to a trait.

The API should be partitioned into multiple unstable features that can be stabilized independently.

Do you have a 50,000-foot view of what the partitioning might look like?

Offhand, I would say something like:

  • portable_simd_types: just Simd, Mask, and a handful of generic functions (these types implement Add, Sub, etc)
  • portable_simd_swizzle if we want to split off swizzling for later
  • portable_simd_scatter_gather: scatter/gather is another fairly complicated API we might want to hold back
  • portable_simd_traits: for all of the traits above, once we're comfortable with how they're divided up etc

I agree that missing swizzles is a rather large limitation. For me personally, unless there is a language feature that you're pretty confident is right at the horizon, or if there's a small language feature that you can get the lang team on board with quickly, I would go to war with the army you have when everything else is ready to stabilize.

One option might be to implement a slightly less powerful set of swizzle2<I0, I1>, swizzle4<I0, I1, I2, I3>, etc functions that are not quite as powerful as we'd like (impossible to make generic over N) and wait on stabilizing the completely generic interface for when we can do something like fn swizzle<const N: usize, const I: [usize; N]>() or perhaps something even better.

@RazrFalcon
Copy link

I hope it will be released soon. I've tried using it in one of my crates (linebender/tiny-skia#88) and it mostly works. There are some glitches which I haven't debugged yet, but I don't think they are critical.
Performance is all over the place on AArch64 (haven't tested x86 yet). Some stuff become faster, some slower. Maybe a porting bug, maybe imperfect codegen. But I fine with small performance regressions as long as I can ditch manual, unsafe SIMD code.

I personally do no use swizzle, therefore those functions are not important to me.

@thomcc
Copy link
Member

thomcc commented Sep 18, 2023

That is, if the API were less trait heavy, what would we lose?

Realistically we'd probably gain things. As it is, there are a number of APIs that have been hard to add due to the need to make them fit into the trait-heavy design.

That said, my stance here (that we should not have gone with the traits as it's lead to a cumbersome and inflexible design) is not a popular one, so YMMV.

@calebzulawski
Copy link
Member Author

As far as restricting the number of lanes goes, I tried combining the bound into something like:

pub struct Simd<T, const N: usize>(...) where Self: Supported;

In theory, this seemed like a good compromise between the current verbose bounds and using a non-local error. In practice, it wasn't really possible to implement this. It turns out we use the separation between the element bound (SimdElement) and the lane count bound (SupportedLaneCount) in a fair number of places. For example, functions that use multiple vector types, such as scatter/gather, require the bound for each type, which actually ends up more verbose. (Currently, they all share a lane count bound). Even worse, functions that only use another vector type internally (like Simd<usize, ...> in functions that take pointer vectors) need the extra bounds and leak implementation details.

After this experiment, I think the best option is to add two new compiler attributes like #[portable_simd_max_elements(N)] and #[portable_simd_require_power_of_2_elements], which error when unsupported lane counts are used.

@programmerjake
Copy link
Member

could we just use:

impl<T: ..., const N: usize> Simd<T, N> {
    const VALID: () = {
        let valid = N > 0 && N < 64;
        if !cfg!(all_lane_counts_iirc) {
            valid &= N.is_power_of_two();
        }
        assert!(valid, "invalid lane count");
    };
    pub fn each_method(self, ...) -> ... {
        Self::VALID; // use so it triggers a post-mono error
        ...
    }
}

actually, no, we can't, since just copying a Simd type requires a supported lane count.

@calebzulawski
Copy link
Member Author

Also, it would be easy to accidentally forget

@scottmcm
Copy link
Member

since just copying a Simd type requires a supported lane count.

Any chance we'd be willing to say that we're ok allowing the type to exist for non-PoT lane counts, but then just not offering anything other than Copy and the array conversions with them? Then it's fine that we don't have a good way to do + or shuffle for Simd<f32, 21>, but we don't force it to be bound everywhere just to exist...

(But I guess that exposes layout questions that we don't want to decide just yet, even if we left them explicitly unspecified.)

it would be easy to accidentally forget

We could maybe have a -Z flag to ICE on non-power-of-two vectors in the codegen, or something, to help catch it.

@programmerjake
Copy link
Member

since just copying a Simd type requires a supported lane count.

Any chance we'd be willing to say that we're ok allowing the type to exist for non-PoT lane counts, but then just not offering anything other than Copy and the array conversions with them?

we already have a feature flag to enable non-pot lane counts, the main issue imo is that someone could try to use Simd<T, 1234567> which should error for being too big. or Simd<T, 0> which is a pain therefore not allowed (though I'm ok with adding it for the consistency benefits).

@calebzulawski
Copy link
Member Author

Also, removing the bounds from Simd but not impl Simd doesn't help us much, I think. In my experience, the bounds are no problem when writing code with "concrete" SIMD types, but painful when trying to do anything generic. If functions still require them, those generics will still be difficult.

we already have a feature flag to enable non-pot lane counts, the main issue imo is that someone could try to use Simd<T, 1234567> which should error for being too big. or Simd<T, 0> which is a pain therefore not allowed (though I'm ok with adding it for the consistency benefits).

There is precedent with [T; HUGE] erroring, so I think we may be able to do the same if we can remove the non-PoT bound as well.

@programmerjake
Copy link
Member

we already have a feature flag to enable non-pot lane counts, the main issue imo is that someone could try to use Simd<T, 1234567> which should error for being too big. or Simd<T, 0> which is a pain therefore not allowed (though I'm ok with adding it for the consistency benefits).

the other issue I forgot is that imo non-pot lane counts have the wrong layout and that should change: #319

@scottmcm
Copy link
Member

scottmcm commented Sep 24, 2023

the main issue imo is that someone could try to use Simd<T, 1234567> which should error for being too big

This one I feel strongly isn't a problem, since [T; isize::MAX as usize] has exactly the same issue, and I don't think that being a non-local error has really ever been a problem. (I elaborate more on that train of thought in rust-lang/rust#104087 (comment) )

So long as using Simd like this gives the same kind of error as we get for arrays in https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a8752e42c5601f574d948ff3677a7e0e, that seems entirely acceptable to me. (Saying that with my lang hat on, but not speaking for the team, since I don't think we've discussed Simd in particular.)

or Simd<T, 0> which is a pain therefore not allowed

I would love to have even a hacky solution to this. as_chunks is blocked on exactly the same thing (rust-lang/rust#74985).

lcnr had some thoughts on doing something like this in https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/How.20essential.20is.20the.20compile-time.20check.20for.20empty.20arrays.3F/near/248387965. Maybe someone could pick that up to deal with the zero problem for a while? If it was easier, Simd<T, 0> feels like another of those places where we might do "the type is well-formed but doesn't implement most of the traits".

but painful when trying to do anything generic

I suppose that's another possibility here -- explicitly decide to have v1 of stabilizing this not allow writing generic code over them, but still let people use specific versions when implementing stuff.

That seems like it'd still be pretty useful, since using f32x4 in the implementations of stuff is still much easier even if I can't be generic over stuff yet.

@programmerjake
Copy link
Member

Also, removing the bounds from Simd but not impl Simd doesn't help us much, I think.

My proposal of using Simd::CHECK would only have type bounds, all lane count bounds would be post-mono errors only.

@sammysheep
Copy link
Contributor

sammysheep commented Sep 25, 2023

I love the current state of this project and am grateful for the hard work being done and the general friendliness/helpfulness of the team.

However, I will be a bit of a contrarian and say that swizzles are important to me, particularly dynamic ones. I've also used the APIs with the const swizzles at length, and it's workable: you can use const fn to generate special index patterns provided you are willing to write some boilerplate to make it ergonomic.

I'm not saying swizzles should stop stabilization per se, I'm a very happy nightly user.

I'll also be contrary and say that it's okay to abstract more functionality for very common use cases. I think some of this is already done, but I wouldn't be sad, for example, to see a function to map bytes using dynamic swizzles. The newbie user (like me) may not realize the hardware limitations (the usual size of look-up tables, for example) of certain operations and so creating a well-designed function for a common use case is helpful to people like me.

@workingjubilee
Copy link
Member

workingjubilee commented Oct 2, 2023

I agree that a satisfying story for swizzles is important.

Regarding traits, I initially proposed it as a way to reduce the "multiplication of entities" problem so that it's easier to follow things. I have begun to think a different approach may be appropriate (increased experience in API design changes your opinions about API design, who would've thought?), but I believe we should probably fork further discussion about that into its own issues/threads.

@jdahlstrom
Copy link

jdahlstrom commented Oct 11, 2023

My 2c is that if Rust's primitive types cannot currently be abstracted over with traits (other than Add etc), it's not necessary to be able to do it for SIMD types either at first. Indeed it would be strange (and slightly amusing) if a workaround for the lack of numeric traits was to wrap your numbers in x1 SIMD and use SIMD traits instead…

@calebzulawski
Copy link
Member Author

My 2c is that if Rust's primitive types cannot currently be abstracted over with traits (other than Add etc), it's not necessary to be able to do it for SIMD types either at first. Indeed it would be strange (and slightly amusing) if a workaround for the lack of numeric traits was to wrap your numbers in x1 SIMD and use SIMD traits instead…

Personally, I don't see a way around using traits. This was one of the first issues we hashed out, I'm sure there is more discussion available in zulip. SIMD types are not just numbers and not just arrays, they act like both simultaneously. Like arrays, SIMD vectors have container-like operations that are only concerned with their memory layout and are indifferent to their element type. This includes splatting, indexing, swizzles, scatter, gather, conversions to and from arrays and slices.

To undo this means that you will have 14 vector types that each implement these functions separately, and there will be no way in the future to ever treat a vector like an element-agnostic container. A comparison would be like if [i32; 4] and [f64; 4] were totally independent types that couldn't both be used as [T; 4].

Using traits allowed us to treat vectors as containers (which they are) while also giving them element-specific behavior. The element-specific functions can't be implemented directly Simd for each element type (they end up having name collisions and the docs are impossible to read), so any alternative to traits would need to account for this.

@RalfJung
Copy link
Member

RalfJung commented Dec 11, 2023

Can we add "documentation and review of the intrinsics' safety requirements" to the list? Currently every time I want to implement one of these in Miri I have to dig through PRs and then usually still go ask people. Intrinsics are language extensions so they should be reviewed by t-opsem before stabilization (and ideally, a draft documentation is created when the intrinsic is implemented, so that one has some starting point for "intended behavior" vs "actual behavior" and checking that they are the same).

I also think the intrinsics' declarations should be moved into libcore so that they can be equipped with doc comments that live in the same repository as the codegen and Miri implementations that turn those doc comments into reality.

@calebzulawski
Copy link
Member Author

Yes, I intend on doing just that.

@calebzulawski
Copy link
Member Author

I opened #381 to track that

@anderspapitto
Copy link

I strongly support checking lane count post-monomorphization. It's quite a hassle to carry around bounds proving that not only N but also N * 2, N / 4, etc are all valid lane counts (and I hit some ICEs along the way rust-lang/rust#126443)

@programmerjake
Copy link
Member

It's quite a hassle to carry around bounds proving that not only N but also N * 2, N / 4, etc are all valid lane counts

if you want to use a function fn f<const N: usize>(a: Simd<u8, N>) -> Simd<u8, { 2 * N }> you'll need some bounds regardless of if we handle element count limiting post-monomorphization or not, since those bounds are needed for any const-generic type, e.g. for arrays:
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=6320749ac371c3aef86ad1c1316d4cdd

@programmerjake
Copy link
Member

programmerjake commented Jul 3, 2024

related conversation about min_generic_const_args maybe gaining support for N.div_ceil(8) which would allow us to calculate bit vector sizes from lane counts: https://rust-lang.zulipchat.com/#narrow/stream/260443-project-const-generics/topic/Layout.20in.20const.20parameters/near/448743588

@kytans
Copy link

kytans commented Jul 15, 2024

Why not support any number of lanes, by using arrays of native SIMD vectors large as needed, and masking for operations that need it?

More in detail:

  • Simd<T, N> is valid for any T and N and is equivalent to [T; N] except that it has a platform-specific alignment (that is of course >= the alignment of T). EDIT: maybe it's better to implement as an array of the native SIMD type, although they are semantically the same thing
  • Simd<T, N> where T is a primitive type gets add/sub/mul/div/etc. operations that use SIMD intrinsics where available, using the largest vector size on the platform <= N (which will result in the compiler usually allocating it in SIMD registers)
  • [T; N] where T is a primitive type also gets add/sub/mul/div/etc. operations that use unaligned SIMD operations where available; ideally, the compiler or LLVM should use aligned SIMD operations instead if it can prove alignment or if it is a local variable whose alignment can be increased
  • Operations between Simd<T, N> and [T; N] are also supported
  • Ideally all operations that any SIMD instruction set has are supported and are provided portably; if any operation is not natively supported, it will use the best sequence of SIMD operations or will be performed on scalars one by one
  • A bunch of #[cfg] APIs are provided to query the potential arch SIMD capabilities, and an API is provided to query the actual capabilities at run-time

@calebzulawski
Copy link
Member Author

The simple answer is that the size limitations are not related to native SIMD register sizes, but codegen backend encoding limitations. We further restrict to even smaller sizes that we are confident operate correctly and optimize well. Our design allows for increasing the maximum in the future.

@kytans
Copy link

kytans commented Jul 15, 2024

The codegen backend encoding limitations can be easily solved by having the std::simd implementation code-generate Simd<T, N> as an array of either T (with higher alignment) or a backend SIMD type and implement operations as loops that operate on parts of it with appropriate size for current SIMD ISA.

Given that this transformation is very simple and that not doing it is very burdensome for the API user (due to the SupportedLaneCount bounds and potentially having to use an extra crate that does that transformation for some applications), I don't see any reason to provide an API that doesn't perform it.

If the user uses a proper SIMD size, then the transformation will be a no-op and performance will be optimal; otherwise, the code will still work albeit less efficiently in some cases.

@kytans
Copy link

kytans commented Jul 15, 2024

More precisely, doing something vaguely like this:

struct Simd<T: SimdArray<N>, N>([<T as SimdArray<N>>::Type; <T as SimdArray<N>>::Length]);

default impl<T, N> SimdArray<N> for T {
  type Type = T;
  const Length: usize = N;
}

// for AVX
impl<N> SimdArray<N> for f32
  where N <= 4 {
  type Type = __m128;
  const Length: usize = 1;
}

impl<N> SimdArray<N> for f32
  where N > 4 {
  type Type = __m256;
  const Length: usize = (N + 7) / 8;
}

impl AddAssign<Simd<f32, N>> for Simd<f32, N>
where T: SimdArray<N, Type = __m128>
{
   fn add_assign(&mut self, b: Self) {
     for i in 0..T::Length {
        self[i] = addps(self[i], b[i])
     }
   }
}

impl AddAssign<Simd<f32, N>> for Simd<f32, N>
where T: SimdArray<N, Type = __m256>
{
   fn add_assign(&mut self, b: Self) {
     for i in 0..T::Length {
        self[i] = vaddps(self[i], b[i])
     }
   }
}

Or something like this:

#[align(simd)] // where the compiler chooses the alignment
struct Simd<T, N>([T; N]);

impl AddAssign<Simd<f32, N>> for Simd<f32, N>
where N <= 4
{
   fn add_assign(&mut self, b: Self) {
        transmute::<_, &mut __m128>(&mut self)[0] = addps(transmute::<_, &mut __m128>(&self)[0], transmute::<_, &__m128>(&b)[0])
   }
}

impl AddAssign<Simd<f32, N>> for Simd<f32, N>
where N > 4
{
   fn add_assign(&mut self, b: Self) {
     for i in 0..(N + 7)/8 {
        transmute::<_, &mut __m256>(&mut self)[i] = vaddps(transmute::<_, &mut __m256>(&self)[i], transmute::<_, &__m256>(&b)[i])
     }
   }
}

Or alternatively the compiler itself could do all this, i.e. you could fix the codegen backend to support arbitrary sizes.

@programmerjake
Copy link
Member

i.e. you could fix the codegen backend to support arbitrary sizes.

Simd is translated directly to LLVM IR vector types, theoretically LLVM supports arbitrary sizes, but really only expects you to use native vector sizes and generates terrible code for very large vector types (e.g. generating 10000 vector add instructions instead of a loop). LLVM also is quite a bit more likely to encounter bugs for weird vector sizes. the general idea is that people using Simd should use a smallish size and if they need to process more data, write an explicit loop into their algorithm.

@calebzulawski
Copy link
Member Author

The codegen limitations are something like 2^16 max elements, which is large enough that I doubt it's necessary to support anything larger. Something like this would make it possible to exceed that limit but I'm not sure it's worth the added complexity. We will likely be able to lift the current limit to something much larger if we can get reliable non-power-of-two-length codegen, but that's proving problematic for now.

@programmerjake
Copy link
Member

generates terrible code for very large vector types (e.g. generating 10000 vector add instructions instead of a loop)

e.g. terrible code LLVM generates for a Simd<f32, 400> addition: https://llvm.godbolt.org/z/6r6EfjjfE

@kytans
Copy link

kytans commented Jul 16, 2024

I guess the downside of my proposal is that it allows the user to write "a + b + c" which will result in two loops, and unless the backend optimizer manages to fuse the loops, that code will be much worse than a single loop since it unnecessarily writes and reads the intermediate result to memory.

So maybe another option to solve the LaneCount<N>: SupportedLaneCount issue is to leave the design as-is, but specify that users should generally never use Simd directly, and instead use an API providing a "SIMD array", "SIMD Vec" and "SIMD iterator" on top of it, which could either be the standard library or a 3rd party crate.

The iterator based design could still implement arithmetic on iterators (a + b would be syntax sugar for a.zip(b).map(|(x, y)| x + y) , so you could write something like (a.iter() + b.iter() + c.iter()).collect(), and in principle one could even have it be (a + b + c).collect() (where a + b would be syntax sugar would a.iter() + b.iter()).

Ideally, the compiler would gain support for generic closures, which would allow to specify a polymorphic closure to the iterator combinator, which would allow to choose at runtime the vector size and instruction set to use.

This would suggest a different change to the design though, which is to have a Simd<T, N, ISA> which would allow for example to have different types that use either AVX-2 or AVX10/256, since otherwise the polymorphic closure scheme can't differentiate between them since vector length is the same, and would also allow to use either instruction in ISA extensions or a replacement sequence.

@andy-thomason
Copy link

generates terrible code for very large vector types (e.g. generating 10000 vector add instructions instead of a loop)

e.g. terrible code LLVM generates for a Simd<f32, 400> addition: https://llvm.godbolt.org/z/6r6EfjjfE

As always, this depends on your choice of CPU. If Rust used target-cpu=native by default...

https://llvm.godbolt.org/z/xjsndxc1b

I agree that we are constantly fighting with dreadful LLVM codegen, regardless of the target.
My libm library was plagued by the need to generate vroundps which LLVM is incapable of.

@programmerjake
Copy link
Member

As always, this depends on your choice of CPU. If Rust used target-cpu=native by default...

all that does is increase the upper limit for somewhat reasonable code, if using a 4000 element vector you go back to terrible code: https://llvm.godbolt.org/z/Wf9o5jT67

@anderspapitto
Copy link

I'm using a personal arbitrary-length wrapper layer. A couple notes

  • making this API is strictly harder than making an api for fixed-length native arrays (without simd) which is already a massive pain. E.g. writing functions like these requires propagating N * 2, N / 4, etc. bounds through your entire codebase, and simd can only make it harder.
fn repeat_twice<const N: usize>(x: [u32;N]) -> [u32;N * 2]) { mem::transmute([x, x]) }
fn reinterpret_u16_as_u64<const N: usize>(x: [u16; N]) -> [u64; N / 4]) { mem::transmute(x) }
  • "bitmask" apis need overhaul because u64 is no longer guaranteed to fit all the bits - you want something like [u8; N / 8] which again requires carrying around the N / 8 bound

  • some instructions inherently cannot be scaled up to arbitrary lengths, e.g. _mm512_conflict_epi32 , which does a quadratic number of comparisons (roughly, every lane compared with every other lane). I guess such operations could have additional bounds on them without affecting the rest of the api?

  • if a vector is size 48, should it be processed as a chunk of size 32 and then a chunk of size 16, or three chunks of size 16? Should the user be able to control this choice? If the user can choose, than maybe the representation should instead be e.g. [[u32; CHUNK_SIZE]; N / CHUNK_SIZE] with user-controlled CHUNK_SIZE where LaneCount<CHUNK_SIZE>: SupportedLaneCount

@hsivonen
Copy link
Member

They're difficult to use and not a very good API, but Rust doesn't currently allow for anything much better. We could hold off stabilizing arbitrary swizzles, but that's a big limitation.

Considering that shuffles/swizzles with compile-time-constant lane indices have worked in since at least 2015 without substantial change to the API shape, I think it's well past time to stabilize the API shape that exists.

So I think it makes sense to stabilize the simd_swizzle! macro and if an API that's perceived to be better becomes possible later, it's then possible to introduce another API alongside the simd_swizzle! macro later.

@calebzulawski
Copy link
Member Author

Portable SIMD has taken a substantially different approach to swizzles than e.g. the x86 shuffle intrinsics. We really wanted to avoid a magic const argument that predates const generics and it's important to be able to generate the control mask with a const fn. This has resulted in a rather clunky but functional Swizzle trait.

It should be safe to stabilize simd_swizzle! (at least once other things are stabilized), but my biggest concern is that we may need to then keep two swizzle implementations around, because the macro is fairly sensitive to the type deduction of its implementation.

@RalfJung
Copy link
Member

Why is this even using a trait? If that's just to work around making the final argument const, then I think this can be expressed much better with a const { ... } block.

However, that would still expose the underlying simd_shuffle intrinsic fairly directly, so -- we'd have to be very careful here with stabilization.

@calebzulawski
Copy link
Member Author

We can't expose the intrinsic directly because we do some manipulation of the const indices before passing them into the intrinsic--passing it via an associated const allows it to be used in regular rust since it isn't bypassing the usual type system.

@RalfJung
Copy link
Member

You should be able to do the same manipulation inside the const { ... } block.

@calebzulawski
Copy link
Member Author

I think I'm following... but do we want more const args? I think I even saw an issue somewhere encouraging const args to be removed in an edition change or something, I'm not sure where they stand right now.

@RalfJung
Copy link
Member

Oh I see, the problem is that the macro shouldn't call simd_shuffle directly but wrap it in a function, and then the question is how does the function receive the index list as a constant. Ideally it would be a const generic but those are quite limited...

@calebzulawski
Copy link
Member Author

calebzulawski commented Sep 14, 2024

I think it's possible to write a macro something like the following (I have a prototype):

fn some_swizzle<T, const LEN: usize, const PARAM1: usize, const PARAM2: usize>(x: Simd<T, LEN>) -> Simd<T, LEN> {
    simd_swizzle!(x,
        const { /* some expression using LEN, PARAM1, PARAM2 */ },
        len = LEN,
        const_parameters = (PARAM1, PARAM2),
    )
}

But is something like this any better than implementing the Swizzle trait? I'm having trouble imagining an idiomatic interface for this macro, but it takes away the need to understand some const generics type inference magic you need to understand to implement it. Notably, const_parameters can be any outer parameter, not just from the function the macro is called in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests