Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic SIMD types and operations are not a substitute for intrinsics #7702

Open
lemaitre opened this issue Jan 6, 2021 · 20 comments
Open
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@lemaitre
Copy link

lemaitre commented Jan 6, 2021

First, I think having generic SIMD types like @Vector(T, N) (#903 or any other syntax) with most arithmetic operations defined on them is really nice and is useful to many people.

However, this will never give all the power of the intrinsics because the generic interface will never be able to cover all instructions from all SIMD ISAs (even with tradeoffs). Vendors will always be creative and invent new instructions to cover specialized work cases that don't necessarily match what other vendors do.
Plus, even when an operation exists in multiple vendor ISAs and is emulatable in the others, the exact semantic might differ and will lead to tradeoffs that would penalize people who want maximal performance (main goal of SIMD).

Examples (really far to be exhaustive):

  • incompatibilities of 1/sqrt(x):
ISA intrinsics precision
SSE _mm_rsqrt_ps 1.5p-12
KNCNI _mm512_rsqrt23_ps 1p-23
AVX512F _mm512_rsqrt14_ps 1p-14
AVX512ER _mm512_rsqrt28_ps 1p-24
Neon vrsqrteq_f32 1p-8
Altivec vec_rsqrte 1p-12
VSX vec_rsqrte 1p-14
  • incompatibilities of fused c - a*b (all compatible with IEEE754, but can give different results with NaNs, and infinities):
ISA intrinsics meaning
AVX2 _mm256_fnmadd_ps(a, b, c) (-(a * b)) + c
Neon vfmsq_f32(c, a, b) ((-a) * b) + c
Altivec vec_nmsub(a, b, c) -((a * b) + (-c))
  • Unique operations, hard/slow to emulate:
ISA intrinsics meaning
AVX512F, KNCNI _mm512_add_round_ps perform addition with specified rounding mode (without overhead)
AVX512CD _mm512_conflict_epi32 compute a vector of "masks" indicating lanes having the sames values
SVE2 svbdep expand bits of first input to posdition indicated by bits of second input, lane-wise
  • any VLA SIMD ISAs like SVE or Risc-V Vector extension.

Those are only a few examples of problematic interface that cannot be abstracted easily/efficiently. There are many many more problems and listing them all would be futile.

To me, the only way to solve all those problems is by providing intrinsics. Of course, the use of intrinsics leads less portable code, but would provide the most control to the user.

@rohlem
Copy link
Contributor

rohlem commented Jan 6, 2021

Question as someone who's not as familiar with this topic, what is the "correct" / a suitable interface level for these "intrinsics"? Or rather, do you have a concrete proposal in mind?

If a developer wants to reliably produce machine instructions, I think inline assembly is the most straight-forward tool of implementing this, which is already supported (and afaik compatible with variables declared to be type @Vector?).
Wrapping this by a function (in most cases inlined) also makes sense. This sounds like a userspace task to me, I don't think there's a lot the language can do in this regard (other than reduce friction, if problems arise).

There is the option of builtin functions, like we have @clz or @sqrt, but as you've pointed out, "the exact semantics might differ", making me think language-level builtins aren't the right facility for this.
If I've used some specialized "fused multiply add", and find a builtin @fmadd to replace it with, I would personally expect its behaviour to be standardized across platforms supported by Zig, which would require an abstraction layer on top. That doesn't sound compatible with what you have in mind.

@Snektron
Copy link
Collaborator

Snektron commented Jan 6, 2021

One thing to consider is that llvm provides many architecture dependent intrinsics on the IR-level, which might help optimizing them. With ASM statements this is typically not easily possible.

I think the most proper way of implementing something like this is providing a std module akin to intrin.h, which provides inline functions based on the current backend. If the backend is llvm, an intrinsic can simply be imported using extern fn @"llvm.x86.mmx.pand"(...) ...;, and otherwise (for stage 2) an asm statement can be provided.

@lemaitre
Copy link
Author

lemaitre commented Jan 6, 2021

First, inline assembly cannot introduce new types, so you still need to have compiler support for intrinsics types. For instance, SVE vector types cannot be implemented with a regular @Vector(T, N) because there exist no such comptime N.
So you would still need compiler support at some point.

The problem with inline assembly is that the compiler may lose many optimization opportunities. For example: constant folding, common sub-expression elimination, dead code elimination, instruction merging... Of course, this comment is valid only when the backend is aware of such operations, which is the case of LLVM.

In fact, LLVM has already all the gears required for all intrinsics for at least x86/ARM/PowerPC because those are implemented for clang. Thus, it should be "only" a matter of accessing this through zig. @"llvm.x86.mmx.pand"() might be good enough for the implementation of the system header, but it does not abstract away the backend and thus is not really made to be used by users.

As a reminder, intrinsics are defined/specified by CPU vendors, not compiler devs, and thus intrinsics code should work on any compilers without any modification. So I would stick with plain old _mm_and_si128() defined in its own module (mimicking c headers?). This would at least match other languages.

@jedisct1
Copy link
Contributor

jedisct1 commented Jan 6, 2021

The problem with inline assembly is that the compiler may lose many optimization opportunities.

And the main thing it loses is information about latency and throughput, leading to very poor scheduling.

The lack of intrinsics is also an issue for instructions providing crypto acceleration.

@rohlem
Copy link
Contributor

rohlem commented Jan 6, 2021

The arguments sound reasonable, but I still see a conflict between these two points:

  • We don't want the compiler to lose as many optimization opportunities as with inline assembly.
  • "[I]ntrinsics code should work on any compilers without any modification".

(I still assume that an intrinsic can be treated, at some point, as "a block of machine instructions"? Please correct me on this.)
Every time a CPU vendor publishes a new intrinsic, we need a way to tell the compiler both

  • (A) how to compile to it, i.e. the machine code representation that should end up in the binary
  • (B) its effects on the used registers etc., so it "understands" it enough to still employ the optimizations you mentioned, that we expect from Zig code.

So to me it appears we should either enhance our current facilities (only inline assembly comes to mind) to support this, or introduce a new language feature that can be used to supply it this information without requiring us to alter the language for every new intrinsic.

Specifically the point about introducing new types though sounds difficult to me. There might need to be some way to implement them in userspace (f.e. a standard library module - maybe using a new language construct, if packed structs + logic aren't flexible enough?), otherwise they might never be transparent enough for the compiler to optimize the code as you suggest.

((Crazy off-the-wall idea, the Zig compiler could totally build comptime-known / build-time code as a "compiler module", and load that dynamically to use in the build process. But at that point we're slipping into designing our own compiler framework.))

@S0urc3C0de
Copy link

I'd like to add that one advantage of having a Vector type is that it allows to write SIMD code in readable way - much like clang's and gcc's vector extensions do. You can (and probably must, for performance) still adjust the code to the architecture you want to use but you don't have to learn a new set of intrinsics for each that simply adds unreadable and hard to maintain code to your codebase.
I recently implemented the Mandelbrot set in C both specifically using intrinsics (AVX2) and using clang's vector extensions - and clang/LLVM emitted pretty much the same instructions as the instrinsics version.
I don't see to many reasons to use intrinsics other than when the vector extensions lack some very specific functionality.

@lemaitre
Copy link
Author

lemaitre commented Jan 7, 2021

@rohlem

  • We don't want the compiler to lose as many optimization opportunities as with inline assembly.
  • "[I]ntrinsics code should work on any compilers without any modification".

(I still assume that an intrinsic can be treated, at some point, as "a block of machine instructions"? Please correct me on this.)

In fact, no. Intrinsics are not just a block of machine instructions, in the exact same way than a scalar addition.
This view could be a quick a dirty way to implement new instructions waiting for the backend to catch up for the target architecture.

For the portability aspect across compilers, just look at C:

void add(float*A, const float* B, int n) {
  for (int i = 0; i < n; i += 16) {
    __m512 a = _mm512_load_ps(&A[i]);
    __m512 b = _mm512_load_ps(&B[i]);
    a = _mm512_add_round_ps(a, b, _MM_FROUND_TO_ZERO);
    _mm512_store_ps(&A[i], a);
  }
}

You can take this C code and compile it as-is on any compiler supporting AVX512: gcc, clang, icc , msvc.
There is no compiler specific code in here. Of course compilers should be adapted to support new ISAs, and this is quite a daunting task. But the heavy work here has already been done inside LLVM (mostly by CPU vendors, btw), and it should be "just" a matter of making the link between zig intrinsics and LLVM builtins.

Every time a CPU vendor publishes a new intrinsic, we need a way to tell the compiler both

  • (A) how to compile to it, i.e. the machine code representation that should end up in the binary
  • (B) its effects on the used registers etc., so it "understands" it enough to still employ the optimizations you mentioned, that we expect from Zig code.

So to me it appears we should either enhance our current facilities (only inline assembly comes to mind) to support this, or introduce a new language feature that can be used to supply it this information without requiring us to alter the language for every new intrinsic.

This is true, but as I said, it is mostly by the LLVM backend. The zig frontend should forward intrinsics pretty much as-is to the backend. So it would be just a matter of registering new intrinsics.

Specifically the point about introducing new types though sounds difficult to me. There might need to be some way to implement them in userspace (f.e. a standard library module - maybe using a new language construct, if packed structs + logic aren't flexible enough?), otherwise they might never be transparent enough for the compiler to optimize the code as you suggest.

As far as I understand, @Vector(T, N) already introduces new types. This is not much different from that. But some intrinsics require more types, like masks registers for AVX512. If the backend supports those types, then it is quite easy to do.
SIMD types have nothing special compared to scalar types (except that they are much more specific to the target architecture).

One ISA that might be complex to integrate is SVE SIMD registers (eg: svfloat_t) because they don't have a comptime size. But zig might already have what is needed to handle them because it already has a concept of pointer to type with unknown sizes.

@S0urc3C0de

I'd like to add that one advantage of having a Vector type is that it allows to write SIMD code in readable way - much like clang's and gcc's vector extensions do. You can (and probably must, for performance) still adjust the code to the architecture you want to use but you don't have to learn a new set of intrinsics for each that simply adds unreadable and hard to maintain code to your codebase.
I recently implemented the Mandelbrot set in C both specifically using intrinsics (AVX2) and using clang's vector extensions - and clang/LLVM emitted pretty much the same instructions as the instrinsics version.
I don't see to many reasons to use intrinsics other than when the vector extensions lack some very specific functionality.

You seem to have missed the point of this issue. Generic SIMD types and operations are really useful and fit many algorithms, thus should be kept.
My point is that for some codes, the generic syntax will not be as efficient as it could be because of the reasons I highlight in this thread. For those codes, intrinsics will be helpful.

Mandelbrot is a really simple code that does not require complex instructions, thus, abstract compiler builtins are enough. But it would be really difficult to implement video codecs or Json SIMD parsing with only those, without using any intrinsics or inline assembly.

@S0urc3C0de
Copy link

@lemaitre It seems, I have indeed missed it. I apologise!

@rohlem
Copy link
Contributor

rohlem commented Jan 8, 2021

@lemaitre Thank you for the detailed follow-up! I think I understand the idea a little better now...

From the position that Zig builds on top of LLVM (as a frontend to its backends) - which it currently always does - this sounds like a worthwhile use case to support in the language.
However, the future plans for Zig include a compiler with fully-self-hosted mode (so it would only use make use of LLVM if explicitly requested, for fully-optimized builds), and an LLVM-independent language specification.
While I'm sure the discussion for implementation via specific backends (like LLVM) is worthwhile, and appreciated, we should also consider if we want to "burden" the core language specification (or rather, everyone in the future planning to implement it) with these options.


(C code example)
You can take this C code and compile it as-is on any compiler supporting AVX512: gcc, clang, icc , msvc.

To me this still looks implementable as a library module/package. For example, a package avx512f could provide a type mm512, and functions mm512_load_ps, mm512_add_round_ps, mm512_store_ps, etc. .
Then, within that module, comptime code would need to identify the compiler implementation/version/backend, and provide fitting implementations. These can use non-standard extensions provided only by the supplied compiler (like facilities forwarded to the LLVM backend).

I guess it could also be done via a new builtin function, f.e. @intrinsic(comptime name: const u8[]), that returns you whatever kind of entity you're referring to. So @intrinsic("__mm512") would return a type, a function name would return that function, etc. .
The question is if this gives us any more flexibility when implementing it, in comparison to writing a userspace library (where any backend's implementation is free to be arbitrarily complex).

If the compiler is not matched with a suitable implementation, the library approach could instead emulate the semantics via a different data structure, that is not guaranteed to be hardware-backed in the same way. To avoid this from happening unknowingly, it could have a comptime flag, to @compileError out instead of falling back to this emulation code.

Going off of this understanding, we would need to propose/design the following facilities:

  • Sufficient compiler detection (f.e. via additional fields in std.builtin).
    I think you can already query the compiler version and the flag whether LLVM is enabled in stage 2, but this would include identifying if you're using the "standard/mainline" implementation, or some third-party Zig compiler.
  • For the LLVM backend, a facility to address their implementations of these intrinsics.
  • A coordinated effort for intrinsics modules, like sse, avx512f, and all the other ones you mentioned.
    I personally think these can live outside of the standard library, but I'm not sure I fully understand the implications other than discoverability vs maintenance effort.
  • EDIT (in response to the comment below): Maybe a way to check for the presence of modules, as in @hasImport?
    Although I assume the current plan would be for this to be handled via the build system. Both options might have their benefits, I think this also warrants discussion. Maybe even on a more general scope, "what are the intended capabilities of Zig WITHOUT the build system".

Not sure if I just reiterated the obvious here (sans misunderstandings), I guess I wanted to reduce the idea to some concrete actionable set of decisions/proposals.

@lemaitre
Copy link
Author

lemaitre commented Jan 9, 2021

However, the future plans for Zig include a compiler with fully-self-hosted mode (so it would only use make use of LLVM if explicitly requested, for fully-optimized builds), and an LLVM-independent language specification.
While I'm sure the discussion for implementation via specific backends (like LLVM) is worthwhile, and appreciated, we should also consider if we want to "burden" the core language specification (or rather, everyone in the future planning to implement it) with these options.

Well, intrinsics are not part of the language itself, but are tied to the target architecture. It makes no sense to provide __m128 types on ARM. So a valid Zig compiler would not need to implement any of these intrinsics as long as it does not claim to support those architectures. SSE might be problematic as I think it is a mandatory part of x86_64, but AVX is definitely not mandatory.
Even if a zig compiler does not use LLVM backend (or even GCC backend), it can always implement intrinsics with inline assembly for simplicity. It just that it will miss some extra optimizations that LLVM and GCC provide.

To me this still looks implementable as a library module/package. For example, a package avx512f could provide a type mm512, and functions mm512_load_ps, mm512_add_round_ps, mm512_store_ps, etc. .

I think that's the way to go, yes. Except for the name of the package: I would advise keeping the name of the C header (*mmintrin for x86, arm_neon for ARM...). Likewise, keeping the __ and _ in front of intrinsics names. Just to have the same names for all languages. BTW, only x86 intrinsics have _ in front of their names.

I guess it could also be done via a new builtin function, f.e. @intrinsic(comptime name: const u8[]), that returns you whatever kind of entity you're referring to. So @intrinsic("__mm512") would return a type, a function name would return that function, etc.

For me, this could be a way for internal uses, but it would be a bit strange to use for end-users compared to plain functions and types.

About discoverability, in C, vendors usually specify some macros to detect the presence of intrinsics, eg: __SSE2__, __AVX__, __ARM_NEON...
I don't know if is a way to have something similar in zig, but I would guess so. The most important point of this is that you soould not need to know in advance what flags can ever exist to implement this. In C, you detect with #ifdef that "evaluates" to false if the token is unknown. No error here.

I hope this makes it clearer.

@ghost
Copy link

ghost commented Jan 10, 2021

@lemaitre, what do you think of a hybrid approach:

  • SIMD data could be represented using portable types (@Vector for packed data, ordinary slice for VLA, bit vectors for mask and predicate registers, etc).
  • SIMD operations would be provided by a library of platform-specific intrinsics.

This way we could retain fine-grained control for optimal performance / precision / timing, while still leaving the door open for more portable and easy to use abstractions that don't touch any intrinsics directly.

@lemaitre
Copy link
Author

  • SIMD data could be represented using portable types
    @Vector for packed data

Yeah, no problem with that.

ordinary slice for VLA

No, that cannot work as VLA register is not a view of an unknown bound array, but is actually the array. So copying has a different meaning.
BTW, here, VLA is for Vector Length Agnostic (see SVE for reference), not Variable Length Array.

bit vectors for mask and predicate registers

Different ISAs implement masks and predicates differently. So again, not really possible.

You could have a @VectorMask(T, N) that is implemented differently depending on the target architecture, though.
For instance, on SSE @VectorMask(float, 4) would be __m128 (or at least implemented like it), while on AVX512, it would be __mmask8.

  • SIMD operations would be provided by a library of platform-specific intrinsics.

That is the idea.

One key point that I think I've not well explained yet is that intrinsics are tied to the target architecture. This means that a light-weight compiler would be allowed to not implement them. This is especially important because it means that Zig (as a language) should not really care about complex intrinsics types like masks and VLA registers as those are only required for high-end SIMD ISAs.
Only compilers targeting high performance should care about AVX* and SVE for example. Those high performance compilers are likely to use either LLVM or GCC backend that already of all the machinery to implement those intrinsics efficiently.

@ghost
Copy link

ghost commented Jan 11, 2021

No, that cannot work as VLA register is not a view of an unknown bound array, but is actually the array. So copying has a different meaning.

Could you explain this point? From my (admittedly limited) understanding of SVE, the idea is that vectors of unknown length can be loaded directly from memory using a base pointer and a length. To me, this sounds exactly like a view/slice.

Different ISAs implement masks and predicates differently. So again, not really possible. You could have a @VectorMask(T, N) that is implemented differently depending on the target architecture, though.

I want to like that idea, but I don't think it would work. Presumably the SIMD data types need to be concrete, so they can be instantiated and passed around. This means @VectorMask would need an extra ISA parameter, like @VectorMask(.SSE, T, N), which would negate the benefit. I'm still hoping that there is at least some kind of useful commonality to exploit, but maybe that's a fool's errand 😄.

Edit: The SIMD-ISA parameter could also be made a project-wide build variable, so maybe that's a non-issue.

@lemaitre
Copy link
Author

Could you explain this point? From my (admittedly limited) understanding of SVE, the idea is that vectors of unknown length can be loaded directly from memory using a base pointer and a length. To me, this sounds exactly like a view/slice.

SVE is much more akin to regular SIMD ISAs than that. SVE registers (like svfloat_t) are value types that can be copied, passed be value, returned. In assembly, they can pushed to and pop from the stack like any other register values.
The difference with regular SIMD is that their length is not known at compile, but is still a runtime constant (all SIMD registers have the same length for at least the entirety of the process).
Semantically, they are close to std::vector from C++, but where the size would be defined by a global constant. However, they need no memory to exist, as they are bound to registers.

Here is some SVE example in C:

int reduce_add(const int* A, int n) {
  // assume for the sake of simplicity that n is a multiple of vector length
  svbool_t pg = svptrue_b32(); // in actual code, predicate is updated at every iteration
  svint32_t sum = svdup_s32(0); // sum is a SIMD register and is not backed in memory
  for (int i = 0, i < n; i += svcntw()) { // i is incremented by SIMD length
    svint32_t a = svld1(pg, &A[i]); // no length specified here
    sum = svadd_m(pg, sum, a);
  }
  return svaddv(svptrue_b32(), sum); // reduce the content of sum
}

Once compiled, there is no memory accesses apart from loads to A. Both sum and pg stay in register and no memory is allocated for those (neither from heap, nor stack). I would really encourage you to have a look into the SVE documentation: https://developer.arm.com/documentation/100987/latest

Anyway, as such a support would be optional, I think there is no need to specify how the language would accommodate for such a weird ISA. This question is only useful in the implementation of the compiler where we can assume LLVM. In that case, everything is already in place in LLVM.

I want to like that idea, but I don't think it would work. Presumably the SIMD data types need to be concrete, so they can be instantiated and passed around. This means @VectorMask would need an extra ISA parameter, like @VectorMask(.SSE, T, N), which would negate the benefit. I'm still hoping that there is at least some kind of useful commonality to exploit, but maybe that's a fool's errand smile.

Edit: The SIMD-ISA parameter could also be made a project-wide build variable, so maybe that's a non-issue.

I'm not a super fan of this solution, but I don't think it is necessary to implement intrinsics anyway. If such a generic vector mask type is implemented, a SIMD-ISA parameter might be a good idea in order to have mutliples ISAs in the same binary (and maybe for ABI compatibility?).

The key point of intrinsics is that they are not made to be generic accross ISAs. They are made to match the target ISA as close as possible. ISAs do not necessarily have the same concepts, so it would be hard to abstract (efficiently) all and whole ISAs in one language.


To me, intrinsics are mandatory for high performance code in order to have the really low level operations given by the target architecture.
But for lightweight compilers, it would be ok to just omit them: if your compiler is lightweight, you don't necessarily expect to support all the latest architectures, so not having the intrinsics for those (like AVX512) would be more than understandable.

Of course, this would mean that even a lightweight compiler that is able to target x86_64 would most likely need SSE2 intrinsics as SSE2 is a requirement for x86_64. But such a compiler could omit AVX and ownwards because those would be a different target.
I would not even mind if such a compiler omitted SSE intrinsics as code using intrinsics should still have a scalar fallback.

@ghost
Copy link

ghost commented Jan 11, 2021

Thanks for the detailed explanation!

I guess SVE is not quite as straightforward as I imagined 😄.

@Vexu Vexu added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Jan 26, 2021
@Vexu Vexu added this to the 0.8.0 milestone Jan 26, 2021
@momumi
Copy link
Contributor

momumi commented Apr 15, 2021

I agree, there should be a way to access instruction specific intrinsics like is common in C/C++.

For reference here's Intel's guide for all the x86 intrinsics for use in C/C++, there's literally hundreds (thousands?!) of them. Note the way you generally access them in C is through header files eg #include <nmmintrin.h>.

Rust has core::arch for architecture specific intrinsics and std::intrinsics for higher level compiler abstractions.

I also wrote this proposal which suggests moving intrinsics out of the default namespace and into the standard libraries.

@dmbfm
Copy link

dmbfm commented Dec 24, 2021

I have been experimenting a little with Vector x inline assembly simd/neon and the performance with Vector has not been great, sometimes worse than non-vector alternatives. I plan to write up on the results I got with more detail, but definitely having a way to do simd intrinsics in zig would be invaluable.

@sharpobject
Copy link
Contributor

sharpobject commented Dec 26, 2022

I've been working on some project that uses simdjzon, and what I've found is that the approach of declaring functions in c that call C intrinsics that take an immediate argument or declaring C intrinsics as extern functions directly results in a binary in which these functions are not inlined. This is pretty suboptimal.

For example, simdjzon defines some C functions like this:

__m256i _prev1(__m256i a, __m256i b) {
    return _mm256_alignr_epi8(a, _mm256_permute2x128_si256(b, a, 0x21), 16 - 1);
}

and the resulting binary contains this

00000000002d6d10 <_prev1>:
  2d6d10:       55                      push   %rbp
  2d6d11:       48 89 e5                mov    %rsp,%rbp
  2d6d14:       c4 e3 75 46 c8 21       vperm2i128 $0x21,%ymm0,%ymm1,%ymm1
  2d6d1a:       c4 e3 7d 0f c1 0f       vpalignr $0xf,%ymm1,%ymm0,%ymm0
  2d6d20:       5d                      pop    %rbp
  2d6d21:       c3                      ret
  2d6d22:       66 66 66 66 66 2e 0f    data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
  2d6d29:       1f 84 00 00 00 00 00
...
0000000000b53de0 <simdjzon.dom.Utf8Checker.prev>:
  b53de0:       55                      push   %rbp
  b53de1:       48 89 e5                mov    %rsp,%rbp
  b53de4:       48 83 e4 e0             and    $0xffffffffffffffe0,%rsp
  b53de8:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
  b53def:       c5 fd 7f 44 24 20       vmovdqa %ymm0,0x20(%rsp)
  b53df5:       c5 fd 7f 0c 24          vmovdqa %ymm1,(%rsp)
  b53dfa:       c5 fd 6f 44 24 20       vmovdqa 0x20(%rsp),%ymm0
  b53e00:       c5 fd 6f 0c 24          vmovdqa (%rsp),%ymm1
  b53e05:       e8 06 2f 78 ff          call   2d6d10 <_prev1>
  b53e0a:       c5 fd 7f 44 24 40       vmovdqa %ymm0,0x40(%rsp)
  b53e10:       c5 fd 6f 44 24 40       vmovdqa 0x40(%rsp),%ymm0
  b53e16:       48 89 ec                mov    %rbp,%rsp
  b53e19:       5d                      pop    %rbp
  b53e1a:       c3                      ret
  b53e1b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

which is not great. LLVM intrinsics cover many (most?) of the vector instructions available in the core-avx2 target, but not vpalignr. The below Zig code happens to generate vpalignr, but I was not able to make this approach work for vpshufb, so it may not be a good approach in general. vpshufb in particular is covered by an LLVM intrinsic though.

fn vpalignr_please(a: u8x32, b: u8x32, comptime imm8: comptime_int) u8x32 {
    var ret: u8x32 = undefined;
    var i: usize = 0;
    while (i + imm8 < 16) : (i += 1) {
        ret[i] = b[i + imm8];
    }
    while (i < 16) : (i += 1) {
        ret[i] = a[i + imm8 - 16];
    }
    while (i + imm8 < 32) : (i += 1) {
        ret[i] = b[i + imm8];
    }
    while (i < 32) : (i += 1) {
        ret[i] = a[i + imm8 - 16];
    }
    return ret;
}

It would be nice if there was a single interface through which to access these instructions, instead of a mix of LLVM intrinsics and code that attempts to convince LLVM to generate the proper instruction. Thanks.

@nano-bot
Copy link

Is there any update on this issue? What will be the future of direct SIMD instructions given the departure from LLVM?

@sharpobject
Copy link
Contributor

sharpobject commented Aug 28, 2024

@nano-bot I think there is no update, but you can use something like https://github.com/aqrit/sse2zig which implements many familiar intrinsics using asm when they cannot be implemented in terms of the primitives provided by the language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests