-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic SIMD types and operations are not a substitute for intrinsics #7702
Comments
Question as someone who's not as familiar with this topic, what is the "correct" / a suitable interface level for these "intrinsics"? Or rather, do you have a concrete proposal in mind? If a developer wants to reliably produce machine instructions, I think inline assembly is the most straight-forward tool of implementing this, which is already supported (and afaik compatible with variables declared to be type There is the option of builtin functions, like we have |
One thing to consider is that llvm provides many architecture dependent intrinsics on the IR-level, which might help optimizing them. With ASM statements this is typically not easily possible. I think the most proper way of implementing something like this is providing a std module akin to intrin.h, which provides inline functions based on the current backend. If the backend is llvm, an intrinsic can simply be imported using |
First, inline assembly cannot introduce new types, so you still need to have compiler support for intrinsics types. For instance, SVE vector types cannot be implemented with a regular The problem with inline assembly is that the compiler may lose many optimization opportunities. For example: constant folding, common sub-expression elimination, dead code elimination, instruction merging... Of course, this comment is valid only when the backend is aware of such operations, which is the case of LLVM. In fact, LLVM has already all the gears required for all intrinsics for at least x86/ARM/PowerPC because those are implemented for clang. Thus, it should be "only" a matter of accessing this through zig. As a reminder, intrinsics are defined/specified by CPU vendors, not compiler devs, and thus intrinsics code should work on any compilers without any modification. So I would stick with plain old |
And the main thing it loses is information about latency and throughput, leading to very poor scheduling. The lack of intrinsics is also an issue for instructions providing crypto acceleration. |
The arguments sound reasonable, but I still see a conflict between these two points:
(I still assume that an intrinsic can be treated, at some point, as "a block of machine instructions"? Please correct me on this.)
So to me it appears we should either enhance our current facilities (only inline assembly comes to mind) to support this, or introduce a new language feature that can be used to supply it this information without requiring us to alter the language for every new intrinsic. Specifically the point about introducing new types though sounds difficult to me. There might need to be some way to implement them in userspace (f.e. a standard library module - maybe using a new language construct, if packed structs + logic aren't flexible enough?), otherwise they might never be transparent enough for the compiler to optimize the code as you suggest. ((Crazy off-the-wall idea, the Zig compiler could totally build comptime-known / build-time code as a "compiler module", and load that dynamically to use in the build process. But at that point we're slipping into designing our own compiler framework.)) |
I'd like to add that one advantage of having a Vector type is that it allows to write SIMD code in readable way - much like clang's and gcc's vector extensions do. You can (and probably must, for performance) still adjust the code to the architecture you want to use but you don't have to learn a new set of intrinsics for each that simply adds unreadable and hard to maintain code to your codebase. |
In fact, no. Intrinsics are not just a block of machine instructions, in the exact same way than a scalar addition. For the portability aspect across compilers, just look at C: void add(float*A, const float* B, int n) {
for (int i = 0; i < n; i += 16) {
__m512 a = _mm512_load_ps(&A[i]);
__m512 b = _mm512_load_ps(&B[i]);
a = _mm512_add_round_ps(a, b, _MM_FROUND_TO_ZERO);
_mm512_store_ps(&A[i], a);
}
} You can take this C code and compile it as-is on any compiler supporting AVX512: gcc, clang, icc , msvc.
This is true, but as I said, it is mostly by the LLVM backend. The zig frontend should forward intrinsics pretty much as-is to the backend. So it would be just a matter of registering new intrinsics.
As far as I understand, One ISA that might be complex to integrate is SVE SIMD registers (eg:
You seem to have missed the point of this issue. Generic SIMD types and operations are really useful and fit many algorithms, thus should be kept. Mandelbrot is a really simple code that does not require complex instructions, thus, abstract compiler builtins are enough. But it would be really difficult to implement video codecs or Json SIMD parsing with only those, without using any intrinsics or inline assembly. |
@lemaitre It seems, I have indeed missed it. I apologise! |
@lemaitre Thank you for the detailed follow-up! I think I understand the idea a little better now... From the position that Zig builds on top of LLVM (as a frontend to its backends) - which it currently always does - this sounds like a worthwhile use case to support in the language.
To me this still looks implementable as a library module/package. For example, a package I guess it could also be done via a new builtin function, f.e. If the compiler is not matched with a suitable implementation, the library approach could instead emulate the semantics via a different data structure, that is not guaranteed to be hardware-backed in the same way. To avoid this from happening unknowingly, it could have a Going off of this understanding, we would need to propose/design the following facilities:
Not sure if I just reiterated the obvious here (sans misunderstandings), I guess I wanted to reduce the idea to some concrete actionable set of decisions/proposals. |
Well, intrinsics are not part of the language itself, but are tied to the target architecture. It makes no sense to provide
I think that's the way to go, yes. Except for the name of the package: I would advise keeping the name of the C header (
For me, this could be a way for internal uses, but it would be a bit strange to use for end-users compared to plain functions and types. About discoverability, in C, vendors usually specify some macros to detect the presence of intrinsics, eg: I hope this makes it clearer. |
@lemaitre, what do you think of a hybrid approach:
This way we could retain fine-grained control for optimal performance / precision / timing, while still leaving the door open for more portable and easy to use abstractions that don't touch any intrinsics directly. |
Yeah, no problem with that.
No, that cannot work as VLA register is not a view of an unknown bound array, but is actually the array. So copying has a different meaning.
Different ISAs implement masks and predicates differently. So again, not really possible. You could have a
That is the idea. One key point that I think I've not well explained yet is that intrinsics are tied to the target architecture. This means that a light-weight compiler would be allowed to not implement them. This is especially important because it means that Zig (as a language) should not really care about complex intrinsics types like masks and VLA registers as those are only required for high-end SIMD ISAs. |
Could you explain this point? From my (admittedly limited) understanding of SVE, the idea is that vectors of unknown length can be loaded directly from memory using a base pointer and a length. To me, this sounds exactly like a view/slice.
I want to like that idea, but I don't think it would work. Presumably the SIMD data types need to be concrete, so they can be instantiated and passed around. This means Edit: The SIMD-ISA parameter could also be made a project-wide build variable, so maybe that's a non-issue. |
SVE is much more akin to regular SIMD ISAs than that. SVE registers (like Here is some SVE example in C: int reduce_add(const int* A, int n) {
// assume for the sake of simplicity that n is a multiple of vector length
svbool_t pg = svptrue_b32(); // in actual code, predicate is updated at every iteration
svint32_t sum = svdup_s32(0); // sum is a SIMD register and is not backed in memory
for (int i = 0, i < n; i += svcntw()) { // i is incremented by SIMD length
svint32_t a = svld1(pg, &A[i]); // no length specified here
sum = svadd_m(pg, sum, a);
}
return svaddv(svptrue_b32(), sum); // reduce the content of sum
} Once compiled, there is no memory accesses apart from loads to A. Both Anyway, as such a support would be optional, I think there is no need to specify how the language would accommodate for such a weird ISA. This question is only useful in the implementation of the compiler where we can assume LLVM. In that case, everything is already in place in LLVM.
I'm not a super fan of this solution, but I don't think it is necessary to implement intrinsics anyway. If such a generic vector mask type is implemented, a SIMD-ISA parameter might be a good idea in order to have mutliples ISAs in the same binary (and maybe for ABI compatibility?). The key point of intrinsics is that they are not made to be generic accross ISAs. They are made to match the target ISA as close as possible. ISAs do not necessarily have the same concepts, so it would be hard to abstract (efficiently) all and whole ISAs in one language. To me, intrinsics are mandatory for high performance code in order to have the really low level operations given by the target architecture. Of course, this would mean that even a lightweight compiler that is able to target x86_64 would most likely need SSE2 intrinsics as SSE2 is a requirement for x86_64. But such a compiler could omit AVX and ownwards because those would be a different target. |
Thanks for the detailed explanation! I guess SVE is not quite as straightforward as I imagined 😄. |
I agree, there should be a way to access instruction specific intrinsics like is common in C/C++. For reference here's Intel's guide for all the x86 intrinsics for use in C/C++, there's literally hundreds (thousands?!) of them. Note the way you generally access them in C is through header files eg Rust has I also wrote this proposal which suggests moving intrinsics out of the default namespace and into the standard libraries. |
I have been experimenting a little with |
I've been working on some project that uses simdjzon, and what I've found is that the approach of declaring functions in c that call C intrinsics that take an immediate argument or declaring C intrinsics as extern functions directly results in a binary in which these functions are not inlined. This is pretty suboptimal. For example, simdjzon defines some C functions like this: __m256i _prev1(__m256i a, __m256i b) {
return _mm256_alignr_epi8(a, _mm256_permute2x128_si256(b, a, 0x21), 16 - 1);
} and the resulting binary contains this
which is not great. LLVM intrinsics cover many (most?) of the vector instructions available in the core-avx2 target, but not vpalignr. The below Zig code happens to generate vpalignr, but I was not able to make this approach work for vpshufb, so it may not be a good approach in general. vpshufb in particular is covered by an LLVM intrinsic though. fn vpalignr_please(a: u8x32, b: u8x32, comptime imm8: comptime_int) u8x32 {
var ret: u8x32 = undefined;
var i: usize = 0;
while (i + imm8 < 16) : (i += 1) {
ret[i] = b[i + imm8];
}
while (i < 16) : (i += 1) {
ret[i] = a[i + imm8 - 16];
}
while (i + imm8 < 32) : (i += 1) {
ret[i] = b[i + imm8];
}
while (i < 32) : (i += 1) {
ret[i] = a[i + imm8 - 16];
}
return ret;
} It would be nice if there was a single interface through which to access these instructions, instead of a mix of LLVM intrinsics and code that attempts to convince LLVM to generate the proper instruction. Thanks. |
Is there any update on this issue? What will be the future of direct SIMD instructions given the departure from LLVM? |
@nano-bot I think there is no update, but you can use something like https://github.com/aqrit/sse2zig which implements many familiar intrinsics using asm when they cannot be implemented in terms of the primitives provided by the language. |
First, I think having generic SIMD types like
@Vector(T, N)
(#903 or any other syntax) with most arithmetic operations defined on them is really nice and is useful to many people.However, this will never give all the power of the intrinsics because the generic interface will never be able to cover all instructions from all SIMD ISAs (even with tradeoffs). Vendors will always be creative and invent new instructions to cover specialized work cases that don't necessarily match what other vendors do.
Plus, even when an operation exists in multiple vendor ISAs and is emulatable in the others, the exact semantic might differ and will lead to tradeoffs that would penalize people who want maximal performance (main goal of SIMD).
Examples (really far to be exhaustive):
_mm_rsqrt_ps
_mm512_rsqrt23_ps
_mm512_rsqrt14_ps
_mm512_rsqrt28_ps
vrsqrteq_f32
vec_rsqrte
vec_rsqrte
_mm256_fnmadd_ps(a, b, c)
vfmsq_f32(c, a, b)
vec_nmsub(a, b, c)
_mm512_add_round_ps
_mm512_conflict_epi32
svbdep
Those are only a few examples of problematic interface that cannot be abstracted easily/efficiently. There are many many more problems and listing them all would be futile.
To me, the only way to solve all those problems is by providing intrinsics. Of course, the use of intrinsics leads less portable code, but would provide the most control to the user.
The text was updated successfully, but these errors were encountered: