-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: add "simd" package to standard library #67520
Comments
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
simd
package to standard library
ISTM, this could be a regular build tag This might also help with WASI, where the solution was to make to make wasip1 the GOOS, but as new features get added to WASM itself, we might run into the same issue of needing to differentiate classic WASM from WASM++ or whatever. |
In the POC this is implemented with a simple build tag. However, I was presenting the idea for backward compatibility reasons. If anyone already defined such build tags, we wouldn't want to interfere with their code. Furthermore, these new build tags might lead to different build logic or state than the original ones. I haven't look in details so I cannot describe a different use case though... |
I guess that's a risk with any new build tag, but yes, it seems more likely to affect real code here. Maybe |
This seems a little bit terse to me and I'm still not comfortable with impacting others code. That's also why I didn't propose file names having a certain ending like But overall, I'm open to the idea. |
Couldn't you have a T32 type that provides a function N that returns 32?
This seems limiting. For example, dot products are quite important, but platforms differ a bit in what they provide. Also, Scatter isn't universally supported, but important in some fields. Would you simply skip it?
This cannot be guaranteed with LLVM, which can and does transform intrinsics to something else. |
If I understand correctly, instead of passing
I believe you are seeing that from the Highway point of view. This proposal tries to provide direct intrinsics. Functions wrapping multiple intrinsics could be then created on top of them. These functions would include dot products, scatters, ...
Fortunately, we are not dealing with LLVM. Go has its own assembly. |
Yes, that's right. Though I'd say VectorN instead of Vector16. 128-bit vectors have their place for historical reasons, but users asking for Vector1024 is a recipe for poor codegen on most ISAs.
Note that some platforms have dot product instructions e.g. vdpbf16ps, which are too good to pass up on. Of course a dot product kernel would call them in a loop, but you still want these instructions reflected in the abstraction layer. |
A sticking point for past proposals has been whether the package should be high level or low level. I think it probably makes sense to make a syscall vs. os style split and have a high level package that calls into a low level package. In terms of implementation, I think starting with the low level package is clearly better. So maybe start with golang.org/x/exp/simd being experimentally special cased by the compiler, if it's successful, move it to runtime/simd and then if everyone is happy with runtime/simd, a separate proposal could add math/vector as a high level library. |
Thank you for clarification @jan-wassenberg @earthboundkid this is the idea behind this proposal. Let's do the low level intrinsics that are not length agnostic and building on that we will be able to do the fancy length agnostic ones. |
It's not clear to me that VL agnostic can be built on top of known-length types, unless for each vector you set aside up to 256 bytes (SVE) or even 64 KiB (RISC-V V)? |
I'm not even sure the compiler can generate code for SVE and RISC-V V to be honest. So I don't think this is the priority. But yes for variable length vectors we will need either to set aside a larger amount of bytes, or come up with higher-level vector types. |
What you are proposing is already covered with build tags. With that said build tags are fundamentally a wrong approach for intrinsics. You seem to imply that intrinsics should be decided at compile time. I disagree. There are maybe 30+ CPU features in x86-64. Add combinations and you have an infinite build matrix already. If intrinsics should make sense, they can already be platform guarded and cpuid detection should make the compiled binary dynamically either use intrinsics or not. We already have Your proposal for a generic "simd" package is too simplistic. While abstracting certain operations would be possible and fine, I stand by my opinion that this can be done either internally or externally if Go provides the intrinsics to begin with. TBH I don't quite understand how this package would work with the build tags. Would it fail with
Distinct types is an abstraction over "register with x bits". Take "xor". It doesn't operate on any specific type, but only on a fixed register size. Do you want an I see all of that as a secondary layer to the main complexity of adding vector registers to the compiler. That seems like the primary problem to solve. To me the approach would be to: A) Find a reasonable way to represent vector registers in Go code, which can be used across platforms. It could be a It seems reasonable to provide B) Provide intrinsics per platform. Intrinsics are "close" to the emitted instructions. There is no scalar fallback, and cpuid checks will be up to the user. I still think dividing intrinsics into (primary) cpuid makes sense. Importing a package doesn't mean you use it, merely that you may use it if the cpu supports it. C) Find a reasonable way for imported packages to "inject" intrinsics into the compiler. That means importing D) Constants A lot of instructions take immediates which must be constant at compile time. For example While it could just result in a compile-time error it would be nice if the function definition by itself could give that information. Not a show-stopper, though - and could be decided later. There are a lot of "minor" decisions that would need to be made. Should instruction with shared source/destination registers abstract this? How should functions be named? Close to the instruction or slightly abstracted (like C, rust) or fully renamed? While these things would need a resolution at some point, they don't really seem like the primary things to focus on too early. If the above works out, third parties can implement what they like. The Go stdlib can then look at whether a generic vector package would make sense if that is deemed important. TLDR; I think the approach of compile-time cpu feature detection is a dead end. I think intrinsics should be low level to allow for the biggest amount of features and leave higher level abstractions to competing third parties. Go will provide the compiler that can make this great. PS. I am genuinely very sorry I never found the time to reply to your emails. |
So, I believe there are misunderstandings here:
The build tag is optional and when not provided, the compiler, through cpuid, would detect the all the intrinsics available for a given platform. For example, if you had a machine with SSE and AVX available, the compiler will detect all these features. The build tags let the user restrict to a set of features if needed.
For that, we could drop to SWAR-like operations or scalar code if a feature is not detected. I believe the currently available intrinsics already have this.
No, for xor under the hood, there is only one intrinsic covering multiple types.
These two points are covered in my proposal. I talk about having to handle constants and making sure that the compiler promote the types to vector registers. Rest
For importing through different import paths, it just seem to be another way of doing what I proposed with the optional build tag but in reverse. If you don't provide a tag, you get all possible intrinsics. Otherwise, you get the intrinsics only for a set of ISAs.
Once again, a lot of intrinsics are shared across multiple types. This means we will have lot of "functions" calling to the same underlying intrinsic.
Totally agree with this. The minor points that I made are purely based on the experience with the POC. They are not the main focus of the proposal. |
I think the better solution is to steal the types from #64634: type ISimd128 ISimd128 Each SIMD vector become it's own native type which are unbreakable. This also discourage programmers to treat them as arrays (which is often slow). // random example
for i, v := range someSIMDVector {
if v % 2 == 0 && i % 2 == 1 { // do some logic
continue
}
someSIMDVector[i] = 42 // create inter-loop dependency if lifting to registers
} however performance for code like this is gonna be way worst than it look it should be on amd64 (and maybe other architectures) due to the high cost to move data between the SIMD and GP complexes. |
Totally agree with you on that. This is why I said:
And having a native type is definitely the way to go. I just didn't know how to properly formulate that. |
SIMD instructions are crucial for computation-intensive workloads like machine learning and deep learning. This proposal really provides a solid initial plan.
However, Golang works at a higher level than C, there should be some more considerations. Essential SIMD operations like vector addition, reduction, and broadcasting should be implemented across |
While I agree on the fact that Go is higher level, the abstraction of SIMD intrinsics is a very complicated thing to get right. On top of that, we might find ourselves having performance differences that are not negligible across platforms/ISAs. I believe that we should let the developer the choice and explicitly make the trade-offs. |
If a SIMD abstraction only provides the strict intersection of ops, then what exactly should developers trade off? When they want more ops, aren't they forced to implement the op in assembly, plus some kind of fallback for other platforms? It seems much more useful if the abstraction takes care of that. It would have to be judicious about what exactly to offer, and avoid ops that really aren't efficient on other platforms. I think we did a decent job of that in Highway :) |
I definitely would like judicious abstraction , I'm just saying that this is hard. To make that a little bit easier, we should definitely learn from Highway on which ops to provide. Note that I was trying to be conservative on this proposal and get a minimal scope that we can evolve from. I believe that if we have the basic ops for each platforms, we can later create higher abstractions as part of the simd package. |
@jan-wassenberg I don't think the comparison applies. The HH api is good for what it offers. However I don't see that as a language feature, but as a third party package API. IMO Go should allow for implementing abstractions and people can choose their preferred abstraction or write directly in intrinsics. Go should provide a compiler that is capable of sensibly allocate registers and handle code generation. Users of intrinsics would gain the benefits of not having the function call overhead of assembly; The benefits of memory safety (maybe with "unsafe" exceptions for things like scatter/gather). Plus the simplicity of register management compared to assembly. Abstractions should be implemented by third parties - they will likely be more powerful and different approaches can compete for the best APIs. |
See #68188. |
Background
After doing a little bit of research on previous issues mentioning adding SIMD to Go, I found the following proposals:
#58610: This proposal relies on the
ispmd
keyword which does not follow the compatibility guarantees for Go 1.#53171: This proposal is closer to what I am proposing, however, there are multiple packages per CPU architecture, and I'm providing a more complete explanation on how to achieve adding SIMD to the Go standard library.
#64634: This proposal is more specific to the crypto package but in essence this is close to what I'm proposing here.
Goals
The main goal of this proposal is to provide an alternative approach to designing a
simd
package for the Go standard library. As of right now, there is no consensus on what the API should look like and this proposal intends to drive the discussion further.This proposal mainly includes two things:
Build Tag
For the first point, I was thinking about optional build tags like the following:
//go:simd sse2
//go:simd neon
//go:simd avx512
etc.
As mentioned, these build tags are optional. If not specified, we would resolve to the appropriate SIMD ISA available on the current OS and ARCH. However, I still think that these build tags are needed for deeper optimization and platform-specific operations. If we know that some instruction is more performant or only available on a certain architecture, we should be able to enforce using it manually.
Finally, having the optional build tag would let the developer choose, at compile time, which SIMD ISA to target and thus cross compile. We could write something similar to:
With this, we could take advantage of platform-specific features and know at compile time the size of the vector registers (e.g., 128, 256, or 512 bits). This would help us make better decisions for optimizations on the compiler side.
Compiler Intrinsics
The next crucial step would be to create a portable SIMD package that would rely on the compiler to generate SIMD instructions through compiler intrinsics. I demonstrated that this is feasible with a POC. As of right now, it looks like the following (you can see more examples here, including UTF8 string validation):
And here, the
AddU8x16
gets lowered down to avector add
instruction after SSA lowering.Notes
We can provide functions like
AddU8x16
,AddU8x32
, etc. without changing the generics implementation. Other implementations like Highway, Rust std::simd, and Zig @Vector rely on generics for the API. In Go, we do not have non-type parameters in generics; thus we cannot have something likeSimd[uint8, 16]
.Also, we do not have a compile time
Sizeof
which could help us have:Philosophy
It is important to understand that this proposal is not describing an abstraction of SIMD features. Such an abstraction could create noticeable performance difference between ISAs. That's why we are trying to avoid it. This means that if an operation is not available on an ISA, we simply don't provide it. Each intrinsic should only have 1 underlying instruction, not a sequence of instructions.
Challenges to Overcome
I believe we would need some kind of type aliases like the following:
These types should not be indexable and only instantiable through init functions like
Splat8x16
,Load8x16
, etc. The compiler would then promote these special types to vector registers. This would remove all the LD/ST dance and memory allocation that I have in my POC and thus make everything a lot faster.In the end the previous code snippet could look like this:
WORD $0x...
) to encode these.I believe we should avoid having to use constants when defining intrinsics. We could implement the missing instructions along the way with the implementation of intrinsics.
VMIN
andUMINV
. The former returns a vector of min elements, and the latter reduce to the minimum in a given vector. As we don't have function overloads, we will need to find a way to name them appropriately.I believe we should try to make the horizontal operations more verbose (e.g.
ReduceMin8x16
) and promote the vertical ones (e.g.Min8x16
). For the example ofVMIN
andUMINV
, the latter does not even seem to exist in SSE2 whereas the first one does.There are other operations that have "conflicting" names. For example
shift right
andshift left
have bothlogical
andarithmetic
shifts. For these cases, I believe we could just call themLogicalShiftRight
,ArithmeticShiftRight
, ... I agree that this is verbose but it makes it clear what is happening.The current POC did not implement the concept of Masks. This is an important concept but also a tricky one to implement without proper compiler support. After discussion with Jan Wassenberg (author of Highway), I realized that some platforms do not treat masks in the same way. Here is a summary:
We could have other types aliases, like the one made for vectors. These types would have different shapes and be loaded differently depending on the platform.
SSHR
(shift right) on NEON takes an immediaten
that need to be restricted between 1 and 8 (see vshr_n_s8). I ran into some problems where during the build of the compiler, then
would resolve to 0 (default value of int passed as param) and it would crash the program.I believe we could have some way to check at compile time these values are within a certain range. Like a static_assert in C++ or checks on the AST.
Why is it Important?
Without SIMD, we are missing on a lot of potential optimizations. Here is a non-exhaustive list of concrete things that could improve performance in daily life scenarios:
Furthermore, it would make these currently existing packages more portable and maintainable:
There are obviously way more applications of SIMD but I am just trying to say that this is useful in practical scenarios.
The text was updated successfully, but these errors were encountered: