-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-power-of-two vector lengths. #63
Comments
TBH I worry that supporting this will cause way more pain by complicating the semantics compared to the value in supporting the use cases.
That's all to say: I'm kind of... pretty strongly opposed. If we do this it should be a decision made carefully and weighing all these things (and perhaps more) against the benefits. One note is: if
None of these apply to unstable features, which I don't have strong feelings on. |
In my opinion, if you are doing things that depend on the layout of a type, you need to ensure that you are properly handling the layout. The vector types already impose architecture-specific alignment implications, I don't think padding implications are particularly onerous. If you want to determine the size of a type, you should use I also think it should be pointed out that LLVM entirely supports non-power-of-two vectors (ignoring floating-point exceptions). The limitation here really just comes from cranelift. Overall I don't think this is about I do think this is a non-critical issue though, and can definitely be ignored for the time being. |
both of RISC-V V and SimpleV natively support non-power-of-2 vector lengths; for SimpleV, the in-memory layout is the same as an array with the same length. |
This is a far more compelling reason to support them IMO. |
Graphics code usually defers to the GPU, which has subtle but important differences from a typical SIMD register, but it is still useful to consider for our purposes, as a comparison point, if nothing else. And observationally GPUs handle Vec3 inputs by simply using a Vec4 in execution with a dummy value set in the 4th lane, but only use the first 3 lanes otherwise. The savings are on space on the hard drive and in RAM, basically, at the cost of complicating marshalling the data into the stream processors. That actually a rather lot of programmers basically are okay with this quirk and, while there's some debate over whether or not they should just shove something in the spare element slot and get the advantage of fully aligned accesses when the data is on the CPU, they manage just fine without anything going particularly bad, seems to suggest we'll be okay. Arm SVE is even more imminent as variable lane silicon that will be available on servers sometime Soon™ (I mean, it's probably already available in HPC land, but I mean for mere mortals who "just" have a big server) and it also can land on non-power-of-2 values for lane bitwidths, though admittedly, always a multiple of 128 bits. |
Filed https://github.com/bjorn3/rustc_codegen_cranelift/issues/1136 for now. Also filing that reminded me that https://github.com/WebAssembly/flexible-vectors/ exists. |
Glam needs non-power-of-two vectors on SPIR-V. |
Hey, just wanted to note that the restriction introduced in rust-lang/rust#80652 breaks a lot of code for the rust-gpu The stated reason for the restriction is the lack of support in Cranelift, however this isn't a concern for us, as our backend does support non-power-of-two SIMD types. Would it be possible to re-enable it just for the |
Perhaps we should revert the power-of-two restriction (not the entire patch as it fixes an ICE) and just allow cg_clif to error if it encounters one? There doesn't seem to be an obvious solution other than keeping #[repr(simd)] perma-unstable (and a little hacky?) pending cranelift support. @workingjubilee suggested GPU targets as a potential user of non-power-of-two vectors from the start and it didn't seem to take long to run into that... |
I agree. Is cranelift only used for wasm targeting right now? |
Cranelift doesn't have a wasm backend. It does have a wasm frontend, which is used by for example Wasmtime. cg_clif, which is another Cranelift frontend, targets x86_64 and in the future AArch64. |
This is going to be tricky, since rust requires size >= alignment... There's some discussion about it here https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Data.20layout.20.2F.20ABI.20expert.20needed, but it's possible another thread should be created if we need to support this. |
@thomcc Checking up on that thread, but where does Rust require size >= alignment? |
It's required by array layout. Each element is required to be |
Hmm. I know that GPUs will execute over Vec3s with some multiple of 4 lanes anyways, but does code shuffling around such data en route to actual execution regularly assume the stride is 12 and not 16? |
Apparently this isn't actually guaranteed, but I think it would probably require an RFC to change (and plausibly would break a lot of unsafe code in the wild — or rather, turn currently-sound interfaces to unsound-if-stride-and-size-dont-match ones) |
That's often not true anymore (though it was usually true in the distant past, for OpenGL 2.x hardware), AMD's GPUs convert all vector operations to scalar operations on each SIMT thread iirc. So, a |
Hmm, my understanding is that it wasn't fully true here either, but it's been a while. But yes, what you're saying is definitely accurate for modern stuff. The |
Oh wow! That's absolutely wild. |
So, my understanding is that for modern AMD GPUs, vectors in SPIR-V translate like this:
So an add of 2 vec3 a, b, r;
r = a + b; Assembly (made up mnemonics): add.f32.vec64 r0, a0, b0
add.f32.vec64 r1, a1, b1
add.f32.vec64 r2, a2, b2 |
To add to the above, modern shader compilers (and other graphics workloads, fx. raytracers) tend to use their lanes as pseudo "threads", masking for control flow, as most work done for graphics is the exact same for several pixels/vertecies/etc so you get a lot better use out of the lanes even for scalar calculations. It also allows you to much better abstract away the width of the hardware SIMD unit while keeping your code path looking mostly like plain scalar code. |
We determined at the meeting yesterday that breaking the SPIRV target that others were using is an anticipated but not desirable side effect so we will figure out how to revert this in rustc and handle the problem where necessary on the cg_clif side. |
…limit, r=nagisa Revert non-power-of-two vector restriction Removes the power of two restriction from rustc. As discussed in rust-lang/portable-simd#63 r? ``@calebzulawski`` cc ``@workingjubilee`` ``@thomcc``
…limit, r=nagisa Revert non-power-of-two vector restriction Removes the power of two restriction from rustc. As discussed in rust-lang/portable-simd#63 r? ```@calebzulawski``` cc ```@workingjubilee``` ```@thomcc```
llvm is gaining code allowing vectors to be aligned to the element's alignment, rather than to the rounded-up size of the vector: https://lists.llvm.org/pipermail/llvm-dev/2021-December/154192.html |
Doesn't seem to have much traction yet. This thread has has a higher speculation to information ratio than I would like. How does LLVM currently lower length 3 vectors for targets like x86? |
on x86, llvm rounds up to a supported vector length (it did for length 3), though sometimes it rounds down and handles the leftover separately (what it does for length 5 without avx). Note that in all cases the memory write is split into some set of valid writes because llvm isn't allowed to write to bytes that are out of range (cuz there could be an atomic or unmapped memory or some memory-mapped hardware there). example::add3:
mov rax, rdi
movaps xmm0, xmmword ptr [rsi]
addps xmm0, xmmword ptr [rdx]
extractps dword ptr [rdi + 8], xmm0, 2
movlps qword ptr [rdi], xmm0
ret
example::add5:
mov rax, rdi
movss xmm0, dword ptr [rsi + 16]
movaps xmm1, xmmword ptr [rsi]
addps xmm1, xmmword ptr [rdx]
addss xmm0, dword ptr [rdx + 16]
movss dword ptr [rdi + 16], xmm0
movaps xmmword ptr [rdi], xmm1
ret |
re: some older objections:
Alas, people are already writing code that exhibits this problem via use of
It seems LLVM already has an answer for this. It may not be the answer we want, but it is an answer.
Sure, the choice may be a mismatch for some obscure hardware architecture, but Rust already rejects many of those. It's not clear we can get away with ignoring non-pow2 vectors, as both SVE2 and RVV imply them, and SVE2 has begun shipping in consumer devices.
It's not, but it's also not actually clear that we shouldn't, in fact, commit to there being a single "SIMD container type" in the language which defines a behavior for how all vectors are stored, read, written, and passed, as an underlying device, like Layout or Box. Yes, vek can choose to wrap it somehow.
"Sound but only by depending on implementation details that are not promised" is unsound in point of fact. The question is whether it should be made sound or not. |
|
In rust-lang/rust#80652 I disabled non-power-of-two vector lengths to deconflict
stdsimd
andcg_clif
development. Power-of-two vectors are typically sufficient, but there was at least one crate using length-3 vectors (yoanlcq/vek#66). It would be nice to reenable these at some point, which would require support from cranelift. Perhaps a feature request should be filed with cranelift?cc @bjorn3
The text was updated successfully, but these errors were encountered: