-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector register endianness causing ABI incompatibility #4566
Comments
Thanks for writing this up and for the discussion in today's Cranelift meeting. I'm not sure yet which approach I favor, but I thought it might be useful to write out a few principles that I want to try to stick to:
I had mentioned the idea of an encoding in the vector types in the meeting earlier today and there was some issue that you had raised but I don't remember exactly what it was; so perhaps it's worth going through it again here. The idea is to have variants That does imply the existence of two different ops Perhaps another approach is to eliminate the notion of bitcasting being a no-op on SIMD vectors. That seems like the other root of the issue here: the actual bit positions change if we have a forward-lane-order, reverse-byte-order-within-lane version of an Maybe combining some of the above ideas, including the types, we can capture the actual format of the data in the type and require conversions like:
So this sketch includes (i) a new Or a slight variation on the above: Then finally at ABI boundaries and at loads/stores we only allow Anyway, hopefully one of the ideas above contains a kernel of usefulness -- eager to hear what you think @uweigand. |
I was probably momentarily confused there. Let me try to systematically go through the various representation options to see what makes sense:
This would result in 8 different cases to be considered. However, I believe half of those can be quickly eliminated since they serve no useful purpose:
As a result, we can conclude:
The next question is, how to characterize the two "lane order" options for values in registers. First, we notice that if the vector is interpreted as
This definition ensures in particular that we can match WebAssembly vector conversion semantics by always using vector types in little-endian lane order (when in registers). In order to match semantics requires by native vector code (e.g. for the Rust front end), we need to use whatever lane order is considered the "native" default for the platform. It appears that this matches the native byte order for all platforms I'm aware of (i.e. all platforms with native little-endian byte order seem to be using native little-endian lane order, and all platforms with native big-endian byte order seem to be using native big-endian lane order), but that's actually not necessarily a requirement. So far, we have not looked at ISA instructions at all. First, we note that most operations are in fact independent of the lane order: an
For the memory operations, we have four versions each, depending on the memory byte-order and the register lane-order. These match the four versions described in the table in the original issue text. For the explicit lane number operations, we'd have two versions each, depending on the lane order. Finally, we have to look at ABI considerations. Obviously, if a value of vector type crosses an ABI boundary, there are ABI concerns: caller and callee must agree on the representation, which means they must agree on byte order if the value is passed in memory, and they must agree on lane order if the value is passed in register. For in-memory byte order, that's actually just business as usual since that applies to all other types as well. [ As an aside, the situation is a bit more complex than one might first think: for values explicitly passed as parameters, we always use native byte order (even with the Wasmtime ABI, all values on the stack are always stored in native byte order currently), but for values used more indirectly (e.g. from a global variable or via a pointer), platform assumptions are in play (Wasmtime uses LE for such values, native code uses native byte order). ] The in-register lane order for vector types at ABI boundaries is a new question to be asked. For native code, we must follow existing precedent, which requires native lane order (whatever that is on any given platform). For Wasmtime, if we're lucky the issue does not occur, since Wasmtime only supports a single All that said, we still have to decide how to represent lane order information. Option 1: Global or per-function setting I understand you don't like this option, but it would certainly be feasible. In particular, it seems to me now that tying the lane order to the function ABI might be a good option, given that:
Option 2: Types It appears it would indeed also be possible to use types to encode the lane order information. We'd have e.g. This option might have some less-desirable consequences:
Option 3: Instructions Another option would be use different instructions. This basically means we'd have either two copies of the affected instructions (memory ops, lane number ops), or else they'd all get some flags argument. Either way, the lane order to be used would be hard-coded into the instruction itself. Some disadvantages of this option:
Option 4: Explicit conversion
Sure, if However, we'd still have to know whether the user intended But I really don't like that option, since it would significantly degrade performance of Wasmtime on BE machines. Some instances of As to your code example, I don't see how that would solve the full problem. You're doing a However, that doesn't solve the problem because we also need two flavors of In Wasmtime semantics, this is supposed to return the LSB of the sum of lanes 0 of the two inputs. In native semantics, this is supposed to return the MSB of the sum of lanes 0 of the two inputs. The code sequences I want to generate to efficiently implement both operations would be, for the native case:
and for the Wasmtime case:
This only works if both the load and the extractlane are affected by the lane order choice. |
Thank you again @uweigand for the detailed space-exploration here; I need to think more about this. Perhaps we can schedule some more time to talk about it in a Cranelift meeting? I will say now that one concern I have with a function-level or global-level mode is composability: if the meaning of instructions depends on the environment (func or larger scope) and that environment has only one setting, then we can no longer write IR that uses both behaviors, and we can no longer copy instructions between functions. That may not be a common operation now, but it will become a first-order concern when we build an inlining pass. I could imagine a (far) future where we have native runtime code compiled from e.g. a Rust frontend, and Wasm code compiled from the Wasm frontend, and we want to perhaps inline across them. So from a semantics and clean-design point of view, I'd prefer for instructions to encapsulate their entire meaning, without influence from environment. That leaves either the types or some update to the opcodes; I'm not sure which is the better option at this point. More discussion and thought needed! |
Thanks for looking at this, @cfallin ! Happy to have a discussion in the meeting - I'd be available next Monday (Aug 8). |
- Discussion of vector endianness (bytecodealliance/wasmtime#4566) - Notification of plans to factor out part of Cranelift's base layer to share with Winch, a baseline compiler (bytecodealliance/rfcs#28)
As I mentioned during the last Cranelift meeting, I think that when data endianness is set to big-endian, the 64-bit Arm architecture does not fit into either of those categories. Consider an indexed vector multiplication, i.e. a variant of the Now, we do not have any plans to implement big-endian support inside Cranelift's AArch64 backend in the near future (big-endian platforms are somewhat exotic and in fact big-endian support is optional in the architecture), but I think that this information might inform whatever design we come up with.
The option I mentioned during the meeting was to introduce an ISA-specific setting in |
Thanks for the pointer! From my reading of the document, this simply means that AArch64 always uses a native vector lane ordering (according to my definition above) of little-endian, independent on whether the default byte order is little-endian or big-endian. Now that I think about it, Power is actually similar: Power supports both big-endian and little-endian default byte order, but the native vector lane ordering is always big-endian. So my statement quoted above that native lane order seems to always match default byte order is definitely not true :-) But as I said, it's not actually a requirement. This simply means that we need to provide both default byte order and native vector lane order as independent parameters of an ISA. |
Thinking about this some more, I have some further comments on this option:
Ignoring efficiency aspects for a moment, this option may actually be the cleanest, semantically, after all. To be specific, the option would consist of replacing The semantics of the new instructions would be:
The semantics of the two instructions would then actually be fully well-defined, and not dependent on any implicit platform properties (unlike the current If the requested lane order matches the lane order the implementation chose to hold vector values in vector registers, the operation remains a no-op; otherwise, the implementation is a permutation. Specifically, a Now, lets look back at implementation efficiency. As I noted above, a naive direct implementation of the semantics as just defined would be very inefficient, certainly in the case of Wasmtime on BE lane order systems. However, there may be options to provide optimizations to make the implementation more efficient - without any change to the semantics! (And therefore, without affecting the properties you were concerned about, @cfallin.) Specifically, I can think of options to optimize implementation: Explicit swap_lanes instructions We could, as an optimization pass, do the following:
That is something that could possibly be done within the new egraph-based framework. But maybe all that compile-time overhead isn't actually necessary either, because we can just do this instead: Implementation choice of lane order (semantics preserving) With the above model, Cranelift IR has well-defined semantics that does not depend on global parameters. However, the code generation back-end is free to chose to emit whatever instruction sequence it wants - assuming it in the end implements that well-defined semantics. In particular, the implementation is actually always free to choose whether to represent vector values in vector registers in little-endian or big-endian lane order! For example, on a platform like
However, I also can implement the very same Cranelift IR semantics by doing instead:
This just results in a different instruction sequence for the current function, but implementing the exact same semantics. Now, since this is just an implementation choice, I'm free to use whatever heuristics make most sense to decide which way to go. For example, if a function contains many In fact, I could even just use the current function's ABI as input to that heuristic, and simply use little-endian lane order for functions using the Wasmtime ABI, and big-endian lane order otherwise. Combined with the fact that in the Wasmtime ABI, we only have the single-lane V128 at ABI boundaries and therefore those swaps are no-ops, I would end up with exactly the same generated code, and basically no compile time overhead either, as I would have in the option described above as "Option 1: Global or per-function setting". But at the same time, there actually is no global state and Cranelift IR semantics is locally well-defined. While I do look at the ABI to make a decision, the ABI only influences implementation choice heuristics, not semantics. |
Hi @cfallin for reference here's the set of charts I shared in today's meetings: Vector-Endian.pdf |
This implements the s390x back-end portion of the solution for bytecodealliance#4566 We now support both big- and little-endian vector lane order in code generation. The order used for a function is determined by the function's ABI: if it uses a Wasmtime ABI, it will use little-endian lane order, and big-endian lane order otherwise. (This ensures that all raw_bitcast instructions generated by both wasmtime and other cranelift frontends can always be implemented as a no-op.) Lane order affects the implementation of a number of operations: - Vector immediates - Vector memory load / store (in big- and little-endian variants) - Operations explicitly using lane numbers (insertlane, extractlane, shuffle, swizzle) - Operations implicitly using lane numbers (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits) In addition, when calling a function using a different lane order, we need to lane-swap all vector values passed or returned in registers. A small number of changes to common code were also needed: - Ensure we always select a Wasmtime calling convention on s390x in crates/cranelift (func_signature). - Fix vector immediates for filetests/runtests. In PR bytecodealliance#4427, I attempted to fix this by byte-swapping the V128 value, but with the new scheme, we'd instead need to perform a per-lane byte swap. Since we do not know the actual type in write_to_slice and read_from_slice, this isn't easily possible. Revert this part of PR bytecodealliance#4427 again, and instead just mark the memory buffer as little-endian when emitting the trampoline; the back-end will then emit correct code to load the constant. - Change a runtest in simd-bitselect-to-vselect.clif to no longer make little-endian lane order assumptions. - Remove runtests in simd-swizzle.clif that make little-endian lane order assumptions by relying on implicit type conversion when using a non-i16x8 swizzle result type (this feature should probably be removed anyway). Tested with both wasmtime and cg_clif.
This implements the s390x back-end portion of the solution for bytecodealliance#4566 We now support both big- and little-endian vector lane order in code generation. The order used for a function is determined by the function's ABI: if it uses a Wasmtime ABI, it will use little-endian lane order, and big-endian lane order otherwise. (This ensures that all raw_bitcast instructions generated by both wasmtime and other cranelift frontends can always be implemented as a no-op.) Lane order affects the implementation of a number of operations: - Vector immediates - Vector memory load / store (in big- and little-endian variants) - Operations explicitly using lane numbers (insertlane, extractlane, shuffle, swizzle) - Operations implicitly using lane numbers (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits) In addition, when calling a function using a different lane order, we need to lane-swap all vector values passed or returned in registers. A small number of changes to common code were also needed: - Ensure we always select a Wasmtime calling convention on s390x in crates/cranelift (func_signature). - Fix vector immediates for filetests/runtests. In PR bytecodealliance#4427, I attempted to fix this by byte-swapping the V128 value, but with the new scheme, we'd instead need to perform a per-lane byte swap. Since we do not know the actual type in write_to_slice and read_from_slice, this isn't easily possible. Revert this part of PR bytecodealliance#4427 again, and instead just mark the memory buffer as little-endian when emitting the trampoline; the back-end will then emit correct code to load the constant. - Change a runtest in simd-bitselect-to-vselect.clif to no longer make little-endian lane order assumptions. - Remove runtests in simd-swizzle.clif that make little-endian lane order assumptions by relying on implicit type conversion when using a non-i16x8 swizzle result type (this feature should probably be removed anyway). Tested with both wasmtime and cg_clif.
This implements the s390x back-end portion of the solution for bytecodealliance#4566 We now support both big- and little-endian vector lane order in code generation. The order used for a function is determined by the function's ABI: if it uses a Wasmtime ABI, it will use little-endian lane order, and big-endian lane order otherwise. (This ensures that all raw_bitcast instructions generated by both wasmtime and other cranelift frontends can always be implemented as a no-op.) Lane order affects the implementation of a number of operations: - Vector immediates - Vector memory load / store (in big- and little-endian variants) - Operations explicitly using lane numbers (insertlane, extractlane, shuffle, swizzle) - Operations implicitly using lane numbers (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits) In addition, when calling a function using a different lane order, we need to lane-swap all vector values passed or returned in registers. A small number of changes to common code were also needed: - Ensure we always select a Wasmtime calling convention on s390x in crates/cranelift (func_signature). - Fix vector immediates for filetests/runtests. In PR bytecodealliance#4427, I attempted to fix this by byte-swapping the V128 value, but with the new scheme, we'd instead need to perform a per-lane byte swap. Since we do not know the actual type in write_to_slice and read_from_slice, this isn't easily possible. Revert this part of PR bytecodealliance#4427 again, and instead just mark the memory buffer as little-endian when emitting the trampoline; the back-end will then emit correct code to load the constant. - Change a runtest in simd-bitselect-to-vselect.clif to no longer make little-endian lane order assumptions. - Remove runtests in simd-swizzle.clif that make little-endian lane order assumptions by relying on implicit type conversion when using a non-i16x8 swizzle result type (this feature should probably be removed anyway). Tested with both wasmtime and cg_clif.
This implements the s390x back-end portion of the solution for #4566 We now support both big- and little-endian vector lane order in code generation. The order used for a function is determined by the function's ABI: if it uses a Wasmtime ABI, it will use little-endian lane order, and big-endian lane order otherwise. (This ensures that all raw_bitcast instructions generated by both wasmtime and other cranelift frontends can always be implemented as a no-op.) Lane order affects the implementation of a number of operations: - Vector immediates - Vector memory load / store (in big- and little-endian variants) - Operations explicitly using lane numbers (insertlane, extractlane, shuffle, swizzle) - Operations implicitly using lane numbers (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits) In addition, when calling a function using a different lane order, we need to lane-swap all vector values passed or returned in registers. A small number of changes to common code were also needed: - Ensure we always select a Wasmtime calling convention on s390x in crates/cranelift (func_signature). - Fix vector immediates for filetests/runtests. In PR #4427, I attempted to fix this by byte-swapping the V128 value, but with the new scheme, we'd instead need to perform a per-lane byte swap. Since we do not know the actual type in write_to_slice and read_from_slice, this isn't easily possible. Revert this part of PR #4427 again, and instead just mark the memory buffer as little-endian when emitting the trampoline; the back-end will then emit correct code to load the constant. - Change a runtest in simd-bitselect-to-vselect.clif to no longer make little-endian lane order assumptions. - Remove runtests in simd-swizzle.clif that make little-endian lane order assumptions by relying on implicit type conversion when using a non-i16x8 swizzle result type (this feature should probably be removed anyway). Tested with both wasmtime and cg_clif.
The remaining open part of this issue is the question of how to define the new IR instruction(s) to implement bitcasts with explicitly specified lane order. In preparation for this, I've been reviewing the existing First, we have
However, the implementation doesn't really match the specification in various aspects:
Then, separately, we have
Reading the text, it is somewhat unclear what exactly, on the specification level, the difference between Looking at the actual implementation, we see:
In addition, there are two places in cranelift middle-end common code that match and/or generate
It seems to me that both of these aren't really necessary: I think the NaN canonicalization can just be done without bitcasts, and the In summary, it appears that the distinction between The specification in However, the one place this breaks down is that we cannot handle boolean types, since those cannot be stored to or loaded from memory. But that in itself seems a bit of an inconsistency: I understand the reason for this restriction is that the actual bit pattern used to implement So I think we either should make this an actual requirement, and then allow loading/storing those types, or else, we disallow Any comments or suggestions welcome! |
@uweigand Are you aware of PR #4820? IMHO the only difference between |
It looks like cranelift-fuzzgen doesn't generate either variant of |
I wasn't - thanks for the pointer!
Well, that, and of course the boolean type issue I mentioned. |
Ah, I see, my memory is fuzzy, and I feel that this was discussed before, but is the boolean thing basically the only real reason for the distinction between P.S. IMHO the actual implementations of both P.P.S. I just noticed the difference with respect to dynamic vector type support - @sparker-arm, was this just an accidental omission? |
I never thought bitcasts could be so complicated until I read the clif docs :) It sounds like the standarization of booleans, hopefully enabling storage, is the only thing preventing one cast to rule them all? RE: dynamic vectors, the fact that I'm don't believe any casts should even apply to dynamic vectors as, at the IR level, we can't tell whether two dynamic types have the equivalent sizes. |
@uweigand In case you aren't aware, there was some discussion on Zulip about |
This makes sense to me.
I would actually be in favor of investigating whether we can completely remove boolean types. They are always the special, odd case. I'm not convinced that we couldn't just get by with I would also be in favor of investigating whether we can remove i8x16 et al and just have |
The typed are used for determining what kind of operation (lane size, lane count) to perform when you have eg an iadd instruction. Also cg_clif doesn't need bitcasting for vectors nearly as much as rust uses unique types for each vector lane count+width. |
Weighing in a bit late here, sorry (and thank you @uweigand for the detailed analysis as always!) -- I think I'm converging toward the following thoughts:
If this seems reasonable, then I think the order of operations is (i) remove bools, (ii) remove Thoughts? |
- Allow bitcast for vectors with differing lane widths - Remove raw_bitcast IR instruction - Change all users of raw_bitcast to bitcast - Implement support for no-op bitcast cases across backends This implements the second step of the plan outlined here: bytecodealliance#4566 (comment)
- Allow bitcast for vectors with differing lane widths - Remove raw_bitcast IR instruction - Change all users of raw_bitcast to bitcast - Implement support for no-op bitcast cases across backends This implements the second step of the plan outlined here: bytecodealliance#4566 (comment)
- Allow bitcast for vectors with differing lane widths - Remove raw_bitcast IR instruction - Change all users of raw_bitcast to bitcast - Implement support for no-op bitcast cases across backends This implements the second step of the plan outlined here: #4566 (comment)
Add a MemFlags operand to the bitcast instruction, where only the `big` and `little` flags are accepted. These define the lane order to be used when casting between types of different lane counts. Update all users to pass an appropriate MemFlags argument. Implement lane swaps where necessary in the s390x back-end. This is the final part necessary to fix bytecodealliance#4566.
Add a MemFlags operand to the bitcast instruction, where only the `big` and `little` flags are accepted. These define the lane order to be used when casting between types of different lane counts. Update all users to pass an appropriate MemFlags argument. Implement lane swaps where necessary in the s390x back-end. This is the final part necessary to fix bytecodealliance#4566.
Add a MemFlags operand to the bitcast instruction, where only the `big` and `little` flags are accepted. These define the lane order to be used when casting between types of different lane counts. Update all users to pass an appropriate MemFlags argument. Implement lane swaps where necessary in the s390x back-end. This is the final part necessary to fix bytecodealliance#4566.
Currently,
s390x
is the only big-endian platform supported by cranelift, which has already caused problems related to memory accesses in the past, since wasmtime defines all data in its memory to be in little-endian byte order.I had assumed that register contents are not affected by byte order, and while this is true for scalar types, it turns out this is not actually the case for vector registers. This is because of a combination of two issues:
The combination of these two makes ISA byte order properties of vector registers visible to IR code, just like byte order in memory is visible via a combination of using memory addresses to access (parts of) a value in memory in a different type than it was stored.
Specifically, in the Intel (or ARM) little-endian ISA, if you load e.g. a
I32X4
into a vector register, useraw_bitcast
to re-interpret that register asI8X16
, and retrieve lane 0 of the resulting value viaextractlane
, the result will be the least-significant byte of theI32
in lane 0 of the original value. And in fact, wasmtime requires this to be the case.On the other hand, on
s390x
the content of a vector register inI32X4
mode will be a sequence of fourI32
, each in big-endian byte order (or else the arithmetic operations on the register will give wrong results). Therefore, the same sequence of operations as above will return the most-significant byte of theI32
in lane 0. This actually caused many wasmtime tests to fail.To fix this, my current implementation of SIMD instructions on
s390x
uses a trick combining two different aspects:I32X4
from little-endian memory into a vector register, I'm byte-reversing all 16 bytes of the loaded value. This not only fixes eachI32
value to be in the correct big-endian order in the register so subsequent arithmetic will work, it also implicitly swaps the order of elements, i.e. the element in slot 0 in memory will end up in what the ISA considers slot 3 of the register etc.extractlane
for lane 0 of aI32X4
will actually at the ISA level extract lane 3 of the register.The combination of these two aspects makes accessing SIMD registers work correctly for wasmtime. For example, in the above case, accessing lane 0 of a
I8X16
is converted to lane 15 of the register, which holds the least-significant byte of theI32
in lane 3 of the register, which was loaded from lane 0 in memory -- so in the end we return the least-significant byte of theI32
in lane 0 of the original value, as expected by wasmtime.However, in implementing support for
rustc_codegen_cranelift
, I noticed that the current implementation actually breaks when handling big-endian vectors in memory - this will be the case for rustc, since that uses native platform byte order everywhere. Specifically, when loading a big-endian vector from memory, I'm just loading the bytes unchanged. This means that e.g. lane 0 of aI32X4
in memory ends up in its original byte order (which is OK since this is already big-endian) in lane 0 of the register - but the latter is a problem if subsequent code wants to extract that lane, since anextractlane 0
will actually access lane 3 in the register as described above!To work around that problem, I've implemented a patch that will perform big-endian vector loads by swapping the order of the elements, but not the byte order within any single element. This will cause lane 0 from memory to end up in lane 3 in the register, and makes
extractlane
work as expected again.With that fix, now both wasmtime and rustc_codegen_cranelift pass all their existing SIMD tests. Unfortunately, I think this is still not quite a complete solution.
Specifically, we can now run into two additional problems with big-endian code, which apparently are just never triggered by the existing rustc_codegen_cranelift tests.
First, I guess it would be possible to re-interpret contents in a vector register in another type even in rustc. Now, as opposed to wasmtime, rustc uses native endianness, presumably also w.r.t. vector contents. Therefore, the semantics of such a re-interpretation would be platform-defined and differ between big- and little-endian platforms (which is probably why it's not frequently used). However, users would expect this platform-specific behavior to be the same between the LLVM and cranelift back ends to rustc - which in the current implementation it would not be.
Even more problematic, carrying vector elements in reverse order in vector registers actually affects the ABI, as vector types are passed in vector registers. Code compiled by rustc using the LLVM back end would expect those to be in the "normal" platform order, while code compiled by rustc using the cranelift back end would expect them to be the "reverse" order.
One option I'm thinking of would be to actually implement both methods in the cranelift back end. Specifically, the back end could support both a "vector big-endian" and "vector little-endian" mode, where the "big-endian" mode would use lane numbers directly as defined by our ISA, while the "little-endian" mode would use the reverse ordering implemented by the current back end code.
There's a slight complication in that we might have to support both big- and little-endian vector modes in load and store operations accessing any combination of big- and little-endian memory locations. But that should be possible:
(Starting with
z15
we actually can implement each of these operations using a single instruction, so that should also be efficient.)The one remaining question is, how to select between the "vector big-endian" and "vector little-endian" modes? There are no attributes (like
MemFlags
) on theextractlane
etc. operations, and that wouldn't even make sense: this is a global property, if you used big-endian mode to load the vector you must also use big-endian mode on all subsequent operations.So I guess this would have to be some sort of global flag, which wasmtime would set to always little-endian, and rustc would leave at native byte order. Of course, this flag is actually ABI changing, so the same setting must be used for code to remain link-compatible. But I think given the above use cases, that should actually be fine.
FYI @cfallin - this is another problem I ran into while enabling
rustc_codegen_cranelift
- I'd appreciate any comments or suggestions!The text was updated successfully, but these errors were encountered: