s390x: Support both big- and little-endian vector lane order #4682

uweigand · 2022-08-10T19:35:15Z

This implements the s390x back-end portion of the solution for
#4566

We now support both big- and little-endian vector lane order
in code generation. The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:

Vector immediates
Vector memory load / store (in big- and little-endian variants)
Operations explicitly using lane numbers
(insertlane, extractlane, shuffle, swizzle)
Operations implicitly using lane numbers
(iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

Ensure we always select a Wasmtime calling convention on s390x
in crates/cranelift (func_signature).
Fix vector immediates for filetests/runtests. In PR s390x: Implement full SIMD support #4427,
I attempted to fix this by byte-swapping the V128 value, but
with the new scheme, we'd instead need to perform a per-lane
byte swap. Since we do not know the actual type in write_to_slice
and read_from_slice, this isn't easily possible.

Revert this part of PR s390x: Implement full SIMD support #4427 again, and instead just mark the
memory buffer as little-endian when emitting the trampoline;
the back-end will then emit correct code to load the constant.
Change a runtest in simd-bitselect-to-vselect.clif to no longer
make little-endian lane order assumptions.
Remove runtests in simd-swizzle.clif that make little-endian
lane order assumptions by relying on implicit type conversion
when using a non-i16x8 swizzle result type (this feature should
probably be removed anyway).

Tested with both wasmtime and cg_clif.

FYI - @cfallin

This implements the s390x back-end portion of the solution for bytecodealliance#4566 We now support both big- and little-endian vector lane order in code generation. The order used for a function is determined by the function's ABI: if it uses a Wasmtime ABI, it will use little-endian lane order, and big-endian lane order otherwise. (This ensures that all raw_bitcast instructions generated by both wasmtime and other cranelift frontends can always be implemented as a no-op.) Lane order affects the implementation of a number of operations: - Vector immediates - Vector memory load / store (in big- and little-endian variants) - Operations explicitly using lane numbers (insertlane, extractlane, shuffle, swizzle) - Operations implicitly using lane numbers (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits) In addition, when calling a function using a different lane order, we need to lane-swap all vector values passed or returned in registers. A small number of changes to common code were also needed: - Ensure we always select a Wasmtime calling convention on s390x in crates/cranelift (func_signature). - Fix vector immediates for filetests/runtests. In PR bytecodealliance#4427, I attempted to fix this by byte-swapping the V128 value, but with the new scheme, we'd instead need to perform a per-lane byte swap. Since we do not know the actual type in write_to_slice and read_from_slice, this isn't easily possible. Revert this part of PR bytecodealliance#4427 again, and instead just mark the memory buffer as little-endian when emitting the trampoline; the back-end will then emit correct code to load the constant. - Change a runtest in simd-bitselect-to-vselect.clif to no longer make little-endian lane order assumptions. - Remove runtests in simd-swizzle.clif that make little-endian lane order assumptions by relying on implicit type conversion when using a non-i16x8 swizzle result type (this feature should probably be removed anyway). Tested with both wasmtime and cg_clif.

cfallin

This looks great to me -- thanks for all of the careful work here!

The introduction of the BE/LE variants of vector-bitcast (as per discussion in the most recent Cranelift meeting) will come later, I assume? Totally fine to follow up with it; just want to make sure I understand the plans.

As usual I don't fully grok the s390x instruction-lowering details but I will lean on the tests (in addition to your careful attention to detail) to ensure correctness here :-)

uweigand · 2022-08-12T10:37:28Z

This looks great to me -- thanks for all of the careful work here!

Thanks!

The introduction of the BE/LE variants of vector-bitcast (as per discussion in the most recent Cranelift meeting) will come later, I assume? Totally fine to follow up with it; just want to make sure I understand the plans.

Yes, I was planning to continue working on this as a follow up. Given that with this PR merged, that follow-up is no longer expected to have any impact on code generation / correctness, but is solely about cleaning up the IR, I think we can take some time to make sure we get it right.

The details will probably require a bit more discussion, e.g. the exact CLIF IR changes (names of the new instructions etc.), and possibly a transition plan: do we need to continue to support the old raw_bitcast for a while to avoid breaking existing users, are there any existing raw_bitcast users that don't readily fit into the new scheme (e.g. bitcasts between types of different sizes), etc.

I'll go ahead and try to investigate these issues as far as I can, and then come back with findings or a proposal for further discussion in issue #4566.

uweigand force-pushed the s390x-lane-order branch from 658da3e to 161cdfd Compare August 10, 2022 20:19

uweigand force-pushed the s390x-lane-order branch from 161cdfd to b37862f Compare August 10, 2022 21:58

github-actions bot added the cranelift Issues related to the Cranelift code generator label Aug 10, 2022

cfallin approved these changes Aug 11, 2022

View reviewed changes

cfallin merged commit 67870d1 into bytecodealliance:main Aug 11, 2022

uweigand deleted the s390x-lane-order branch August 12, 2022 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s390x: Support both big- and little-endian vector lane order #4682

s390x: Support both big- and little-endian vector lane order #4682

uweigand commented Aug 10, 2022

cfallin left a comment

uweigand commented Aug 12, 2022

s390x: Support both big- and little-endian vector lane order #4682

s390x: Support both big- and little-endian vector lane order #4682

Conversation

uweigand commented Aug 10, 2022

cfallin left a comment

Choose a reason for hiding this comment

uweigand commented Aug 12, 2022