Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s390x: Support both big- and little-endian vector lane order #4682

Merged
merged 1 commit into from
Aug 11, 2022

Conversation

uweigand
Copy link
Member

This implements the s390x back-end portion of the solution for
#4566

We now support both big- and little-endian vector lane order
in code generation. The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:

  • Vector immediates
  • Vector memory load / store (in big- and little-endian variants)
  • Operations explicitly using lane numbers
    (insertlane, extractlane, shuffle, swizzle)
  • Operations implicitly using lane numbers
    (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

  • Ensure we always select a Wasmtime calling convention on s390x
    in crates/cranelift (func_signature).

  • Fix vector immediates for filetests/runtests. In PR s390x: Implement full SIMD support #4427,
    I attempted to fix this by byte-swapping the V128 value, but
    with the new scheme, we'd instead need to perform a per-lane
    byte swap. Since we do not know the actual type in write_to_slice
    and read_from_slice, this isn't easily possible.

    Revert this part of PR s390x: Implement full SIMD support #4427 again, and instead just mark the
    memory buffer as little-endian when emitting the trampoline;
    the back-end will then emit correct code to load the constant.

  • Change a runtest in simd-bitselect-to-vselect.clif to no longer
    make little-endian lane order assumptions.

  • Remove runtests in simd-swizzle.clif that make little-endian
    lane order assumptions by relying on implicit type conversion
    when using a non-i16x8 swizzle result type (this feature should
    probably be removed anyway).

Tested with both wasmtime and cg_clif.

FYI - @cfallin

This implements the s390x back-end portion of the solution for
bytecodealliance#4566

We now support both big- and little-endian vector lane order
in code generation.  The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
  (insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
  (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

- Ensure we always select a Wasmtime calling convention on s390x
  in crates/cranelift (func_signature).

- Fix vector immediates for filetests/runtests.  In PR bytecodealliance#4427,
  I attempted to fix this by byte-swapping the V128 value, but
  with the new scheme, we'd instead need to perform a per-lane
  byte swap.  Since we do not know the actual type in write_to_slice
  and read_from_slice, this isn't easily possible.

  Revert this part of PR bytecodealliance#4427 again, and instead just mark the
  memory buffer as little-endian when emitting the trampoline;
  the back-end will then emit correct code to load the constant.

- Change a runtest in simd-bitselect-to-vselect.clif to no longer
  make little-endian lane order assumptions.

- Remove runtests in simd-swizzle.clif that make little-endian
  lane order assumptions by relying on implicit type conversion
  when using a non-i16x8 swizzle result type (this feature should
  probably be removed anyway).

Tested with both wasmtime and cg_clif.
@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Aug 10, 2022
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me -- thanks for all of the careful work here!

The introduction of the BE/LE variants of vector-bitcast (as per discussion in the most recent Cranelift meeting) will come later, I assume? Totally fine to follow up with it; just want to make sure I understand the plans.

As usual I don't fully grok the s390x instruction-lowering details but I will lean on the tests (in addition to your careful attention to detail) to ensure correctness here :-)

@cfallin cfallin merged commit 67870d1 into bytecodealliance:main Aug 11, 2022
@uweigand uweigand deleted the s390x-lane-order branch August 12, 2022 09:32
@uweigand
Copy link
Member Author

This looks great to me -- thanks for all of the careful work here!

Thanks!

The introduction of the BE/LE variants of vector-bitcast (as per discussion in the most recent Cranelift meeting) will come later, I assume? Totally fine to follow up with it; just want to make sure I understand the plans.

Yes, I was planning to continue working on this as a follow up. Given that with this PR merged, that follow-up is no longer expected to have any impact on code generation / correctness, but is solely about cleaning up the IR, I think we can take some time to make sure we get it right.

The details will probably require a bit more discussion, e.g. the exact CLIF IR changes (names of the new instructions etc.), and possibly a transition plan: do we need to continue to support the old raw_bitcast for a while to avoid breaking existing users, are there any existing raw_bitcast users that don't readily fit into the new scheme (e.g. bitcasts between types of different sizes), etc.

I'll go ahead and try to investigate these issues as far as I can, and then come back with findings or a proposal for further discussion in issue #4566.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants