Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s390x: Implement full SIMD support #4427

Merged
merged 1 commit into from
Jul 18, 2022

Conversation

uweigand
Copy link
Member

@uweigand uweigand commented Jul 9, 2022

This adds full support for all Cranelift SIMD instructions
to the s390x target. Everything is matched fully via ISLE.

In addition to adding support for many new instructions,
and the lower.isle code to match all SIMD IR patterns,
this patch also adds ABI support for vector types.
In particular, we now need to handle the fact that
vector registers 8 .. 15 are partially callee-saved,
i.e. the high parts of those registers (which correspond
to the old floating-poing registers) are callee-saved,
but the low parts are not. This is the exact same situation
that we already have on AArch64, and so this patch uses the
same solution (the is_included_in_clobbers callback).

The bulk of the changes are platform-specific, but there are
a few exceptions:

  • Added ISLE extractors for the Immediate and Constant types,
    to enable matching the vconst and swizzle instructions.

  • Added a missing accessor for call_conv to ABISig.

  • Fixed endian conversion for vector types in data_value.rs
    to enable their use in runtests on the big-endian platforms.

  • Enabled (nearly) all SIMD runtests on s390x. [ Two test cases
    remain disabled due to vector shift count semantics, see below. ]

  • Enabled all Wasmtime SIMD tests on s390x.

There are three minor issues, called out via FIXMEs below,
which should be addressed in the future, but should not be
blockers to getting this patch merged. I've opened the
following issues to track them:

All tests, including all newly enabled SIMD tests, pass
on both z14 and z15 architectures.

FYI @cfallin

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Jul 9, 2022
@uweigand uweigand force-pushed the s390x-simd branch 2 times, most recently from caef959 to 79f9127 Compare July 9, 2022 18:51
@uweigand
Copy link
Member Author

uweigand commented Jul 9, 2022

Unfortunately there seems to be another QEMU problem. I'm now using a new s390x instruction to implement the "pseudo" min/max semantics, and this seems to be incorrectly implemented on QEMU. This causes the pmin/pmax tests (both the cranelift runtests and the wasm spec tests) to fail under QEMU, even though the all pass on native hardware (both z14 and z15).

I've now disabled those test to make the CI pass. Once we've fixed QEMU they can be re-enabled again.

@alexcrichton
Copy link
Member

If you're interested in @uweigand one thing that might be good to double-check this PR is to run some of the fuzzers for awhile locally on real hardware since QEMU-based performance on our end probably won't end up fuzzing much. For SIMD fuzzing the v8 fuzzer probably won't work since I suspect that the v8 Rust crate doesn't have prebuilt binaries for s390x (or actually I'm not sure v8 even has support for s390x...) but the differential_spec fuzzer might work? If OCaml works on s390x then that should be enough to get the differential_spec fuzzer running and if you run that for a few hours it might be good to weed out possible issues with simd-related (or other perhaps) instructions.

@uweigand
Copy link
Member Author

Hmm, unfortunately cargo fuzz run differential_spec fails immediately with:

error: address sanitizer is not supported for this target

which is a bit strange since LLVM on SystemZ does support the address sanitizer. But I guess there might be some bits of support in the Rust compiler itself that are missing? Do you have any pointers what we'd need to do here?

OCaml does work SystemZ, but I don't think it's even getting that far yet.

@alexcrichton
Copy link
Member

If you pass -s none to cargo fuzz does it make more progress? (that disables asan)

@alexcrichton
Copy link
Member

Ah sorry, and for fuzzing, the supported_sanitizers option in the upstream rust-lang/rust target specification I think is probably what's gating this on the s390x target. There may also be distribution bits as well, although I'm less certain about that.

@uweigand
Copy link
Member Author

With -s none the build completed successfully, and the fuzzer seems to be running now. Hasn't found anything yet - I'll just leave it running for a while. Thanks!

This adds full support for all Cranelift SIMD instructions
to the s390x target.  Everything is matched fully via ISLE.

In addition to adding support for many new instructions,
and the lower.isle code to match all SIMD IR patterns,
this patch also adds ABI support for vector types.
In particular, we now need to handle the fact that
vector registers 8 .. 15 are partially callee-saved,
i.e. the high parts of those registers (which correspond
to the old floating-poing registers) are callee-saved,
but the low parts are not.  This is the exact same situation
that we already have on AArch64, and so this patch uses the
same solution (the is_included_in_clobbers callback).

The bulk of the changes are platform-specific, but there are
a few exceptions:

- Added ISLE extractors for the Immediate and Constant types,
  to enable matching the vconst and swizzle instructions.

- Added a missing accessor for call_conv to ABISig.

- Fixed endian conversion for vector types in data_value.rs
  to enable their use in runtests on the big-endian platforms.

- Enabled (nearly) all SIMD runtests on s390x.  [ Two test cases
  remain disabled due to vector shift count semantics, see below. ]

- Enabled all Wasmtime SIMD tests on s390x.

There are three minor issues, called out via FIXMEs below,
which should be addressed in the future, but should not be
blockers to getting this patch merged.  I've opened the
following issues to track them:

- Vector shift count semantics
  bytecodealliance#4424

- is_included_in_clobbers vs. link register
  bytecodealliance#4425

- gen_constant callback
  bytecodealliance#4426

All tests, including all newly enabled SIMD tests, pass
on both z14 and z15 architectures.
@uweigand
Copy link
Member Author

The fuzzer did indeed find two problems, both were detected after only a few minutes of running:

  • a problem with random high bits in sub-word immediates that were not properly zero-extended
  • a regalloc problem caused by inaccurate tracking of use of the %r1 temp register

I've now updated this PR to included fixes for both of those. With this fixed, the fuzzer has now been running for over 12 hours without finding anything, currently coming up to:

=== Execution rate (2717815 executed modules / 4171000 tried modules): 65.15979381443299% ===

(not sure if that is a lot or not ... or how long it makes sense to keep it running)

@alexcrichton
Copy link
Member

It's at least some progress! In #4315 two bugs were found in the x86_64 implementation with a program in the wild, so our fuzzing as-is right now is only but so comprehensive. There's various efforts to improve things, but 12 hours of no more bugs is better than 0 hours at least!

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall -- thank you for the work on this implementation @uweigand!

Given the volume of code and my non-familiarity with s390x's vector instructions, I mostly spot-checked the usual things like regalloc metadata, reasonable ISLE definitions, and the like.

I did notice in some cases in lower.isle special cases -- actually the way in which shuffle uses pattern-matching is quite beautiful -- and was wondering if there were compile-tests that hit these special-cases in particular (vs relying on the Wasm testsuite and existing runtests). The reason I ask is that without rule priorities, these lean on ISLE implicit prioritization to pick the more specific match. With the fallback being correct for all cases, there's no bug here, and the more-specific-first heuristic should usually do the right thing, but it might be good to assert.

Otherwise, I'm happy to see this merged!

@uweigand
Copy link
Member Author

I did notice in some cases in lower.isle special cases -- actually the way in which shuffle uses pattern-matching is quite beautiful -- and was wondering if there were compile-tests that hit these special-cases in particular (vs relying on the Wasm testsuite and existing runtests).

Yes, cranelift/filetests/filetests/isa/s390x/vec-permute.clif has a compile test for each of the special cases.

Otherwise, I'm happy to see this merged!

Thanks for the review!

@cfallin cfallin merged commit 638dc4e into bytecodealliance:main Jul 18, 2022
@alexcrichton alexcrichton mentioned this pull request Jul 20, 2022
3 tasks
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Jul 20, 2022
This carries over a narrow fix from bytecodealliance#4427 to prior release branches. The
patch here was created by `@uweigand` during the investigation for bytecodealliance#4487.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Jul 20, 2022
This carries over a narrow fix from bytecodealliance#4427 to prior release branches. The
patch here was created by `@uweigand` during the investigation for bytecodealliance#4487.
alexcrichton added a commit that referenced this pull request Jul 20, 2022
* Fix panics in s390x codegen related to aliases

This commit fixes an issue introduced as part of the fix for
GHSA-5fhj-g3p3-pq9g. The `reftyped_vregs` list given to `regalloc2` is
not allowed to have duplicates in it and while the list originally
doesn't have duplicates once aliases are applied the list may have
duplicates. The fix here is to perform another pass to remove duplicates
after the aliases have been processed.

* Fix a miscompile for s390x with constants

This carries over a narrow fix from #4427 to prior release branches. The
patch here was created by `@uweigand` during the investigation for #4487.

* Add release notes
alexcrichton added a commit that referenced this pull request Jul 20, 2022
* Fix panics in s390x codegen related to aliases

This commit fixes an issue introduced as part of the fix for
GHSA-5fhj-g3p3-pq9g. The `reftyped_vregs` list given to `regalloc2` is
not allowed to have duplicates in it and while the list originally
doesn't have duplicates once aliases are applied the list may have
duplicates. The fix here is to perform another pass to remove duplicates
after the aliases have been processed.

* Fix a miscompile for s390x with constants

This carries over a narrow fix from #4427 to prior release branches. The
patch here was created by `@uweigand` during the investigation for #4487.

* Add release notes
uweigand added a commit to uweigand/wasmtime that referenced this pull request Aug 10, 2022
This implements the s390x back-end portion of the solution for
bytecodealliance#4566

We now support both big- and little-endian vector lane order
in code generation.  The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
  (insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
  (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

- Ensure we always select a Wasmtime calling convention on s390x
  in crates/cranelift (func_signature).

- Fix vector immediates for filetests/runtests.  In PR bytecodealliance#4427,
  I attempted to fix this by byte-swapping the V128 value, but
  with the new scheme, we'd instead need to perform a per-lane
  byte swap.  Since we do not know the actual type in write_to_slice
  and read_from_slice, this isn't easily possible.

  Revert this part of PR bytecodealliance#4427 again, and instead just mark the
  memory buffer as little-endian when emitting the trampoline;
  the back-end will then emit correct code to load the constant.

- Change a runtest in simd-bitselect-to-vselect.clif to no longer
  make little-endian lane order assumptions.

- Remove runtests in simd-swizzle.clif that make little-endian
  lane order assumptions by relying on implicit type conversion
  when using a non-i16x8 swizzle result type (this feature should
  probably be removed anyway).

Tested with both wasmtime and cg_clif.
uweigand added a commit to uweigand/wasmtime that referenced this pull request Aug 10, 2022
This implements the s390x back-end portion of the solution for
bytecodealliance#4566

We now support both big- and little-endian vector lane order
in code generation.  The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
  (insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
  (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

- Ensure we always select a Wasmtime calling convention on s390x
  in crates/cranelift (func_signature).

- Fix vector immediates for filetests/runtests.  In PR bytecodealliance#4427,
  I attempted to fix this by byte-swapping the V128 value, but
  with the new scheme, we'd instead need to perform a per-lane
  byte swap.  Since we do not know the actual type in write_to_slice
  and read_from_slice, this isn't easily possible.

  Revert this part of PR bytecodealliance#4427 again, and instead just mark the
  memory buffer as little-endian when emitting the trampoline;
  the back-end will then emit correct code to load the constant.

- Change a runtest in simd-bitselect-to-vselect.clif to no longer
  make little-endian lane order assumptions.

- Remove runtests in simd-swizzle.clif that make little-endian
  lane order assumptions by relying on implicit type conversion
  when using a non-i16x8 swizzle result type (this feature should
  probably be removed anyway).

Tested with both wasmtime and cg_clif.
uweigand added a commit to uweigand/wasmtime that referenced this pull request Aug 10, 2022
This implements the s390x back-end portion of the solution for
bytecodealliance#4566

We now support both big- and little-endian vector lane order
in code generation.  The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
  (insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
  (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

- Ensure we always select a Wasmtime calling convention on s390x
  in crates/cranelift (func_signature).

- Fix vector immediates for filetests/runtests.  In PR bytecodealliance#4427,
  I attempted to fix this by byte-swapping the V128 value, but
  with the new scheme, we'd instead need to perform a per-lane
  byte swap.  Since we do not know the actual type in write_to_slice
  and read_from_slice, this isn't easily possible.

  Revert this part of PR bytecodealliance#4427 again, and instead just mark the
  memory buffer as little-endian when emitting the trampoline;
  the back-end will then emit correct code to load the constant.

- Change a runtest in simd-bitselect-to-vselect.clif to no longer
  make little-endian lane order assumptions.

- Remove runtests in simd-swizzle.clif that make little-endian
  lane order assumptions by relying on implicit type conversion
  when using a non-i16x8 swizzle result type (this feature should
  probably be removed anyway).

Tested with both wasmtime and cg_clif.
cfallin pushed a commit that referenced this pull request Aug 11, 2022
This implements the s390x back-end portion of the solution for
#4566

We now support both big- and little-endian vector lane order
in code generation.  The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
  (insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
  (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

- Ensure we always select a Wasmtime calling convention on s390x
  in crates/cranelift (func_signature).

- Fix vector immediates for filetests/runtests.  In PR #4427,
  I attempted to fix this by byte-swapping the V128 value, but
  with the new scheme, we'd instead need to perform a per-lane
  byte swap.  Since we do not know the actual type in write_to_slice
  and read_from_slice, this isn't easily possible.

  Revert this part of PR #4427 again, and instead just mark the
  memory buffer as little-endian when emitting the trampoline;
  the back-end will then emit correct code to load the constant.

- Change a runtest in simd-bitselect-to-vselect.clif to no longer
  make little-endian lane order assumptions.

- Remove runtests in simd-swizzle.clif that make little-endian
  lane order assumptions by relying on implicit type conversion
  when using a non-i16x8 swizzle result type (this feature should
  probably be removed anyway).

Tested with both wasmtime and cg_clif.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants