s390x: Implement full SIMD support #4427

uweigand · 2022-07-09T17:42:51Z

This adds full support for all Cranelift SIMD instructions
to the s390x target. Everything is matched fully via ISLE.

In addition to adding support for many new instructions,
and the lower.isle code to match all SIMD IR patterns,
this patch also adds ABI support for vector types.
In particular, we now need to handle the fact that
vector registers 8 .. 15 are partially callee-saved,
i.e. the high parts of those registers (which correspond
to the old floating-poing registers) are callee-saved,
but the low parts are not. This is the exact same situation
that we already have on AArch64, and so this patch uses the
same solution (the is_included_in_clobbers callback).

The bulk of the changes are platform-specific, but there are
a few exceptions:

Added ISLE extractors for the Immediate and Constant types,
to enable matching the vconst and swizzle instructions.
Added a missing accessor for call_conv to ABISig.
Fixed endian conversion for vector types in data_value.rs
to enable their use in runtests on the big-endian platforms.
Enabled (nearly) all SIMD runtests on s390x. [ Two test cases
remain disabled due to vector shift count semantics, see below. ]
Enabled all Wasmtime SIMD tests on s390x.

There are three minor issues, called out via FIXMEs below,
which should be addressed in the future, but should not be
blockers to getting this patch merged. I've opened the
following issues to track them:

Vector shift count semantics
Craneline vector shift count semantics does not match docs #4424
is_included_in_clobbers vs. link register
Semantics of is_included_in_clobbers not ideal #4425
gen_constant callback
MachInst::gen_constant should no longer be needed #4426

All tests, including all newly enabled SIMD tests, pass
on both z14 and z15 architectures.

FYI @cfallin

uweigand · 2022-07-09T19:22:34Z

Unfortunately there seems to be another QEMU problem. I'm now using a new s390x instruction to implement the "pseudo" min/max semantics, and this seems to be incorrectly implemented on QEMU. This causes the pmin/pmax tests (both the cranelift runtests and the wasm spec tests) to fail under QEMU, even though the all pass on native hardware (both z14 and z15).

I've now disabled those test to make the CI pass. Once we've fixed QEMU they can be re-enabled again.

alexcrichton · 2022-07-11T15:50:59Z

If you're interested in @uweigand one thing that might be good to double-check this PR is to run some of the fuzzers for awhile locally on real hardware since QEMU-based performance on our end probably won't end up fuzzing much. For SIMD fuzzing the v8 fuzzer probably won't work since I suspect that the v8 Rust crate doesn't have prebuilt binaries for s390x (or actually I'm not sure v8 even has support for s390x...) but the differential_spec fuzzer might work? If OCaml works on s390x then that should be enough to get the differential_spec fuzzer running and if you run that for a few hours it might be good to weed out possible issues with simd-related (or other perhaps) instructions.

uweigand · 2022-07-11T16:38:25Z

Hmm, unfortunately cargo fuzz run differential_spec fails immediately with:

error: address sanitizer is not supported for this target

which is a bit strange since LLVM on SystemZ does support the address sanitizer. But I guess there might be some bits of support in the Rust compiler itself that are missing? Do you have any pointers what we'd need to do here?

OCaml does work SystemZ, but I don't think it's even getting that far yet.

alexcrichton · 2022-07-11T16:40:15Z

If you pass -s none to cargo fuzz does it make more progress? (that disables asan)

alexcrichton · 2022-07-11T16:41:39Z

Ah sorry, and for fuzzing, the supported_sanitizers option in the upstream rust-lang/rust target specification I think is probably what's gating this on the s390x target. There may also be distribution bits as well, although I'm less certain about that.

uweigand · 2022-07-11T16:58:57Z

With -s none the build completed successfully, and the fuzzer seems to be running now. Hasn't found anything yet - I'll just leave it running for a while. Thanks!

This adds full support for all Cranelift SIMD instructions to the s390x target. Everything is matched fully via ISLE. In addition to adding support for many new instructions, and the lower.isle code to match all SIMD IR patterns, this patch also adds ABI support for vector types. In particular, we now need to handle the fact that vector registers 8 .. 15 are partially callee-saved, i.e. the high parts of those registers (which correspond to the old floating-poing registers) are callee-saved, but the low parts are not. This is the exact same situation that we already have on AArch64, and so this patch uses the same solution (the is_included_in_clobbers callback). The bulk of the changes are platform-specific, but there are a few exceptions: - Added ISLE extractors for the Immediate and Constant types, to enable matching the vconst and swizzle instructions. - Added a missing accessor for call_conv to ABISig. - Fixed endian conversion for vector types in data_value.rs to enable their use in runtests on the big-endian platforms. - Enabled (nearly) all SIMD runtests on s390x. [ Two test cases remain disabled due to vector shift count semantics, see below. ] - Enabled all Wasmtime SIMD tests on s390x. There are three minor issues, called out via FIXMEs below, which should be addressed in the future, but should not be blockers to getting this patch merged. I've opened the following issues to track them: - Vector shift count semantics bytecodealliance#4424 - is_included_in_clobbers vs. link register bytecodealliance#4425 - gen_constant callback bytecodealliance#4426 All tests, including all newly enabled SIMD tests, pass on both z14 and z15 architectures.

uweigand · 2022-07-12T15:23:02Z

The fuzzer did indeed find two problems, both were detected after only a few minutes of running:

a problem with random high bits in sub-word immediates that were not properly zero-extended
a regalloc problem caused by inaccurate tracking of use of the %r1 temp register

I've now updated this PR to included fixes for both of those. With this fixed, the fuzzer has now been running for over 12 hours without finding anything, currently coming up to:

=== Execution rate (2717815 executed modules / 4171000 tried modules): 65.15979381443299% ===

(not sure if that is a lot or not ... or how long it makes sense to keep it running)

alexcrichton · 2022-07-12T15:28:16Z

It's at least some progress! In #4315 two bugs were found in the x86_64 implementation with a program in the wild, so our fuzzing as-is right now is only but so comprehensive. There's various efforts to improve things, but 12 hours of no more bugs is better than 0 hours at least!

cfallin

This looks great overall -- thank you for the work on this implementation @uweigand!

Given the volume of code and my non-familiarity with s390x's vector instructions, I mostly spot-checked the usual things like regalloc metadata, reasonable ISLE definitions, and the like.

I did notice in some cases in lower.isle special cases -- actually the way in which shuffle uses pattern-matching is quite beautiful -- and was wondering if there were compile-tests that hit these special-cases in particular (vs relying on the Wasm testsuite and existing runtests). The reason I ask is that without rule priorities, these lean on ISLE implicit prioritization to pick the more specific match. With the fallback being correct for all cases, there's no bug here, and the more-specific-first heuristic should usually do the right thing, but it might be good to assert.

Otherwise, I'm happy to see this merged!

uweigand · 2022-07-18T20:43:22Z

I did notice in some cases in lower.isle special cases -- actually the way in which shuffle uses pattern-matching is quite beautiful -- and was wondering if there were compile-tests that hit these special-cases in particular (vs relying on the Wasm testsuite and existing runtests).

Yes, cranelift/filetests/filetests/isa/s390x/vec-permute.clif has a compile test for each of the special cases.

Otherwise, I'm happy to see this merged!

Thanks for the review!

This carries over a narrow fix from bytecodealliance#4427 to prior release branches. The patch here was created by `@uweigand` during the investigation for bytecodealliance#4487.

* Fix panics in s390x codegen related to aliases This commit fixes an issue introduced as part of the fix for GHSA-5fhj-g3p3-pq9g. The `reftyped_vregs` list given to `regalloc2` is not allowed to have duplicates in it and while the list originally doesn't have duplicates once aliases are applied the list may have duplicates. The fix here is to perform another pass to remove duplicates after the aliases have been processed. * Fix a miscompile for s390x with constants This carries over a narrow fix from #4427 to prior release branches. The patch here was created by `@uweigand` during the investigation for #4487. * Add release notes

This implements the s390x back-end portion of the solution for bytecodealliance#4566 We now support both big- and little-endian vector lane order in code generation. The order used for a function is determined by the function's ABI: if it uses a Wasmtime ABI, it will use little-endian lane order, and big-endian lane order otherwise. (This ensures that all raw_bitcast instructions generated by both wasmtime and other cranelift frontends can always be implemented as a no-op.) Lane order affects the implementation of a number of operations: - Vector immediates - Vector memory load / store (in big- and little-endian variants) - Operations explicitly using lane numbers (insertlane, extractlane, shuffle, swizzle) - Operations implicitly using lane numbers (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits) In addition, when calling a function using a different lane order, we need to lane-swap all vector values passed or returned in registers. A small number of changes to common code were also needed: - Ensure we always select a Wasmtime calling convention on s390x in crates/cranelift (func_signature). - Fix vector immediates for filetests/runtests. In PR bytecodealliance#4427, I attempted to fix this by byte-swapping the V128 value, but with the new scheme, we'd instead need to perform a per-lane byte swap. Since we do not know the actual type in write_to_slice and read_from_slice, this isn't easily possible. Revert this part of PR bytecodealliance#4427 again, and instead just mark the memory buffer as little-endian when emitting the trampoline; the back-end will then emit correct code to load the constant. - Change a runtest in simd-bitselect-to-vselect.clif to no longer make little-endian lane order assumptions. - Remove runtests in simd-swizzle.clif that make little-endian lane order assumptions by relying on implicit type conversion when using a non-i16x8 swizzle result type (this feature should probably be removed anyway). Tested with both wasmtime and cg_clif.

This implements the s390x back-end portion of the solution for #4566 We now support both big- and little-endian vector lane order in code generation. The order used for a function is determined by the function's ABI: if it uses a Wasmtime ABI, it will use little-endian lane order, and big-endian lane order otherwise. (This ensures that all raw_bitcast instructions generated by both wasmtime and other cranelift frontends can always be implemented as a no-op.) Lane order affects the implementation of a number of operations: - Vector immediates - Vector memory load / store (in big- and little-endian variants) - Operations explicitly using lane numbers (insertlane, extractlane, shuffle, swizzle) - Operations implicitly using lane numbers (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits) In addition, when calling a function using a different lane order, we need to lane-swap all vector values passed or returned in registers. A small number of changes to common code were also needed: - Ensure we always select a Wasmtime calling convention on s390x in crates/cranelift (func_signature). - Fix vector immediates for filetests/runtests. In PR #4427, I attempted to fix this by byte-swapping the V128 value, but with the new scheme, we'd instead need to perform a per-lane byte swap. Since we do not know the actual type in write_to_slice and read_from_slice, this isn't easily possible. Revert this part of PR #4427 again, and instead just mark the memory buffer as little-endian when emitting the trampoline; the back-end will then emit correct code to load the constant. - Change a runtest in simd-bitselect-to-vselect.clif to no longer make little-endian lane order assumptions. - Remove runtests in simd-swizzle.clif that make little-endian lane order assumptions by relying on implicit type conversion when using a non-i16x8 swizzle result type (this feature should probably be removed anyway). Tested with both wasmtime and cg_clif.

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Jul 9, 2022

uweigand force-pushed the s390x-simd branch 2 times, most recently from caef959 to 79f9127 Compare July 9, 2022 18:51

uweigand force-pushed the s390x-simd branch from 79f9127 to fe6ddcb Compare July 12, 2022 15:22

cfallin approved these changes Jul 18, 2022

View reviewed changes

cfallin merged commit 638dc4e into bytecodealliance:main Jul 18, 2022

alexcrichton mentioned this pull request Jul 20, 2022

CI for main is broken #4487

Closed

3 tasks

uweigand mentioned this pull request Aug 10, 2022

s390x: Support both big- and little-endian vector lane order #4682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s390x: Implement full SIMD support #4427

s390x: Implement full SIMD support #4427

uweigand commented Jul 9, 2022

uweigand commented Jul 9, 2022

alexcrichton commented Jul 11, 2022

uweigand commented Jul 11, 2022

alexcrichton commented Jul 11, 2022

alexcrichton commented Jul 11, 2022

uweigand commented Jul 11, 2022

uweigand commented Jul 12, 2022

alexcrichton commented Jul 12, 2022

cfallin left a comment

uweigand commented Jul 18, 2022

s390x: Implement full SIMD support #4427

s390x: Implement full SIMD support #4427

Conversation

uweigand commented Jul 9, 2022

uweigand commented Jul 9, 2022

alexcrichton commented Jul 11, 2022

uweigand commented Jul 11, 2022

alexcrichton commented Jul 11, 2022

alexcrichton commented Jul 11, 2022

uweigand commented Jul 11, 2022

uweigand commented Jul 12, 2022

alexcrichton commented Jul 12, 2022

cfallin left a comment

Choose a reason for hiding this comment

uweigand commented Jul 18, 2022