Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Arm SVE2 for 128 bits vector target #6781

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

stevesuzuki-arm
Copy link
Contributor

@stevesuzuki-arm stevesuzuki-arm commented May 25, 2022

TL;DR

This commit enables Halide to compile a pipeline into LLVM-IR with Scalable Vector Extension version two (SVE2) of the Armv9-A architecture, instead of Neon. LLVM version 14+ is required and the supported vector length is 128 bits only.

For Halide Users

What is this for

In a nut shell, SVE2 is a new SIMD set of Arm CPU and a superset of SVE and Neon (more details in the above link). Depending on the characteristic of the pipeline you compile by Halide, you could leverage the benefit of it to boost the performance.

Performance uplift might be possible if the pipeline has :

  • Gather load
  • Scatter store
  • Predicated load/store
  • Dot-product of 16 bit integer in such pattern where Neon dot-product of 8 bit works well
  • 64 bit or float16 type operation which is not well supported in Neon

For example, some improvement was observed with apps/local_laplacian, apps/bilateral_grid, apps/camera_pipe and apps/nl_means in Halide repository.

On the other hand, it could result in worse performance than Neon, if :

  • Widening or narrowing operation is dominant
  • Vectorized with the factor value which is not the multiple of natural vector size (e.g. 3, 5, 6, 7, 9, etc)

Usage

To enable this feature, just add sve2 (Target::SVE2) feature and vector_bits_128 to Halide Target (e.g. arm-64-android-sve2-vector_bits_128). In terms of Halide scheduling, no API is updated by this PR, so just schedule in the same way as Neon. Other knowledge about the usage is as follows.

  • Only 64 bits OS on Arm architecture with SVE2 capability is supported.
  • The supported vector length of the device is only 128 bits at the moment, which means vscale value of scalable vector type in LLVM is assumed to be 1 at compilation time. Runtime error is generated if executed on device where vector length is other than 128 bits.
  • LLVM version 14 or later is required for compilation. As of issuing of this PR, SHA143f8a6b74931 is used for verification.
  • Feature SVE (i.e. without the suffix 2) is not supported.
  • Feature NoNeon disables SVE2 as well.
  • Feature ARMDotProd and ARMFp16 are enabled implicitly by the feature SVE2.

For Halide Maintainers

Some of the key points of this commit are captured in below.

Vector Length Agnostic concept of SVE

This PR works as the initial step to enable SVE2 and SME in the future. The target vector length is assumed to be 128 bits at compilation time, aiming to give us performance uplift on latest smartphone SoC with SVE2 capability. The reason of this approach is as follows.
SVE is designed as an embodiment of Vector Length Agnostic (VLA) concept, where the exact vector length (VL) is unknown at compilation time and obtained in run time. In LLVM-IR context, it is called "Scalable" Vector. However, Halide compiler assumes VL is compile-time fixed value and that assumption exists in many places in large SW stack of Halide. I think it would be a non trivial technical challenge if we try to incorporate VLA concept into Halide. On the other hand, from user perspective, when scheduling to explore better performance/memory bandwidth, we usually have the specific target processor in mind (i.e. we know the exact VL). Therefore, even if "Fixed sized vector length" is assumed at compilation time, I would argue enabling SVE2 features by Halide backend would provide substantial value and it would presumably make the complexity/effort small.

Shuffle Vector

By design, LLVM shufflevector doesn't accept scalable vector except for zero mask. Instead, llvm.experimental.vector.xxx intrinsic supports scalable vector. However, as of LLVM 14, there are a few non trivial issues.

  • Supported operation pattern is limited. (e.g. no intrinsic for interleaving)
  • AArch64 backend doesn't seem to be mature enough to compile arbitrary cases.(e.g. LLVM Error often occurs when dealing with vector lanes not power-of-two)

Therefore, lots of tricky workaround is implemented to process scalable vector and to avoid LLVM Errors, where some of them are using Arm SVE2 intrinsic.

Unsupported peep-hole patterns in SVE2

Some of the Arm intrinsic have the same name between Neon and SVE2 but with different behavior. The remarkable ones are, widening, narrowing and pair-wise operations which are performed in even (top) and odd (bottom) lanes basis in SVE, while in high and low lanes in Neon. Therefore, peep-hole code-gen of those patterns into SVE2 intrinsic is not enabled for now, because additional interleaving/deinterleaveing is required to restore the element order in a vector.

Workaround for LLVM issues

As of LLVM 14.0.3, LLVM Error often occurs with vanilla code-gen for scalable vector type with "unnatural" lanes. This commit has lots of workaround to avoid that by performing code-gen in natural lanes basis, where total_lanes are divided into slices, code-gen is performed for each slice, and results are concatenated into total_lanes.
The list of LLVM issues are captured in the appendix.

Refactoring of unit tests for Arm SIMD

  • simd_op_check_arm.cpp is created to merge Neon test cases in simd_op_check.cpp and float16_t_neon_op_check.cpp.
  • Improved the verification of instruction so that operand is checked with data bit width and number of lanes
  • Added SVE2 test cases

CMake Tests on emulator

To run test executables on emulator, TARGET_EMULATOR CMake variable is added, which is set to the argument of add_test() CMake function. For example, the value is the path to the wrapper script like:

#!/bin/bash
armie -msve-vector-bits=128 -i libinscount_emulated.so -- $@

Future work

The following is the list of remaining items and next steps going forward.

  1. Minor improvements, which may be incorporated after looking into more concrete pipelines in detail.
  2. Remove workaround for LLVM issues once they are fixed
  3. Support vscale value other than 1 as a target feature. (e.g. vector bits of 256, 512 etc)
  4. Support more extensions such as SVE (not SVE2), MMLA, and future extensions.

Nothing above is committed to be delivered. I'd be happy to hear what others think or want.

Appendix 1) Test results

Setup

Item Value
Host machine arm-64-linux, Ubuntu 20.04 on AWS Graviton2
Emulator for SVE Arm Instruction Emulator
LLVM SHA143f8a6b74931, Relase build
Halide build configuration CMAKE_BUILD_TYPE:Debug, Halide_TARGET:arm-64-linux-sve2
Environmental variables in test HL_NUM_THREADS=1, due to the limitation of the emulator, otherwise .parallel() raises SIGSEGV
Command to run tests ctest -V -C Debug -j14

Result

Target with SVE2

97% tests passed, 16 tests failed out of 584

In summary, most of the failures are due to the limitation of emulator environment. From the practical usage perspective, what might affects the end user experience the most is the issue found in correctness_rfactor.

Detail of failed cases:

Test Item Cause
correctness_async Emulator limitation when scheduled with .async()
correctness_async_copy_chain Emulator limitation when scheduled with .async()
correctness_atomics Emulator limitation when scheduled with .async()
correctness_parallel_fork Emulator limitation when scheduled with .async()
correctness_rfactor LLVM Error in compilation, which the corresponding issue is #55405. The issues happens only in tuple_specialize_rdom_predicate_rfactor_test()
correctness_interleave LLVM crashes with weird error after very long compilation time with huge memory usage. The issue happens only with the last test case of weird size 27, where number of values with unrealistic lanes <vscale x 729 x i16> are emitted
performance_fan_in Measurement on emulator doesn't make sense
performance_fast_inverse Measurement on emulator doesn't make sense
performance_inner_loop_parallel Measurement on emulator doesn't make sense
performance_memcpy Emulator crashes
performance_memory_profiler Emulator crashes
performance_parallel_performance Measurement on emulator doesn't make sense
performance_vectorize Measurement on emulator doesn't make sense
generator_aot_async_parallel Emulator limitation when scheduled with .async()
generator_aot_memory_profiler_mandelbrot Emulator crashes
generator_aot_variable_num_threads Emulator limitation when scheduled with .parallel()

Target without SVE2

Target is set as Halide_TARGET: arm-64-linux-arm_dot_prod-arm_fp16. Test execution is performed without emulator.

99% tests passed, 1 tests failed out of 584
...
The following tests FAILED:
	497 - performance_fast_inverse (Failed)

In my setup, the result is the same regardless of this PR.

Appendix 2) LLVM Issues

Title Link
[AArch64][SVE] Invalid size request on a scalable vector #54424
[AArch64][SVE] Unable to widen vector store #54423
[AArch64][SVE] Don't know how to widen the operands for INSERT_SUBVECTOR #54982
[AArch64][SVE] Error in insert_v4f32_v2f32 #55037
[AArch64][SVE] Error in bitcast on scalable vector #55114
[AArch64][SVE] Error in llvm.experimental.stepvector: Do not know how to widen the result of this operator #55165
[AArch64][SVE] Error in vector_reverse: Do not know how to widen the result of this operator #55166
[AArch64][SVE] Assertion fails in Casting #55348
[AArch64][SVE] Assertion fails in inserting sub vector of i1 #55405
[SVE] Assertion fails in PtrToInt with scalable vector #55410
[AArch64][SVE] No way to convert scalable vector into fixed sized vector #55412

@zvookin
Copy link
Member

zvookin commented May 25, 2022

I'll start looking at this shortly.

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch? I've mostly been looking at RISC V recently, but the branch does support SVE2 with longer than 128-bit vectors. I'll need to look this over, but hopefully there isn't much overlap and the mechanism I am using to set vscale to values other than 1 (vector_bits_* target flag) can just be hooked up in this PR to get support for longer vectors.

@zvookin
Copy link
Member

zvookin commented May 25, 2022

Does this PR allow generating code for an asserted fixed hardware vscale?

@stevesuzuki-arm
Copy link
Contributor Author

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch?

Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.

Does this PR allow generating code for an asserted fixed hardware vscale?

This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware.

@zvookin
Copy link
Member

zvookin commented May 25, 2022

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch?

Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.

Does this PR allow generating code for an asserted fixed hardware vscale?

This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware.

Yes, that is why I am asking. My PR supports generating code at a specific asserted vscale, which is what one wants for best optimized code targeted at known hardware. The question is whether this is just an assert to worry about or if there are other issues targeting vscale other than 1 but still fixed.

I assume this handles SVE2 implementations with larger vector widths than 128 via predication, which my PR currently does not as it asserts both min and max vscale. Not asserting the max would be pretty easy, but the intended use case is targeting exact sizes.

When you say increasing vscale greater than 1, is your idea that this will need to support arbitrary vscale to do that? Our impression was this is a great deal of work or a significant performance hit to do in general, but would be interested in your thoughts.

@stevesuzuki-arm
Copy link
Contributor Author

@zvookin
This PR works as the initial step, where currently the target vector length is assumed to be 128 bits at compilation time. Targeting VL other than 128 bits will break somewhere in complex vector operation such as shuffle. Actualy, it is not tested on target other than 128 bits VL. The runtime assertion is to guard the user from such troubles.

I think it would be a non trivial technical challenge if we try to incorporate complete VLAgnostic concept into Halide. So the viable approach would be to assume vscale is a compile time fixed value, even for vscale other than 1.

@zvookin
Copy link
Member

zvookin commented May 26, 2022

High level, I want to have a vscale PR separated out from either SVE or RISC-V vector support. Then we can have separate PRs for each side. Probably the easiest way to do that is for me to put up a PR combining what's in fixed_length_vectors and what is here. I don't think the support here is quite right as it seems to assume vscale of 1 is 128-bits and does not appear to assert the vscale range on functions. But perhaps I am wrong.

On RISC V, I deal with shuffles by coercing to fixed vector, but this may require setting the fixed vector size options which are global and architecture specific. I'm pretty sure that method worked on SVE2 at one time, but I'd have only verified that it compiled to something, not run it under a simulator.

@stevesuzuki-arm
Copy link
Contributor Author

stevesuzuki-arm commented May 27, 2022

I agree with introducing vector bits in Target as the initial step.
I thought we start with enabling full functionality of 128 bits for commercialized SoCs, which is the first step and adding other vscale values would be done as a next step over time. But now you are working on 256 bits or more, I realize the assumption of 128 bits might potentially cause a conflict somewhere commonly used. One of the mitigation I come up with is to modify the diff of CodeGen_LLVM so that in case 128 bits assumption could cause problem, do either:

  1. Move it to CodeGen_ARM
  2. Guard with target.vector_bits condition
  3. Make it work for vscale values not only 128

Does it make sense?

I previously explored some different approach. One was to use sve-aarch64-sve-vector-bits-max/min option, which turned out to be aimed for 256 bits and 512 bits only, and doesn't work for 128bits. The other was using fixed sized vector and converting it into/from scalable vector, which turned out to not be the level of maturity which we can rely on, as I faced frequently "Extracting a fixed-length vector from an illegal scalable vector is not yet supported" #55412

@stevesuzuki-arm
Copy link
Contributor Author

@zvookin Maybe I should rebase this PR on #6786 . Other than that, do you have any updates about your views to this change? It would be appreciated if you give me your thoughts on the question in my previous comment.

@zvookin
Copy link
Member

zvookin commented Jun 6, 2022

See #6802 . It may be useful to put the concat_vectors and reverse_vectors support in this PR, though I believe what those routines do works here by converting for fixed vectors and back for shuffle_vectors. Whether that generates good code is an open question.

I did not put the code that sets the backend specific fixed length vector flags in that PR as I hope it will not be necessary. The PR does assert the vscale range on functions.

Let me know if there are any issues or if this doesn't look workable. Idea is both to factor the pull requests into more reviewable chunks and to make sure the SVE and RISC V Vector support line up together.

@steven-johnson
Copy link
Contributor

(Just catching up on existing PRs...) where does this stand? IIRC there were pieces that we wanted to break out and land separately.

@stevesuzuki-arm
Copy link
Contributor Author

My idea for the next step is to rebase this PR on top of other PRs for vscale vectors (#6786 and #6802) and modify the code to align with them (i.e. CodeGen_LLVM/Internal).
Sorry for not updating, as I had been away from the work and also was waiting for #6802.

@steven-johnson
Copy link
Contributor

No worries, I also have been away :-)

APIs of CodeGen_Internal requires explicit argument of effective_vscale,
while APIs of CodeGen_LLVM doesn't have that argument
and use the cached one in member variable to make caller code simple.
- Error if LLVM version is less than 14
- ARMDotProd and ARMFp16 are enabled implicitly by SVE2
- All the vector values emitted by CodeGen become
  ScalableVectorType LLVM-IR in case SVE2 is enabled.
- vector_bits in Target must be set for SVE2,
  which only 128 is supported for now.
Runtime assertion is injected at the start of the function in order to
check if the vscale value of Scalable Vector matches
between compiletime and runtime.
By design, LLVM shufflevector doesn't accept scalable vector.
Instead, llvm.experimental.vector.xx intrinsic supports scalable vector.
However, as of LLVM 14, there are a few non trivial issues.
- Supported operation pattern is limited.
(e.g. no intrinsic for interleaving)
- AArch64 backend doesn't seem to be mature for those intrin.
(e.g. LLVM Error often occurs with vector lanes not power-of-two)

Another approach is to perform shuffle operation in fixed sized vector
by adding conversion between scalable vector and fixed vector.
However, that conversion results in LLVM Error for most of the cases
except for natural size vector. Even if that error is fixed at some point,
it would be only possible via load/store memory,
which would presumably be poor performance.

In this commit, lots of workaround is implemented to avoid LLVM Error,
where some of them are using Arm SVE2 intrinsic.
As of LLVM 14, LLVM Error often occurs with vanila codegen for
scalable vector type with "unnatural" lanes.
This commit is a workaround to avoid that by performing codegen
in natural lanes basis, where total_lanes are divided into slices,
codegen is performed for each slice,
and results are concatenated into total_lanes.
- Structured load/store (LDN/STN)
- Predicated load/store
- Gather load
- Scatter store
- Add support 16 bit integer dot_product in SVE2
- Exclude pair-wise reduction in SVE2
- Use Arm SVE2 intrin for across vector reduction
- Predicate related arguments required by LLVM SVE2 intrinsic
  are added via LLVM wrapper function
- widening, narrowing and pair-wise are not used for SVE2
- Refined consistency about float operation intrinsics
To prevent absd(fmul_vector, fmul_scalar) from being compiled into "fmsub"
(Fused Multiply-Subtract) by LLVM backend optimization in Arm,
which generated worse error in testing arm NEON "fmul" with float16.
- Consolidated float16_t_neon_op_check and Arm tests in simd_op_check
- Improved the verification of instruction so that operand is checked
  with data bit width and number of lanes
- Helper functor is introduced to reduce the code to add test cases
- Adjusted the test condition based on LLVM v14.0.3
- Add checking if "NaN" in output comparison
- Add HL_DEBUG_SIMDOPCHECK for debug log
@stevesuzuki-arm
Copy link
Contributor Author

This PR is ready to review. Commits have been rebased to main branch as of 5th July. LLVM which was used for this work is updated to SHA1:43f8a6b74931.
I've reorganized the commits so that single topic is implemented in single commit for the reviewers not to need to look across commits.

@zvookin, there are some updates for the topics we have discussed, which are captured in the following and I'm happy to incorporate your feedbacks.

Supported Vector length

This PR supports SVE2 with only 128 bits vector length (vscale=1). Other cases will be added overtime. That said, the changes in CodeGen_LLVM.cpp/h should work also with other vscale values than 1, aiming not to break the ongoing work for 256 bits etc.

Shuffle vectors

As mentioned in this commit, I didn't take the approach of performing shuffle operation in fixed sized vector by adding conversion between scalable vector and fixed sized vector. The reason is, that conversion results in LLVM Error for most of the cases except for natural size vector. Even if that error is fixed at some point, it would be only possible via load/store memory, which would presumably be poor performance.

Helper APIs for scalable vector code-gen

As mentioned in this commit, APIs in #6802 are modified so that effective_vscale is taken into account implicitly in APIs of CodeGen_LLVM. More specifically, APIs of CodeGen_Internal requires explicit argument of effective_vscale, while APIs of CodeGen_LLVM doesn't have that argument and use the cached one in member variable, aiming to make caller code simple.

@zvookin
Copy link
Member

zvookin commented Jul 7, 2022

llvm/llvm-project#56431

@steven-johnson
Copy link
Contributor

Monday Morning Review Ping -- where does this PR stand?

@stevesuzuki-arm
Copy link
Contributor Author

@steven-johnson I'm waiting for review feedback and approval if I understand the situation right. If there is anything on my side that could accelerate the review to go through, I would appreciate it.

Since bitcast between ScalableVector and scalar is not allowed in LLVM,
conversion to/from FixedVectorType is added.
@stevesuzuki-arm
Copy link
Contributor Author

Recent changes in main branch have been incorporated. Test results were the same as before.

@zvookin zvookin mentioned this pull request Jan 29, 2024
@stevesuzuki-arm
Copy link
Contributor Author

I missed the update for a while, but now I'm more than happy to see this landed finally! Many thanks for all the efforts to make this happen🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants