-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Arm SVE2 for 128 bits vector target #6781
base: main
Are you sure you want to change the base?
Conversation
I'll start looking at this shortly. Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch? I've mostly been looking at RISC V recently, but the branch does support SVE2 with longer than 128-bit vectors. I'll need to look this over, but hopefully there isn't much overlap and the mechanism I am using to set vscale to values other than 1 (vector_bits_* target flag) can just be hooked up in this PR to get support for longer vectors. |
Does this PR allow generating code for an asserted fixed hardware vscale? |
Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.
This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware. |
Yes, that is why I am asking. My PR supports generating code at a specific asserted vscale, which is what one wants for best optimized code targeted at known hardware. The question is whether this is just an assert to worry about or if there are other issues targeting vscale other than 1 but still fixed. I assume this handles SVE2 implementations with larger vector widths than 128 via predication, which my PR currently does not as it asserts both min and max vscale. Not asserting the max would be pretty easy, but the intended use case is targeting exact sizes. When you say increasing vscale greater than 1, is your idea that this will need to support arbitrary vscale to do that? Our impression was this is a great deal of work or a significant performance hit to do in general, but would be interested in your thoughts. |
@zvookin I think it would be a non trivial technical challenge if we try to incorporate complete VLAgnostic concept into Halide. So the viable approach would be to assume vscale is a compile time fixed value, even for vscale other than 1. |
High level, I want to have a vscale PR separated out from either SVE or RISC-V vector support. Then we can have separate PRs for each side. Probably the easiest way to do that is for me to put up a PR combining what's in fixed_length_vectors and what is here. I don't think the support here is quite right as it seems to assume vscale of 1 is 128-bits and does not appear to assert the vscale range on functions. But perhaps I am wrong. On RISC V, I deal with shuffles by coercing to fixed vector, but this may require setting the fixed vector size options which are global and architecture specific. I'm pretty sure that method worked on SVE2 at one time, but I'd have only verified that it compiled to something, not run it under a simulator. |
I agree with introducing vector bits in Target as the initial step.
Does it make sense? I previously explored some different approach. One was to use |
See #6802 . It may be useful to put the concat_vectors and reverse_vectors support in this PR, though I believe what those routines do works here by converting for fixed vectors and back for shuffle_vectors. Whether that generates good code is an open question. I did not put the code that sets the backend specific fixed length vector flags in that PR as I hope it will not be necessary. The PR does assert the vscale range on functions. Let me know if there are any issues or if this doesn't look workable. Idea is both to factor the pull requests into more reviewable chunks and to make sure the SVE and RISC V Vector support line up together. |
(Just catching up on existing PRs...) where does this stand? IIRC there were pieces that we wanted to break out and land separately. |
No worries, I also have been away :-) |
APIs of CodeGen_Internal requires explicit argument of effective_vscale, while APIs of CodeGen_LLVM doesn't have that argument and use the cached one in member variable to make caller code simple.
- Error if LLVM version is less than 14 - ARMDotProd and ARMFp16 are enabled implicitly by SVE2 - All the vector values emitted by CodeGen become ScalableVectorType LLVM-IR in case SVE2 is enabled. - vector_bits in Target must be set for SVE2, which only 128 is supported for now.
Runtime assertion is injected at the start of the function in order to check if the vscale value of Scalable Vector matches between compiletime and runtime.
By design, LLVM shufflevector doesn't accept scalable vector. Instead, llvm.experimental.vector.xx intrinsic supports scalable vector. However, as of LLVM 14, there are a few non trivial issues. - Supported operation pattern is limited. (e.g. no intrinsic for interleaving) - AArch64 backend doesn't seem to be mature for those intrin. (e.g. LLVM Error often occurs with vector lanes not power-of-two) Another approach is to perform shuffle operation in fixed sized vector by adding conversion between scalable vector and fixed vector. However, that conversion results in LLVM Error for most of the cases except for natural size vector. Even if that error is fixed at some point, it would be only possible via load/store memory, which would presumably be poor performance. In this commit, lots of workaround is implemented to avoid LLVM Error, where some of them are using Arm SVE2 intrinsic.
As of LLVM 14, LLVM Error often occurs with vanila codegen for scalable vector type with "unnatural" lanes. This commit is a workaround to avoid that by performing codegen in natural lanes basis, where total_lanes are divided into slices, codegen is performed for each slice, and results are concatenated into total_lanes.
- Structured load/store (LDN/STN) - Predicated load/store - Gather load - Scatter store
- Add support 16 bit integer dot_product in SVE2 - Exclude pair-wise reduction in SVE2 - Use Arm SVE2 intrin for across vector reduction
- Predicate related arguments required by LLVM SVE2 intrinsic are added via LLVM wrapper function - widening, narrowing and pair-wise are not used for SVE2 - Refined consistency about float operation intrinsics
To prevent absd(fmul_vector, fmul_scalar) from being compiled into "fmsub" (Fused Multiply-Subtract) by LLVM backend optimization in Arm, which generated worse error in testing arm NEON "fmul" with float16.
- Consolidated float16_t_neon_op_check and Arm tests in simd_op_check - Improved the verification of instruction so that operand is checked with data bit width and number of lanes - Helper functor is introduced to reduce the code to add test cases
- Adjusted the test condition based on LLVM v14.0.3 - Add checking if "NaN" in output comparison - Add HL_DEBUG_SIMDOPCHECK for debug log
b3739a4
to
d4af592
Compare
This PR is ready to review. Commits have been rebased to main branch as of 5th July. LLVM which was used for this work is updated to SHA1: @zvookin, there are some updates for the topics we have discussed, which are captured in the following and I'm happy to incorporate your feedbacks. Supported Vector lengthThis PR supports SVE2 with only 128 bits vector length (vscale=1). Other cases will be added overtime. That said, the changes in CodeGen_LLVM.cpp/h should work also with other vscale values than 1, aiming not to break the ongoing work for 256 bits etc. Shuffle vectorsAs mentioned in this commit, I didn't take the approach of performing shuffle operation in fixed sized vector by adding conversion between scalable vector and fixed sized vector. The reason is, that conversion results in LLVM Error for most of the cases except for natural size vector. Even if that error is fixed at some point, it would be only possible via load/store memory, which would presumably be poor performance. Helper APIs for scalable vector code-genAs mentioned in this commit, APIs in #6802 are modified so that effective_vscale is taken into account implicitly in APIs of CodeGen_LLVM. More specifically, APIs of CodeGen_Internal requires explicit argument of effective_vscale, while APIs of CodeGen_LLVM doesn't have that argument and use the cached one in member variable, aiming to make caller code simple. |
Monday Morning Review Ping -- where does this PR stand? |
@steven-johnson I'm waiting for review feedback and approval if I understand the situation right. If there is anything on my side that could accelerate the review to go through, I would appreciate it. |
Since bitcast between ScalableVector and scalar is not allowed in LLVM, conversion to/from FixedVectorType is added.
Recent changes in main branch have been incorporated. Test results were the same as before. |
I missed the update for a while, but now I'm more than happy to see this landed finally! Many thanks for all the efforts to make this happen🎉 |
TL;DR
This commit enables Halide to compile a pipeline into LLVM-IR with Scalable Vector Extension version two (SVE2) of the Armv9-A architecture, instead of Neon. LLVM version 14+ is required and the supported vector length is 128 bits only.
For Halide Users
What is this for
In a nut shell, SVE2 is a new SIMD set of Arm CPU and a superset of SVE and Neon (more details in the above link). Depending on the characteristic of the pipeline you compile by Halide, you could leverage the benefit of it to boost the performance.
Performance uplift might be possible if the pipeline has :
For example, some improvement was observed with
apps/local_laplacian
,apps/bilateral_grid
,apps/camera_pipe
andapps/nl_means
in Halide repository.On the other hand, it could result in worse performance than Neon, if :
Usage
To enable this feature, just add
sve2
(Target::SVE2
) feature andvector_bits_128
to Halide Target (e.g.arm-64-android-sve2-vector_bits_128
). In terms of Halide scheduling, no API is updated by this PR, so just schedule in the same way as Neon. Other knowledge about the usage is as follows.64 bits
OS on Arm architecture withSVE2
capability is supported.128 bits
at the moment, which meansvscale
value of scalable vector type in LLVM is assumed to be1
at compilation time. Runtime error is generated if executed on device where vector length is other than 128 bits.14
or later is required for compilation. As of issuing of this PR, SHA143f8a6b74931
is used for verification.SVE
(i.e. without the suffix2
) is not supported.NoNeon
disablesSVE2
as well.ARMDotProd
andARMFp16
are enabled implicitly by the featureSVE2
.For Halide Maintainers
Some of the key points of this commit are captured in below.
Vector Length Agnostic concept of SVE
This PR works as the initial step to enable SVE2 and SME in the future. The target vector length is assumed to be 128 bits at compilation time, aiming to give us performance uplift on latest smartphone SoC with SVE2 capability. The reason of this approach is as follows.
SVE is designed as an embodiment of Vector Length Agnostic (VLA) concept, where the exact vector length (VL) is unknown at compilation time and obtained in run time. In LLVM-IR context, it is called "Scalable" Vector. However, Halide compiler assumes VL is compile-time fixed value and that assumption exists in many places in large SW stack of Halide. I think it would be a non trivial technical challenge if we try to incorporate VLA concept into Halide. On the other hand, from user perspective, when scheduling to explore better performance/memory bandwidth, we usually have the specific target processor in mind (i.e. we know the exact VL). Therefore, even if "Fixed sized vector length" is assumed at compilation time, I would argue enabling SVE2 features by Halide backend would provide substantial value and it would presumably make the complexity/effort small.
Shuffle Vector
By design, LLVM
shufflevector
doesn't accept scalable vector except for zero mask. Instead, llvm.experimental.vector.xxx intrinsic supports scalable vector. However, as of LLVM 14, there are a few non trivial issues.Therefore, lots of tricky workaround is implemented to process scalable vector and to avoid LLVM Errors, where some of them are using Arm SVE2 intrinsic.
Unsupported peep-hole patterns in SVE2
Some of the Arm intrinsic have the same name between Neon and SVE2 but with different behavior. The remarkable ones are, widening, narrowing and pair-wise operations which are performed in even (top) and odd (bottom) lanes basis in SVE, while in high and low lanes in Neon. Therefore, peep-hole code-gen of those patterns into SVE2 intrinsic is not enabled for now, because additional interleaving/deinterleaveing is required to restore the element order in a vector.
Workaround for LLVM issues
As of LLVM 14.0.3, LLVM Error often occurs with vanilla code-gen for scalable vector type with "unnatural" lanes. This commit has lots of workaround to avoid that by performing code-gen in natural lanes basis, where total_lanes are divided into slices, code-gen is performed for each slice, and results are concatenated into total_lanes.
The list of LLVM issues are captured in the appendix.
Refactoring of unit tests for Arm SIMD
simd_op_check_arm.cpp
is created to merge Neon test cases insimd_op_check.cpp
andfloat16_t_neon_op_check.cpp
.CMake Tests on emulator
To run test executables on emulator,
TARGET_EMULATOR
CMake variable is added, which is set to the argument ofadd_test()
CMake function. For example, the value is the path to the wrapper script like:Future work
The following is the list of remaining items and next steps going forward.
vscale
value other than1
as a target feature. (e.g. vector bits of 256, 512 etc)Nothing above is committed to be delivered. I'd be happy to hear what others think or want.
Appendix 1) Test results
Setup
arm-64-linux
, Ubuntu 20.04 on AWS Graviton243f8a6b74931
,Relase
buildCMAKE_BUILD_TYPE
:Debug
,Halide_TARGET
:arm-64-linux-sve2
HL_NUM_THREADS=1
, due to the limitation of the emulator, otherwise.parallel()
raises SIGSEGVctest -V -C Debug -j14
Result
Target with SVE2
In summary, most of the failures are due to the limitation of emulator environment. From the practical usage perspective, what might affects the end user experience the most is the issue found in
correctness_rfactor
.Detail of failed cases:
.async()
.async()
.async()
.async()
tuple_specialize_rdom_predicate_rfactor_test()
27
, where number of values with unrealistic lanes<vscale x 729 x i16>
are emitted.async()
.parallel()
Target without SVE2
Target is set as
Halide_TARGET
:arm-64-linux-arm_dot_prod-arm_fp16
. Test execution is performed without emulator.In my setup, the result is the same regardless of this PR.
Appendix 2) LLVM Issues