Enable Arm SVE2 for 128 bits vector target #6781

stevesuzuki-arm · 2022-05-25T14:39:06Z

TL;DR

This commit enables Halide to compile a pipeline into LLVM-IR with Scalable Vector Extension version two (SVE2) of the Armv9-A architecture, instead of Neon. LLVM version 14+ is required and the supported vector length is 128 bits only.

For Halide Users

What is this for

In a nut shell, SVE2 is a new SIMD set of Arm CPU and a superset of SVE and Neon (more details in the above link). Depending on the characteristic of the pipeline you compile by Halide, you could leverage the benefit of it to boost the performance.

Performance uplift might be possible if the pipeline has :

Gather load
Scatter store
Predicated load/store
Dot-product of 16 bit integer in such pattern where Neon dot-product of 8 bit works well
64 bit or float16 type operation which is not well supported in Neon

For example, some improvement was observed with apps/local_laplacian, apps/bilateral_grid, apps/camera_pipe and apps/nl_means in Halide repository.

On the other hand, it could result in worse performance than Neon, if :

Widening or narrowing operation is dominant
Vectorized with the factor value which is not the multiple of natural vector size (e.g. 3, 5, 6, 7, 9, etc)

Usage

To enable this feature, just add sve2 (Target::SVE2) feature and vector_bits_128 to Halide Target (e.g. arm-64-android-sve2-vector_bits_128). In terms of Halide scheduling, no API is updated by this PR, so just schedule in the same way as Neon. Other knowledge about the usage is as follows.

Only 64 bits OS on Arm architecture with SVE2 capability is supported.
The supported vector length of the device is only 128 bits at the moment, which means vscale value of scalable vector type in LLVM is assumed to be 1 at compilation time. Runtime error is generated if executed on device where vector length is other than 128 bits.
LLVM version 14 or later is required for compilation. As of issuing of this PR, SHA143f8a6b74931 is used for verification.
Feature SVE (i.e. without the suffix 2) is not supported.
Feature NoNeon disables SVE2 as well.
Feature ARMDotProd and ARMFp16 are enabled implicitly by the feature SVE2.

For Halide Maintainers

Some of the key points of this commit are captured in below.

Vector Length Agnostic concept of SVE

This PR works as the initial step to enable SVE2 and SME in the future. The target vector length is assumed to be 128 bits at compilation time, aiming to give us performance uplift on latest smartphone SoC with SVE2 capability. The reason of this approach is as follows.
SVE is designed as an embodiment of Vector Length Agnostic (VLA) concept, where the exact vector length (VL) is unknown at compilation time and obtained in run time. In LLVM-IR context, it is called "Scalable" Vector. However, Halide compiler assumes VL is compile-time fixed value and that assumption exists in many places in large SW stack of Halide. I think it would be a non trivial technical challenge if we try to incorporate VLA concept into Halide. On the other hand, from user perspective, when scheduling to explore better performance/memory bandwidth, we usually have the specific target processor in mind (i.e. we know the exact VL). Therefore, even if "Fixed sized vector length" is assumed at compilation time, I would argue enabling SVE2 features by Halide backend would provide substantial value and it would presumably make the complexity/effort small.

Shuffle Vector

By design, LLVM shufflevector doesn't accept scalable vector except for zero mask. Instead, llvm.experimental.vector.xxx intrinsic supports scalable vector. However, as of LLVM 14, there are a few non trivial issues.

Supported operation pattern is limited. (e.g. no intrinsic for interleaving)
AArch64 backend doesn't seem to be mature enough to compile arbitrary cases.(e.g. LLVM Error often occurs when dealing with vector lanes not power-of-two)

Therefore, lots of tricky workaround is implemented to process scalable vector and to avoid LLVM Errors, where some of them are using Arm SVE2 intrinsic.

Unsupported peep-hole patterns in SVE2

Some of the Arm intrinsic have the same name between Neon and SVE2 but with different behavior. The remarkable ones are, widening, narrowing and pair-wise operations which are performed in even (top) and odd (bottom) lanes basis in SVE, while in high and low lanes in Neon. Therefore, peep-hole code-gen of those patterns into SVE2 intrinsic is not enabled for now, because additional interleaving/deinterleaveing is required to restore the element order in a vector.

Workaround for LLVM issues

As of LLVM 14.0.3, LLVM Error often occurs with vanilla code-gen for scalable vector type with "unnatural" lanes. This commit has lots of workaround to avoid that by performing code-gen in natural lanes basis, where total_lanes are divided into slices, code-gen is performed for each slice, and results are concatenated into total_lanes.
The list of LLVM issues are captured in the appendix.

Refactoring of unit tests for Arm SIMD

simd_op_check_arm.cpp is created to merge Neon test cases in simd_op_check.cpp and float16_t_neon_op_check.cpp.
Improved the verification of instruction so that operand is checked with data bit width and number of lanes
Added SVE2 test cases

CMake Tests on emulator

To run test executables on emulator, TARGET_EMULATOR CMake variable is added, which is set to the argument of add_test() CMake function. For example, the value is the path to the wrapper script like:

#!/bin/bash
armie -msve-vector-bits=128 -i libinscount_emulated.so -- $@

Future work

The following is the list of remaining items and next steps going forward.

Minor improvements, which may be incorporated after looking into more concrete pipelines in detail.
Remove workaround for LLVM issues once they are fixed
Support vscale value other than 1 as a target feature. (e.g. vector bits of 256, 512 etc)
Support more extensions such as SVE (not SVE2), MMLA, and future extensions.

Nothing above is committed to be delivered. I'd be happy to hear what others think or want.

Appendix 1) Test results

Setup

Item	Value
Host machine	`arm-64-linux`, Ubuntu 20.04 on AWS Graviton2
Emulator for SVE	Arm Instruction Emulator
LLVM	SHA1`43f8a6b74931`, `Relase` build
Halide build configuration	`CMAKE_BUILD_TYPE`:`Debug`, `Halide_TARGET`:`arm-64-linux-sve2`
Environmental variables in test	`HL_NUM_THREADS=1`, due to the limitation of the emulator, otherwise `.parallel()` raises SIGSEGV
Command to run tests	`ctest -V -C Debug -j14`

Result

Target with SVE2

97% tests passed, 16 tests failed out of 584

In summary, most of the failures are due to the limitation of emulator environment. From the practical usage perspective, what might affects the end user experience the most is the issue found in correctness_rfactor.

Detail of failed cases:

Test Item	Cause
correctness_async	Emulator limitation when scheduled with `.async()`
correctness_async_copy_chain	Emulator limitation when scheduled with `.async()`
correctness_atomics	Emulator limitation when scheduled with `.async()`
correctness_parallel_fork	Emulator limitation when scheduled with `.async()`
correctness_rfactor	LLVM Error in compilation, which the corresponding issue is #55405. The issues happens only in `tuple_specialize_rdom_predicate_rfactor_test()`
correctness_interleave	LLVM crashes with weird error after very long compilation time with huge memory usage. The issue happens only with the last test case of weird size `27`, where number of values with unrealistic lanes `<vscale x 729 x i16>` are emitted
performance_fan_in	Measurement on emulator doesn't make sense
performance_fast_inverse	Measurement on emulator doesn't make sense
performance_inner_loop_parallel	Measurement on emulator doesn't make sense
performance_memcpy	Emulator crashes
performance_memory_profiler	Emulator crashes
performance_parallel_performance	Measurement on emulator doesn't make sense
performance_vectorize	Measurement on emulator doesn't make sense
generator_aot_async_parallel	Emulator limitation when scheduled with `.async()`
generator_aot_memory_profiler_mandelbrot	Emulator crashes
generator_aot_variable_num_threads	Emulator limitation when scheduled with `.parallel()`

Target without SVE2

Target is set as Halide_TARGET: arm-64-linux-arm_dot_prod-arm_fp16. Test execution is performed without emulator.

99% tests passed, 1 tests failed out of 584
...
The following tests FAILED:
	497 - performance_fast_inverse (Failed)

In my setup, the result is the same regardless of this PR.

Appendix 2) LLVM Issues

Title	Link
[AArch64][SVE] Invalid size request on a scalable vector	#54424
[AArch64][SVE] Unable to widen vector store	#54423
[AArch64][SVE] Don't know how to widen the operands for INSERT_SUBVECTOR	#54982
[AArch64][SVE] Error in insert_v4f32_v2f32	#55037
[AArch64][SVE] Error in bitcast on scalable vector	#55114
[AArch64][SVE] Error in llvm.experimental.stepvector: Do not know how to widen the result of this operator	#55165
[AArch64][SVE] Error in vector_reverse: Do not know how to widen the result of this operator	#55166
[AArch64][SVE] Assertion fails in Casting	#55348
[AArch64][SVE] Assertion fails in inserting sub vector of i1	#55405
[SVE] Assertion fails in PtrToInt with scalable vector	#55410
[AArch64][SVE] No way to convert scalable vector into fixed sized vector	#55412

zvookin · 2022-05-25T15:33:13Z

I'll start looking at this shortly.

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch? I've mostly been looking at RISC V recently, but the branch does support SVE2 with longer than 128-bit vectors. I'll need to look this over, but hopefully there isn't much overlap and the mechanism I am using to set vscale to values other than 1 (vector_bits_* target flag) can just be hooked up in this PR to get support for longer vectors.

zvookin · 2022-05-25T15:38:44Z

Does this PR allow generating code for an asserted fixed hardware vscale?

stevesuzuki-arm · 2022-05-25T15:57:48Z

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch?

Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.

Does this PR allow generating code for an asserted fixed hardware vscale?

This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware.

zvookin · 2022-05-25T16:12:38Z

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch?

Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.

Does this PR allow generating code for an asserted fixed hardware vscale?

This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware.

Yes, that is why I am asking. My PR supports generating code at a specific asserted vscale, which is what one wants for best optimized code targeted at known hardware. The question is whether this is just an assert to worry about or if there are other issues targeting vscale other than 1 but still fixed.

I assume this handles SVE2 implementations with larger vector widths than 128 via predication, which my PR currently does not as it asserts both min and max vscale. Not asserting the max would be pretty easy, but the intended use case is targeting exact sizes.

When you say increasing vscale greater than 1, is your idea that this will need to support arbitrary vscale to do that? Our impression was this is a great deal of work or a significant performance hit to do in general, but would be interested in your thoughts.

stevesuzuki-arm · 2022-05-25T16:31:14Z

@zvookin
This PR works as the initial step, where currently the target vector length is assumed to be 128 bits at compilation time. Targeting VL other than 128 bits will break somewhere in complex vector operation such as shuffle. Actualy, it is not tested on target other than 128 bits VL. The runtime assertion is to guard the user from such troubles.

I think it would be a non trivial technical challenge if we try to incorporate complete VLAgnostic concept into Halide. So the viable approach would be to assume vscale is a compile time fixed value, even for vscale other than 1.

zvookin · 2022-05-26T19:26:52Z

High level, I want to have a vscale PR separated out from either SVE or RISC-V vector support. Then we can have separate PRs for each side. Probably the easiest way to do that is for me to put up a PR combining what's in fixed_length_vectors and what is here. I don't think the support here is quite right as it seems to assume vscale of 1 is 128-bits and does not appear to assert the vscale range on functions. But perhaps I am wrong.

On RISC V, I deal with shuffles by coercing to fixed vector, but this may require setting the fixed vector size options which are global and architecture specific. I'm pretty sure that method worked on SVE2 at one time, but I'd have only verified that it compiled to something, not run it under a simulator.

stevesuzuki-arm · 2022-05-27T08:14:43Z

I agree with introducing vector bits in Target as the initial step.
I thought we start with enabling full functionality of 128 bits for commercialized SoCs, which is the first step and adding other vscale values would be done as a next step over time. But now you are working on 256 bits or more, I realize the assumption of 128 bits might potentially cause a conflict somewhere commonly used. One of the mitigation I come up with is to modify the diff of CodeGen_LLVM so that in case 128 bits assumption could cause problem, do either:

Move it to CodeGen_ARM
Guard with target.vector_bits condition
Make it work for vscale values not only 128

Does it make sense?

I previously explored some different approach. One was to use sve-aarch64-sve-vector-bits-max/min option, which turned out to be aimed for 256 bits and 512 bits only, and doesn't work for 128bits. The other was using fixed sized vector and converting it into/from scalable vector, which turned out to not be the level of maturity which we can rely on, as I faced frequently "Extracting a fixed-length vector from an illegal scalable vector is not yet supported" #55412

stevesuzuki-arm · 2022-06-03T19:02:40Z

@zvookin Maybe I should rebase this PR on #6786 . Other than that, do you have any updates about your views to this change? It would be appreciated if you give me your thoughts on the question in my previous comment.

zvookin · 2022-06-06T00:48:20Z

See #6802 . It may be useful to put the concat_vectors and reverse_vectors support in this PR, though I believe what those routines do works here by converting for fixed vectors and back for shuffle_vectors. Whether that generates good code is an open question.

I did not put the code that sets the backend specific fixed length vector flags in that PR as I hope it will not be necessary. The PR does assert the vscale range on functions.

Let me know if there are any issues or if this doesn't look workable. Idea is both to factor the pull requests into more reviewable chunks and to make sure the SVE and RISC V Vector support line up together.

steven-johnson · 2022-06-29T15:38:11Z

(Just catching up on existing PRs...) where does this stand? IIRC there were pieces that we wanted to break out and land separately.

stevesuzuki-arm · 2022-06-29T16:23:21Z

My idea for the next step is to rebase this PR on top of other PRs for vscale vectors (#6786 and #6802) and modify the code to align with them (i.e. CodeGen_LLVM/Internal).
Sorry for not updating, as I had been away from the work and also was waiting for #6802.

steven-johnson · 2022-06-29T16:25:13Z

No worries, I also have been away :-)

APIs of CodeGen_Internal requires explicit argument of effective_vscale, while APIs of CodeGen_LLVM doesn't have that argument and use the cached one in member variable to make caller code simple.

- Error if LLVM version is less than 14 - ARMDotProd and ARMFp16 are enabled implicitly by SVE2 - All the vector values emitted by CodeGen become ScalableVectorType LLVM-IR in case SVE2 is enabled. - vector_bits in Target must be set for SVE2, which only 128 is supported for now.

Runtime assertion is injected at the start of the function in order to check if the vscale value of Scalable Vector matches between compiletime and runtime.

By design, LLVM shufflevector doesn't accept scalable vector. Instead, llvm.experimental.vector.xx intrinsic supports scalable vector. However, as of LLVM 14, there are a few non trivial issues. - Supported operation pattern is limited. (e.g. no intrinsic for interleaving) - AArch64 backend doesn't seem to be mature for those intrin. (e.g. LLVM Error often occurs with vector lanes not power-of-two) Another approach is to perform shuffle operation in fixed sized vector by adding conversion between scalable vector and fixed vector. However, that conversion results in LLVM Error for most of the cases except for natural size vector. Even if that error is fixed at some point, it would be only possible via load/store memory, which would presumably be poor performance. In this commit, lots of workaround is implemented to avoid LLVM Error, where some of them are using Arm SVE2 intrinsic.

As of LLVM 14, LLVM Error often occurs with vanila codegen for scalable vector type with "unnatural" lanes. This commit is a workaround to avoid that by performing codegen in natural lanes basis, where total_lanes are divided into slices, codegen is performed for each slice, and results are concatenated into total_lanes.

- Structured load/store (LDN/STN) - Predicated load/store - Gather load - Scatter store

- Add support 16 bit integer dot_product in SVE2 - Exclude pair-wise reduction in SVE2 - Use Arm SVE2 intrin for across vector reduction

- Predicate related arguments required by LLVM SVE2 intrinsic are added via LLVM wrapper function - widening, narrowing and pair-wise are not used for SVE2 - Refined consistency about float operation intrinsics

To prevent absd(fmul_vector, fmul_scalar) from being compiled into "fmsub" (Fused Multiply-Subtract) by LLVM backend optimization in Arm, which generated worse error in testing arm NEON "fmul" with float16.

- Consolidated float16_t_neon_op_check and Arm tests in simd_op_check - Improved the verification of instruction so that operand is checked with data bit width and number of lanes - Helper functor is introduced to reduce the code to add test cases

- Adjusted the test condition based on LLVM v14.0.3 - Add checking if "NaN" in output comparison - Add HL_DEBUG_SIMDOPCHECK for debug log

stevesuzuki-arm · 2022-07-05T15:41:21Z

This PR is ready to review. Commits have been rebased to main branch as of 5th July. LLVM which was used for this work is updated to SHA1:43f8a6b74931.
I've reorganized the commits so that single topic is implemented in single commit for the reviewers not to need to look across commits.

@zvookin, there are some updates for the topics we have discussed, which are captured in the following and I'm happy to incorporate your feedbacks.

Supported Vector length

This PR supports SVE2 with only 128 bits vector length (vscale=1). Other cases will be added overtime. That said, the changes in CodeGen_LLVM.cpp/h should work also with other vscale values than 1, aiming not to break the ongoing work for 256 bits etc.

Shuffle vectors

As mentioned in this commit, I didn't take the approach of performing shuffle operation in fixed sized vector by adding conversion between scalable vector and fixed sized vector. The reason is, that conversion results in LLVM Error for most of the cases except for natural size vector. Even if that error is fixed at some point, it would be only possible via load/store memory, which would presumably be poor performance.

Helper APIs for scalable vector code-gen

As mentioned in this commit, APIs in #6802 are modified so that effective_vscale is taken into account implicitly in APIs of CodeGen_LLVM. More specifically, APIs of CodeGen_Internal requires explicit argument of effective_vscale, while APIs of CodeGen_LLVM doesn't have that argument and use the cached one in member variable, aiming to make caller code simple.

zvookin · 2022-07-07T20:07:19Z

llvm/llvm-project#56431

steven-johnson · 2022-08-22T16:45:37Z

Monday Morning Review Ping -- where does this PR stand?

stevesuzuki-arm · 2022-08-22T20:18:06Z

@steven-johnson I'm waiting for review feedback and approval if I understand the situation right. If there is anything on my side that could accelerate the review to go through, I would appreciate it.

Since bitcast between ScalableVector and scalar is not allowed in LLVM, conversion to/from FixedVectorType is added.

stevesuzuki-arm · 2022-09-15T18:35:19Z

Recent changes in main branch have been incorporated. Test results were the same as before.

stevesuzuki-arm · 2024-03-28T16:07:00Z

I missed the update for a while, but now I'm more than happy to see this landed finally! Many thanks for all the efforts to make this happen🎉

steven-johnson requested a review from zvookin May 26, 2022 17:04

EwoutH mentioned this pull request May 31, 2022

ENH: Add CPU feature detection for SVE2 numpy/numpy#21638

Open

zvookin mentioned this pull request Jun 6, 2022

Add support for vscale vector code generation. #6802

Merged

steven-johnson closed this Jun 29, 2022

steven-johnson reopened this Jun 29, 2022

stevesuzuki-arm added 13 commits July 5, 2022 15:14

Update LLVM helper APIs for scalable vector

4bf0fa9

APIs of CodeGen_Internal requires explicit argument of effective_vscale, while APIs of CodeGen_LLVM doesn't have that argument and use the cached one in member variable to make caller code simple.

Add CMake variable to run tests on emulator

c478395

Validate vscale value of scalable vector at runtime

598a19f

Runtime assertion is injected at the start of the function in order to check if the vscale value of Scalable Vector matches between compiletime and runtime.

Fix interleave of slices in CodeGen_LLVM Shuffle

1ec378b

Codegen Ramp for scalable vector

ae35fb1

Codegen load/store for SVE2

1ffd601

- Structured load/store (LDN/STN) - Predicated load/store - Gather load - Scatter store

Codegen vector_reduce for SVE2

7045b28

- Add support 16 bit integer dot_product in SVE2 - Exclude pair-wise reduction in SVE2 - Use Arm SVE2 intrin for across vector reduction

Support SVE2 intrinsics

f8cfbcd

- Predicate related arguments required by LLVM SVE2 intrinsic are added via LLVM wrapper function - widening, narrowing and pair-wise are not used for SVE2 - Refined consistency about float operation intrinsics

Add workaround for the issue in Cast(Broadcast)

b11d3c5

Support Call::make_struct with scalable vector

b778cf0

stevesuzuki-arm added 5 commits July 5, 2022 15:14

Fix scheduling in calculating error in simd_op_check

9ed1eff

To prevent absd(fmul_vector, fmul_scalar) from being compiled into "fmsub" (Fused Multiply-Subtract) by LLVM backend optimization in Arm, which generated worse error in testing arm NEON "fmul" with float16.

Add SVE2 test cases in simd_op_check_arm

bdf3264

- Adjusted the test condition based on LLVM v14.0.3 - Add checking if "NaN" in output comparison - Add HL_DEBUG_SIMDOPCHECK for debug log

clang-format for simd_op_check

af1fa5a

Fix missing vectorization in vector_math test

d4af592

stevesuzuki-arm force-pushed the arm-sve2-128b branch from b3739a4 to d4af592 Compare July 5, 2022 15:21

MattPD mentioned this pull request Jul 6, 2022

[AArch64][SVE] Error in masked_gather_v2f16 when enabling SVE for 128-bit target llvm/llvm-project#56412

Open

stevesuzuki-arm added 3 commits September 15, 2022 10:08

Merge branch 'main' into arm-sve2-128b

768052a

Merge branch 'main' into arm-sve2-128b

ac89f71

Fix extract_concat_bits test failure with SVE2

44d0ddf

Since bitcast between ScalableVector and scalar is not allowed in LLVM, conversion to/from FixedVectorType is added.

stevesuzuki-arm added 5 commits October 31, 2022 17:52

Merge branch 'd5a81186a' into arm-sve2-128b

9682eb9

Temporarily disable assertion for runtime vscale

81b7bb6

Merge branch 'bd15cee6f' into arm-sve2-128b

853b4ca

Add more scalar FP llvm intrin in CodeGen_ARM

78fddca

Fix SHRN in simd_op_check_arm for LLVM15

baaf9ff

zvookin mentioned this pull request Jan 29, 2024

Support for ARM SVE2. #8051

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Arm SVE2 for 128 bits vector target #6781

Enable Arm SVE2 for 128 bits vector target #6781

stevesuzuki-arm commented May 25, 2022 •

edited

Loading

zvookin commented May 25, 2022

zvookin commented May 25, 2022

stevesuzuki-arm commented May 25, 2022

zvookin commented May 25, 2022

stevesuzuki-arm commented May 25, 2022

zvookin commented May 26, 2022

stevesuzuki-arm commented May 27, 2022 •

edited

Loading

stevesuzuki-arm commented Jun 3, 2022

zvookin commented Jun 6, 2022

steven-johnson commented Jun 29, 2022

stevesuzuki-arm commented Jun 29, 2022

steven-johnson commented Jun 29, 2022

stevesuzuki-arm commented Jul 5, 2022

zvookin commented Jul 7, 2022

steven-johnson commented Aug 22, 2022

stevesuzuki-arm commented Aug 22, 2022

stevesuzuki-arm commented Sep 15, 2022

stevesuzuki-arm commented Mar 28, 2024

Enable Arm SVE2 for 128 bits vector target #6781

Are you sure you want to change the base?

Enable Arm SVE2 for 128 bits vector target #6781

Conversation

stevesuzuki-arm commented May 25, 2022 • edited Loading

TL;DR

For Halide Users

What is this for

Usage

For Halide Maintainers

Vector Length Agnostic concept of SVE

Shuffle Vector

Unsupported peep-hole patterns in SVE2

Workaround for LLVM issues

Refactoring of unit tests for Arm SIMD

CMake Tests on emulator

Future work

Appendix 1) Test results

Setup

Result

Target with SVE2

Target without SVE2

Appendix 2) LLVM Issues

zvookin commented May 25, 2022

zvookin commented May 25, 2022

stevesuzuki-arm commented May 25, 2022

zvookin commented May 25, 2022

stevesuzuki-arm commented May 25, 2022

zvookin commented May 26, 2022

stevesuzuki-arm commented May 27, 2022 • edited Loading

stevesuzuki-arm commented Jun 3, 2022

zvookin commented Jun 6, 2022

steven-johnson commented Jun 29, 2022

stevesuzuki-arm commented Jun 29, 2022

steven-johnson commented Jun 29, 2022

stevesuzuki-arm commented Jul 5, 2022

Supported Vector length

Shuffle vectors

Helper APIs for scalable vector code-gen

zvookin commented Jul 7, 2022

steven-johnson commented Aug 22, 2022

stevesuzuki-arm commented Aug 22, 2022

stevesuzuki-arm commented Sep 15, 2022

stevesuzuki-arm commented Mar 28, 2024

stevesuzuki-arm commented May 25, 2022 •

edited

Loading

stevesuzuki-arm commented May 27, 2022 •

edited

Loading