sve gather scatter offset sinking #66932

paulwalker-arm · 2023-09-20T17:40:02Z

[flang] Add comdats to functions with linkonce linkage ([flang] Add comdats to functions with linkonce linkage #66516)
[clang][TSA] Thread safety cleanup functions
[SPIRV] Test basic float and int types ([SPIRV] Test basic float and int types #66282)
[mlgo] Fix tests post PR [CodeGen] Renumber slot indexes before register allocation #66334
[libunwind][AIX] Fix up TOC register if unw_getcontext is called from a different module ([libunwind][AIX] Fix up TOC register if unw_getcontext is called from a different module #66549)
[RISCV] Recognize veyron-v1 processor in clang driver. ([RISCV] Recognize veyron-v1 processor in clang driver. #66703)
[RISCV] Add a combine to form masked.store from unit strided store
[SROA] Remove unnecessary IsStorePastEnd handling (NFCI)
In ExprRequirement building, treat OverloadExpr as dependent (In ExprRequirement building, treat OverloadExpr as dependent #66683)
[mlir][SCF] ForOp: Remove getIterArgNumberForOpOperand ([mlir][SCF] ForOp: Remove getIterArgNumberForOpOperand #66629)
[mlir][Interfaces] LoopLikeOpInterface: Support ops with multiple regions ([mlir][Interfaces] LoopLikeOpInterface: Support ops with multiple regions #66754)
[DAGCombiner] Combine vp.strided.load with unit stride to vp.load ([DAGCombiner] Combine vp.strided.load with unit stride to vp.load #66766)
[DAGCombiner] Combine vp.strided.store with unit stride to vp.store ([DAGCombiner] Combine vp.strided.store with unit stride to vp.store #66774)
[TwoAddressInstruction] Use isPlainlyKilled in processTiedPairs ([TwoAddressInstruction] Use isPlainlyKilled in processTiedPairs #65976)
[RISCV] Fix bad isel predicate handling for Ztso. ([RISCV] Fix bad isel predicate handling for Ztso. #66739)
[libc][math] Extract non-MPFR math tests into libc-math-smoke-tests.
[lit] Drop "Script:", make -v and -a imply -vv
[lit] Improve test output from lit's internal shell
[lit] Echo full RUN lines in case of external shells ([lit] Echo full RUN lines in case of external shells #66408)
[RISCV] Add a pass to rewrite rd to x0 for non-computational instrs whose return values are unused
[mlir][spirv][gpu] Convert remaining wmma ops to KHR coop matrix ([mlir][spirv][gpu] Convert remaining wmma ops to KHR coop matrix #66455)
[mlir][sparse] More allocate -> empty tensor migration ([mlir][sparse] More allocate -> empty tensor migration #66720)
[gn build] Port 93fde2e
[RISCV] Add more instructions for the short forward branch optimization. ([RISCV] Add more instructions for the short forward branch optimization. #66789)
[SSP] Accessing __stack_chk_guard when using LTO ([SSP] Accessing __stack_chk_guard when using LTO #66535)
[RISCV] Expand test coverage for widening gather and strided load idioms
[lldb][NFCI] Remove unneeded ConstString from intel-pt plugin ([lldb][NFCI] Remove unneeded ConstString from intel-pt plugin #66721)
[lldb][NFCI] Remove unneccessary allocation in ScriptInterpreterPythonImpl::GetSyntheticTypeName ([lldb][NFCI] Remove unneccessary allocation in ScriptInterpreterPythonImpl::GetSyntheticTypeName #66724)
[Profile] Delete coverage-debug-info-correlate.cpp test on mac as debug info correlation not working on mac for unkown reasons.
[lldb] Fix build after d5a62b7
[flang][hlfir] Fixed assignment/finalization order for user-defined assignments. ([flang][hlfir] Fixed assignment/finalization order for user-defined assignments. #66736)
[RISCV] Require alignment when forming gather with larger element type
Addressed review comments to use ThreadSafe instead of !ThreadSafe
[flang] Follow memory source through more operations ([flang] Follow memory source through more operations #66713)
[X86] Use RIP-relative addressing for data under large data threshold for medium code model
Fix a bug with cancelling "attach -w" after you have run a process previously (Fix a bug with cancelling "attach -w" after you have run a process previously #65822)
Let the c(xx)_status pages reflect that clang 17 is released
Revert "[flang][hlfir] Fixed assignment/finalization order for user-defined assignments. ([flang][hlfir] Fixed assignment/finalization order for user-defined assignments. #66736)"
Revert "Revert "[flang][hlfir] Fixed assignment/finalization order for user-defined assignments. ([flang][hlfir] Fixed assignment/finalization order for user-defined assignments. #66736)""
[ORC] Add writePointers to ExecutorProcessControl's MemoryAccess
[Coverage] Skip visiting ctor member initializers with invalid source locations.
[SLP]Fix PR66795: Check correct deps for vectorized inst with multiple vectorized node uses.
[github] Make branch workflow more robust ([github] Make branch workflow more robust #66781)
[flang] Correct handling of assumed-rank allocatables in ALLOCATE ([flang] Correct handling of assumed-rank allocatables in ALLOCATE #66718)
[BOLT][runtime] Test for outline-atomics support
[mlir][spirv] Add conversions for Arith's maxnumf and minnumf ([mlir][spirv] Add conversions for Arith's maxnumf and minnumf #66696)
[libc++][NFC] Clean up std::__call_once
[libc][cmake] Tidy compiler includes ([libc][cmake] Tidy compiler includes #66783)
[OpenMP][Docs][NFC] Update documentation
[RISCV] Match strided load via DAG combine ([RISCV] Match strided load via DAG combine #66800)
[llvm-nm] Add --line-numbers flag
Revert "[libc][cmake] Tidy compiler includes ([libc][cmake] Tidy compiler includes #66783)" (Revert "[libc][cmake] Tidy compiler includes (#66783)" #66822)
[-Wunsafe-bugger-usage] Clean tests: remove nondeterministic ordering
[mlir][sparse][gpu] free all buffers allocated for spGEMM ([mlir][sparse][gpu] free all buffers allocated for spGEMM #66813)
[llvm][docs] Update active CoC Commitee members ([llvm][docs] Update active CoC Commitee members #66814)
Explicitly set triple on line-numbers.test
[AsmPrint] Dump raw frequencies in -mbb-profile-dump ([AsmPrint] Dump raw frequencies in -mbb-profile-dump #66818)
[Clang] Static member initializers are not immediate escalating context. ([Clang] Static member initializers are not immediate escalating context. #66021)
[mlir][spirv] Suffix NV cooperative matrix props with _nv ([mlir][spirv] Suffix NV cooperative matrix props with _nv #66820)
[mlir][spirv] Define KHR cooperative matrix properties ([mlir][spirv] Define KHR cooperative matrix properties #66823)
[lit] Fix a test fail under windows
[InstrProf][compiler-rt] Enable MC/DC Support in LLVM Source-based Code Coverage (1/3)
[AMDGPU] Use inreg for hint to preload kernel arguments
[EarlyCSE] Compare GEP instructions based on offset ([EarlyCSE] Compare GEP instructions based on offset #65875)
[libc++] Fix __threading_support when used with C11 threading ([libc++] Fix __threading_support when used with C11 threading #66780)
[clang] Improve CI output when trailing whitespace is found ([clang] Improve CI output when trailing whitespace is found #66649)
[libc] Fix printf config not working ([libc] Fix printf config not working #66834)
[lit] Apply aa71680's fix to an additional test
[AMDGPU] Add ASM and MC updates for preloading kernargs
[bazel] Port c649f29 (llvm-nm --line-numbers)
Fix test added in D150987 to account for different path separators which was causing the test to fail on Windows.
[SimplifyCFG] Pre-commit test for extending HoistThenElseCodeToIf.
[SimplifyCFG] Hoist common instructions on Switch.
[IR] Add "Large Data Threshold" module metadata ([IR] Add "Large Data Threshold" module metadata #66797)
A test was changing directory and then incorrectly restoring the directory to the "testdir" which is the build directory for that test, not the original source directory. That caused subsequent tests to fail.
[mlir][sparse] unifies sparse_tensor.sort_coo/sort into one operation. ([mlir][sparse] unifies sparse_tensor.sort_coo/sort into one operation. #66722)
[Docs] Fix table after previous document update
[Sparc] Remove LEA instructions (NFCI) ([Sparc] Remove LEA instructions (NFCI) #65850)
[lldb][NFCI] Remove unused struct ConstString::StringIsEqual
[builtins][NFC] Avoid using CRT_LDBL_128BIT in tests ([builtins][NFC] Avoid using CRT_LDBL_128BIT in tests #66832)
[RISCV] Prefer Zcmp push/pop instead of save-restore calls. ([RISCV] Prefer Zcmp push/pop instead of save-restore calls. #66046)
[DependencyScanningFilesystem] Make sure the local/shared cache filename lookups use only absolute paths ([DependencyScanningFilesystem] Make sure the local/shared cache filename lookups use only absolute paths #66122)
[NFC][hwasan] Make ShowHeapOrGlobalCandidate a method ([hwasan] Store some report data early #66682)
[NFC][hwasan] Find overflow candidate early ([hwasan] Store some report data early #66682)
[NFC][hwasan] Clang-format c557621
[NFC][hwasan] Extract a few BaseReport::Copy methods ([hwasan] Store some report data early #66682)
[NFC][hwasan] Extract announce_by_id ([hwasan] Store some report data early #66682)
[NFC][hwasan] Collect heap allocations early ([hwasan] Store some report data early #66682)
[libc++] Warn if an unsupported compiler is used
[ELF][test] Improve tests about non-SHF_ALLOC sections relocated by non-ABS relocations
[ELF] Remove a R_ARM_PCA special case from relocateNonAlloc
[clang][dataflow] Reorder checks to protect against a null pointer dereference. ([clang][dataflow] Reorder checks to protect against a null pointer dereference. #66764)
[MC,X86] Property report error for modifiers with incorrect size
[RISCV] Install sifive_vector.h to riscv-resource-headers ([RISCV] Install sifive_vector.h to riscv-resource-headers #66330)
[InferAlignment] Create tests for InferAlignment pass
[InferAlignment] Implement InferAlignmentPass
[InstCombine] Use a cl::opt to control calls to getOrEnforceKnownAlignment in LoadInst and StoreInst
[InferAlignment] Enable InferAlignment pass by default
[ELF][test] Improve -r tests for local symbols
[mlir][IR] Trigger notifyOperationRemoved callback for nested ops ([mlir][IR] Trigger notifyOperationRemoved callback for nested ops #66771)
[Workflow] Add new code format helper. ([Workflow] Add new code format helper. #66684)
[gn build] Port 0f152a5
[RISCV] Fix bugs about register list of Zcmp push/pop. ([RISCV] Fix bugs about getting register list of Zcmp push/pop. #66073)
[AMDGPU] Run twoaddr tests with -early-live-intervals ([AMDGPU] Run twoaddr tests with -early-live-intervals #66775)
[TableGen][GlobalISel] Use GIM_SwitchOpcode in Combiners ([TableGen][GlobalISel] Use GIM_SwitchOpcode in Combiners #66864)
[NFC][InferAlignment] Swap extern declaration and definition of EnableInferAlignmentPass
[flang] Prevent IR name clashes between BIND(C) and external procedures ([flang] Prevent IR name clashes between BIND(C) and external procedures #66777)
Revert "[Workflow] Add new code format helper. ([Workflow] Add new code format helper. #66684)"
[lldb][Docs] Fix typo in style docs
[clang-format][NFC] Clean up signatures of some parser functions ([clang-format][NFC] Clean up signatures of some parser functions #66569)
Revert "Fix a bug with cancelling "attach -w" after you have run a process previously (Fix a bug with cancelling "attach -w" after you have run a process previously #65822)"
[OpenMP][VE] Limit the number of threads to create ([OpenMP][VE] Limit the number of threads to create #66729)
[SimpleLoopUnswitch] Fix reversed branch during condition injection
[mlir][vector] Make ReorderElementwiseOpsOnBroadcast support vector.splat ([mlir][vector] Make ReorderElementwiseOpsOnBroadcast support vector.splat #66596)
[lldb][AArch64] Add SME's streaming vector control register
[reland][libc][cmake] Tidy compiler includes ([libc][cmake] Tidy compiler includes #66783) ([reland][libc][cmake] Tidy compiler includes (#66783) #66878)
[GuardUtils] Revert llvm::isWidenableBranch change ([GuardUtils] Revert llvm::isWidenableBranch change #66411)
[LLVM] convergence verifier should visit all instructions ([LLVM] convergence verifier should visit all instructions #66200)
[lldb][API] Remove debug print in TestRunLocker.py
[clang] [C23] Fix crash with _BitInt running clang-tidy ([clang] [C23] Fix crash with _BitInt running clang-tidy #65889)
[Flang][OpenMP] Move FIR lowering tests to a separate directory ([Flang][OpenMP] Move FIR lowering tests to a separate directory #66779)
[RISCV] Add missing V extensions for zvk-invalid-features.c ([RISCV] Add missing V extensions for zvk-invalid-features.c #66875)
[mlir][gpu][bufferization] Implement BufferDeallocationOpInterface for gpu.terminator ([mlir][gpu][bufferization] Implement BufferDeallocationOpInterface for gpu.terminator #66880)
[analyzer] Fix crash analyzing _BitInt() in evalIntegralCast ([analyzer] Fix crash analyzing _BitInt() in evalIntegralCast #66782)
[IR] Fix a memory leak if Function::dropAllReferences() is followed by setHungoffOperand
[X86] vector-interleaved tests - add AVX512-SLOW/AVX512-FAST common prefixes to reduce duplication
[X86] combineINSERT_SUBVECTOR - attempt to combine concatenated shuffles
[X86] Add test cases for gnux32 large constants Issue Simple X32 miscompilation #55061
[NFC][Clang] Address reviews about overrideFunctionFeaturesWithTargetFeatures ([NFC][Clang] Address reviews about overrideFunctionFeaturesWithTargetFeatures #65938)
[analyzer] Fix StackAddrEscapeChecker crash on temporary object fields ([analyzer] Fix StackAddrEscapeChecker crash on temporary object fields #66493)
[VE] Add unittest for intrinsics ([VE] Add unittest for intrinsics #66730)
[NFC][AMDGPU] Perform a single lookup in map in SIInsertWaitcnts::isPreheaderToFlush
[NFC][AMDGPU] Remove redundant hasSideEffects=1
[SROA] Don't shrink volatile load past end
[mlir][bufferization][scf] Implement BufferDeallocationOpInterface for scf.reduce.return ([mlir][bufferization][scf] Implement BufferDeallocationOpInterface for scf.reduce.return #66886)
[RISCV] Add tests where bin ops of splats could be scalarized. NFC ([RISCV] Add tests where bin ops of splats could be scalarized. NFC #65747)
[clang][Interp][NFC] Small code refactoring
[Docs] Update ExceptionHandling example (NFC)
[mlir][bufferization][NFC] Move memref specific implementation of AllocationOpInterface to memref dialect directory ([mlir][bufferization] Move memref specific implementation of AllocationOpInterface to memref dialect directory #66637)
[X86] Align other variants to use void * as 512 variants. ([X86] Align other variants to use void * as 512 variants. #66310)
[X86] Fix an assembler bug of CMPCCXADD. ([X86] Fix an assembler bug of CMPCCXADD. #66748)
[clang][dataflow] Identify post-visit state changes in the HTML logger. ([clang][dataflow] Identify post-visit state changes in the HTML logger. #66746)
[MLIR][Presburger] Template Matrix to allow MPInt and Fraction; use IntMatrix for integer matrices ([MLIR][Presburger] Template Matrix to allow MPInt and Fraction and separate out IntMatrix #66897)
[SPIR-V] Fix 64-bit integer literal printing ([SPIR-V] Fix 64-bit integer literal printing #66686)
[libc++] Simplify how the global stream tests are written ([libc++] Simplify how the global stream tests are written #66842)
[AArch64][SME] Enable TPIDR2 lazy-save for za_preserved
[X86] X86DAGToDAGISel::matchIndexRecursively - replace hard coded recursion limit with SelectionDAG::MaxRecursionDepth. NFCI.
[libc++] Sort available features before printing them
[mlir][VectorOps] Extend vector.constant_mask to support 'all true' scalable dims ([mlir][VectorOps] Extend vector.constant_mask to support 'all true' scalable dims #66638)
Warn on align directive with non-zero fill value in virtual sections (Warn on align directive with non-zero fill value in virtual sections #66792)
[VE] Add TargetParser to CMakeLists.txt for VE unittest
[lldb-vscode] Use auto summaries whenever variables don't have a summary ([lldb-vscode] Use auto summaries whenever variables don't have a summary #66551)
Revert "[clang] Don't inherit dllimport/dllexport to exclude_from_explicit_instantiation members during explicit instantiation ([clang] Don't inherit dllimport/dllexport to exclude_from_explicit_in… #65961)"
[AMDGPU] Convert tests rotr.ll and rotl.ll to be auto-generated ([AMDGPU] Convert tests rotr.ll and rotl.ll to be auto-generated #66828)
[NFC] Fix spelling 'constanst' -> 'constants'
[mlir][Vector] Add fastmath flags to vector.reduction ([mlir][Vector] Add fastmath flags to vector.reduction #66905)
[lldb][AArch64] Invalidate cached VG value before reconfiguring SVE registers
[gn] Add dummy build file for VETests
[SPIRV] Fix OpConstant float and double printing
[flang][hlfir] Fixed cleanup code placement indeterminism in OrderedAssignments. ([flang][hlfir] Fixed cleanup code placement indeterminism in OrderedAssignments. #66811)
[AMDGPU] Regenerate always-uniform.ll
[X86] Regenerate pr39098.ll
[ELF][test] Add a test to demonstrate [LLD] LLD can report "unable to move location counter backward" error too early #66836
[NFC][AsmPrinter] Refactor FrameIndexExprs as a std::set ([NFC][AsmPrinter] Refactor FrameIndexExprs as a std::set #66433)
[ELF] Postpone "unable to move location counter backward" error ([ELF] Postpone "unable to move location counter backward" error #66854)
[clang][CodeGen] The eh_typeid_for intrinsic needs special care too ([clang][CodeGen] The eh_typeid_for intrinsic needs special care too #65699)
[AArch64][GlobalISel] Adopt dup(load) -> LD1R patterns from SelectionDAG
Cleanup fallback NOT checks
[AArch64] Add some tests for setcc known bits fold. NFC
[SelectionDAG] [NFC] Add pre-commit test for PR66701. ([SelectionDAG] [NFC] Add pre-commit test for PR66701. #66796)
[Driver] Some improvements for path handling on NetBSD ([Driver] Some improvements for path handling on NetBSD #66863)
[mlir][sparse] remove most bufferization.alloc_tensor ops from sparse ([mlir][sparse] remove most bufferization.alloc_tensor ops from sparse #66847)
[mlir] Bazel fixes for 1b8b556 ([mlir] Bazel fixes for 1b8b55644313216e6b0fa233bbd8b01fee23f99f #66929)
[mlir] introduce transform.loop.forall_to_for ([mlir] introduce transform.loop.forall_to_for #65474)
[mlir] regenerate linalg named ops yaml ([mlir] regenerate linalg named ops yaml #65475)
[SLP]Fix a crash when trying to find operand with re-vectorized main instruction.
[libc][Obvious] Fix incorrect RPC opcode for clearerr
[SVE][CodeGenPrepare] Sink address calculations that match SVE gather/scatter addressing modes.

…/scatter addressing modes. SVE supports scalar+vector and scalar+extw(vector) addressing modes. However, the masked gather/scatter intrinsics take a vector of addresses, which means address computations can be hoisted out of loops. The is especially true for things like offsets where the true size of offsets is lost by the time you get to code generation. This is problematic because it forces the code generator to legalise towards `<vscale x 2 x ty>` vectors that will not maximise bandwidth if the main block datatypes is in fact i32 or smaller. This patch sinks GEPs and extends for cases where one of the above addressing modes can be used. NOTE: There are cases where it would be better to split the extend in two with one half hoisted out of a loop and the other within the loop. Whilst true I think this switch of default is still better than before because the extra extends are an improvement over being forced to split a gather/scatter.

llvmbot · 2023-09-20T17:41:11Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-aarch64

Changes

[flang] Add comdats to functions with linkonce linkage (#66516)
[clang][TSA] Thread safety cleanup functions
[SPIRV] Test basic float and int types (#66282)
[mlgo] Fix tests post PR #66334
[libunwind][AIX] Fix up TOC register if unw_getcontext is called from a different module (#66549)
[RISCV] Recognize veyron-v1 processor in clang driver. (#66703)
[RISCV] Add a combine to form masked.store from unit strided store
[SROA] Remove unnecessary IsStorePastEnd handling (NFCI)
In ExprRequirement building, treat OverloadExpr as dependent (#66683)
[mlir][SCF] ForOp: Remove getIterArgNumberForOpOperand (#66629)
[mlir][Interfaces] LoopLikeOpInterface: Support ops with multiple regions (#66754)
[DAGCombiner] Combine vp.strided.load with unit stride to vp.load (#66766)
[DAGCombiner] Combine vp.strided.store with unit stride to vp.store (#66774)
[TwoAddressInstruction] Use isPlainlyKilled in processTiedPairs (#65976)
[RISCV] Fix bad isel predicate handling for Ztso. (#66739)
[libc][math] Extract non-MPFR math tests into libc-math-smoke-tests.
[lit] Drop "Script:", make -v and -a imply -vv
[lit] Improve test output from lit's internal shell
[lit] Echo full RUN lines in case of external shells (#66408)
[RISCV] Add a pass to rewrite rd to x0 for non-computational instrs whose return values are unused
[mlir][spirv][gpu] Convert remaining wmma ops to KHR coop matrix (#66455)
[mlir][sparse] More allocate -> empty tensor migration (#66720)
[gn build] Port 93fde2e
[RISCV] Add more instructions for the short forward branch optimization. (#66789)
[SSP] Accessing __stack_chk_guard when using LTO (#66535)
[RISCV] Expand test coverage for widening gather and strided load idioms
[lldb][NFCI] Remove unneeded ConstString from intel-pt plugin (#66721)
[lldb][NFCI] Remove unneccessary allocation in ScriptInterpreterPythonImpl::GetSyntheticTypeName (#66724)
[Profile] Delete coverage-debug-info-correlate.cpp test on mac as debug info correlation not working on mac for unkown reasons.
[lldb] Fix build after d5a62b7
[flang][hlfir] Fixed assignment/finalization order for user-defined assignments. (#66736)
[RISCV] Require alignment when forming gather with larger element type
Addressed review comments to use ThreadSafe instead of !ThreadSafe
[flang] Follow memory source through more operations (#66713)
[X86] Use RIP-relative addressing for data under large data threshold for medium code model
Fix a bug with cancelling "attach -w" after you have run a process previously (#65822)
Let the c(xx)_status pages reflect that clang 17 is released
Revert "[flang][hlfir] Fixed assignment/finalization order for user-defined assignments. (#66736)"
Revert "Revert "[flang][hlfir] Fixed assignment/finalization order for user-defined assignments. (#66736)""
[ORC] Add writePointers to ExecutorProcessControl's MemoryAccess
[Coverage] Skip visiting ctor member initializers with invalid source locations.
[SLP]Fix PR66795: Check correct deps for vectorized inst with multiple vectorized node uses.
[github] Make branch workflow more robust (#66781)
[flang] Correct handling of assumed-rank allocatables in ALLOCATE (#66718)
[BOLT][runtime] Test for outline-atomics support
[mlir][spirv] Add conversions for Arith's maxnumf and minnumf (#66696)
[libc++][NFC] Clean up std::__call_once
[libc][cmake] Tidy compiler includes (#66783)
[OpenMP][Docs][NFC] Update documentation
[RISCV] Match strided load via DAG combine (#66800)
[llvm-nm] Add --line-numbers flag
Revert "[libc][cmake] Tidy compiler includes (#66783)" (#66822)
[-Wunsafe-bugger-usage] Clean tests: remove nondeterministic ordering
[mlir][sparse][gpu] free all buffers allocated for spGEMM (#66813)
[llvm][docs] Update active CoC Commitee members (#66814)
Explicitly set triple on line-numbers.test
[AsmPrint] Dump raw frequencies in -mbb-profile-dump (#66818)
[Clang] Static member initializers are not immediate escalating context. (#66021)
[mlir][spirv] Suffix NV cooperative matrix props with _nv (#66820)
[mlir][spirv] Define KHR cooperative matrix properties (#66823)
[lit] Fix a test fail under windows
[InstrProf][compiler-rt] Enable MC/DC Support in LLVM Source-based Code Coverage (1/3)
[AMDGPU] Use inreg for hint to preload kernel arguments
[EarlyCSE] Compare GEP instructions based on offset (#65875)
[libc++] Fix __threading_support when used with C11 threading (#66780)
[clang] Improve CI output when trailing whitespace is found (#66649)
[libc] Fix printf config not working (#66834)
[lit] Apply aa71680's fix to an additional test
[AMDGPU] Add ASM and MC updates for preloading kernargs
[bazel] Port c649f29 (llvm-nm --line-numbers)
Fix test added in D150987 to account for different path separators which was causing the test to fail on Windows.
[SimplifyCFG] Pre-commit test for extending HoistThenElseCodeToIf.
[SimplifyCFG] Hoist common instructions on Switch.
[IR] Add "Large Data Threshold" module metadata (#66797)
A test was changing directory and then incorrectly restoring the directory to the "testdir" which is the build directory for that test, not the original source directory. That caused subsequent tests to fail.
[mlir][sparse] unifies sparse_tensor.sort_coo/sort into one operation. (#66722)
[Docs] Fix table after previous document update
[Sparc] Remove LEA instructions (NFCI) (#65850)
[lldb][NFCI] Remove unused struct ConstString::StringIsEqual
[builtins][NFC] Avoid using CRT_LDBL_128BIT in tests (#66832)
[RISCV] Prefer Zcmp push/pop instead of save-restore calls. (#66046)
[DependencyScanningFilesystem] Make sure the local/shared cache filename lookups use only absolute paths (#66122)
[NFC][hwasan] Make ShowHeapOrGlobalCandidate a method (#66682)
[NFC][hwasan] Find overflow candidate early (#66682)
[NFC][hwasan] Clang-format c557621
[NFC][hwasan] Extract a few BaseReport::Copy methods (#66682)
[NFC][hwasan] Extract announce_by_id (#66682)
[NFC][hwasan] Collect heap allocations early (#66682)
[libc++] Warn if an unsupported compiler is used
[ELF][test] Improve tests about non-SHF_ALLOC sections relocated by non-ABS relocations
[ELF] Remove a R_ARM_PCA special case from relocateNonAlloc
[clang][dataflow] Reorder checks to protect against a null pointer dereference. (#66764)
[MC,X86] Property report error for modifiers with incorrect size
[RISCV] Install sifive_vector.h to riscv-resource-headers (#66330)
[InferAlignment] Create tests for InferAlignment pass
[InferAlignment] Implement InferAlignmentPass
[InstCombine] Use a cl::opt to control calls to getOrEnforceKnownAlignment in LoadInst and StoreInst
[InferAlignment] Enable InferAlignment pass by default
[ELF][test] Improve -r tests for local symbols
[mlir][IR] Trigger notifyOperationRemoved callback for nested ops (#66771)
[Workflow] Add new code format helper. (#66684)
[gn build] Port 0f152a5
[RISCV] Fix bugs about register list of Zcmp push/pop. (#66073)
[AMDGPU] Run twoaddr tests with -early-live-intervals (#66775)
[TableGen][GlobalISel] Use GIM_SwitchOpcode in Combiners (#66864)
[NFC][InferAlignment] Swap extern declaration and definition of EnableInferAlignmentPass
[flang] Prevent IR name clashes between BIND(C) and external procedures (#66777)
Revert "[Workflow] Add new code format helper. (#66684)"
[lldb][Docs] Fix typo in style docs
[clang-format][NFC] Clean up signatures of some parser functions (#66569)
Revert "Fix a bug with cancelling "attach -w" after you have run a process previously (#65822)"
[OpenMP][VE] Limit the number of threads to create (#66729)
[SimpleLoopUnswitch] Fix reversed branch during condition injection
[mlir][vector] Make ReorderElementwiseOpsOnBroadcast support vector.splat (#66596)
[lldb][AArch64] Add SME's streaming vector control register
[reland][libc][cmake] Tidy compiler includes (#66783) (#66878)
[GuardUtils] Revert llvm::isWidenableBranch change (#66411)
[LLVM] convergence verifier should visit all instructions (#66200)
[lldb][API] Remove debug print in TestRunLocker.py
[clang] [C23] Fix crash with _BitInt running clang-tidy (#65889)
[Flang][OpenMP] Move FIR lowering tests to a separate directory (#66779)
[RISCV] Add missing V extensions for zvk-invalid-features.c (#66875)
[mlir][gpu][bufferization] Implement BufferDeallocationOpInterface for gpu.terminator (#66880)
[analyzer] Fix crash analyzing _BitInt() in evalIntegralCast (#66782)
[IR] Fix a memory leak if Function::dropAllReferences() is followed by setHungoffOperand
[X86] vector-interleaved tests - add AVX512-SLOW/AVX512-FAST common prefixes to reduce duplication
[X86] combineINSERT_SUBVECTOR - attempt to combine concatenated shuffles
[X86] Add test cases for gnux32 large constants Issue #55061
[NFC][Clang] Address reviews about overrideFunctionFeaturesWithTargetFeatures (#65938)
[analyzer] Fix StackAddrEscapeChecker crash on temporary object fields (#66493)
[VE] Add unittest for intrinsics (#66730)
[NFC][AMDGPU] Perform a single lookup in map in SIInsertWaitcnts::isPreheaderToFlush
[NFC][AMDGPU] Remove redundant hasSideEffects=1
[SROA] Don't shrink volatile load past end
[mlir][bufferization][scf] Implement BufferDeallocationOpInterface for scf.reduce.return (#66886)
[RISCV] Add tests where bin ops of splats could be scalarized. NFC (#65747)
[clang][Interp][NFC] Small code refactoring
[Docs] Update ExceptionHandling example (NFC)
[mlir][bufferization][NFC] Move memref specific implementation of AllocationOpInterface to memref dialect directory (#66637)
[X86] Align other variants to use void * as 512 variants. (#66310)
[X86] Fix an assembler bug of CMPCCXADD. (#66748)
[clang][dataflow] Identify post-visit state changes in the HTML logger. (#66746)
[MLIR][Presburger] Template Matrix to allow MPInt and Fraction; use IntMatrix for integer matrices (#66897)
[SPIR-V] Fix 64-bit integer literal printing (#66686)
[libc++] Simplify how the global stream tests are written (#66842)
[AArch64][SME] Enable TPIDR2 lazy-save for za_preserved
[X86] X86DAGToDAGISel::matchIndexRecursively - replace hard coded recursion limit with SelectionDAG::MaxRecursionDepth. NFCI.
[libc++] Sort available features before printing them
[mlir][VectorOps] Extend vector.constant_mask to support 'all true' scalable dims (#66638)
Warn on align directive with non-zero fill value in virtual sections (#66792)
[VE] Add TargetParser to CMakeLists.txt for VE unittest
[lldb-vscode] Use auto summaries whenever variables don't have a summary (#66551)
Revert "[clang] Don't inherit dllimport/dllexport to exclude_from_explicit_instantiation members during explicit instantiation (#65961)"
[AMDGPU] Convert tests rotr.ll and rotl.ll to be auto-generated (#66828)
[NFC] Fix spelling 'constanst' -> 'constants'
[mlir][Vector] Add fastmath flags to vector.reduction (#66905)
[lldb][AArch64] Invalidate cached VG value before reconfiguring SVE registers
[gn] Add dummy build file for VETests
[SPIRV] Fix OpConstant float and double printing
[flang][hlfir] Fixed cleanup code placement indeterminism in OrderedAssignments. (#66811)
[AMDGPU] Regenerate always-uniform.ll
[X86] Regenerate pr39098.ll
[ELF][test] Add a test to demonstrate #66836
[NFC][AsmPrinter] Refactor FrameIndexExprs as a std::set (#66433)
[ELF] Postpone "unable to move location counter backward" error (#66854)
[clang][CodeGen] The eh_typeid_for intrinsic needs special care too (#65699)
[AArch64][GlobalISel] Adopt dup(load) -> LD1R patterns from SelectionDAG
Cleanup fallback NOT checks
[AArch64] Add some tests for setcc known bits fold. NFC
[SelectionDAG] [NFC] Add pre-commit test for PR66701. (#66796)
[Driver] Some improvements for path handling on NetBSD (#66863)
[mlir][sparse] remove most bufferization.alloc_tensor ops from sparse (#66847)
[mlir] Bazel fixes for 1b8b556 (#66929)
[mlir] introduce transform.loop.forall_to_for (#65474)
[mlir] regenerate linalg named ops yaml (#65475)
[SLP]Fix a crash when trying to find operand with re-vectorized main instruction.
[libc][Obvious] Fix incorrect RPC opcode for clearerr
[SVE][CodeGenPrepare] Sink address calculations that match SVE gather/scatter addressing modes.

Full diff: https://github.com/llvm/llvm-project/pull/66932.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+35)
(added) llvm/test/Transforms/CodeGenPrepare/AArch64/sink-gather-scatter-addressing.ll (+231)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index ad01a206c93fb39..f80ce9239458730 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -14380,6 +14380,31 @@ static bool areOperandsOfVmullHighP64(Value *Op1, Value *Op2) {
   return isOperandOfVmullHighP64(Op1) && isOperandOfVmullHighP64(Op2);
 }
 
+static bool shouldSinkVectorOfPtrs(Value* Ptrs, SmallVectorImpl<Use *> &Ops) {
+  // Restrict ourselves to the form CodeGenPrepare typically constructs.
+  auto *GEP = dyn_cast<GetElementPtrInst>(Ptrs);
+  if (!GEP || GEP->getNumOperands() != 2)
+    return false;
+
+  Value *Base = GEP->getOperand(0);
+  Value *Offsets = GEP->getOperand(1);
+
+  // We only care about scalar_base+vector_offsets.
+  if (Base->getType()->isVectorTy() || !Offsets->getType()->isVectorTy())
+    return false;
+
+  // Sink extends that would allow us to use 32-bit offset vectors.
+  if (isa<SExtInst>(Offsets) || isa<ZExtInst>(Offsets)) {
+    auto *OffsetsInst = cast<Instruction>(Offsets);
+    if (OffsetsInst->getType()->getScalarSizeInBits() > 32 &&
+        OffsetsInst->getOperand(0)->getType()->getScalarSizeInBits() <= 32)
+      Ops.push_back(&GEP->getOperandUse(1));
+  }
+
+  // Sink the GEP.
+  return true;
+}
+
 /// Check if sinking \p I's operands to I's basic block is profitable, because
 /// the operands can be folded into a target instruction, e.g.
 /// shufflevectors extracts and/or sext/zext can be folded into (u,s)subl(2).
@@ -14481,6 +14506,16 @@ bool AArch64TargetLowering::shouldSinkOperands(
       Ops.push_back(&II->getArgOperandUse(0));
       Ops.push_back(&II->getArgOperandUse(1));
       return true;
+    case Intrinsic::masked_gather:
+      if (!shouldSinkVectorOfPtrs(II->getArgOperand(0), Ops))
+        return false;
+      Ops.push_back(&II->getArgOperandUse(0));
+      return true;
+    case Intrinsic::masked_scatter:
+      if (!shouldSinkVectorOfPtrs(II->getArgOperand(1), Ops))
+        return false;
+      Ops.push_back(&II->getArgOperandUse(1));
+      return true;
     default:
       return false;
     }
diff --git a/llvm/test/Transforms/CodeGenPrepare/AArch64/sink-gather-scatter-addressing.ll b/llvm/test/Transforms/CodeGenPrepare/AArch64/sink-gather-scatter-addressing.ll
new file mode 100644
index 000000000000000..73322836d1b84a7
--- /dev/null
+++ b/llvm/test/Transforms/CodeGenPrepare/AArch64/sink-gather-scatter-addressing.ll
@@ -0,0 +1,231 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -S --codegenprepare < %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; Sink the GEP to make use of scalar+vector addressing modes.
+define <vscale x 4 x float> @gather_offsets_sink_gep(ptr %base, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offsets_sink_gep(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i32> [[INDICES]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[TMP0]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i32> %indices
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Sink sext to make use of scalar+sxtw(vector) addressing modes.
+define <vscale x 4 x float> @gather_offsets_sink_sext(ptr %base, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offsets_sink_sext(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = sext <vscale x 4 x i32> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[PTRS:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[TMP0]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[PTRS]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i32> %indices to <vscale x 4 x i64>
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.sext
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; As above but ensure both the GEP and sext is sunk.
+define <vscale x 4 x float> @gather_offsets_sink_sext_get(ptr %base, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offsets_sink_sext_get(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = sext <vscale x 4 x i32> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[TMP0]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[TMP1]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i32> %indices to <vscale x 4 x i64>
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.sext
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Don't sink GEPs that cannot benefit from SVE's scalar+vector addressing modes.
+define <vscale x 4 x float> @gather_no_scalar_base(<vscale x 4 x ptr> %bases, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_no_scalar_base(
+; CHECK-SAME: <vscale x 4 x ptr> [[BASES:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[PTRS:%.*]] = getelementptr float, <vscale x 4 x ptr> [[BASES]], <vscale x 4 x i32> [[INDICES]]
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[PTRS]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %ptrs = getelementptr float, <vscale x 4 x ptr> %bases, <vscale x 4 x i32> %indices
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Don't sink extends whose result type is already favourable for SVE's sxtw/uxtw addressing modes.
+; NOTE: We still want to sink the GEP.
+define <vscale x 4 x float> @gather_offset_type_too_small(ptr %base, <vscale x 4 x i8> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offset_type_too_small(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i8> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[INDICES_SEXT:%.*]] = sext <vscale x 4 x i8> [[INDICES]] to <vscale x 4 x i32>
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i32> [[INDICES_SEXT]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[TMP0]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i8> %indices to <vscale x 4 x i32>
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i32> %indices.sext
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Don't sink extends that cannot benefit from SVE's sxtw/uxtw addressing modes.
+; NOTE: We still want to sink the GEP.
+define <vscale x 4 x float> @gather_offset_type_too_big(ptr %base, <vscale x 4 x i48> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offset_type_too_big(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i48> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[INDICES_SEXT:%.*]] = sext <vscale x 4 x i48> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[INDICES_SEXT]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[TMP0]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i48> %indices to <vscale x 4 x i64>
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.sext
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Sink zext to make use of scalar+uxtw(vector) addressing modes.
+; TODO: There's an argument here to split the extend into i8->i32 and i32->i64,
+; which would be especially useful if the i8s are the result of a load because
+; it would maintain the use of sign-extending loads.
+define <vscale x 4 x float> @gather_offset_sink_zext(ptr %base, <vscale x 4 x i8> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offset_sink_zext(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i8> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 4 x i8> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[PTRS:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[TMP0]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[PTRS]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.zext = zext <vscale x 4 x i8> %indices to <vscale x 4 x i64>
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.zext
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Ensure we support scatters as well as gathers.
+define void @scatter_offsets_sink_sext_get(<vscale x 4 x float> %data, ptr %base, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define void @scatter_offsets_sink_sext_get(
+; CHECK-SAME: <vscale x 4 x float> [[DATA:%.*]], ptr [[BASE:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = sext <vscale x 4 x i32> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[TMP0]]
+; CHECK-NEXT:    tail call void @llvm.masked.scatter.nxv4f32.nxv4p0(<vscale x 4 x float> [[DATA]], <vscale x 4 x ptr> [[TMP1]], i32 4, <vscale x 4 x i1> [[MASK]])
+; CHECK-NEXT:    ret void
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i32> %indices to <vscale x 4 x i64>
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.sext
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  tail call void @llvm.masked.scatter.nxv4f32(<vscale x 4 x float> %data, <vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask)
+  br label %exit
+
+exit:
+  ret void
+}
+
+declare <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr>, i32, <vscale x 4 x i1>, <vscale x 4 x float>)
+declare void @llvm.masked.scatter.nxv4f32(<vscale x 4 x float>, <vscale x 4 x ptr>, i32, <vscale x 4 x i1>)

llvmbot added backend:AArch64 llvm:transforms labels Sep 20, 2023

paulwalker-arm closed this Sep 20, 2023

paulwalker-arm deleted the sve-gather-scatter-offset-sinking branch September 20, 2023 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sve gather scatter offset sinking #66932

sve gather scatter offset sinking #66932

Uh oh!

paulwalker-arm commented Sep 20, 2023

Uh oh!

llvmbot commented Sep 20, 2023 •

edited

Loading

Uh oh!

Uh oh!

sve gather scatter offset sinking #66932

sve gather scatter offset sinking #66932

Uh oh!

Conversation

paulwalker-arm commented Sep 20, 2023

Uh oh!

llvmbot commented Sep 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Sep 20, 2023 •

edited

Loading