Define triton_intel_gpu.simd_reduce
and use in optimized transposed reduction
#2890
Labels
Milestone
triton_intel_gpu.simd_reduce
and use in optimized transposed reduction
#2890
Define operation adhering to following description:
The
triton_intel_gpu.simd_reduce
operation performs a SIMD reduction. Contrary tott.reduce
, when performing a warp reduction, the result is non-uniform.The reduction axis must be in such a way that only a warp reduction is
performed, i.e.,
sizePerThread[axis]
,warpsPerCTA[axis]
andCTAsPerCGA[axis]
must be 1; andshape[axis]
andthreadsPerWarp[axis]
must be equal to the sub-group size.
The output type must be compatible with the performed reduction by reducing
the size per thread
Example:
Note this is in essence an optimized
N*sub_group_sizexsub_group_size->N*sub_group_size
reduction that involves a transpose and a good candidate to generate better code in the optimized reduction pass without going through SLM.The text was updated successfully, but these errors were encountered: