Define `triton_intel_gpu.simd_reduce` and use in optimized transposed reduction #2890

victor-eds · 2024-12-02T11:57:13Z

Define operation adhering to following description:

The triton_intel_gpu.simd_reduce operation performs a SIMD reduction. Contrary to tt.reduce, when performing a warp reduction, the result is non-uniform.

The reduction axis must be in such a way that only a warp reduction is
performed, i.e., sizePerThread[axis], warpsPerCTA[axis] and
CTAsPerCGA[axis] must be 1; and shape[axis] and threadsPerWarp[axis]
must be equal to the sub-group size.

The output type must be compatible with the performed reduction by reducing
the size per thread

Example:

#blocked = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [1, 1], order = [0, 1]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [16], warpsPerCTA = [1], order = [0]}>
triton_intel_gpu.simd_reduce { add } %0 axis = 0 : tensor<16x16xf32, #blocked> -> tensor<16xf32, #blocked1>
// # 3D reduction:
#blocked = #ttg.blocked<{sizePerThread = [1, 16, 1], threadsPerWarp = [16, 1, 1], warpsPerCTA = [1, 1, 2], order = [0, 1, 2]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 1], warpsPerCTA = [1, 2], order = [0, 1]}>
triton_intel_gpu.simd_reduce { add } %0 axis = 0 : tensor<16x16x2xf32, #blocked> -> tensor<16x2xf32, #blocked1>

Note this is in essence an optimized N*sub_group_sizexsub_group_size->N*sub_group_size reduction that involves a transpose and a good candidate to generate better code in the optimized reduction pass without going through SLM.

The text was updated successfully, but these errors were encountered:

victor-eds · 2024-12-02T12:31:42Z

Working on this. Top priority. Will push something today or tomorrow to get performance numbers.

victor-eds · 2024-12-09T13:30:09Z

Blocked by backend bug

victor-eds · 2025-01-13T12:59:07Z

Still blocked by backend bug

victor-eds added the codegen: attention label Dec 2, 2024

victor-eds self-assigned this Dec 2, 2024

victor-eds mentioned this issue Dec 2, 2024

Fine tune sub-group transpose bank conflict prevention for PVC #2797

Open

vlad-penkin added this to the 4.0 [Performance] Core milestone Dec 2, 2024

vlad-penkin added performance enhancement New feature or request labels Dec 2, 2024

victor-eds linked a pull request Jan 29, 2025 that will close this issue

[XPU][OptRed] Define triton_intel_gpu.simd_reduce and use in optimized transposed reduction #2907

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define `triton_intel_gpu.simd_reduce` and use in optimized transposed reduction #2890

Define `triton_intel_gpu.simd_reduce` and use in optimized transposed reduction #2890

victor-eds commented Dec 2, 2024 •

edited

Loading

victor-eds commented Dec 2, 2024

victor-eds commented Dec 9, 2024

victor-eds commented Jan 13, 2025

Define triton_intel_gpu.simd_reduce and use in optimized transposed reduction #2890

Define triton_intel_gpu.simd_reduce and use in optimized transposed reduction #2890

Comments

victor-eds commented Dec 2, 2024 • edited Loading

victor-eds commented Dec 2, 2024

victor-eds commented Dec 9, 2024

victor-eds commented Jan 13, 2025

Define `triton_intel_gpu.simd_reduce` and use in optimized transposed reduction #2890

Define `triton_intel_gpu.simd_reduce` and use in optimized transposed reduction #2890

victor-eds commented Dec 2, 2024 •

edited

Loading