Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define triton_intel_gpu.simd_reduce and use in optimized transposed reduction #2890

Open
victor-eds opened this issue Dec 2, 2024 · 3 comments · May be fixed by #2907
Open

Define triton_intel_gpu.simd_reduce and use in optimized transposed reduction #2890

victor-eds opened this issue Dec 2, 2024 · 3 comments · May be fixed by #2907
Assignees

Comments

@victor-eds
Copy link
Contributor

victor-eds commented Dec 2, 2024

Define operation adhering to following description:

The triton_intel_gpu.simd_reduce operation performs a SIMD reduction. Contrary to tt.reduce, when performing a warp reduction, the result is non-uniform.

The reduction axis must be in such a way that only a warp reduction is
performed, i.e., sizePerThread[axis], warpsPerCTA[axis] and
CTAsPerCGA[axis] must be 1; and shape[axis] and threadsPerWarp[axis]
must be equal to the sub-group size.

The output type must be compatible with the performed reduction by reducing
the size per thread

Example:

#blocked = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [1, 1], order = [0, 1]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [16], warpsPerCTA = [1], order = [0]}>
triton_intel_gpu.simd_reduce { add } %0 axis = 0 : tensor<16x16xf32, #blocked> -> tensor<16xf32, #blocked1>
// # 3D reduction:
#blocked = #ttg.blocked<{sizePerThread = [1, 16, 1], threadsPerWarp = [16, 1, 1], warpsPerCTA = [1, 1, 2], order = [0, 1, 2]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 1], warpsPerCTA = [1, 2], order = [0, 1]}>
triton_intel_gpu.simd_reduce { add } %0 axis = 0 : tensor<16x16x2xf32, #blocked> -> tensor<16x2xf32, #blocked1>

Note this is in essence an optimized N*sub_group_sizexsub_group_size->N*sub_group_size reduction that involves a transpose and a good candidate to generate better code in the optimized reduction pass without going through SLM.

@victor-eds
Copy link
Contributor Author

Working on this. Top priority. Will push something today or tomorrow to get performance numbers.

@vlad-penkin vlad-penkin added this to the 4.0 [Performance] Core milestone Dec 2, 2024
@vlad-penkin vlad-penkin added performance enhancement New feature or request labels Dec 2, 2024
@victor-eds
Copy link
Contributor Author

Blocked by backend bug

@victor-eds
Copy link
Contributor Author

Still blocked by backend bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment