Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sub-group transpose and shuffles with more than one contiguous row per thread #2749

Closed
victor-eds opened this issue Nov 19, 2024 · 0 comments
Assignees

Comments

@victor-eds
Copy link
Contributor

As of now, the transpose and shuffle layout conversions assume each work-item will own a single contiguous row in the matrix, i.e., the layout will look something like:

t0 t0 t0 ...
t1 t1 t1 ...
...
t0 t0 t0 ...
t1 t1 t1 ...

However, allowing more than one elements per work-item in the Y dimension enables further optimizations so we want to support layouts like:

t0 t0 t0 ...
t0 t0 t0 ...
t1 t1 t1 ...
t1 t1 t1 ...
...

This will allow more advanced approaches in -tritonintelgpu-optimize-reduction-locality.

@victor-eds victor-eds self-assigned this Nov 19, 2024
@victor-eds victor-eds changed the title Allow sub-group transpose and shuffles with more than one contiguos row per thread Allow sub-group transpose and shuffles with more than one contiguous row per thread Nov 19, 2024
@vlad-penkin vlad-penkin added the enhancement New feature or request label Nov 19, 2024
etiotto pushed a commit that referenced this issue Nov 26, 2024
Add support for layout conversion shuffles in which rows managed by a
single thread are contiguous in the output matrix.

Step 2/2 to
#2749

---------

Signed-off-by: victor-eds <victor.perez@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants