Skip to content

Commit

Permalink
Update shared_32x16_to_ldmatrix_32x16_layout to be injective
Browse files Browse the repository at this point in the history
Previous version mapped the 512 input indices in a `(32,16)` array to
only 128 output indices.  This wasn't caught before, because the
bijectivity assertion was only triggered for TE schedules.
  • Loading branch information
Lunderberg committed Sep 13, 2022
1 parent d532610 commit efb25ac
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion python/tvm/tir/tensor_intrin/cuda.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ def shared_16x32_to_ldmatrix_32x16_layout(i, j):


def shared_32x16_to_ldmatrix_32x16_layout(i, j):
thread_id = (i % 4) + 4 * (j % 8)
thread_id = (i % 16) // 4 + 4 * (j % 8)
return thread_id, 8 * (j // 8) + (i // 16) * 4 + i % 4


Expand Down

0 comments on commit efb25ac

Please sign in to comment.