Update shared_32x16_to_ldmatrix_32x16_layout to be injective

Previous version mapped the 512 input indices in a `(32,16)` array to only 128 output indices. This wasn't caught before, because the bijectivity assertion was only triggered for TE schedules.
apache · Sep 13, 2022 · efb25ac · efb25ac
1 parent d532610
commit efb25ac
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/python/tvm/tir/tensor_intrin/cuda.py b/python/tvm/tir/tensor_intrin/cuda.py
@@ -36,7 +36,7 @@ def shared_16x32_to_ldmatrix_32x16_layout(i, j):
 
 
 def shared_32x16_to_ldmatrix_32x16_layout(i, j):
-    thread_id = (i % 4) + 4 * (j % 8)
+    thread_id = (i % 16) // 4 + 4 * (j % 8)
     return thread_id, 8 * (j // 8) + (i // 16) * 4 + i % 4