[BACKEND] Support Hopper MMA to MMA convert_layout ops #4492

Jokeren · 2024-08-08T17:31:00Z

This PR enables mma to mma conversion on the hopper architecture.
We also replace the previous isMmaToMmaShortcut check with cvtReordersRegisters in several places.
Note that mma to mma conversion using shared memory still goes through the legacy ConvertLayoutOpConversion function; we will deprecate it soon in the next PR.

Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update

jlebar · 2024-08-09T18:45:24Z

[BACKEND] Hopper mma to mma

Could we make the PR description have a verb in it? Like, "Support Hopper MMA to MMA convert_layout ops", something like that?

jlebar · 2024-08-09T18:16:20Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

@@ -289,12 +290,37 @@ struct ConvertLayoutOpUsingLinearLayoutsConversion
 // stronger than this, checking also that the choice of lane/warp/block does
 // not affect the permutation of registers. If we allow different
 // lane/warp/blocks to have different permutations, we can generalize this.
- if (std::optional<LinearLayout> c = conversion.divideRight(
+
+ // There are two three possible cases


jlebar · 2024-08-09T18:19:10Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

+ // 2. The `src_layout` has fewer registers than the `dst_layout`.
+ // 3. The `src_layout` has more registers than the `dst_layout`.
+ // In the second case, we may generate a conversion that is not surjective
+ // because not all lanes are covered. Instead, we could use the inverse of


Are you saying that we may *have generated* a conversion, that is, are you saying that the function "conversion" is non-surjective?

jlebar · 2024-08-09T18:19:29Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

+ // 2. The `src_layout` has fewer registers than the `dst_layout`.
+ // 3. The `src_layout` has more registers than the `dst_layout`.
+ // In the second case, we may generate a conversion that is not surjective
+ // because not all lanes are covered. Instead, we could use the inverse of


Instead, we can use

jlebar · 2024-08-09T18:21:34Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

+ // 2. The `src_layout` has fewer registers than the `dst_layout`.
+ // 3. The `src_layout` has more registers than the `dst_layout`.
+ // In the second case, we may generate a conversion that is not surjective
+ // because not all lanes are covered. Instead, we could use the inverse of


not all lanes are covered

I think you mean "not all destination registers" are covered, you don't mean to talk about lanes?

jlebar · 2024-08-09T18:27:20Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

+ // because not all lanes are covered. Instead, we could use the inverse of
+ // the conversion, mapping from `dst_layout` to `src_layout`, which is
+ // surjective. This inverse layout indicates that multiple destination
+ // registers may come from the same source register.


If I understand this comment correctly, I think it can be rewritten more concisely.

If src_layout has fewer registers than dst_layout, then conversion, which is src_layout . dst_layout^-1, will necessarily be non-surjective. But the whole point is to cover all of the destination registers, so in this case we use the inverse layout, namely dst_layout . src_layout^-1 instead.

Additionally:

I think this comment should be moved inside the if statement?

Instead of using conversion sometimes and inverseConversion other times, why don't we simply always use inverseConversion?

jlebar · 2024-08-09T18:38:40Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

+ // <inDimName, baseIdx, outValue>
+ std::vector<std::tuple<StringAttr, int, int>> sortedBases;
+ for (auto [inDimName, basis] : bases) {
+ for (size_t baseIdx = 0; baseIdx < basis.size(); baseIdx++) {


basisIdx, I think? There's no such thing as a base here.

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

jlebar · 2024-08-09T18:41:39Z

lib/Dialect/TritonGPU/Transforms/Utility.cpp

@@ -673,12 +673,13 @@ Operation *cloneWithInferType(mlir::OpBuilder &rewriter, Operation *op,
 return newOp;
 }

-// Check if the convert will be a no-op in codegen.
+// Check if the convert will be performed without shared memory.
 static bool isFreeConvert(Operation *op) {


Maybe change the function name too?

Also, are you sure that "convert is performed without shared memory" is what you want to check? Maybe you actually want to check "convert is performed by reordering registers, i.e. it's a nop"? For example, a convert that does a warp shuffle maybe should not be "free"? I don't know how it's used though.

It's unknow whether warp shuffle with be considered as "free" or inexpensive, but we agree that better to consider only register->register conversion for now.

jlebar · 2024-08-09T18:43:02Z

test/TritonGPU/combine.mlir

@@ -1936,7 +1936,8 @@ tt.func public @yield_outside_loop2(%arg0: i32, %arg1: i32) -> (i32, i32) {

 // -----

-// Check that we handle corner cases when hoisting convert on top of extf. For complex slices we may hoist convert on top of extf while the source of extf has multiple uses in the slice.
+// Check that we handle corner cases when hoisting convert on top of extf because convert ops on a smaller type which is faster.


Typo in comment

jlebar · 2024-08-09T18:48:41Z

unittest/Dialect/TritonGPU/LinearLayoutConversionsTest.cpp

- LinearLayout({{S("register"), {{4}}},
- {S("lane"), {{0}, {0}, {1}, {2}, {0}}},
+ LinearLayout({{S("register"), {{0}}},
+ {S("lane"), {{0}, {0}, {1}, {2}, {4}}},


Was this a bug before? If so, did this affect production code? If so, could we explain the bug in the PR description? (I don't fully understand why the new thing is right...)

Yes, it was a bug

We didn't find problems in real mmav2 test cases, but we did find wrong mappings for real mmav3 test cases.
By real cases, I'm referring to those used in test_core.py but not C++ tests.

For this unit test case, the slice is a column from the parent. Because the output dimension is 8, it's fully covered by the first registers of t0, t4, ..., t28. The second registers of t0, t4, ..., t28 just store duplicated values of the first registers.

Jokeren · 2024-08-09T21:10:59Z

Hi @jlebar , feel free to take the second pass now. Thanks for the suggestions. Hope my comments are better now :)

jlebar · 2024-08-10T02:08:24Z

include/triton/Analysis/Utility.h

@@ -189,18 +189,16 @@ bool supportMMA(triton::DotOp op, int version);

 bool supportMMA(Value value, int version);

+bool cvtReordersRegisters(RankedTensorType srcTy, RankedTensorType dstTy);


Add a brief comment to this?

(In particular does the MmaToDot shortcut also just reorder registers? If so maybe the name for this should have "LinearLayout" or "LL" in the name?)

What do you mean "LL" in the name?

In particular does the MmaToDot shortcut also just reorder registers?

There are some hacks. If we get true from calling matchMmaV3AndDotOperandLayout, actually we already use warp shuffle, but this is a very special case

"LL" meant LinearLayout. Anyway this is good, thanks. :)

jlebar · 2024-08-10T02:11:08Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

@@ -279,6 +281,7 @@ struct ConvertLayoutOpUsingLinearLayoutsConversion
 int numWarps = conversion.getInDimSize(str_attr("warp"));
 int numBlocks = conversion.getInDimSize(str_attr("block"));

+ StringAttr kRegister = str_attr("register");


unused now?

jlebar · 2024-08-10T02:15:25Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

+ auto dstToSrc = inverseConversion.divideRight(
+ LinearLayout::identity1D(numLanes, kLane, kLane) *
+ LinearLayout::identity1D(numWarps, kWarp, kWarp) *
+ LinearLayout::identity1D(numBlocks, kBlock, kBlock));


I think it would be clearer to move this into the transferWithinThread function.

Fundamentally the current function creates a layout called conversion, which it uses to decide what kind of conversion to do -- reg-to-reg, shfl, shmem, whatever. Then it calls transferWithinX, which figures out how to do the conversion.

If you moved this into the transferWithinThread function, you wouldn't need much of a comment at all, because you wouldn't be tempted to use conversion! That is, it would be (more) obvious to do dstLayout->invertAndCompose(*srcLayout) rather than vice versa.

(It's also confusing as-is that we take a variable called inverseConversion and pass it into a function where it's used in a variable called plain conversion. Again moving the code into the function may clarify this...)

jlebar · 2024-08-10T02:20:04Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

+ auto dstToSrc = inverseConversion.divideRight(
+ LinearLayout::identity1D(numLanes, kLane, kLane) *
+ LinearLayout::identity1D(numWarps, kWarp, kWarp) *
+ LinearLayout::identity1D(numBlocks, kBlock, kBlock));


As written here, we're implicitly asserting that if cvtReordersRegisters is true, then you can divide the layout like this.

I don't love this because we're duplicating the logic (i.e. the division) in two places. Could we make a function which tries to divide and returns an Optional instead?

jlebar · 2024-08-10T02:22:20Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

+// because bases that map to a location larger the shape[d]
+// effectively duplicate along that dimension. For example, consider a layout
+// with an output dimension size of 32, and we call ensureLayoutNotLargerThan to
+// shrink the output dimension size to 8:


I notice some typos in this. Here's what ChatGPT suggests, I think this is a good start. https://chatgpt.com/share/e/0290a621-c78b-49e3-97cf-8a95b6b83030

jlebar · 2024-08-10T02:23:07Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

-// L(register=4) = 1
-// L(lane=1) = 2
-// L(lane=2) = 16.
+// We achieve this by setting the largest value in each output dimension d to 0


Hm, I don't know what the "largest value in [an] output dimension" is. The old comment makes sense to me, but the new comment seems to have lost information, making it difficult for me to understand.

jlebar · 2024-08-10T02:24:50Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

+// L(register=2) = 4
+// L(register=4) = 1
+// L(lane=1) = 2
+// L(lane=2) = 16


We seem to have lost the indentation in the example here. Was that intentional? Indentation helps readers scan the example.

jlebar · 2024-08-10T02:30:20Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

-// dimensions.
+// Now the output dimension of this layout has a size of 8, which is the desired
+// size. Note that this method works only because the bases are powers of two.
+// It is unclear what to do when they are not.
 LinearLayout ensureLayoutNotLargerThan(


I see, I think I understand what you're doing here now. Thank you for improving the comments.

I am not convinced this new behavior matches the behavior of the old legacy layouts, though. (Maybe we don't need to match them because we got rid of emitIndices based on legacy layouts? But changing the behavior of the layouts is very subtle and can affect lots of other code.)

What problem are we solving that requires us to make this change?

The problem we are trying to solve is that the old logic is wrong for wgmma v3

"triton_gpu.convert_layout"(%1912) {allocation.offset = 0 : i32} : (tensor<64x64xf16, #triton_gpu.nvidia_mma<{versionMajor = 3, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [16, 128, 16]}>>) -> tensor<64x64xf16, #triton_gpu.nvidia_mma<{versionMajor = 3, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [16, 64, 16]}>>

Old logic:

- register=1 -> (0, 1) register=2 -> (8, 0) register=4 -> (0, 4) register=8 -> (0, 8) register=16 -> (0, 16) register=32 -> (0, 32) - lane=1 -> (0, 2) lane=2 -> (0, 0) lane=4 -> (1, 0) lane=8 -> (2, 0) lane=16 -> (4, 0) - warp=1 -> (16, 0) warp=2 -> (32, 0) - block is a size 1 dimension where out dims are: [dim0 (size 64), dim1 (size 64)]

lane 2 should be mapped to (0, 4) instead of [0, 0]

…n/mma-to-mma

Jokeren · 2024-08-11T02:47:14Z

Hi @jlebar , please feel free to take another pass now.

jlebar · 2024-08-11T20:15:40Z

include/triton/Analysis/Utility.h

@@ -189,18 +189,16 @@ bool supportMMA(triton::DotOp op, int version);

 bool supportMMA(Value value, int version);

+bool cvtReordersRegisters(RankedTensorType srcTy, RankedTensorType dstTy);


"LL" meant LinearLayout. Anyway this is good, thanks. :)

jlebar · 2024-08-11T20:16:54Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

-// dimensions.
+// Now the output dimension of this layout has a size of 8, which is the desired
+// size. Note that this method works only because the bases are powers of two.
+// It is unclear what to do when they are not.
 LinearLayout ensureLayoutNotLargerThan(


triton-lang#4492 started causing an issue where chained MMAs on hopper would segfault with 8 warps. It seems that previously this was checked, but the check got removed in this PR and it's still unsupported. Adding back this check means these MMAs will have to go back to shared memory, but it's better than segfaulting until it's actually supported. Resolves openxla/xla#17356

…4803) #4492 started causing an issue where chained MMAs on hopper would segfault with 8 warps. It seems that previously this was checked, but the check got removed in this PR and it's still unsupported. Adding back this check means these MMAs will have to go back to shared memory, but it's better than segfaulting until it's actually supported. Resolves openxla/xla#17356 Co-authored-by: Tori <vwbaker@google.com>

Jokeren added 2 commits August 6, 2024 13:27

Update

232574c

Update

dd60c90

Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update

Jokeren force-pushed the keren/mma-to-mma branch from 5e017fb to dd60c90 Compare August 9, 2024 14:40

Jokeren linked an issue Aug 9, 2024 that may be closed by this pull request

Layout conversion error on H100 #4418

Closed

Jokeren added 2 commits August 9, 2024 10:45

Update

062bf44

Fix comment

2a31da4

Jokeren marked this pull request as ready for review August 9, 2024 16:31

Jokeren requested a review from ptillet as a code owner August 9, 2024 16:31

Jokeren requested review from jlebar and ThomasRaoux August 9, 2024 16:31

Jokeren changed the title ~~[BACKEND][DRAFT] Hopper mma to mma~~ [BACKEND] Hopper mma to mma Aug 9, 2024

jlebar reviewed Aug 9, 2024

View reviewed changes

Jokeren changed the title ~~[BACKEND] Hopper mma to mma~~ [BACKEND] Support Hopper MMA to MMA convert_layout ops Aug 9, 2024

Jokeren added 5 commits August 9, 2024 16:19

Update

c96caa6

Update

5252bea

Update

738d7dc

Update

6661b74

Update

cad90f0

Merge branch 'main' into keren/mma-to-mma

c2da032

jlebar reviewed Aug 10, 2024

View reviewed changes

Jokeren added 4 commits August 10, 2024 11:48

Update

ec91841

Merge branch 'keren/mma-to-mma' of github.com:openai/triton into kere…

f24192b

…n/mma-to-mma

Update

1fdf5e2

Update

a5b514a

zhyncs mentioned this pull request Aug 11, 2024

[Bug] DeepSeek V2 H100 x8 Triton failure sgl-project/sglang#913

Closed

3 tasks

jlebar approved these changes Aug 11, 2024

View reviewed changes

Merge branch 'main' into keren/mma-to-mma

17625eb

Jokeren enabled auto-merge (squash) August 12, 2024 00:20

Jokeren merged commit 7d89248 into main Aug 12, 2024
6 checks passed

Jokeren deleted the keren/mma-to-mma branch August 12, 2024 00:29

zhyncs mentioned this pull request Aug 12, 2024

[Bug] Assertion `idx < size()' failed. #4502

Open

htyu mentioned this pull request Aug 12, 2024

[BACKEND] Fix DotWaitOp layout propagation. #4193

Merged

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

vwbaker mentioned this pull request Sep 24, 2024

Pallas/Triton segfault on H100 openxla/xla#17356

Open

chsigg mentioned this pull request Sep 25, 2024

[backend] Fix improper mma->dot shortcut when warpsPerCTA[1] > 1 #4803

Merged

		@@ -189,18 +189,16 @@ bool supportMMA(triton::DotOp op, int version);

		bool supportMMA(Value value, int version);

		bool cvtReordersRegisters(RankedTensorType srcTy, RankedTensorType dstTy);

[BACKEND] Support Hopper MMA to MMA convert_layout ops #4492

[BACKEND] Support Hopper MMA to MMA convert_layout ops #4492

Conversation

Jokeren commented Aug 8, 2024 • edited Loading

jlebar commented Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jokeren commented Aug 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jokeren Aug 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jokeren commented Aug 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jokeren commented Aug 8, 2024 •

edited

Loading

jlebar commented Aug 9, 2024 •

edited

Loading

Jokeren Aug 10, 2024 •

edited

Loading