[LinalgExt] Add online_attention op #17536

Groverkss · 2024-05-31T16:42:23Z

This patch adds a new online_attention op. This op represents a partially reduced attention op which can be tiled along it's k2 reduction dimension. This op also has indexing maps, supports tiling on all dimensions other than k1 dimension, and can decompose based on any given indexing maps.

This patch also makes the CPU backend use online attention to decompose and tile reduction dimension, allowing it to be tiled along N and batch dimensions, and tiling using LLVMCPUTile.

MaheshRavishankar

Overall this looks OK to me. I didnt look too much into details of the attention op implementation itself. I am happy to stamp if needed.

Meta comment, please add more comments on methods (more for future you than anything else)

MaheshRavishankar · 2024-06-05T17:22:53Z

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/DecomposeAttention.cpp

@@ -366,6 +366,16 @@ void DecomposeAttentionPass::runOnOperation() {
    SmallVector<Operation *> ops;
    decomposeTiledAttention(attnOp, ops, rewriter, optionalTileSize);


Can we remove the "decomposeTiledAttention" part now? They both are doing the same thing?

I'd like to do that in a separate patch. There are a number of transform scripts for attention (The CUDA attention transform scripts) that I need to take into account before doing this.

hanhanW

First round of comments, more about asking questions.

The PR is very big.. It's fine for this one, but please break it to small PRs in the future. If it were me, I'd split it to:

Introduce OnlineAttention op
Implement TilingInterface methods for the op
Implement AggregatedOpInterface methods for the op
Implement convertToOnlineAttention
The rest of changes on CPU side.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

hanhanW · 2024-06-05T20:36:02Z

compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp

+  funcPassManager.addPass(
+      IREE::LinalgExt::createConvertAttentionToOnlineAttentionPass());


Can this happen before tiling parallel dims? Is it a requirement for tiling reduction loops?

You can only tile reduction loops on online_attention op. We could do this before tiling parallel dims, but we would then need to propagate lowering_config info in createConvertAttentionToOnlineAttention pass. For more context, the conversion does:

attention { lowering_config }

to

acc = acc_fill max = max_fill sum = sum_fill out:3 = online_attention acc, max, sum {lowering_config} elementwise out#0, out#2

The lowering config gets preserved on the online_attention op and is used for reduction tiling. Until we have consumer fusion (and greedy fusion for multiple operands/results) fixed, I don't think we can do it.

As a side note, this doesn't allow us to do further levels of parallel tiling on the elementwise and fill operations (which is not the best).

Ideally, I would like there to be a way to propagate the lowering_config attribute when I do a conversion like this (which would be putting the tiling information on the type, or somewhere more presistent).

We could do this before tiling parallel dims, but we would then need to propagate lowering_config info in createConvertAttentionToOnlineAttention pass.

It is more like asking questions but not a requirement to address the comment. I'm trying to see the whole picture of how it could be done in CPU backend.

So it seems that we can convert the op to online_attention op before lowering strategy selection, like what we've done in softmax op. Do you think that we want to keep it as attention form when we're doing the tiling on parallel loops? Or it does not matter if we have "tile online_attention op and fuse its producers/consumers into the for loop"?

Ah, I understand what you mean now. I can try. I'm thinking there might be problems with fusion because online_attention op has multiple results. Let me try and see if I can do it.

No need to try it and land it in this PR, because the PR is already big and it is fairly new to CPU backends. I can pull in others to help with CPU changes later. Are there other pending changes for attention ops?

compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/TilingInterfaceImpl.cpp

Groverkss · 2024-06-07T12:17:12Z

First round of comments, more about asking questions.

The PR is very big.. It's fine for this one, but please break it to small PRs in the future. If it were me, I'd split it to:

Introduce OnlineAttention op

Implement TilingInterface methods for the op

Implement AggregatedOpInterface methods for the op

Implement convertToOnlineAttention

The rest of changes on CPU side.

Yeah, Ideally this patch should have been split up. I just sent my entire experimentation branch as a patch for now (because we need this patch soon).

kuhar

Left some cosmetic comments

compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/TransformExtensions/LinalgExtExtensionsOps.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/AggregatedOpInterfaceImpl.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/TileAttention.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/Passes.td

hanhanW

LGTM for CPU changes and code structure. I'll review other implementation details later.

hanhanW · 2024-06-07T19:06:51Z

compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp

+  funcPassManager.addPass(
+      IREE::LinalgExt::createConvertAttentionToOnlineAttentionPass());


We could do this before tiling parallel dims, but we would then need to propagate lowering_config info in createConvertAttentionToOnlineAttention pass.

It is more like asking questions but not a requirement to address the comment. I'm trying to see the whole picture of how it could be done in CPU backend.

So it seems that we can convert the op to online_attention op before lowering strategy selection, like what we've done in softmax op. Do you think that we want to keep it as attention form when we're doing the tiling on parallel loops? Or it does not matter if we have "tile online_attention op and fuse its producers/consumers into the for loop"?

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/AggregatedOpInterfaceImpl.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/TilingInterfaceImpl.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/AggregatedOpInterfaceImpl.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/TileAttention.cpp

This reverts commit 713c95b.

ScottTodd · 2024-06-12T23:03:00Z

Did this regress CPU performance?

Presubmit test results are suspicious on this PR and postsubmit started failing after merge.

FAILED SHARK-TestSuite/iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank/model.mlirbc::cpu_llvm_task_real_weights - Failed: Timeout >1200.0s

https://github.com/iree-org/iree/actions/runs/9484305572/job/26134004282

ScottTodd · 2024-06-12T23:11:42Z

Looking at the CI logs, this may have timed out during compilation. That makes more sense than timing out at runtime, but still should be investigated.

Compile command from the logs:

INFO     root:conftest.py:393 Launching compile command:
cd /home/nod/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank && iree-compile model.mlirbc --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-input-demote-f64-to-f32 -o model_cpu_llvm_task_real_weights.vmfb

Model file: https://github.com/nod-ai/SHARK-TestSuite/blob/main/iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank/model.mlirbc

This reverts commit abf0087.

Reverts #17536 This caused `sdxl-scheduled-unet-3-tank` to hit timeouts when compiling for cpu: https://github.com/iree-org/iree/actions/runs/9484305572/job/26134004282

hanhanW · 2024-06-13T00:11:09Z

@Groverkss I'm not able to take a look today, but I can do it tomorrow. Let me know if you want me to take a look.

This patch adds a new online_attention op. This op represents a partially reduced attention op which can be tiled along it's k2 reduction dimension. This op also has indexing maps, supports tiling on all dimensions other than k1 dimension, and can decompose based on any given indexing maps. This patch also makes the CPU backend use online attention to decompose and tile reduction dimension, allowing it to be tiled along N and batch dimensions, and tiling using LLVMCPUTile. Signed-off-by: Lubo Litchev <lubol@google.com>

Reverts iree-org#17536 This caused `sdxl-scheduled-unet-3-tank` to hit timeouts when compiling for cpu: https://github.com/iree-org/iree/actions/runs/9484305572/job/26134004282 Signed-off-by: Lubo Litchev <lubol@google.com>

Groverkss force-pushed the new-decomposition-attention branch from 1918c97 to cd0db37 Compare June 3, 2024 14:56

Groverkss mentioned this pull request Jun 4, 2024

[attention] Extend attention to fuse transpose nod-ai/SHARK-ModelDev#669

Closed

Groverkss marked this pull request as ready for review June 4, 2024 18:20

Groverkss requested review from hanhanW and MaheshRavishankar as code owners June 4, 2024 18:20

Groverkss force-pushed the new-decomposition-attention branch from 6a947fe to 8ea2ff6 Compare June 5, 2024 14:29

Groverkss requested review from harsh-nod, kuhar, qedawkins and raikonenfnu June 5, 2024 16:15

MaheshRavishankar reviewed Jun 5, 2024

View reviewed changes

hanhanW requested changes Jun 5, 2024

View reviewed changes

Groverkss force-pushed the new-decomposition-attention branch from 8ea2ff6 to 997815b Compare June 7, 2024 14:13

Groverkss requested a review from hanhanW June 7, 2024 14:13

kuhar reviewed Jun 7, 2024

View reviewed changes

Groverkss requested a review from kuhar June 7, 2024 15:11

hanhanW reviewed Jun 7, 2024

View reviewed changes

Groverkss force-pushed the new-decomposition-attention branch from bf8cfce to 48470bf Compare June 10, 2024 12:11

Groverkss mentioned this pull request Jun 10, 2024

[LinalgExt] Remove attention tile and decompose #17626

Closed

Groverkss force-pushed the new-decomposition-attention branch from 48470bf to cdc9f6b Compare June 10, 2024 16:45

hanhanW reviewed Jun 10, 2024

View reviewed changes

hanhanW approved these changes Jun 10, 2024

View reviewed changes

Groverkss force-pushed the new-decomposition-attention branch from cdc9f6b to dc651fa Compare June 11, 2024 17:11

Groverkss enabled auto-merge (squash) June 11, 2024 17:23

Groverkss added 5 commits June 12, 2024 13:02

Split tests

9549091

Address comments

872ca3e

save

a3e3471

save

b53734d

add online attention op

3f35acb

Groverkss added 16 commits June 12, 2024 13:02

add test for tiling

283f5cc

clang-format

6feaa37

add decompose test

c3fb664

Add docs for online_attention

0974aee

bazeltocamke

2e96982

remove todo

8835a84

address comments

791a31a

Move aggregate op implementation to seperate file

f937879

addreess comments

5d6f8cc

fix compilation error

b52d70d

Address hanhan's comments

4f0a7e9

pre-commit

f490be5

dummy reduction tile sizes for winograd

ac149e3

fix tests

ec7aff2

fix test

c249c98

BAZEL 😢

713c95b

Groverkss force-pushed the new-decomposition-attention branch from 53a65bb to 713c95b Compare June 12, 2024 13:43

Groverkss added 3 commits June 12, 2024 13:49

Revert "BAZEL 😢"

5157ed5

This reverts commit 713c95b.

BAZEL BAZEL

01fca7d

BEZELL

3e2f54f

Groverkss merged commit abf0087 into iree-org:main Jun 12, 2024
50 of 51 checks passed

ScottTodd added a commit that referenced this pull request Jun 12, 2024

Revert "[LinalgExt] Add online_attention op (#17536)"

1513871

This reverts commit abf0087.

ScottTodd mentioned this pull request Jun 12, 2024

Revert "[LinalgExt] Add online_attention op" #17658

Merged

Groverkss mentioned this pull request Aug 19, 2024

[Attention] Generalize Attention Tiling and Decomposition #17467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LinalgExt] Add online_attention op #17536

[LinalgExt] Add online_attention op #17536

Groverkss commented May 31, 2024 •

edited

Loading

MaheshRavishankar left a comment

MaheshRavishankar Jun 5, 2024

Groverkss Jun 7, 2024

hanhanW left a comment

hanhanW Jun 5, 2024

Groverkss Jun 7, 2024 •

edited

Loading

Groverkss Jun 7, 2024 •

edited

Loading

hanhanW Jun 7, 2024

Groverkss Jun 10, 2024

hanhanW Jun 10, 2024

Groverkss commented Jun 7, 2024

kuhar left a comment

hanhanW left a comment

hanhanW Jun 7, 2024

ScottTodd commented Jun 12, 2024

ScottTodd commented Jun 12, 2024

hanhanW commented Jun 13, 2024 •

edited

Loading

		@@ -366,6 +366,16 @@ void DecomposeAttentionPass::runOnOperation() {
		SmallVector<Operation *> ops;
		decomposeTiledAttention(attnOp, ops, rewriter, optionalTileSize);

		funcPassManager.addPass(
		IREE::LinalgExt::createConvertAttentionToOnlineAttentionPass());

[LinalgExt] Add online_attention op #17536

[LinalgExt] Add online_attention op #17536

Conversation

Groverkss commented May 31, 2024 • edited Loading

MaheshRavishankar left a comment

Choose a reason for hiding this comment

MaheshRavishankar Jun 5, 2024

Choose a reason for hiding this comment

Groverkss Jun 7, 2024

Choose a reason for hiding this comment

hanhanW left a comment

Choose a reason for hiding this comment

hanhanW Jun 5, 2024

Choose a reason for hiding this comment

Groverkss Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

Groverkss Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

hanhanW Jun 7, 2024

Choose a reason for hiding this comment

Groverkss Jun 10, 2024

Choose a reason for hiding this comment

hanhanW Jun 10, 2024

Choose a reason for hiding this comment

Groverkss commented Jun 7, 2024

kuhar left a comment

Choose a reason for hiding this comment

hanhanW left a comment

Choose a reason for hiding this comment

hanhanW Jun 7, 2024

Choose a reason for hiding this comment

ScottTodd commented Jun 12, 2024

ScottTodd commented Jun 12, 2024

hanhanW commented Jun 13, 2024 • edited Loading

Groverkss commented May 31, 2024 •

edited

Loading

Groverkss Jun 7, 2024 •

edited

Loading

Groverkss Jun 7, 2024 •

edited

Loading

hanhanW commented Jun 13, 2024 •

edited

Loading