Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs][Tuner] Add initial tuning spec docs #19462

Merged
merged 5 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/website/docs/reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,5 @@ repository.

* [Glossary](./glossary.md)
* [Optimization options](./optimization-options.md)
* [Tuning](./tuning.md)
kuhar marked this conversation as resolved.
Show resolved Hide resolved
* [Extensions](./extensions.md)
136 changes: 136 additions & 0 deletions docs/website/docs/reference/tuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
icon: octicons/meter-16
---
kuhar marked this conversation as resolved.
Show resolved Hide resolved

# Tuning

This page documents support for IREE dispatch tuning. The compiler supports
both default and user-provided tuning specs (specifications) that override
compiler heuristics that guide dispatch code generation. In our experience,
tuning specs can provide meaningful speedup of model execution. For example, we
achieved a ~10% improvement on the Stable Diffusion XL (SDXL) model with the
MI300X GPU.

## Tuning specs

The default specs are shipped with the IREE compiler and are target-specific.
We aim to provide default tuning specs that cover the most in-demand hardware
and dispatches from most popular ML models, although we do not guarantee
completeness.

User-provided tuning specs are a mechanism that allows for users to get the
best performance on custom models and hardware targets without having to modify
the compiler source code or needlessly special-case compiler heuristics.

Currently, the dispatch tuner that generates tuning specs is still experimental
and hosted
[in an external repo](https://github.com/nod-ai/shark-ai/tree/main/tuner). This
document describes how to work with tuning specs generated by the SHARK Tuner
or produced manually, but it does not go into detail on how to generate these
specs.

## Flags

The use of tuning specs in `iree-compile` is controlled with the following
flags:

* `--iree-codegen-enable-default-tuning-specs` -- enables or disables the
default tuning specs shipped with the compiler.
* `--iree-codegen-tuning-spec-path` -- loads a user-specified tuning spec.
* `--iree-codegen-dump-tuning-specs-to` -- dumps final tuning specs to a
directory or standard output.
kuhar marked this conversation as resolved.
Show resolved Hide resolved

Note that both default and user-provided specs can be enabled at the same time.
The compiler will link them together and invoke the user-provided spec before
attempting the default one.

## Anatomy of a tuning spec

### Example

```mlir
kuhar marked this conversation as resolved.
Show resolved Hide resolved
module @my_spec attributes { transform.with_named_sequence } {
transform.named_sequence @apply_op_config(%op: !transform.any_op {transform.readonly},
%config: !transform.any_param {transform.readonly}) {
transform.annotate %op "compilation_info" = %config : !transform.any_op, !transform.any_param
transform.yield
}

transform.named_sequence
@match_mmt_f16_f16_f32(%root: !transform.any_op {transform.readonly}) -> !transform.any_op {
transform.match.operation_name %root ["linalg.generic"] : !transform.any_op
%ins, %outs = transform.iree.match.cast_compatible_dag_from_root %root {
^bb0(%lhs: tensor<?x?xf16>, %rhs: tensor<?x?xf16>, %out: tensor<?x?xf32>):
%7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>,
affine_map<(d0, d1, d2) -> (d1, d2)>,
affine_map<(d0, d1, d2) -> (d0, d1)>],
iterator_types = ["parallel", "parallel", "reduction"]}
ins(%lhs, %rhs : tensor<?x?xf16>, tensor<?x?xf16>) outs(%out : tensor<?x?xf32>) {
^bb0(%in: f16, %in_0: f16, %acc: f32):
%8 = arith.extf %in : f16 to f32
%9 = arith.extf %in_0 : f16 to f32
%10 = arith.mulf %8, %9 : f32
%11 = arith.addf %acc, %10 : f32
linalg.yield %11 : f32
} -> tensor<?x?xf32>
} : (!transform.any_op) -> (!transform.any_value, !transform.any_value)
transform.yield %root : !transform.any_op
}

transform.named_sequence
@match_mmt_2048x1280x5120_f16_f16_f32(%matmul: !transform.any_op {transform.readonly})
-> (!transform.any_op, !transform.any_param) {
%mmt = transform.include @match_mmt_f16_f16_f32 failures(propagate) (%matmul)
: (!transform.any_op) -> !transform.any_op
%lhs = transform.get_operand %matmul[0] : (!transform.any_op) -> !transform.any_value
%rhs = transform.get_operand %matmul[1] : (!transform.any_op) -> !transform.any_value
transform.iree.match.cast_compatible_type %lhs = tensor<2048x5120xf16> : !transform.any_value
transform.iree.match.cast_compatible_type %rhs = tensor<1280x5120xf16> : !transform.any_value
%config = transform.param.constant #iree_codegen.compilation_info<
lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1],
mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>,
subgroup_m_count = 2, subgroup_n_count = 2,
reduction = [0, 0, 64],
workgroup = [64, 128, 0]}>,
translation_info = #iree_codegen.translation_info<pipeline = LLVMGPUVectorDistribute
workgroup_size = [256, 1, 1] subgroup_size = 64,
{gpu_pipeline_options = #iree_gpu.pipeline_options<prefetch_shared_memory = true>}>
> -> !transform.any_param
transform.yield %matmul, %config : !transform.any_op, !transform.any_param
}

transform.named_sequence
@__kernel_config(%variant_op: !transform.any_op {transform.consumed}) -> !transform.any_op
attributes { iree_codegen.tuning_spec_entrypoint } {
%res = transform.foreach_match in %variant_op
@match_mmt_2048x1280x5120_f16_f16_f32 -> @apply_op_config
: (!transform.any_op) -> !transform.any_op
transform.yield %res : !transform.any_op
}
}
```

### Explanation

Tuning specs are
[transform dialect](https://mlir.llvm.org/docs/Dialects/Transform/) libraries
that conform to the following format:

* All tuning spec entry points (named sequence ops) are marked with the
`iree_codegen.tuning_spec_entrypoint` attribute. They have a single argument
of type `!transform.any_op` and return a single value of type
`!transform.any_op`.
* All entry points in the final tuning specs must either read
(`transform.readonly`) or consume (`transform.consumed`) the argument.

The tuning spec above attempts to match `linalg.generic` ops that correspond to the
matmul operation with the RHS operand transposed (a.k.a. mmt) of shape
`2048x1280x5120` and `f16` operand element types and `f32` result element type.

If the match succeeds, the tuning spec applies the `compilation_info` attribute
that will drive the code generation. This attribute is considered a compiler
implementation detail; in general, each codegen pipeline has its own
requirements as to what is considered a valid compilation info and how to
interpret it.

Tuning specs get executed by the 'Materialize User Configs` pass.
1 change: 1 addition & 0 deletions docs/website/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ nav:
- "Other topics":
- Glossary: "reference/glossary.md"
- Optimization options: "reference/optimization-options.md"
- Tuning: "reference/tuning.md"
- Extensions: "reference/extensions.md"
- "Developers":
- "developers/index.md"
Expand Down
Loading