From 10e5bbbed2d4ce95c4a213cb393bc7d00d24f2cd Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Wed, 11 Dec 2024 20:57:11 -0800 Subject: [PATCH 01/64] Refresh links and install instructions for PyTorch/iree-turbine. (#19470) * Relax the warning about support status (we no longer have pending bug fixes in flight or depend on nightly releases) * Swap PyPI URL for GitHub URL as project homepage * Mention nightly releases (similar to https://github.com/iree-org/iree-turbine?tab=readme-ov-file#quick-start-for-users) * Adjust code sample URLs, including removing a 404'd link and adding a new link to https://github.com/nod-ai/shark-ai/tree/main/sharktank --- .../docs/guides/ml-frameworks/pytorch.md | 54 ++++++++++++------- 1 file changed, 35 insertions(+), 19 deletions(-) diff --git a/docs/website/docs/guides/ml-frameworks/pytorch.md b/docs/website/docs/guides/ml-frameworks/pytorch.md index c94194b2826c..20e5d2971ed8 100644 --- a/docs/website/docs/guides/ml-frameworks/pytorch.md +++ b/docs/website/docs/guides/ml-frameworks/pytorch.md @@ -12,17 +12,15 @@ status: new !!! caution "Caution - under development" - We are still validating and fixing specific models. Between bug fixes in - flight and releases running behind, we don't expect that you will be able - to do a lot of advanced things without using nightly releases or working - with us. + We are still validating and fixing specific models. We don't expect that + you will be able to do a lot of advanced things without working with us. Stay tuned and join the discussion in our [Discord server](https://discord.gg/wEWh6Z9nMU)'s `#pytorch` channel. ## :octicons-book-16: Overview -[iree-turbine](https://pypi.org/project/iree-turbine/) offers a tight +[iree-turbine](https://github.com/iree-org/iree-turbine) offers a tight integration between compatible versions of IREE, [torch-mlir](https://github.com/llvm/torch-mlir), and [PyTorch](https://pytorch.org/). @@ -64,22 +62,37 @@ graph LR ## :octicons-download-16: Prerequisites -Install a recent version of PyTorch -(`2.4.1`, latest stable release as of September 2024): +We recommend first installing a recent version of PyTorch for CPU by following +the [official instructions](https://pytorch.org/get-started/locally/). ``` shell python -m pip install \ - --index-url https://download.pytorch.org/whl/test/cpu torch==2.4.1 + --index-url https://download.pytorch.org/whl/test/cpu torch>=2.3.0 ``` - - Install iree-turbine: -``` shell -python -m pip install iree-turbine -``` +=== ":octicons-package-16: Stable releases" + + Stable release packages are + [published to PyPI](https://pypi.org/project/iree-turbine/). + + ``` shell + python -m pip install iree-turbine + ``` + +=== ":octicons-beaker-16: Nightly pre-releases" + + Nightly pre-releases are published on + [GitHub releases](https://github.com/iree-org/iree-turbine/releases/tag/dev-wheels). + + ``` shell hl_lines="2-4" + python -m pip install \ + --find-links https://iree.dev/pip-release-links.html \ + --pre \ + --upgrade \ + iree-turbine + ``` ## :octicons-flame-16: Just-in-time (JIT) execution @@ -158,7 +171,7 @@ turbine_output = opt_linear_module(args) | Code samples | | | -- | -- | JIT compilation notebook | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iree-org/iree/blob/main/samples/colab/pytorch_jit.ipynb) -Simple MLP eager | [`core/examples/eager_mlp/mlp_eager_simple.py`](https://github.com/iree-org/iree-turbine/tree/main/examples/eager_mlp/mlp_eager_simple.py) +Simple MLP eager | [iree-turbine `core/examples/eager_mlp/mlp_eager_simple.py`](https://github.com/iree-org/iree-turbine/tree/main/examples/eager_mlp/mlp_eager_simple.py) ## :octicons-package-dependents-16: Ahead-of-time (AOT) export @@ -235,7 +248,7 @@ print(result.to_host()) | -- | -- | Simple AOT export notebook | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iree-org/iree/blob/main/samples/colab/pytorch_aot_simple.ipynb) Import [Whisper](https://huggingface.co/openai/whisper-small) from [:hugging: Hugging Face](https://huggingface.co/) notebook | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iree-org/iree/blob/main/samples/colab/pytorch_huggingface_whisper.ipynb) -Simple MLP export | [`core/examples/aot_mlp/mlp_export_simple.py`](https://github.com/iree-org/iree-turbine/tree/main/examples/aot_mlp/mlp_export_simple.py) +Simple MLP export | [iree-turbine `core/examples/aot_mlp/mlp_export_simple.py`](https://github.com/iree-org/iree-turbine/tree/main/examples/aot_mlp/mlp_export_simple.py) ### :octicons-tools-16: Advanced API @@ -451,6 +464,9 @@ np.save("input.npy", input_np) | -- | -- | Advanced AOT export notebook | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iree-org/iree/blob/main/samples/colab/pytorch_aot_advanced.ipynb) PyTorch dynamic shapes notebook | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iree-org/iree/blob/main/samples/dynamic_shapes/pytorch_dynamic_shapes.ipynb) -AOT unit tests | [`tests/aot/`](https://github.com/iree-org/iree-turbine/tree/main/tests/aot) -Dynamic MLP export | [`core/examples/aot_mlp/mlp_export_dynamic.py`](https://github.com/iree-org/iree-turbine/tree/main/examples/aot_mlp/mlp_export_dynamic.py) -stateless llama2 | [`models/turbine_models/custom_models/stateless_llama.py`](https://github.com/nod-ai/SHARK-ModelDev/blob/main/models/turbine_models/custom_models/stateless_llama.py) +AOT unit tests | [iree-turbine `tests/aot/`](https://github.com/iree-org/iree-turbine/tree/main/tests/aot) + +The sharktank project hosted at + also uses +`iree-turbine` heavily to provide inference-optimized ops, layers, and models +for popular gen-ai applications. From d68cd28a2220b41079180091d4fb9c0051ecce8f Mon Sep 17 00:00:00 2001 From: Marius Brehler Date: Thu, 12 Dec 2024 10:38:26 +0100 Subject: [PATCH 02/64] Limit scheduling jobs to iree-org (#19224) Limits the execution of cron-scheduled jobs to in iree-org, while allowing to trigger via `workflow_dispatch` also from forks. --- .github/workflows/ci_linux_arm64_clang.yml | 1 + .github/workflows/ci_linux_x64_clang_debug.yml | 3 ++- .github/workflows/ci_linux_x64_clang_tsan.yml | 7 ++----- .github/workflows/ci_windows_x64_msvc.yml | 1 + .github/workflows/samples.yml | 3 +++ .github/workflows/schedule_candidate_release.yml | 1 + 6 files changed, 10 insertions(+), 6 deletions(-) diff --git a/.github/workflows/ci_linux_arm64_clang.yml b/.github/workflows/ci_linux_arm64_clang.yml index 65bc19b19d03..00fb531ff528 100644 --- a/.github/workflows/ci_linux_arm64_clang.yml +++ b/.github/workflows/ci_linux_arm64_clang.yml @@ -25,6 +25,7 @@ concurrency: jobs: linux_arm64_clang: + if: ${{ github.repository_owner == 'iree-org' }} # See https://gitlab.arm.com/tooling/gha-runner-docs runs-on: ah-ubuntu_22_04-c7g_4x-50 container: diff --git a/.github/workflows/ci_linux_x64_clang_debug.yml b/.github/workflows/ci_linux_x64_clang_debug.yml index 9f2e629b8656..c57319a38b5d 100644 --- a/.github/workflows/ci_linux_x64_clang_debug.yml +++ b/.github/workflows/ci_linux_x64_clang_debug.yml @@ -27,7 +27,8 @@ jobs: # This may run out of memory / disk space on standard GitHub-hosted runners, # so run on self-hosted CPU build runners instead. linux_x64_clang_debug: - runs-on: azure-linux-scale + if: ${{ github.repository_owner == 'iree-org' || github.event_name != 'schedule' }} + runs-on: ${{ github.repository_owner == 'iree-org' && 'azure-linux-scale' || 'ubuntu-24.04' }} container: ghcr.io/iree-org/cpubuilder_ubuntu_jammy@sha256:78a558b999b230f7e1da376639e14b44f095f30f1777d6a272ba48c0bbdd4ccb defaults: run: diff --git a/.github/workflows/ci_linux_x64_clang_tsan.yml b/.github/workflows/ci_linux_x64_clang_tsan.yml index 7ababb08c4fa..6d6562c888f7 100644 --- a/.github/workflows/ci_linux_x64_clang_tsan.yml +++ b/.github/workflows/ci_linux_x64_clang_tsan.yml @@ -24,12 +24,9 @@ concurrency: cancel-in-progress: true jobs: - setup: - uses: ./.github/workflows/setup.yml - linux_x64_clang_tsan: - needs: setup - runs-on: azure-linux-scale + if: ${{ github.repository_owner == 'iree-org' || github.event_name != 'schedule' }} + runs-on: ${{ github.repository_owner == 'iree-org' && 'azure-linux-scale' || 'ubuntu-24.04' }} container: image: ghcr.io/iree-org/cpubuilder_ubuntu_jammy@sha256:78a558b999b230f7e1da376639e14b44f095f30f1777d6a272ba48c0bbdd4ccb # TSan in particular needs some settings that this option includes: diff --git a/.github/workflows/ci_windows_x64_msvc.yml b/.github/workflows/ci_windows_x64_msvc.yml index d309732bb272..09ebea36a215 100644 --- a/.github/workflows/ci_windows_x64_msvc.yml +++ b/.github/workflows/ci_windows_x64_msvc.yml @@ -21,6 +21,7 @@ concurrency: jobs: windows_x64_msvc: + if: ${{ github.repository_owner == 'iree-org' || github.event_name != 'schedule' }} runs-on: azure-windows-scale env: BASE_BUILD_DIR_POWERSHELL: C:\mnt\azure\b diff --git a/.github/workflows/samples.yml b/.github/workflows/samples.yml index 576ebe623fbf..d13221705197 100644 --- a/.github/workflows/samples.yml +++ b/.github/workflows/samples.yml @@ -28,6 +28,7 @@ concurrency: jobs: colab: + if: ${{ github.repository_owner == 'iree-org' || github.event_name != 'schedule' }} runs-on: ubuntu-24.04 steps: - name: "Checking out repository" @@ -40,6 +41,7 @@ jobs: run: ./samples/colab/test_notebooks.py samples: + if: ${{ github.repository_owner == 'iree-org' || github.event_name != 'schedule' }} runs-on: ubuntu-24.04 env: CC: clang @@ -59,6 +61,7 @@ jobs: run: ./build_tools/testing/test_samples.sh web: + if: ${{ github.repository_owner == 'iree-org' || github.event_name != 'schedule' }} runs-on: ubuntu-24.04 env: VENV_DIR: ${{ github.workspace }}/.venv diff --git a/.github/workflows/schedule_candidate_release.yml b/.github/workflows/schedule_candidate_release.yml index fb2b5facdc8d..d9b247350dcd 100644 --- a/.github/workflows/schedule_candidate_release.yml +++ b/.github/workflows/schedule_candidate_release.yml @@ -14,6 +14,7 @@ on: jobs: tag_release: + if: ${{ github.repository_owner == 'iree-org' || github.event_name != 'schedule' }} name: "Tag candidate release" runs-on: ubuntu-24.04 steps: From 27742f6e559742ed7918229a3be3ddc45f8bc3ed Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Thu, 12 Dec 2024 07:56:46 -0800 Subject: [PATCH 03/64] Deflake some pkgci jobs. (#19472) * Increase real weight test timeouts from 4 minutes to 10 minutes to work around https://github.com/iree-org/iree/actions/runs/12281522213/job/34271200734#step:9:1461 ``` ============================== slowest durations =============================== 240.00s call SHARK-TestSuite/iree_tests/sharktank/punet/int8/test_cases.json::sdxl_unet_int8_export.mlir::gpu_rocm::real_weights 31.44s call SHARK-TestSuite/iree_tests/sharktank/punet/fp16/test_cases.json::sdxl_unet_fp16_export.mlir::gpu_rocm::real_weights 11.22s call SHARK-TestSuite/iree_tests/sharktank/llama/open-llama-3b-v2-f16/test_cases.json::open-llama-3b-v2-f16.mlirbc::gpu_rocm::real_weights_prefill 0.08s call SHARK-TestSuite/iree_tests/pytorch/models/resnet50/test_cases.json::resnet50.mlirbc::gpu_rocm::real_weights 0.07s call SHARK-TestSuite/iree_tests/pytorch/models/opt-125M/test_cases.json::opt-125M.mlirbc::gpu_rocm::real_weights (10 durations < 0.005s hidden. Use -vv to show these durations.) =========================== short test summary info ============================ PASSED SHARK-TestSuite/iree_tests/sharktank/llama/open-llama-3b-v2-f16/test_cases.json::open-llama-3b-v2-f16.mlirbc::gpu_rocm::real_weights_prefill PASSED SHARK-TestSuite/iree_tests/sharktank/punet/fp16/test_cases.json::sdxl_unet_fp16_export.mlir::gpu_rocm::real_weights XFAIL SHARK-TestSuite/iree_tests/pytorch/models/opt-125M/test_cases.json::opt-125M.mlirbc::gpu_rocm::real_weights - Expected compilation to fail (included in 'expected_compile_failures') XFAIL SHARK-TestSuite/iree_tests/pytorch/models/resnet50/test_cases.json::resnet50.mlirbc::gpu_rocm::real_weights - Expected compilation to fail (included in 'expected_compile_failures') FAILED SHARK-TestSuite/iree_tests/sharktank/punet/int8/test_cases.json::sdxl_unet_int8_export.mlir::gpu_rocm::real_weights - Failed: Timeout >240.0s ======= 1 failed, 2 passed, 2 deselected, 2 xfailed in 282.99s (0:04:42) ======= ``` * Skip flaky test_gridsample_zeros_padding op test to work around https://github.com/iree-org/iree/actions/runs/12286576807/job/34287344921#step:8:59 ``` _ IREE compile and run: test_gridsample_zeros_padding::model.mlir::model.mlir::cpu_llvm_sync _ [gw3] linux -- Python 3.11.10 /home/runner/work/iree/iree/venv/bin/python Error invoking iree-run-module Error code: 1 Stderr diagnostics: Stdout diagnostics: EXEC @test_gridsample_zeros_padding [FAILED] result[0]: element at index 3 (2.80544E+13) does not match the expected (0); expected that the view is equal to contents of a view of 1x1x2x4xf32 expected: 1x1x2x4xf32=[[[0 0 1.7 0][0 1.7 0 0]]] actual: 1x1x2x4xf32=[[[0 0 1.7 2.80544E+13][2.80544E+13 1.7 0 2.80544E+13]]] ``` and https://github.com/iree-org/iree/actions/runs/12285879922/job/34285283119#step:8:51 ``` _ IREE compile and run: test_gridsample_zeros_padding::model.mlir::model.mlir::cpu_llvm_sync _ [gw3] linux -- Python 3.11.11 /home/runner/work/iree/iree/venv/bin/python Error invoking iree-run-module Error code: 1 Stderr diagnostics: Stdout diagnostics: EXEC @test_gridsample_zeros_padding [FAILED] result[0]: element at index 3 (39529.7) does not match the expected (0); expected that the view is equal to contents of a view of 1x1x2x4xf32 expected: 1x1x2x4xf32=[[[0 0 1.7 0][0 1.7 0 0]]] actual: 1x1x2x4xf32=[[[0 0 1.7 39529.7][39529.7 1.7 0 39529.7]]] ``` (This test seems to be failing consistently as of https://github.com/iree-org/iree/commit/ea9176ab6f299d5d0fb01b887bc7b4478fad9c4b, but with differing outputs, we could mark it as failing or skip) --- .github/workflows/pkgci_regression_test.yml | 2 +- .../iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/.github/workflows/pkgci_regression_test.yml b/.github/workflows/pkgci_regression_test.yml index 86d6169672b7..1b22e6ee8950 100644 --- a/.github/workflows/pkgci_regression_test.yml +++ b/.github/workflows/pkgci_regression_test.yml @@ -112,7 +112,7 @@ jobs: --no-skip-tests-missing-files \ --capture=no \ --log-cli-level=info \ - --timeout=240 \ + --timeout=600 \ --durations=0 \ --config-files=${MODELS_CONFIG_FILE_PATH} diff --git a/tests/external/iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json b/tests/external/iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json index abbacba76305..351c3420407c 100644 --- a/tests/external/iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json +++ b/tests/external/iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json @@ -13,7 +13,9 @@ "onnx/node/generated/test_group_normalization_epsilon_expanded", "onnx/node/generated/test_group_normalization_example_expanded" ], - "skip_run_tests": [], + "skip_run_tests": [ + "onnx/node/generated/test_gridsample_zeros_padding" + ], "expected_compile_failures": [ "onnx/node/generated/test_affine_grid_2d", "onnx/node/generated/test_affine_grid_2d_align_corners", From 9b8595d53d4b8c85d2d118ff7e3eea7848673e56 Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Thu, 12 Dec 2024 09:08:51 -0800 Subject: [PATCH 04/64] Revert llvm submodule change that was accidentally added in #18790 (#19476) Signed-off-by: Andrew Woloszyn --- third_party/llvm-project | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/third_party/llvm-project b/third_party/llvm-project index 6038573ce5f7..65099e8406e8 160000 --- a/third_party/llvm-project +++ b/third_party/llvm-project @@ -1 +1 @@ -Subproject commit 6038573ce5f70b6c62db858950ae6040aa182fb9 +Subproject commit 65099e8406e8b7003b64bb9f929511d25358a521 From e562559cce413afe8afe2e918e23a443aefb71e6 Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Thu, 12 Dec 2024 09:24:49 -0800 Subject: [PATCH 05/64] Increase all timeouts in pkgci_regression_test.yml. (#19477) Follow-up to https://github.com/iree-org/iree/pull/19472. CI is still showing timeouts: https://github.com/iree-org/iree/actions/runs/12300081495/job/34328004297#step:6:390 ci-exactly: build_packages, regression_test --- .github/workflows/pkgci_regression_test.yml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/.github/workflows/pkgci_regression_test.yml b/.github/workflows/pkgci_regression_test.yml index 1b22e6ee8950..04ae87f37105 100644 --- a/.github/workflows/pkgci_regression_test.yml +++ b/.github/workflows/pkgci_regression_test.yml @@ -189,7 +189,7 @@ jobs: -rpfE \ --capture=no \ --log-cli-level=info \ - --timeout=240 \ + --timeout=600 \ --durations=0 env: ROCM_CHIP: ${{ matrix.rocm-chip }} @@ -203,7 +203,7 @@ jobs: -rpfE \ --capture=no \ --log-cli-level=info \ - --timeout=240 \ + --timeout=600 \ --durations=0 env: ROCM_CHIP: ${{ matrix.rocm-chip }} @@ -227,7 +227,7 @@ jobs: --goldensize-rocm-clip-bytes 860000 \ --goldensize-rocm-vae-bytes 840000 \ --rocm-chip gfx90a \ - --timeout=240 \ + --timeout=600 \ --log-cli-level=info \ --retries 7 echo "$(> $GITHUB_STEP_SUMMARY @@ -256,6 +256,6 @@ jobs: --goldensize-rocm-punet-int8-fp8-bytes 2800000 \ --rocm-chip gfx942 \ --log-cli-level=info \ - --timeout=240 \ + --timeout=600 \ --retries 7 echo "$(> $GITHUB_STEP_SUMMARY From 7e1804bcc94438aa36768bedc0e4045034cb81e9 Mon Sep 17 00:00:00 2001 From: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Thu, 12 Dec 2024 11:32:59 -0800 Subject: [PATCH 06/64] Update LLVM to llvm/llvm-project@3f136f7 (#19479) Carrying the following reverts - https://github.com/llvm/llvm-project/pull/116470 - https://github.com/llvm/llvm-project/pull/117424 - https://github.com/llvm/llvm-project/pull/119671 First two are carry over from previous integrate. It is being fixed in https://github.com/iree-org/iree/pull/19451 . The last one is a new failure. --------- Signed-off-by: MaheshRavishankar --- third_party/llvm-project | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/third_party/llvm-project b/third_party/llvm-project index 65099e8406e8..1cfbe1f2e035 160000 --- a/third_party/llvm-project +++ b/third_party/llvm-project @@ -1 +1 @@ -Subproject commit 65099e8406e8b7003b64bb9f929511d25358a521 +Subproject commit 1cfbe1f2e035fce940fef0dd6a0568a05d989d11 From 900ef1dda1a16d1d8f2c404bfd0dbb006f81eced Mon Sep 17 00:00:00 2001 From: Twice Date: Fri, 13 Dec 2024 03:58:06 +0800 Subject: [PATCH 07/64] [PJRT] Update README to align with the current status (#19457) Some content in the README of PJRT plugin is outdated, so in this PR I try to update them. Since it is difficult for me to decide some of the content directly, there will be some comments below to solicit people's suggestions : ) --------- Signed-off-by: PragmaTwice --- integrations/pjrt/README.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/integrations/pjrt/README.md b/integrations/pjrt/README.md index 04f8aaaa4c2a..9c018be70f58 100644 --- a/integrations/pjrt/README.md +++ b/integrations/pjrt/README.md @@ -13,12 +13,11 @@ most powerful). ## Install a compatible version of Jax and the IREE compiler -``` +```shell pip install -r requirements.txt -# Assume that you have the Jax repo checked out at JAX_REPO from -# https://github.com/google/jax (must be paired with nightly jaxlib). -pip install -e $JAX_REPO +# a higher version of jax is highly recommended, e.g. 0.4.36 +pip install jax==0.4.36 ``` Verify that your Jax install is functional like: @@ -78,14 +77,17 @@ The plugin `openxla_pjrt_artifacts` is in the `ctstools` directory and performs additional manipulation of the environment in order to save compilation artifacts, reproducers, etc. -## Contacts +## Communication channels -* [GitHub issues](https://github.com/openxla/openxla-pjrt-plugin/issues): - Feature requests, bugs, and other work tracking -* [OpenXLA discord](https://discord.gg/pvuUmVQa): Daily development discussions - with the core team and collaborators +* Please submit feature requests and bug reports about the plugin in [GitHub Issues](https://github.com/iree-org/iree/issues). +* Discuss the development of the plugin at `#jax` or `#pjrt-plugin` channel of [IREE Discord server](https://discord.gg/wEWh6Z9nMU). +* Check the [OpenXLA/XLA](https://github.com/openxla/xla) repo and [its communication channels](https://github.com/openxla/community?tab=readme-ov-file#communication-channels) for PJRT APIs and clients. ## License -OpenXLA PJRT plugin is licensed under the terms of the Apache 2.0 License with -LLVM Exceptions. See [LICENSE](LICENSE) for more information. +IREE PJRT plugin is licensed under the terms of the Apache 2.0 License with +LLVM Exceptions. See [LICENSE](../../LICENSE) for more information. + +[PJRT C API](./third_party/pjrt_c_api) comes from +[OpenXLA/XLA](https://github.com/openxla/xla) and is licensed under +the Apache 2.0 License. See its own [LICENSE](./third_party/pjrt_c_api/LICENSE) for more information. From c618134f5d077a19b97add87f268e722a86b78df Mon Sep 17 00:00:00 2001 From: Han-Chung Wang Date: Thu, 12 Dec 2024 19:27:18 -0800 Subject: [PATCH 08/64] Calculate storage bytes through interface method for encoding types. (#19413) The revision moves the implementation of storage bytes calculation to `EncodingAttr::calculateStorageSizeInBytes`. If the encoding attribute implements the interface, the implementation has higher priority. Because it knows all the details, including whether packing the data back-to-back or not. The change is not NFC because it also fixes a bug for dynamic cases. The `dynamicDims` value range is not a mixed value range. It is only for dynamic cases. To make the logic correct, we need to use `getDynamicDimIndex()` to get the corresponding dimension index before the update. The revision duplicates two methods from Util to Encoding dialect because we do not want the dependency (i.e., encoding -> util): - getTypeBitWidth - getRoundedElementByteWidth The function argument of `calculateStorageSizeInBytes` method is changed because of the needs. --------- Signed-off-by: hanhanW --- .../Dialect/Encoding/IR/EncodingAttrs.cpp | 107 +++++++++++++++++- .../Dialect/Encoding/IR/EncodingAttrs.td | 15 ++- .../Dialect/Encoding/IR/EncodingInterfaces.td | 3 +- .../Transforms/test/encode_host_tensors.mlir | 18 +++ .../compiler/Utils/ElementPackingUtils.cpp | 54 +++------ 5 files changed, 147 insertions(+), 50 deletions(-) diff --git a/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingAttrs.cpp b/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingAttrs.cpp index cd023d0ec92b..593d9b8fc5c6 100644 --- a/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingAttrs.cpp +++ b/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingAttrs.cpp @@ -10,6 +10,7 @@ #include "llvm/ADT/SmallVector.h" #include "llvm/ADT/TypeSwitch.h" #include "mlir/Dialect/Affine/Utils.h" +#include "mlir/Dialect/Arith/Utils/Utils.h" #include "mlir/Dialect/Linalg/Utils/Utils.h" #include "mlir/Dialect/Utils/StructuredOpsUtils.h" #include "mlir/IR/Attributes.h" @@ -41,7 +42,7 @@ EncodingAttr EncodingAttr::get(MLIRContext *ctx, int64_t operandIndex, bcastMapAttr, roundDimsToAttr, layoutsAttr); } -AffineMap EncodingAttr::getMapForOperandIndex() { +AffineMap EncodingAttr::getMapForOperandIndex() const { auto index = getOperandIndex().getValue().getZExtValue(); switch (index) { case MATMUL_LHS: @@ -59,7 +60,8 @@ AffineMap EncodingAttr::getMapForOperandIndex() { } } -std::optional EncodingAttr::mapDimToOperandIndex(int64_t dimPos) { +std::optional +EncodingAttr::mapDimToOperandIndex(int64_t dimPos) const { return getMapForOperandIndex().getResultPosition( getAffineDimExpr(dimPos, getContext())); } @@ -91,7 +93,7 @@ MatmulNarrowDim getMatmulNarrowDim(linalg::LinalgOp linalgOp, return (narrowM && (!narrowN || mSize <= nSize)) ? narrowM : narrowN; } -ArrayRef EncodingAttr::getRoundDimsToArray() { +ArrayRef EncodingAttr::getRoundDimsToArray() const { auto roundDimsTo = getRoundDimsTo(); if (!roundDimsTo) { return {}; @@ -111,6 +113,105 @@ EncodingAttr EncodingAttr::clone(AffineMap bcastMap) { AffineMapAttr::get(bcastMap), getRoundDimsTo(), getLayouts()); } +/// Returns the bit-width of the scalar type. If the type is complex, it returns +/// the type of individual elements * 2 (1 for real and 1 for complex). +static unsigned getTypeBitWidth(Type type) { + if (auto complexType = dyn_cast(type)) { + return 2 * complexType.getElementType().getIntOrFloatBitWidth(); + } + return type.getIntOrFloatBitWidth(); +} + +/// Returns the number of bytes an element of the given type occupies in memory. +/// This is in the default dense conversion to machine words where sizes must be +/// powers of two aligned to bytes. +/// +/// Examples: +/// getRoundedElementByteWidth(i1) = 1 +/// getRoundedElementByteWidth(i23) = 4 +/// getRoundedElementByteWidth(i32) = 4 +/// getRoundedElementByteWidth(bf16) = 2 +/// getRoundedElementByteWidth(i33) = 8 +/// getRoundedElementByteWidth(complex) = 8 +static int32_t getRoundedElementByteWidth(Type type) { + unsigned bitsUnaligned = getTypeBitWidth(type); + assert(bitsUnaligned > 0 && "0-width types unsupported"); + // Round up to 8-bit aligned bytes. + unsigned byteAligned = (bitsUnaligned + 8 - 1) / 8; + // Round up to the next power of two (unless already a power of two). + return llvm::PowerOf2Ceil(byteAligned); +} + +Value EncodingAttr::calculateStorageSizeInBytes(Location loc, + OpBuilder &builder, + RankedTensorType type, + ValueRange dynamicDims) const { + SmallVector paddedShape(type.getShape()); + SmallVector paddedDynamicDims(dynamicDims.begin(), dynamicDims.end()); + ArrayRef roundDimsTo = getRoundDimsToArray(); + FailureOr cDims = + getEncodingContractionDims(*this); + auto pad = [&](int dim, int value) { + std::optional maybeMappedDim = mapDimToOperandIndex(dim); + if (!maybeMappedDim) { + return; + } + unsigned mappedDim = maybeMappedDim.value(); + if (type.isDynamicDim(mappedDim)) { + mappedDim = type.getDynamicDimIndex(mappedDim); + auto alignment = builder.create(loc, value); + paddedDynamicDims[mappedDim] = builder.create( + loc, paddedDynamicDims[mappedDim], alignment); + paddedDynamicDims[mappedDim] = builder.create( + loc, paddedDynamicDims[mappedDim], alignment); + } else { + paddedShape[mappedDim] = llvm::alignTo(paddedShape[mappedDim], value); + } + }; + for (auto m : cDims->m) { + pad(m, roundDimsTo[0]); + } + for (auto n : cDims->n) { + pad(n, roundDimsTo[1]); + } + for (auto k : cDims->k) { + pad(k, roundDimsTo[2]); + } + + constexpr int64_t kNumBitsInByte = 8; + unsigned elementBits = getTypeBitWidth(type.getElementType()); + int64_t numBytesPerElem = 1; + if (elementBits > kNumBitsInByte) { + numBytesPerElem *= getRoundedElementByteWidth(type.getElementType()); + } + + int64_t staticCount = numBytesPerElem; + for (unsigned i = 0, e = type.getRank(); i < e; ++i) { + if (!type.isDynamicDim(i)) { + staticCount *= paddedShape[i]; + } + } + + Value result = + builder.create(loc, staticCount).getResult(); + for (auto dim : paddedDynamicDims) { + result = builder.create(loc, result, dim); + } + + // Always pack the elements back-to-back for subtypes. + if (elementBits < kNumBitsInByte) { + if (kNumBitsInByte % elementBits) { + assert(false && "unsupported subtype"); + return Value(); + } + Value divisor = builder.create( + loc, kNumBitsInByte / elementBits); + result = builder.create(loc, result, divisor); + } + + return result; +} + MatmulNarrowDim getMatmulNarrowDim(EncodingAttr encoding) { if (encoding.getOpType().getValue() != EncodingOpType::matmul) { return {}; diff --git a/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingAttrs.td b/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingAttrs.td index 9086c10b20b3..54829b68e2cf 100644 --- a/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingAttrs.td +++ b/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingAttrs.td @@ -8,6 +8,7 @@ #define IREE_DIALECT_ENCODING_ATTRS include "iree/compiler/Dialect/Encoding/IR/EncodingBase.td" +include "iree/compiler/Dialect/Encoding/IR/EncodingInterfaces.td" include "mlir/IR/AttrTypeBase.td" include "mlir/IR/EnumAttr.td" @@ -41,10 +42,14 @@ def EncodingOpTypeAttr: IREEEncoding_EnumAttr; def EncodingAttr : - IREEEncoding_Attr<"Encoding"> { + IREEEncoding_Attr<"Encoding", [ + DeclareAttrInterfaceMethods + ]> { let mnemonic = "encoding"; let summary = [{information to decide how to data-tile a tensor}]; - let description = [{ + let description = [{ This attribute describes the change in the layout for a given tensor to execute subsequent operations on the tiled layout. The encoding serves as a way to @@ -93,15 +98,15 @@ def EncodingAttr : /// operand_index. The dimensions of the returned map are those of the /// data-tiled op's iteration space, and the results of the map are in /// the domain of the encoded tensor type. - AffineMap getMapForOperandIndex(); + AffineMap getMapForOperandIndex() const; /// Given the dim position of the encoding `user_indexing_maps`, returns the /// matching index of the given encoding's tensor, using getMapForOperandIndex /// bcast_map and user_indexing_map. - std::optional mapDimToOperandIndex(int64_t dimPos); + std::optional mapDimToOperandIndex(int64_t dimPos) const; /// Returns an integer array with values in `round_dims_to`. - ArrayRef getRoundDimsToArray(); + ArrayRef getRoundDimsToArray() const; /// Returns a vector with values in `element_types`. SmallVector getElementTypesArray(); diff --git a/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingInterfaces.td b/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingInterfaces.td index 12ca8fe382d5..4dfa4b5c8531 100644 --- a/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingInterfaces.td +++ b/compiler/src/iree/compiler/Dialect/Encoding/IR/EncodingInterfaces.td @@ -30,9 +30,10 @@ def IREEEncoding_EncodingLayoutAttrInterface : Returns the storage size (in bytes) for the tensor types with an optional encoding. }], - /*retTy=*/"::mlir::OpFoldResult", + /*retTy=*/"::mlir::Value", /*methodName=*/"calculateStorageSizeInBytes", /*args=*/(ins + "::mlir::Location":$loc, "::mlir::OpBuilder &":$builder, "RankedTensorType":$type, "ValueRange":$dynamicDims diff --git a/compiler/src/iree/compiler/Dialect/Stream/Transforms/test/encode_host_tensors.mlir b/compiler/src/iree/compiler/Dialect/Stream/Transforms/test/encode_host_tensors.mlir index 377c2a31b4ff..42764fe48cd0 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Transforms/test/encode_host_tensors.mlir +++ b/compiler/src/iree/compiler/Dialect/Stream/Transforms/test/encode_host_tensors.mlir @@ -42,6 +42,24 @@ util.func public @sizeof_lhs_encoding_dynamic(%arg0: index, %arg1: index) -> ind // ----- +#map = affine_map<(d0, d1, d2) -> (d0, d2)> +#map1 = affine_map<(d0, d1, d2) -> (d2, d1)> +#map2 = affine_map<(d0, d1, d2) -> (d0, d1)> +#encoding = #iree_encoding.encoding> +util.func public @sizeof_lhs_encoding_partially_dynamic(%arg0: index) -> index { + %0 = stream.tensor.sizeof tensor<10x?xf32, #encoding>{%arg0} : index + util.return %0 : index +} +// CHECK-LABEL: @sizeof_lhs_encoding_partially_dynamic +// CHECK-DAG: %[[C48:.+]] = arith.constant 48 : index +// CHECK-DAG: %[[C16:.+]] = arith.constant 16 : index +// CHECK: %[[CEIL_DIV_D1:.+]] = arith.ceildivui %arg0, %[[C16]] +// CHECK: %[[PAD_D1:.+]] = arith.muli %[[CEIL_DIV_D1]], %[[C16]] +// CHECK: %[[T0:.+]] = arith.muli %[[PAD_D1]], %[[C48]] +// CHECK: return %[[T0]] + +// ----- + #map = affine_map<(d0, d1, d2) -> (d0, d2)> #map1 = affine_map<(d0, d1, d2) -> (d2, d1)> #map2 = affine_map<(d0, d1, d2) -> (d0, d1)> diff --git a/compiler/src/iree/compiler/Utils/ElementPackingUtils.cpp b/compiler/src/iree/compiler/Utils/ElementPackingUtils.cpp index b00969ef4988..2f92ffa62913 100644 --- a/compiler/src/iree/compiler/Utils/ElementPackingUtils.cpp +++ b/compiler/src/iree/compiler/Utils/ElementPackingUtils.cpp @@ -7,7 +7,9 @@ #include "iree/compiler/Utils/ElementPackingUtils.h" #include "iree/compiler/Dialect/Encoding/IR/EncodingOps.h" +#include "iree/compiler/Dialect/Encoding/IR/EncodingTypes.h" #include "iree/compiler/Dialect/Util/IR/UtilTypes.h" +#include "llvm/Support/Casting.h" #include "llvm/Support/CommandLine.h" #include "llvm/Support/MathExtras.h" #include "mlir/Dialect/Arith/IR/Arith.h" @@ -62,6 +64,14 @@ Value calculateStorageElementCountInBytes(Location loc, RankedTensorType shapedType, ValueRange dynamicDims, OpBuilder &builder) { + Attribute encoding = shapedType.getEncoding(); + if (auto encodingLayoutAttr = + dyn_cast_or_null( + encoding)) { + return encodingLayoutAttr.calculateStorageSizeInBytes( + loc, builder, shapedType, dynamicDims); + } + Type alignedElementType = legalizeStorageElementType(shapedType.getElementType()); unsigned elementBits = IREE::Util::getTypeBitWidth(alignedElementType); @@ -72,52 +82,14 @@ Value calculateStorageElementCountInBytes(Location loc, staticCount *= IREE::Util::getRoundedElementByteWidth(alignedElementType); } - // TODO: Do we use makeComposedFoldedAffineApply here, so the index - // computation an be much simpler. - SmallVector paddedShape(shapedType.getShape()); - SmallVector paddedDynamicDims(dynamicDims.begin(), dynamicDims.end()); - auto encoding = IREE::Encoding::getEncodingAttr(shapedType); - if (encoding && !encoding.getRoundDimsToArray().empty()) { - auto roundDimsTo = encoding.getRoundDimsToArray(); - FailureOr cDims = - IREE::Encoding::getEncodingContractionDims(encoding); - auto pad = [&](int dim, int value) { - std::optional maybeMappedDim = - encoding.mapDimToOperandIndex(dim); - if (!maybeMappedDim) { - return; - } - unsigned mappedDim = maybeMappedDim.value(); - if (shapedType.isDynamicDim(mappedDim)) { - auto alignment = builder.create(loc, value); - paddedDynamicDims[mappedDim] = builder.create( - loc, paddedDynamicDims[mappedDim], alignment); - paddedDynamicDims[mappedDim] = builder.create( - loc, paddedDynamicDims[mappedDim], alignment); - } else { - paddedShape[mappedDim] = llvm::alignTo(paddedShape[mappedDim], value); - } - }; - for (auto m : cDims->m) { - pad(m, roundDimsTo[0]); - } - for (auto n : cDims->n) { - pad(n, roundDimsTo[1]); - } - for (auto k : cDims->k) { - pad(k, roundDimsTo[2]); - } - } - for (unsigned i = 0; i < shapedType.getRank(); ++i) { if (!shapedType.isDynamicDim(i)) - staticCount *= paddedShape[i]; + staticCount *= shapedType.getDimSize(i); } - // Scale by dynamic dims, if present. auto value = builder.create(loc, staticCount).getResult(); - for (auto dim : paddedDynamicDims) { + for (auto dim : dynamicDims) { value = builder.createOrFold(loc, value, dim); } // Sub-byte packing requires putting multiple elements in the same byte. @@ -127,7 +99,7 @@ Value calculateStorageElementCountInBytes(Location loc, // TODO(antiagainst): We may want to emit runtime check to make sure this is // divisible. auto divisor = builder.create(loc, byteElements); - if (!clEnableI1Support && paddedDynamicDims.empty() && + if (!clEnableI1Support && dynamicDims.empty() && (staticCount * elementBits) % 8 != 0) { return nullptr; } From ad938ae6e36d458904cc24b93648234ebe62ace0 Mon Sep 17 00:00:00 2001 From: Han-Chung Wang Date: Thu, 12 Dec 2024 20:04:00 -0800 Subject: [PATCH 09/64] [DT][NFC] Localize CPU specific encoding materialization logic. (#19452) The revision moves the CPU materialization logic from Dialect/Codegen/Utils/Utils.[h|cpp] to CPUEncodingExternalModels. They were public methods during transition states. After all the CPU layout attributes are implemented, we no longer need to expose them to the public. Additionally, it removes the outdated logic from MaterializeContractionOp pattern. And it removes the `transposeNarrowN` input argument from lowerContractionOpWithEncoding method because all the CPU backends enable the transposeNarrowN feature. Signed-off-by: hanhanW --- .../MaterializeEncodingIntoPackUnPack.cpp | 26 +- .../Codegen/Dialect/Codegen/Utils/Utils.cpp | 270 ---------------- .../Codegen/Dialect/Codegen/Utils/Utils.h | 23 -- .../CPUEncodingExternalModels.cpp | 293 +++++++++++++++++- 4 files changed, 294 insertions(+), 318 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp index 84b854026545..ad2ce7c48c05 100644 --- a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp @@ -10,6 +10,7 @@ #include "iree/compiler/Codegen/Common/EncodingUtils.h" #include "iree/compiler/Codegen/Common/Passes.h" +#include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenInterfaces.h" #include "iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.h" #include "iree/compiler/Codegen/Utils/Utils.h" #include "iree/compiler/Dialect/Encoding/IR/EncodingOps.h" @@ -740,25 +741,14 @@ class MaterializeContractionOp auto converter = static_cast( this->getTypeConverter()); - if (auto layoutAttr = converter->getLayoutAttr()) { - SmallVector convertedResTypes; - for (auto init : op.getDpsInits()) { - convertedResTypes.push_back(converter->convertType(init.getType())); - } - Operation *newOp = - layoutAttr.lowerOp(rewriter, op, convertedResTypes, operands); - rewriter.replaceOp(op, newOp->getResults()); - return success(); - } - - FailureOr convertedOp = - IREE::Codegen::lowerContractionOpWithEncoding( - rewriter, op, operands, converter->getTransposeNarrowN(), - converter->getLayoutAttr()); - if (failed(convertedOp)) { - return failure(); + IREE::Codegen::LayoutAttrInterface layoutAttr = converter->getLayoutAttr(); + SmallVector convertedResTypes; + for (auto init : op.getDpsInits()) { + convertedResTypes.push_back(converter->convertType(init.getType())); } - rewriter.replaceOp(op.getOperation(), convertedOp.value()->getResult(0)); + Operation *newOp = + layoutAttr.lowerOp(rewriter, op, convertedResTypes, operands); + rewriter.replaceOp(op, newOp->getResults()); return success(); } diff --git a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.cpp b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.cpp index 32dbc46563b2..c515766e396f 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.cpp +++ b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.cpp @@ -305,274 +305,4 @@ getEncodingInfoForMatmul(Encoding::EncodingAttr encoding, TileMxNxK tileMxNxK) { return encodingInfo; } -static RankedTensorType dropEncoding(RankedTensorType type) { - return RankedTensorType::get(type.getShape(), type.getElementType()); -} - -static Operation *dropEncodingAndCloneOp(OpBuilder &builder, Operation *op, - ValueRange convertedInputOperands, - ValueRange convertedOutputOperands) { - SmallVector operands; - operands.append(convertedInputOperands.begin(), convertedInputOperands.end()); - operands.append(convertedOutputOperands.begin(), - convertedOutputOperands.end()); - return mlir::clone(builder, op, - {dropEncoding(cast( - convertedOutputOperands[0].getType()))}, - operands); -} - -static RankedTensorType -getExpandedType(RankedTensorType type, bool isBatched, bool isTransposed, - SmallVectorImpl &ri) { - if (!isBatched) { - ri.assign({{0, 1}, {2, 3}}); - if (!isTransposed) { - return RankedTensorType::get( - {1, type.getDimSize(0), 1, type.getDimSize(1)}, - type.getElementType()); - } - return RankedTensorType::get({type.getDimSize(0), 1, type.getDimSize(1), 1}, - type.getElementType()); - } - - ri.assign({{0}, {1, 2}, {3, 4}}); - if (!isTransposed) { - return RankedTensorType::get( - {type.getDimSize(0), 1, type.getDimSize(1), 1, type.getDimSize(2)}, - type.getElementType()); - } - return RankedTensorType::get( - {type.getDimSize(0), type.getDimSize(1), 1, type.getDimSize(2), 1}, - type.getElementType()); -} - -/// Given an input Value and a desired output element type, create and return -/// an element-wise linalg::GenericOp that extends the input Value to the -/// output element type. -static Value createElementWiseExtUIOp(OpBuilder &builder, Value input, - Location loc, Type outElemType) { - auto inputType = cast(input.getType()); - SmallVector maps( - 2, builder.getMultiDimIdentityMap(inputType.getRank())); - SmallVector iteratorTypes(inputType.getRank(), - utils::IteratorType::parallel); - auto castedType = inputType.clone(outElemType); - SmallVector inputMixedSizes = - tensor::getMixedSizes(builder, loc, input); - Value init = - builder.create(loc, inputMixedSizes, outElemType); - return builder - .create( - loc, castedType, input, init, maps, iteratorTypes, - [&](OpBuilder &b, Location nestedLoc, ValueRange args) { - Value castRes = - b.create(nestedLoc, outElemType, args[0]) - ->getResult(0); - b.create(nestedLoc, castRes); - }) - .getResult(0); -} - -/// If needed, expand and the input Value, and return the resulting input with -/// the canonical mmt4d input shape. If the input element type is unsigned, -/// create a producer Linalg::GenericOp on the input that unsigned extends the -/// input to the output element type. This extension is required to keep the -/// unsignedness information on the input for ukernels. If `transpose` is true, -/// the `linalgOp`'s indexing maps are transposed. -static Value getMmt4dOperand(Value value, linalg::LinalgOp linalgOp, - bool transpose, OpBuilder &builder, - SmallVectorImpl &ri, - ArrayRef elemTypes, int operandIdx) { - assert(linalgOp.getNumDpsInputs() == 2); - assert(linalgOp.getNumDpsInits() == 1); - auto cDims = linalg::inferContractionDims(linalgOp); - Location loc = linalgOp->getLoc(); - Value expandedValue = value; - // If vecmat with non-rhs operandIdx or matvec with non-lhs operandIdx, the - // operand is a vector and must be extended - if ((cDims->m.empty() && operandIdx != 1) || - (cDims->n.empty() && operandIdx != 0)) { - auto type = cast(value.getType()); - RankedTensorType newType = getExpandedType( - type, /*isBatched=*/!cDims->batch.empty(), - /*isTransposed=*/operandIdx == 2 && (transpose ^ cDims->n.empty()), ri); - expandedValue = - builder.create(loc, newType, value, ri); - } - if (elemTypes[operandIdx].isUnsignedInteger()) { - return createElementWiseExtUIOp(builder, expandedValue, loc, - elemTypes.back()); - } - return expandedValue; -} - -TileMxNxK chooseMatmulTile(ArrayRef enumeratedTiles, - IREE::Encoding::MatmulNarrowDim narrowDim, - ArrayRef hostDefinedUpperBound) { - assert((hostDefinedUpperBound.empty() || hostDefinedUpperBound.size() >= 3) && - "expected hostDefinedUpperBound is empty or has upper bound for {M, " - "N, K}"); - // Handle narrow-N by transposing to reduce to narrow-M. Note: the - // enumeratedTiles currently only enumerate narrow-M cases. - if (narrowDim.isN()) { - SmallVector newHostDefinedUpperBound(hostDefinedUpperBound); - std::swap(newHostDefinedUpperBound[0], newHostDefinedUpperBound[1]); - narrowDim.dim = IREE::Encoding::MatmulNarrowDim::Dim::M; - TileMxNxK tile = - chooseMatmulTile(enumeratedTiles, narrowDim, newHostDefinedUpperBound); - std::swap(tile.M, tile.N); - return tile; - } - // Handle kDynamic: currently this is only used with VMVX, where there is only - // one enumerated tile and it has all three M/N/K dimensions dynamic, so for - // now we only support that. Generalize that as needed when more dynamic tile - // sizes are used outside of VMVX, e.g. perhaps some day with Arm SVE. Decide - // how to incorporate the handling of kDynamic in the cost-model evaluation - // below to decide when to prefer a dynamic vs a static tile shape. - for (auto tile : enumeratedTiles) { - if (ShapedType::isDynamic(tile.M) || ShapedType::isDynamic(tile.N) || - ShapedType::isDynamic(tile.K)) { - assert(enumeratedTiles.size() == 1); - assert(ShapedType::isDynamic(tile.M) && ShapedType::isDynamic(tile.N) && - ShapedType::isDynamic(tile.K)); - return tile; - } - } - // We're going to "rate" the enumerated tiles. - struct RatedTileMxNxK : TileMxNxK { - RatedTileMxNxK() {} - RatedTileMxNxK(TileMxNxK tile) : TileMxNxK(tile) {} - // Penalize tiles that are wider in the M dimension than matmulNarrowM. - int64_t paddingPenalty = 0; - // Favor larger tiles, as long as they still minimize paddingPenalty. - int64_t productMxNxK = 0; - }; - SmallVector ratedTiles; - ratedTiles.reserve(enumeratedTiles.size()); - int64_t bestPaddingPenalty = INT64_MAX; - int64_t mUB = INT64_MAX; - int64_t nUB = INT64_MAX; - int64_t kUB = INT64_MAX; - if (!hostDefinedUpperBound.empty()) { - mUB = hostDefinedUpperBound[0]; - nUB = hostDefinedUpperBound[1]; - kUB = hostDefinedUpperBound[2]; - } - for (auto tile : enumeratedTiles) { - if (tile.M > mUB || tile.N > nUB || tile.K > kUB) { - LLVM_DEBUG(llvm::dbgs() << "[" << DEBUG_TYPE << "]: tile ("; - llvm::interleaveComma( - ArrayRef{tile.M, tile.N, tile.K}, llvm::dbgs()); - llvm::dbgs() - << ") is skipped because it is not valid for upper_bound ("; - llvm::interleaveComma(ArrayRef{mUB, nUB, kUB}, - llvm::dbgs()); - llvm::dbgs() << ")\n"); - continue; - } - RatedTileMxNxK ratedTile(tile); - ratedTile.paddingPenalty = 0; - // If we are choosing a tile for a narrow-M case, we want to minimize - // padding along the M dimension. - // The PowerOf2Ceil is so that we are OK with padding up to the next - // power of two, we just try to avoid padding beyond that. For example, - // if matmulNarrowM==7 and we have enumerated tiles with M=8,4,2,1, we - // are OK with the tile that has M==8 even though it requires some padding. - // Otherwise, we would be penalizing the tiles with M==8,4,2 and we would - // end up selecting the vecmat tile (M==1) for that case! - if (narrowDim) { - ratedTile.paddingPenalty = - std::max(tile.M - llvm::PowerOf2Ceil(narrowDim.size), 0); - } - ratedTile.productMxNxK = tile.M * tile.N * tile.K; - ratedTiles.push_back(ratedTile); - LLVM_DEBUG(llvm::dbgs() << "candidate: "; llvm::interleaveComma( - ArrayRef{tile.M, tile.N, tile.K}, llvm::dbgs()); - llvm::dbgs() << " penalty:" << ratedTile.paddingPenalty << "\n"); - bestPaddingPenalty = std::min(bestPaddingPenalty, ratedTile.paddingPenalty); - } - RatedTileMxNxK bestRatedTile; - for (auto ratedTile : ratedTiles) { - // Choose only among tiles that minimize paddingPenalty. Among those, - // maximize productMxNxK. - if (ratedTile.paddingPenalty == bestPaddingPenalty && - bestRatedTile.productMxNxK < ratedTile.productMxNxK) { - bestRatedTile = ratedTile; - } - } - // Sanity check. This assert can only fail if there's a programming mistake - // locally here. - assert(bestRatedTile.paddingPenalty == bestPaddingPenalty); - return bestRatedTile; -} - -FailureOr -lowerContractionOpWithEncoding(OpBuilder &builder, linalg::LinalgOp linalgOp, - ValueRange operands, bool transposeNarrowN, - LayoutAttrInterface layoutAttr) { - if (!linalgOp.hasPureTensorSemantics()) { - return failure(); - } - - auto inputs = linalgOp.getDpsInputOperands(); - auto outputs = linalgOp.getDpsInits(); - - auto lhsType = cast(inputs[0]->get().getType()); - auto rhsType = cast(inputs[1]->get().getType()); - auto resultType = cast(outputs[0].getType()); - auto lhsEncoding = IREE::Encoding::getEncodingAttr(lhsType); - auto rhsEncoding = IREE::Encoding::getEncodingAttr(rhsType); - auto resultEncoding = IREE::Encoding::getEncodingAttr(resultType); - if (!lhsEncoding || !rhsEncoding || !resultEncoding) { - return failure(); - } - - if (lhsEncoding.getOperandIndex().getValue() != IREE::Encoding::MATMUL_LHS || - rhsEncoding.getOperandIndex().getValue() != IREE::Encoding::MATMUL_RHS || - resultEncoding.getOperandIndex().getValue() != - IREE::Encoding::MATMUL_RESULT) { - return failure(); - } - - MaterializeEncodingInfo encodingInfo = layoutAttr.getEncodingInfo( - cast(linalgOp->getResultTypes()[0])); - - if (isIdentityLayout(encodingInfo)) { - return dropEncodingAndCloneOp(builder, linalgOp, - operands.take_front(inputs.size()), - operands.drop_front(inputs.size())); - } - - bool transpose = transposeNarrowN && isNarrowNResult(resultEncoding); - SmallVector elemTypes = lhsEncoding.getElementTypesArray(); - SmallVector ri; - Value newLhs = getMmt4dOperand(operands[0], linalgOp, transpose, builder, ri, - elemTypes, /*operandIdx=*/0); - Value newRhs = getMmt4dOperand(operands[1], linalgOp, transpose, builder, ri, - elemTypes, /*operandIdx=*/1); - Value newResult = getMmt4dOperand(operands[2], linalgOp, transpose, builder, - ri, elemTypes, /*operandIdx=*/2); - if (transpose) { - std::swap(newLhs, newRhs); - } - Type newResultType = newResult.getType(); - auto cDims = IREE::Encoding::getEncodingContractionDims(lhsEncoding); - Operation *result; - if (cDims->batch.empty()) { - result = builder.create(linalgOp.getLoc(), newResultType, - ValueRange{newLhs, newRhs}, - ValueRange{newResult}); - } else { - result = builder.create( - linalgOp.getLoc(), newResultType, ValueRange{newLhs, newRhs}, - ValueRange{newResult}); - } - if (!ri.empty()) { - result = builder.create( - linalgOp->getLoc(), operands[2].getType(), result->getResult(0), ri); - } - return result; -} - } // namespace mlir::iree_compiler::IREE::Codegen diff --git a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.h b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.h index 1bee3ec74032..8a8c309d4ba5 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.h +++ b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.h @@ -75,29 +75,6 @@ struct TileMxNxK { MaterializeEncodingInfo getEncodingInfoForMatmul(Encoding::EncodingAttr encoding, TileMxNxK tileMxNxK); -//===----------------------------------------------------------------------===// -// Operation Lowering Utilities. -//===----------------------------------------------------------------------===// - -// TODO(hanchung): The below methods are exposed to public because they are -// shared between MaterializeEncodingIntoPackUnPack.cpp.cpp and -// CPUEncodingExternalModels.cpp. They will be moved to other places after all -// the CPU backends implement their layout attributes. - -/// Returns the best TileMxNxK from `enumeratedTiles` pool. If the -/// `hostDefinedUpperBound` is not empty, the chosen tile sizes can not be -/// greater than the values. -/// TODO(#16933): Remove `hostDefinedUpperBound` once we can propagate such -/// information to host. For now, they are defined by host. -TileMxNxK chooseMatmulTile(ArrayRef enumeratedTiles, - IREE::Encoding::MatmulNarrowDim narrowDim, - ArrayRef hostDefinedUpperBound = {}); - -FailureOr -lowerContractionOpWithEncoding(OpBuilder &builder, linalg::LinalgOp linalgOp, - ValueRange operands, bool transposeNarrowN, - LayoutAttrInterface layoutAttr); - } // namespace mlir::iree_compiler::IREE::Codegen #endif // IREE_COMPILER_CODEGEN_DIALECT_CODEGEN_UTILS_H_ diff --git a/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp b/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp index a847abd5a1b2..89de2e6dcc16 100644 --- a/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp +++ b/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp @@ -24,13 +24,292 @@ using Codegen::TileMxNxK; namespace { +//===----------------------------------------------------------------------===// +// Utilities. +//===----------------------------------------------------------------------===// + +static RankedTensorType dropEncoding(RankedTensorType type) { + return RankedTensorType::get(type.getShape(), type.getElementType()); +} + +static Operation *dropEncodingAndCloneOp(OpBuilder &builder, Operation *op, + ValueRange convertedInputOperands, + ValueRange convertedOutputOperands) { + SmallVector operands; + operands.append(convertedInputOperands.begin(), convertedInputOperands.end()); + operands.append(convertedOutputOperands.begin(), + convertedOutputOperands.end()); + return mlir::clone(builder, op, + {dropEncoding(cast( + convertedOutputOperands[0].getType()))}, + operands); +} + +static RankedTensorType +getExpandedType(RankedTensorType type, bool isBatched, bool isTransposed, + SmallVectorImpl &ri) { + if (!isBatched) { + ri.assign({{0, 1}, {2, 3}}); + if (!isTransposed) { + return RankedTensorType::get( + {1, type.getDimSize(0), 1, type.getDimSize(1)}, + type.getElementType()); + } + return RankedTensorType::get({type.getDimSize(0), 1, type.getDimSize(1), 1}, + type.getElementType()); + } + + ri.assign({{0}, {1, 2}, {3, 4}}); + if (!isTransposed) { + return RankedTensorType::get( + {type.getDimSize(0), 1, type.getDimSize(1), 1, type.getDimSize(2)}, + type.getElementType()); + } + return RankedTensorType::get( + {type.getDimSize(0), type.getDimSize(1), 1, type.getDimSize(2), 1}, + type.getElementType()); +} + +/// Given an input Value and a desired output element type, create and return +/// an element-wise linalg::GenericOp that extends the input Value to the +/// output element type. +static Value createElementWiseExtUIOp(OpBuilder &builder, Value input, + Location loc, Type outElemType) { + auto inputType = cast(input.getType()); + SmallVector maps( + 2, builder.getMultiDimIdentityMap(inputType.getRank())); + SmallVector iteratorTypes(inputType.getRank(), + utils::IteratorType::parallel); + auto castedType = inputType.clone(outElemType); + SmallVector inputMixedSizes = + tensor::getMixedSizes(builder, loc, input); + Value init = + builder.create(loc, inputMixedSizes, outElemType); + return builder + .create( + loc, castedType, input, init, maps, iteratorTypes, + [&](OpBuilder &b, Location nestedLoc, ValueRange args) { + Value castRes = + b.create(nestedLoc, outElemType, args[0]) + ->getResult(0); + b.create(nestedLoc, castRes); + }) + .getResult(0); +} + +/// If needed, expand and the input Value, and return the resulting input with +/// the canonical mmt4d input shape. If the input element type is unsigned, +/// create a producer Linalg::GenericOp on the input that unsigned extends the +/// input to the output element type. This extension is required to keep the +/// unsignedness information on the input for ukernels. If `transpose` is true, +/// the `linalgOp`'s indexing maps are transposed. +static Value getMmt4dOperand(Value value, linalg::LinalgOp linalgOp, + bool transpose, OpBuilder &builder, + SmallVectorImpl &ri, + ArrayRef elemTypes, int operandIdx) { + assert(linalgOp.getNumDpsInputs() == 2); + assert(linalgOp.getNumDpsInits() == 1); + auto cDims = linalg::inferContractionDims(linalgOp); + Location loc = linalgOp->getLoc(); + Value expandedValue = value; + // If vecmat with non-rhs operandIdx or matvec with non-lhs operandIdx, the + // operand is a vector and must be extended + if ((cDims->m.empty() && operandIdx != 1) || + (cDims->n.empty() && operandIdx != 0)) { + auto type = cast(value.getType()); + RankedTensorType newType = getExpandedType( + type, /*isBatched=*/!cDims->batch.empty(), + /*isTransposed=*/operandIdx == 2 && (transpose ^ cDims->n.empty()), ri); + expandedValue = + builder.create(loc, newType, value, ri); + } + if (elemTypes[operandIdx].isUnsignedInteger()) { + return createElementWiseExtUIOp(builder, expandedValue, loc, + elemTypes.back()); + } + return expandedValue; +} + +/// Returns the best TileMxNxK from `enumeratedTiles` pool. If the +/// `hostDefinedUpperBound` is not empty, the chosen tile sizes can not be +/// greater than the values. +/// TODO(#16933): Remove `hostDefinedUpperBound` once we can propagate such +/// information to host. For now, they are defined by host. +TileMxNxK chooseMatmulTile(ArrayRef enumeratedTiles, + IREE::Encoding::MatmulNarrowDim narrowDim, + ArrayRef hostDefinedUpperBound = {}) { + assert((hostDefinedUpperBound.empty() || hostDefinedUpperBound.size() >= 3) && + "expected hostDefinedUpperBound is empty or has upper bound for {M, " + "N, K}"); + // Handle narrow-N by transposing to reduce to narrow-M. Note: the + // enumeratedTiles currently only enumerate narrow-M cases. + if (narrowDim.isN()) { + SmallVector newHostDefinedUpperBound(hostDefinedUpperBound); + std::swap(newHostDefinedUpperBound[0], newHostDefinedUpperBound[1]); + narrowDim.dim = IREE::Encoding::MatmulNarrowDim::Dim::M; + TileMxNxK tile = + chooseMatmulTile(enumeratedTiles, narrowDim, newHostDefinedUpperBound); + std::swap(tile.M, tile.N); + return tile; + } + // Handle kDynamic: currently this is only used with VMVX, where there is only + // one enumerated tile and it has all three M/N/K dimensions dynamic, so for + // now we only support that. Generalize that as needed when more dynamic tile + // sizes are used outside of VMVX, e.g. perhaps some day with Arm SVE. Decide + // how to incorporate the handling of kDynamic in the cost-model evaluation + // below to decide when to prefer a dynamic vs a static tile shape. + for (auto tile : enumeratedTiles) { + if (ShapedType::isDynamic(tile.M) || ShapedType::isDynamic(tile.N) || + ShapedType::isDynamic(tile.K)) { + assert(enumeratedTiles.size() == 1); + assert(ShapedType::isDynamic(tile.M) && ShapedType::isDynamic(tile.N) && + ShapedType::isDynamic(tile.K)); + return tile; + } + } + // We're going to "rate" the enumerated tiles. + struct RatedTileMxNxK : TileMxNxK { + RatedTileMxNxK() {} + RatedTileMxNxK(TileMxNxK tile) : TileMxNxK(tile) {} + // Penalize tiles that are wider in the M dimension than matmulNarrowM. + int64_t paddingPenalty = 0; + // Favor larger tiles, as long as they still minimize paddingPenalty. + int64_t productMxNxK = 0; + }; + SmallVector ratedTiles; + ratedTiles.reserve(enumeratedTiles.size()); + int64_t bestPaddingPenalty = INT64_MAX; + int64_t mUB = INT64_MAX; + int64_t nUB = INT64_MAX; + int64_t kUB = INT64_MAX; + if (!hostDefinedUpperBound.empty()) { + mUB = hostDefinedUpperBound[0]; + nUB = hostDefinedUpperBound[1]; + kUB = hostDefinedUpperBound[2]; + } + for (auto tile : enumeratedTiles) { + if (tile.M > mUB || tile.N > nUB || tile.K > kUB) { + LLVM_DEBUG(llvm::dbgs() << "[" << DEBUG_TYPE << "]: tile ("; + llvm::interleaveComma( + ArrayRef{tile.M, tile.N, tile.K}, llvm::dbgs()); + llvm::dbgs() + << ") is skipped because it is not valid for upper_bound ("; + llvm::interleaveComma(ArrayRef{mUB, nUB, kUB}, + llvm::dbgs()); + llvm::dbgs() << ")\n"); + continue; + } + RatedTileMxNxK ratedTile(tile); + ratedTile.paddingPenalty = 0; + // If we are choosing a tile for a narrow-M case, we want to minimize + // padding along the M dimension. + // The PowerOf2Ceil is so that we are OK with padding up to the next + // power of two, we just try to avoid padding beyond that. For example, + // if matmulNarrowM==7 and we have enumerated tiles with M=8,4,2,1, we + // are OK with the tile that has M==8 even though it requires some padding. + // Otherwise, we would be penalizing the tiles with M==8,4,2 and we would + // end up selecting the vecmat tile (M==1) for that case! + if (narrowDim) { + ratedTile.paddingPenalty = + std::max(tile.M - llvm::PowerOf2Ceil(narrowDim.size), 0); + } + ratedTile.productMxNxK = tile.M * tile.N * tile.K; + ratedTiles.push_back(ratedTile); + LLVM_DEBUG(llvm::dbgs() << "candidate: "; llvm::interleaveComma( + ArrayRef{tile.M, tile.N, tile.K}, llvm::dbgs()); + llvm::dbgs() << " penalty:" << ratedTile.paddingPenalty << "\n"); + bestPaddingPenalty = std::min(bestPaddingPenalty, ratedTile.paddingPenalty); + } + RatedTileMxNxK bestRatedTile; + for (auto ratedTile : ratedTiles) { + // Choose only among tiles that minimize paddingPenalty. Among those, + // maximize productMxNxK. + if (ratedTile.paddingPenalty == bestPaddingPenalty && + bestRatedTile.productMxNxK < ratedTile.productMxNxK) { + bestRatedTile = ratedTile; + } + } + // Sanity check. This assert can only fail if there's a programming mistake + // locally here. + assert(bestRatedTile.paddingPenalty == bestPaddingPenalty); + return bestRatedTile; +} + +FailureOr +lowerContractionOpWithEncoding(OpBuilder &builder, linalg::LinalgOp linalgOp, + ValueRange operands, + IREE::Codegen::LayoutAttrInterface layoutAttr) { + if (!linalgOp.hasPureTensorSemantics()) { + return failure(); + } + + auto inputs = linalgOp.getDpsInputOperands(); + auto outputs = linalgOp.getDpsInits(); + + auto lhsType = cast(inputs[0]->get().getType()); + auto rhsType = cast(inputs[1]->get().getType()); + auto resultType = cast(outputs[0].getType()); + auto lhsEncoding = IREE::Encoding::getEncodingAttr(lhsType); + auto rhsEncoding = IREE::Encoding::getEncodingAttr(rhsType); + auto resultEncoding = IREE::Encoding::getEncodingAttr(resultType); + if (!lhsEncoding || !rhsEncoding || !resultEncoding) { + return failure(); + } + + if (lhsEncoding.getOperandIndex().getValue() != IREE::Encoding::MATMUL_LHS || + rhsEncoding.getOperandIndex().getValue() != IREE::Encoding::MATMUL_RHS || + resultEncoding.getOperandIndex().getValue() != + IREE::Encoding::MATMUL_RESULT) { + return failure(); + } + + MaterializeEncodingInfo encodingInfo = layoutAttr.getEncodingInfo( + cast(linalgOp->getResultTypes()[0])); + + if (isIdentityLayout(encodingInfo)) { + return dropEncodingAndCloneOp(builder, linalgOp, + operands.take_front(inputs.size()), + operands.drop_front(inputs.size())); + } + + bool transpose = isNarrowNResult(resultEncoding); + SmallVector elemTypes = lhsEncoding.getElementTypesArray(); + SmallVector ri; + Value newLhs = getMmt4dOperand(operands[0], linalgOp, transpose, builder, ri, + elemTypes, /*operandIdx=*/0); + Value newRhs = getMmt4dOperand(operands[1], linalgOp, transpose, builder, ri, + elemTypes, /*operandIdx=*/1); + Value newResult = getMmt4dOperand(operands[2], linalgOp, transpose, builder, + ri, elemTypes, /*operandIdx=*/2); + if (transpose) { + std::swap(newLhs, newRhs); + } + Type newResultType = newResult.getType(); + auto cDims = IREE::Encoding::getEncodingContractionDims(lhsEncoding); + Operation *result; + if (cDims->batch.empty()) { + result = builder.create(linalgOp.getLoc(), newResultType, + ValueRange{newLhs, newRhs}, + ValueRange{newResult}); + } else { + result = builder.create( + linalgOp.getLoc(), newResultType, ValueRange{newLhs, newRhs}, + ValueRange{newResult}); + } + if (!ri.empty()) { + result = builder.create( + linalgOp->getLoc(), operands[2].getType(), result->getResult(0), ri); + } + return result; +} + //===----------------------------------------------------------------------===// // Interface methods implementaion for iree_cpu.cpu_encoding_layout. //===----------------------------------------------------------------------===// // Enumerate tile sizes to choose from on riscv32. // For narrow-{M,N} cases, this only enumerates on narrow M. The narrow-N cases -// are handled by transposition in IREE::CPU::chooseMatmulTile. +// are handled by transposition in chooseMatmulTile. static SmallVector enumerateMatmulTileRiscv32(DictionaryAttr config) { if (hasUkernel(config)) { @@ -47,7 +326,7 @@ enumerateMatmulTileRiscv32(DictionaryAttr config) { // Enumerate tile sizes to choose from on arm64. // For narrow-{M,N} cases, this only enumerates on narrow M. The narrow-N cases -// are handled by transposition in IREE::CPU::chooseMatmulTile. +// are handled by transposition in chooseMatmulTile. static SmallVector enumerateMatmulTileArm64(TypeRange elementTypes, DictionaryAttr config) { // Data-tiling for SVE is not implemented yet. @@ -137,7 +416,7 @@ static SmallVector enumerateMatmulTileArm64(TypeRange elementTypes, // Enumerate tile sizes to choose from on x86-64. // For narrow-{M,N} cases, this only enumerates on narrow M. The narrow-N cases -// are handled by transposition in IREE::CPU::chooseMatmulTile. +// are handled by transposition in chooseMatmulTile. static SmallVector enumerateMatmulTileX86_64(TypeRange elementTypes, DictionaryAttr config) { assert(elementTypes.size() == 3); @@ -309,8 +588,8 @@ struct CPUDeviceEncodingLayoutAttrInterface return nullptr; } - FailureOr newOp = Codegen::lowerContractionOpWithEncoding( - b, linalgOp, convertedOperands, /*transposeNarrowN=*/true, + FailureOr newOp = lowerContractionOpWithEncoding( + b, linalgOp, convertedOperands, cast(layoutAttr)); return newOp.value_or(nullptr); } @@ -393,8 +672,8 @@ struct VMVXDeviceEncodingLayoutAttrInterface return nullptr; } - FailureOr newOp = Codegen::lowerContractionOpWithEncoding( - b, linalgOp, convertedOperands, /*transposeNarrowN=*/true, + FailureOr newOp = lowerContractionOpWithEncoding( + b, linalgOp, convertedOperands, cast(layoutAttr)); return newOp.value_or(nullptr); } From 0cafee98706cf5683bb6fb5cd5ea1a7816c3dad7 Mon Sep 17 00:00:00 2001 From: Vinayak Dev <104419489+vinayakdsci@users.noreply.github.com> Date: Fri, 13 Dec 2024 12:51:26 +0530 Subject: [PATCH 10/64] [vm] Add support for SI64 to F32 casts (#19455) Adds support to the VM for casting from `si64` type to `f32` type. Enables the lowering of `arith.sitofp %arg0 : i64 to f32` after demotion. --- .../VM/Conversion/ArithToVM/Patterns.cpp | 100 ++++++++++++++---- .../ArithToVM/test/conversion_ops.mlir | 12 +++ .../Conversion/VMToEmitC/ConvertVMToEmitC.cpp | 1 + .../compiler/Dialect/VM/IR/VMOpFolders.cpp | 10 ++ .../compiler/Dialect/VM/IR/VMOpcodesF32.td | 3 + .../src/iree/compiler/Dialect/VM/IR/VMOps.td | 7 ++ runtime/src/iree/vm/bytecode/disassembler.c | 10 ++ runtime/src/iree/vm/bytecode/dispatch.c | 5 + .../vm/bytecode/utils/generated/op_table.h | 12 +-- runtime/src/iree/vm/bytecode/verifier.c | 4 + runtime/src/iree/vm/ops.h | 1 + .../onnx_ops/onnx_ops_gpu_vulkan.json | 3 - 12 files changed, 139 insertions(+), 29 deletions(-) diff --git a/compiler/src/iree/compiler/Dialect/VM/Conversion/ArithToVM/Patterns.cpp b/compiler/src/iree/compiler/Dialect/VM/Conversion/ArithToVM/Patterns.cpp index 8e5f96f65f81..6d902e549611 100644 --- a/compiler/src/iree/compiler/Dialect/VM/Conversion/ArithToVM/Patterns.cpp +++ b/compiler/src/iree/compiler/Dialect/VM/Conversion/ArithToVM/Patterns.cpp @@ -526,30 +526,93 @@ struct TruncateIOpConversion : public OpConversionPattern { } }; -template -struct IntToFPOpConversion : public OpConversionPattern { - using OpConversionPattern::OpConversionPattern; +struct SIToFPOpConversion : public OpConversionPattern { + using OpConversionPattern::OpConversionPattern; LogicalResult - matchAndRewrite(OpTy srcOp, typename OpTy::Adaptor adaptor, + matchAndRewrite(arith::SIToFPOp srcOp, OpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { - auto srcType = srcOp.getIn().getType(); + auto input = srcOp.getIn(); + auto srcType = input.getType(); auto dstType = srcOp.getResult().getType(); - if (!dstType.isF32() || - !(srcType.isSignedInteger() || srcType.isSignlessInteger())) { + auto resultType = getTypeConverter()->convertType(dstType); + + if (!(dstType.isF32() || dstType.isF64())) { return rewriter.notifyMatchFailure(srcOp, "unsupported type"); } - Value input = srcOp.getIn(); - if (!(srcType.isSignlessInteger(32) || srcType.isSignedInteger(32))) { - if (srcType.getIntOrFloatBitWidth() < 32) { - input = rewriter.create( - srcOp.getLoc(), IntegerType::get(this->getContext(), 32), input); + + if (srcType.isSignedInteger(32) || srcType.isSignlessInteger(32)) { + if (dstType.isF32()) { + rewriter.replaceOpWithNewOp(srcOp, resultType, + input); + return success(); + } + if (dstType.isF64()) { + return rewriter.notifyMatchFailure(srcOp, "unsupported type"); + } + } + if (srcType.isSignedInteger(64) || srcType.isSignlessInteger(64)) { + if (dstType.isF32()) { + rewriter.replaceOpWithNewOp(srcOp, resultType, + input); } else { + rewriter.replaceOpWithNewOp(srcOp, resultType, + input); + } + return success(); + } + + if (srcType.getIntOrFloatBitWidth() < 32) { + input = rewriter.create( + srcOp.getLoc(), IntegerType::get(this->getContext(), 32), input); + } + + rewriter.replaceOpWithNewOp(srcOp, resultType, + input); + return success(); + } +}; + +struct UIToFPOpConversion : public OpConversionPattern { + using OpConversionPattern::OpConversionPattern; + LogicalResult + matchAndRewrite(arith::UIToFPOp srcOp, OpAdaptor adaptor, + ConversionPatternRewriter &rewriter) const override { + auto input = srcOp.getIn(); + auto srcType = input.getType(); + auto dstType = srcOp.getResult().getType(); + + if (!(dstType.isF32() || dstType.isF64())) { + return rewriter.notifyMatchFailure(srcOp, "unsupported type"); + } + + auto resultType = getTypeConverter()->convertType(dstType); + if (srcType.isUnsignedInteger(32) || srcType.isSignlessInteger(32)) { + if (dstType.isF32()) { + rewriter.replaceOpWithNewOp(srcOp, resultType, + input); + return success(); + } + if (dstType.isF64()) { + return rewriter.notifyMatchFailure(srcOp, "unsupported type"); + } + } + if (srcType.isUnsignedInteger(64) || srcType.isSignlessInteger(64)) { + if (dstType.isF32()) { return rewriter.notifyMatchFailure(srcOp, "unsupported type"); } + + rewriter.replaceOpWithNewOp(srcOp, resultType, + input); + return success(); + } + + if (srcType.getIntOrFloatBitWidth() < 32) { + input = rewriter.create( + srcOp.getLoc(), IntegerType::get(this->getContext(), 32), input); } - auto resultType = this->getTypeConverter()->convertType(dstType); - rewriter.replaceOpWithNewOp(srcOp, resultType, input); + rewriter.replaceOpWithNewOp(srcOp, resultType, + input); return success(); } }; @@ -742,12 +805,9 @@ void populateArithToVMPatterns(MLIRContext *context, IREE::VM::MaxF64Op>>(typeConverter, context); // Floating-point conversion ops. - patterns.insert, - IntToFPOpConversion, - FPToSIOpConversion, FPToUIOpConversion, BitcastOpConversion>( - typeConverter, context); + patterns.insert(typeConverter, + context); // Shift ops. patterns diff --git a/compiler/src/iree/compiler/Dialect/VM/Conversion/ArithToVM/test/conversion_ops.mlir b/compiler/src/iree/compiler/Dialect/VM/Conversion/ArithToVM/test/conversion_ops.mlir index be4ec1f83b87..5b8da0ba9f03 100644 --- a/compiler/src/iree/compiler/Dialect/VM/Conversion/ArithToVM/test/conversion_ops.mlir +++ b/compiler/src/iree/compiler/Dialect/VM/Conversion/ArithToVM/test/conversion_ops.mlir @@ -275,6 +275,18 @@ module @sitofp_i8_f32 { // ----- +// CHECK-LABEL: @sitofp_i64_f32 +module @sitofp_i64_f32 { + // CHECK: vm.func private @fn(%[[ARG0:.+]]: i64) + func.func @fn(%arg0: i64) -> f32 { + // CHECK: vm.cast.si64.f32 %[[ARG0]] : i64 -> f32 + %0 = arith.sitofp %arg0 : i64 to f32 + return %0 : f32 + } +} + +// ----- + // CHECK-LABEL: @uitofp_i8_f32 module @uitofp_i8_f32 { // CHECK: vm.func private @fn(%[[ARG0:.+]]: i32) diff --git a/compiler/src/iree/compiler/Dialect/VM/Conversion/VMToEmitC/ConvertVMToEmitC.cpp b/compiler/src/iree/compiler/Dialect/VM/Conversion/VMToEmitC/ConvertVMToEmitC.cpp index 71eaea620cb2..845c3e3ace4b 100644 --- a/compiler/src/iree/compiler/Dialect/VM/Conversion/VMToEmitC/ConvertVMToEmitC.cpp +++ b/compiler/src/iree/compiler/Dialect/VM/Conversion/VMToEmitC/ConvertVMToEmitC.cpp @@ -4521,6 +4521,7 @@ void populateVMToEmitCPatterns(ConversionTarget &conversionTarget, ADD_GENERIC_PATTERN(IREE::VM::CastF32UI32Op, "vm_cast_f32ui32"); ADD_GENERIC_PATTERN(IREE::VM::CastF32UI64Op, "vm_cast_f32ui64"); ADD_GENERIC_PATTERN(IREE::VM::CastSI32F32Op, "vm_cast_si32f32"); + ADD_GENERIC_PATTERN(IREE::VM::CastSI64F32Op, "vm_cast_si64f32"); ADD_GENERIC_PATTERN(IREE::VM::CastUI32F32Op, "vm_cast_ui32f32"); ADD_GENERIC_PATTERN(IREE::VM::CeilF32Op, "vm_ceil_f32"); ADD_GENERIC_PATTERN(IREE::VM::CmpEQF32OOp, "vm_cmp_eq_f32o"); diff --git a/compiler/src/iree/compiler/Dialect/VM/IR/VMOpFolders.cpp b/compiler/src/iree/compiler/Dialect/VM/IR/VMOpFolders.cpp index bec93d4c43b5..f227292e3e4a 100644 --- a/compiler/src/iree/compiler/Dialect/VM/IR/VMOpFolders.cpp +++ b/compiler/src/iree/compiler/Dialect/VM/IR/VMOpFolders.cpp @@ -1683,6 +1683,16 @@ OpFoldResult CastSI32F32Op::fold(FoldAdaptor operands) { }); } +OpFoldResult CastSI64F32Op::fold(FoldAdaptor operands) { + return constFoldCastOp( + Float32Type::get(getContext()), operands.getOperand(), + [&](const APInt &a) { + APFloat b = APFloat(0.0f); + b.convertFromAPInt(a, /*IsSigned=*/true, APFloat::rmNearestTiesToAway); + return b; + }); +} + OpFoldResult CastUI32F32Op::fold(FoldAdaptor operands) { return constFoldCastOp( Float32Type::get(getContext()), operands.getOperand(), diff --git a/compiler/src/iree/compiler/Dialect/VM/IR/VMOpcodesF32.td b/compiler/src/iree/compiler/Dialect/VM/IR/VMOpcodesF32.td index af9295f165f4..1ce37ae032f0 100644 --- a/compiler/src/iree/compiler/Dialect/VM/IR/VMOpcodesF32.td +++ b/compiler/src/iree/compiler/Dialect/VM/IR/VMOpcodesF32.td @@ -45,6 +45,7 @@ def VM_OPC_MinF32 : VM_OPC<0x37, "MinF32">; def VM_OPC_MaxF32 : VM_OPC<0x38, "MaxF32">; def VM_OPC_CastSI32F32 : VM_OPC<0x14, "CastSI32F32">; +def VM_OPC_CastSI64F32 : VM_OPC<0x3C, "CastSI64F32">; def VM_OPC_CastUI32F32 : VM_OPC<0x15, "CastUI32F32">; def VM_OPC_CastF32SI32 : VM_OPC<0x16, "CastF32SI32">; def VM_OPC_CastF32SI64 : VM_OPC<0x3A, "CastF32SI64">; @@ -116,10 +117,12 @@ def VM_ExtF32OpcodeAttr : VM_OPC_CeilF32, VM_OPC_FloorF32, VM_OPC_RoundF32, + VM_OPC_RoundF32Even, VM_OPC_MinF32, VM_OPC_MaxF32, VM_OPC_CastSI32F32, + VM_OPC_CastSI64F32, VM_OPC_CastUI32F32, VM_OPC_CastF32SI32, VM_OPC_CastF32SI64, diff --git a/compiler/src/iree/compiler/Dialect/VM/IR/VMOps.td b/compiler/src/iree/compiler/Dialect/VM/IR/VMOps.td index c23e687a8c6d..f7e59449211b 100644 --- a/compiler/src/iree/compiler/Dialect/VM/IR/VMOps.td +++ b/compiler/src/iree/compiler/Dialect/VM/IR/VMOps.td @@ -3167,6 +3167,13 @@ def VM_CastSI64F64Op : let hasFolder = 1; } +def VM_CastSI64F32Op : + VM_ConversionOp { + let summary = [{cast from a signed integer to a float-point value}]; + let hasFolder = 1; +} + def VM_CastUI64F64Op : VM_ConversionOp { diff --git a/runtime/src/iree/vm/bytecode/disassembler.c b/runtime/src/iree/vm/bytecode/disassembler.c index 02d93e9070e6..4b24843a4a07 100644 --- a/runtime/src/iree/vm/bytecode/disassembler.c +++ b/runtime/src/iree/vm/bytecode/disassembler.c @@ -2111,6 +2111,16 @@ iree_status_t iree_vm_bytecode_disassemble_op( EMIT_OPTIONAL_VALUE_I32(regs->i32[operand_reg]); break; } + DISASM_OP(EXT_F32, CastSI64F32) { + uint16_t operand_reg = VM_ParseOperandRegI64("operand"); + uint16_t result_reg = VM_ParseResultRegF32("result"); + EMIT_F32_REG_NAME(result_reg); + IREE_RETURN_IF_ERROR( + iree_string_builder_append_cstring(b, " = vm.cast.si64.f32 ")); + EMIT_I64_REG_NAME(operand_reg); + EMIT_OPTIONAL_VALUE_I64(regs->i32[operand_reg]); + break; + } DISASM_OP(EXT_F32, CastUI32F32) { uint16_t operand_reg = VM_ParseOperandRegI32("operand"); uint16_t result_reg = VM_ParseResultRegF32("result"); diff --git a/runtime/src/iree/vm/bytecode/dispatch.c b/runtime/src/iree/vm/bytecode/dispatch.c index 40ae195b660d..ba48f3228477 100644 --- a/runtime/src/iree/vm/bytecode/dispatch.c +++ b/runtime/src/iree/vm/bytecode/dispatch.c @@ -2046,6 +2046,11 @@ static iree_status_t iree_vm_bytecode_dispatch( float* result = VM_DecResultRegF32("result"); *result = vm_cast_si32f32(operand); }); + DISPATCH_OP(EXT_F32, CastSI64F32, { + int64_t operand = (int64_t)VM_DecOperandRegI64("operand"); + float* result = VM_DecResultRegF32("result"); + *result = vm_cast_si64f32(operand); + }); DISPATCH_OP(EXT_F32, CastUI32F32, { int32_t operand = (int32_t)VM_DecOperandRegI32("operand"); float* result = VM_DecResultRegF32("result"); diff --git a/runtime/src/iree/vm/bytecode/utils/generated/op_table.h b/runtime/src/iree/vm/bytecode/utils/generated/op_table.h index 2a5a76c0d7ab..2c760731a023 100644 --- a/runtime/src/iree/vm/bytecode/utils/generated/op_table.h +++ b/runtime/src/iree/vm/bytecode/utils/generated/op_table.h @@ -388,10 +388,10 @@ typedef enum { OPC(0x77, AbsI32) \ OPC(0x78, AbsI64) \ OPC(0x79, Block) \ - OPC(0x7A, MinI64S) \ - OPC(0x7B, MinI64U) \ - OPC(0x7C, MaxI64S) \ - OPC(0x7D, MaxI64U) \ + OPC(0x7A, MinI32S) \ + OPC(0x7B, MinI32U) \ + OPC(0x7C, MaxI32S) \ + OPC(0x7D, MaxI32U) \ OPC(0x7E, MinI64S) \ OPC(0x7F, MinI64U) \ OPC(0x80, MaxI64S) \ @@ -584,7 +584,7 @@ typedef enum { IREE_VM_OP_EXT_F32_RoundF32Even = 0x39, IREE_VM_OP_EXT_F32_CastF32SI64 = 0x3A, IREE_VM_OP_EXT_F32_CastF32UI64 = 0x3B, - IREE_VM_OP_EXT_F32_RSV_0x3C, + IREE_VM_OP_EXT_F32_CastSI64F32 = 0x3C, IREE_VM_OP_EXT_F32_RSV_0x3D, IREE_VM_OP_EXT_F32_RSV_0x3E, IREE_VM_OP_EXT_F32_RSV_0x3F, @@ -843,7 +843,7 @@ typedef enum { OPC(0x39, RoundF32Even) \ OPC(0x3A, CastF32SI64) \ OPC(0x3B, CastF32UI64) \ - RSV(0x3C) \ + OPC(0x3C, CastSI64F32) \ RSV(0x3D) \ RSV(0x3E) \ RSV(0x3F) \ diff --git a/runtime/src/iree/vm/bytecode/verifier.c b/runtime/src/iree/vm/bytecode/verifier.c index c5b9d635f220..5c726db74e8f 100644 --- a/runtime/src/iree/vm/bytecode/verifier.c +++ b/runtime/src/iree/vm/bytecode/verifier.c @@ -1823,6 +1823,10 @@ static iree_status_t iree_vm_bytecode_function_verify_bytecode_op( VM_VerifyOperandRegI32(operand); VM_VerifyResultRegF32(result); }); + VERIFY_OP(EXT_F32, CastSI64F32, { + VM_VerifyOperandRegI64(operand); + VM_VerifyResultRegF32(result); + }); VERIFY_OP(EXT_F32, CastUI32F32, { VM_VerifyOperandRegI32(operand); VM_VerifyResultRegF32(result); diff --git a/runtime/src/iree/vm/ops.h b/runtime/src/iree/vm/ops.h index b9ffd70122da..68c939e62350 100644 --- a/runtime/src/iree/vm/ops.h +++ b/runtime/src/iree/vm/ops.h @@ -599,6 +599,7 @@ static inline float vm_erf_f32(float operand) { return erff(operand); } //===------------------------------------------------------------------===// static inline float vm_cast_si32f32(int32_t operand) { return (float)operand; } +static inline float vm_cast_si64f32(int64_t operand) { return (float)operand; } static inline float vm_cast_ui32f32(int32_t operand) { return (float)(uint32_t)operand; } diff --git a/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_vulkan.json b/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_vulkan.json index 721c34c4ad82..eb8e94e5aa36 100644 --- a/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_vulkan.json +++ b/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_vulkan.json @@ -373,9 +373,6 @@ "onnx/node/generated/test_slice_start_out_of_bounds", "onnx/node/generated/test_stft", "onnx/node/generated/test_stft_with_window", - "onnx/node/generated/test_tfidfvectorizer_tf_batch_onlybigrams_skip0", - "onnx/node/generated/test_tfidfvectorizer_tf_batch_onlybigrams_skip5", - "onnx/node/generated/test_tfidfvectorizer_tf_batch_uniandbigrams_skip5", "onnx/node/generated/test_tfidfvectorizer_tf_only_bigrams_skip0", "onnx/node/generated/test_tfidfvectorizer_tf_onlybigrams_levelempty", "onnx/node/generated/test_tfidfvectorizer_tf_onlybigrams_skip5", From ffa0f4248d549b9e886431124a0b037d831b9e56 Mon Sep 17 00:00:00 2001 From: Twice Date: Sat, 14 Dec 2024 00:05:44 +0800 Subject: [PATCH 11/64] [PJRT] Allow to pass extra compile options via env variables (#19418) Sometime it's useful to pass some extra IREE compiler options to the PJRT plugin by environment variables to debug/do some experiment/performance tuning without recompilation. This is a rewrite to the following code which was commented as a TODO. https://github.com/iree-org/iree/blob/b68c535ece28e139492606f391493f3e95242420/integrations/pjrt/src/iree_pjrt/common/iree_compiler.cc#L231-L245 ci-exactly: build_packages, test_pjrt --------- Signed-off-by: PragmaTwice Co-authored-by: Scott Todd --- integrations/pjrt/README.md | 13 +++++ .../pjrt/src/iree_pjrt/common/CMakeLists.txt | 13 +++++ .../iree_pjrt/common/command_line_utils.cc | 54 +++++++++++++++++++ .../src/iree_pjrt/common/command_line_utils.h | 26 +++++++++ .../common/command_line_utils_test.cc | 24 +++++++++ .../pjrt/src/iree_pjrt/common/compiler.h | 4 ++ .../src/iree_pjrt/common/dylib_platform.cc | 11 +++- .../src/iree_pjrt/common/iree_compiler.cc | 20 +++---- 8 files changed, 150 insertions(+), 15 deletions(-) create mode 100644 integrations/pjrt/src/iree_pjrt/common/command_line_utils.cc create mode 100644 integrations/pjrt/src/iree_pjrt/common/command_line_utils.h create mode 100644 integrations/pjrt/src/iree_pjrt/common/command_line_utils_test.cc diff --git a/integrations/pjrt/README.md b/integrations/pjrt/README.md index 9c018be70f58..f4f154538681 100644 --- a/integrations/pjrt/README.md +++ b/integrations/pjrt/README.md @@ -36,6 +36,19 @@ pip install -v --no-deps -e python_packages/iree_cpu_plugin JAX_PLATFORMS=iree_cpu python -c "import jax; a = jax.numpy.asarray([1, 2, 3, 4, 5, 6, 7, 8, 9]); print(a + a);" ``` +## Advanced settings + +To pass additional compile options to IREE during JIT compilation, you can use +the `IREE_PJRT_IREE_COMPILER_OPTIONS` environment variable. This variable can +be set to a space-delimited list of flags that would be passed to the +`iree-compile` command-line tool. + +For example: +```shell +export IREE_PJRT_IREE_COMPILER_OPTIONS=--iree-scheduling-dump-statistics-format=csv +JAX_PLATFORMS=iree_cpu python -c "import jax; a = jax.numpy.asarray([1, 2, 3, 4, 5, 6, 7, 8, 9]); print(a + a);" +``` + ## Incrementally developing If you did an editable install (`-e`) above, then you should be able to incrementally diff --git a/integrations/pjrt/src/iree_pjrt/common/CMakeLists.txt b/integrations/pjrt/src/iree_pjrt/common/CMakeLists.txt index c5855fc76196..712709141d3e 100644 --- a/integrations/pjrt/src/iree_pjrt/common/CMakeLists.txt +++ b/integrations/pjrt/src/iree_pjrt/common/CMakeLists.txt @@ -9,6 +9,7 @@ iree_cc_library( common HDRS "api_impl.h" + "command_line_utils.h" "dylib_entry_point.cc.inc" "iree_helpers.h" "layout_utils.h" @@ -16,6 +17,7 @@ iree_cc_library( "tensor_utils.h" SRCS "api_impl.cc" + "command_line_utils.cc" "layout_utils.cc" "platform.cc" "tensor_utils.cc" @@ -60,6 +62,17 @@ iree_cc_library( PUBLIC ) +iree_cc_test( + NAME + command_line_utils_test + SRCS + "command_line_utils_test.cc" + DEPS + ::common + iree::testing::gtest + iree::testing::gtest_main +) + iree_cc_library( NAME debugging diff --git a/integrations/pjrt/src/iree_pjrt/common/command_line_utils.cc b/integrations/pjrt/src/iree_pjrt/common/command_line_utils.cc new file mode 100644 index 000000000000..31f6af075b91 --- /dev/null +++ b/integrations/pjrt/src/iree_pjrt/common/command_line_utils.cc @@ -0,0 +1,54 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "command_line_utils.h" + +namespace iree { +namespace pjrt { + +// TODO: currently this function doesn't handle escape sequences, +// it just ensure that single/double quotes are interpreted corrently. +std::optional> ParseOptionsFromCommandLine( + std::string_view options_str) { + std::vector options; + std::string current; + + enum { NORMAL, SINGLE_QUOTE, DOUBLE_QUOTE } state = NORMAL; + for (auto it = options_str.begin(); it != options_str.end(); ++it) { + if (std::isspace(*it) && state == NORMAL) { + if (!current.empty()) { + options.push_back(std::move(current)); + current.clear(); + } + } else if (*it == '"' && state != SINGLE_QUOTE) { + if (state == NORMAL) + state = DOUBLE_QUOTE; + else if (state == DOUBLE_QUOTE) + state = NORMAL; + } else if (*it == '\'' && state != DOUBLE_QUOTE) { + if (state == NORMAL) + state = SINGLE_QUOTE; + else if (state == SINGLE_QUOTE) + state = NORMAL; + } else { + current.push_back(*it); + } + } + + if (!current.empty()) { + options.push_back(std::move(current)); + } + + // if it's still in a quote, then return nullopt + if (state != NORMAL) { + return std::nullopt; + } + + return options; +} + +} // namespace pjrt +} // namespace iree diff --git a/integrations/pjrt/src/iree_pjrt/common/command_line_utils.h b/integrations/pjrt/src/iree_pjrt/common/command_line_utils.h new file mode 100644 index 000000000000..b8df54d8988a --- /dev/null +++ b/integrations/pjrt/src/iree_pjrt/common/command_line_utils.h @@ -0,0 +1,26 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#ifndef IREE_PJRT_PLUGIN_PJRT_COMMON_COMMAND_LINE_UTILS_H_ +#define IREE_PJRT_PLUGIN_PJRT_COMMON_COMMAND_LINE_UTILS_H_ + +#include +#include +#include +#include + +namespace iree { +namespace pjrt { + +// parse command line options (maybe with quotes) to an array of options +// e.g. `a b "c d"` -> {"a", "b", "c d"} +std::optional> ParseOptionsFromCommandLine( + std::string_view options_str); + +} // namespace pjrt +} // namespace iree + +#endif diff --git a/integrations/pjrt/src/iree_pjrt/common/command_line_utils_test.cc b/integrations/pjrt/src/iree_pjrt/common/command_line_utils_test.cc new file mode 100644 index 000000000000..ffd785216297 --- /dev/null +++ b/integrations/pjrt/src/iree_pjrt/common/command_line_utils_test.cc @@ -0,0 +1,24 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "iree_pjrt/common/command_line_utils.h" + +#include + +using namespace iree::pjrt; + +TEST(CommandLineUtils, ParseOptionsFromCommandLine) { + EXPECT_EQ(ParseOptionsFromCommandLine("--help --verbose"), + (std::vector{"--help", "--verbose"})); + EXPECT_EQ(ParseOptionsFromCommandLine("-a='x y' -b \"n m\""), + (std::vector{"-a=x y", "-b", "n m"})); + EXPECT_EQ(ParseOptionsFromCommandLine("'\"' \"'\""), + (std::vector{"\"", "'"})); + EXPECT_EQ(ParseOptionsFromCommandLine("ab abc d 'e f g' h "), + (std::vector{"ab", "abc", "d", "e f g", "h"})); + EXPECT_EQ(ParseOptionsFromCommandLine("a 'b"), std::nullopt); + EXPECT_EQ(ParseOptionsFromCommandLine("x\"y"), std::nullopt); +} diff --git a/integrations/pjrt/src/iree_pjrt/common/compiler.h b/integrations/pjrt/src/iree_pjrt/common/compiler.h index 9969a00db8dc..e7dcdc515621 100644 --- a/integrations/pjrt/src/iree_pjrt/common/compiler.h +++ b/integrations/pjrt/src/iree_pjrt/common/compiler.h @@ -76,12 +76,16 @@ class AbstractCompiler { // An AbstractCompiler based on IREE. class IREECompiler : public AbstractCompiler { public: + IREECompiler(std::vector extra_options = {}) + : extra_options_(std::move(extra_options)) {} + std::unique_ptr StartJob() override; std::string GetRevision() override; std::string GetErrorMessage() override { return error_message_; } private: std::string error_message_; + std::vector extra_options_; }; // An AbstractCompiler based on the HLO partitioner. diff --git a/integrations/pjrt/src/iree_pjrt/common/dylib_platform.cc b/integrations/pjrt/src/iree_pjrt/common/dylib_platform.cc index f42d5a5cedb6..6feb5c837ffe 100644 --- a/integrations/pjrt/src/iree_pjrt/common/dylib_platform.cc +++ b/integrations/pjrt/src/iree_pjrt/common/dylib_platform.cc @@ -14,6 +14,7 @@ #include "iree/base/internal/path.h" #include "iree/compiler/embedding_api.h" #include "iree/compiler/loader.h" +#include "iree_pjrt/common/command_line_utils.h" #include "iree_pjrt/partitioner_api/embedding_api.h" #include "iree_pjrt/partitioner_api/loader.h" @@ -98,7 +99,15 @@ iree_status_t DylibPlatform::SubclassInitialize() { message.append(*loaded_compiler); logger().debug(message); } - compiler_ = std::make_unique(); + + std::vector extra_compiler_options; + if (auto options_str = config_vars().Lookup("IREE_COMPILER_OPTIONS")) { + if (auto options = ParseOptionsFromCommandLine(*options_str)) { + extra_compiler_options = std::move(*options); + logger().debug("Extra compile options: " + *options_str); + } + } + compiler_ = std::make_unique(std::move(extra_compiler_options)); { std::string message("Compiler Version: "); message.append(compiler_->GetRevision()); diff --git a/integrations/pjrt/src/iree_pjrt/common/iree_compiler.cc b/integrations/pjrt/src/iree_pjrt/common/iree_compiler.cc index 1cddbc3b37b6..87bc3b8af0c2 100644 --- a/integrations/pjrt/src/iree_pjrt/common/iree_compiler.cc +++ b/integrations/pjrt/src/iree_pjrt/common/iree_compiler.cc @@ -228,20 +228,12 @@ std::unique_ptr IREECompiler::StartJob() { } // Propagate all options set via environment variable. - // TODO: Excise/translate to something that doesn't rely on LLVM. - // if (std::optional env_value = llvm::sys::Process::GetEnv( - // llvm::StringRef("IREE_COMPILER_OPTIONS"))) { - // llvm::SmallVector new_argv; - // llvm::BumpPtrAllocator a; - // llvm::StringSaver saver(a); - - // llvm::cl::TokenizeGNUCommandLine(*env_value, saver, new_argv); - // for (auto arg : new_argv) - // if (!job->SetFlag(arg)) { - // error_message_ = job->GetErrorMessage(); - // return nullptr; - // } - // } + for (auto arg : extra_options_) { + if (!job->SetFlag(arg.c_str())) { + error_message_ = job->GetErrorMessage(); + return nullptr; + } + } return job; } From 8a7b75417cbf491508166d80ef90305b02d5a21e Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Fri, 13 Dec 2024 08:36:57 -0800 Subject: [PATCH 12/64] Update website copyright text per Linux Foundation guidance. (#19480) See https://lfprojects.org/. Signed-off-by: Scott Todd --- docs/website/mkdocs.yml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/website/mkdocs.yml b/docs/website/mkdocs.yml index e43683075333..6edef7ae8a3e 100644 --- a/docs/website/mkdocs.yml +++ b/docs/website/mkdocs.yml @@ -82,7 +82,10 @@ exclude_docs: | assets/images/README.md **/snippets/*.md -copyright: Copyright © 2024 The IREE Authors +copyright: | + Copyright © 2024 IREE a Series of LF Projects, LLC. + For web site terms of use, trademark policy and other project policies please + see https://lfprojects.org. markdown_extensions: - abbr From 99b600f729c32c566ed5a819b85de252d8c64bbb Mon Sep 17 00:00:00 2001 From: Max191 <44243577+Max191@users.noreply.github.com> Date: Fri, 13 Dec 2024 12:19:59 -0500 Subject: [PATCH 13/64] [Codegen] Allow padding of dynamic allocas (#19399) This PR adds support for padding for allocas in the PadDynamicAllocsPass. The padding works the same for alloca as for alloc. --------- Signed-off-by: Max Dawkins --- .../Codegen/Common/PadDynamicAlloc.cpp | 21 +++++++++++++------ .../Common/test/pad_dynamic_alloc.mlir | 12 ++++++++++- 2 files changed, 26 insertions(+), 7 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/PadDynamicAlloc.cpp b/compiler/src/iree/compiler/Codegen/Common/PadDynamicAlloc.cpp index db1819b017e6..2a9d2d39bd75 100644 --- a/compiler/src/iree/compiler/Codegen/Common/PadDynamicAlloc.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/PadDynamicAlloc.cpp @@ -65,7 +65,8 @@ static FailureOr getUpperBound(Value dim, return failure(); } -static LogicalResult padAlloc(MLIRContext *context, memref::AllocOp allocOp, +template +static LogicalResult padAlloc(MLIRContext *context, AllocLikeOp allocOp, const DataFlowSolver &solver) { IRRewriter rewriter(context); rewriter.setInsertionPoint(allocOp); @@ -94,7 +95,7 @@ static LogicalResult padAlloc(MLIRContext *context, memref::AllocOp allocOp, MemRefType allocType = MemRefType::get(shape, elType, AffineMap(), allocOp.getType().getMemorySpace()); Location loc = allocOp.getLoc(); - Value paddedAlloc = rewriter.create(loc, allocType); + Value paddedAlloc = rewriter.create(loc, allocType); SmallVector offsets(shape.size(), rewriter.getIndexAttr(0)); SmallVector strides(shape.size(), rewriter.getIndexAttr(1)); Value subview = rewriter.create(loc, paddedAlloc, offsets, @@ -111,7 +112,6 @@ struct PadDynamicAllocPass final void runOnOperation() override { auto funcOp = getOperation(); MLIRContext *context = &getContext(); - SmallVector sharedMemAllocs; DataFlowSolver solver; solver.load(); @@ -122,12 +122,21 @@ struct PadDynamicAllocPass final } // Collect all the alloc operations. - funcOp.walk( - [&](memref::AllocOp allocOp) { sharedMemAllocs.push_back(allocOp); }); - for (memref::AllocOp alloc : sharedMemAllocs) { + SmallVector allocs; + funcOp.walk([&](memref::AllocOp allocOp) { allocs.push_back(allocOp); }); + for (memref::AllocOp alloc : allocs) { if (failed(padAlloc(context, alloc, solver))) return signalPassFailure(); } + + // Collect all the alloca operations. + SmallVector allocas; + funcOp.walk( + [&](memref::AllocaOp allocaOp) { allocas.push_back(allocaOp); }); + for (memref::AllocaOp alloca : allocas) { + if (failed(padAlloc(context, alloca, solver))) + return signalPassFailure(); + } } }; } // namespace diff --git a/compiler/src/iree/compiler/Codegen/Common/test/pad_dynamic_alloc.mlir b/compiler/src/iree/compiler/Codegen/Common/test/pad_dynamic_alloc.mlir index 0b56bd2cb963..e9d4d7b82181 100644 --- a/compiler/src/iree/compiler/Codegen/Common/test/pad_dynamic_alloc.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/pad_dynamic_alloc.mlir @@ -37,4 +37,14 @@ func.func @dynamic_bound_alloc(%id : index) { return } // CHECK-LABEL: func @dynamic_bound_alloc( -// CHECK: %alloc = memref.alloc() : memref<4088xf32, 3> +// CHECK: memref.alloc() : memref<4088xf32, 3> + +// ----- + +func.func @dynamic_bound_alloca(%id : index) { + %0 = util.assume.int %id : index + %1 = memref.alloca(%0) : memref + return +} +// CHECK-LABEL: func @dynamic_bound_alloca( +// CHECK: memref.alloca() : memref<4088xf32, 3> From 442956c85c66028600bcfb83065c0d5b77165675 Mon Sep 17 00:00:00 2001 From: Nirvedh Meshram <96096277+nirvedhmeshram@users.noreply.github.com> Date: Fri, 13 Dec 2024 13:02:18 -0600 Subject: [PATCH 14/64] Use LLVMGPUTileandFuse instead of LLVMGPUVectorize for convolutions (#19469) With this PR for convs that are not picked by VectorDistribute or TileAndFuse via IGEMM, we default lower them with TileAndFuse instead of using Vectorize pipeline. There doesnt seem to be a major performance impact in testing done with iree-kernel-benchmark as shown [here](https://docs.google.com/spreadsheets/d/1WaJ1ELhwdo1wFvNiKbdoddSncSt2_UsbvrTdSObNaAo/edit?gid=0#gid=0) and we can always look into improving the heuristics if performance is a problem. Fixes https://github.com/iree-org/iree/issues/19478 --------- Signed-off-by: Nirvedh --- .../compiler/Codegen/LLVMGPU/KernelConfig.cpp | 26 ++++++++++++++----- .../LLVMGPU/test/conv_pipeline_test_cuda.mlir | 2 +- .../LLVMGPU/test/gpu_set_num_workgroups.mlir | 7 +++-- 3 files changed, 23 insertions(+), 12 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp index af9eb75a0288..cb22b598a94b 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp @@ -77,6 +77,12 @@ llvm::cl::opt clGPUUnalignedGEMMVectorDistribution( "unaligned GEMMs when supported"), llvm::cl::init(false)); +llvm::cl::opt clGPUUseTileAndFuseConvolution( + "iree-codegen-llvmgpu-use-tile-and-fuse-convolution", + llvm::cl::desc( + "enable the tile and fuse pipeline for supported convolutions"), + llvm::cl::init(true)); + /// Flag to force using WMMA tensorcore operations. llvm::cl::opt clGPUUseWMMA("iree-codegen-llvmgpu-use-wmma", @@ -2196,12 +2202,19 @@ static bool distributeToSquare(const int64_t oh, const int64_t ow, // Convolution Pipeline Configuration //====---------------------------------------------------------------------===// -static LogicalResult setConvolutionConfig(IREE::GPU::TargetAttr target, - linalg::LinalgOp linalgOp, - const int64_t bestTilingFactor) { +static LogicalResult setConvolutionConfig( + IREE::GPU::TargetAttr target, mlir::FunctionOpInterface entryPointFn, + linalg::LinalgOp linalgOp, const int64_t bestTilingFactor) { if (!isa(linalgOp)) { return failure(); } + if (clGPUUseTileAndFuseConvolution) { + if (succeeded(IREE::GPU::setTileAndFuseLoweringConfig(target, entryPointFn, + linalgOp))) { + LDBG("Tile and fuse convolution config"); + return success(); + } + } const bool isNCHW = isa(*linalgOp); const bool isNHWC = isa(*linalgOp); @@ -2284,9 +2297,8 @@ static LogicalResult setConvolutionConfig(IREE::GPU::TargetAttr target, SmallVector windowTileSizes(4, 0); windowTileSizes[ohIndex] = 1; tileSizes.push_back(windowTileSizes); - auto funcOp = linalgOp->getParentOfType(); - return setOpConfigAndEntryPointFnTranslation(funcOp, linalgOp, tileSizes, - pipeline, workgroupSize); + return setOpConfigAndEntryPointFnTranslation( + entryPointFn, linalgOp, tileSizes, pipeline, workgroupSize); } //====---------------------------------------------------------------------===// @@ -2340,7 +2352,7 @@ static LogicalResult setRootConfig(IREE::GPU::TargetAttr target, LDBG("Warp Reduction Config"); return success(); } - if (succeeded(setConvolutionConfig(target, linalgOp, 16))) { + if (succeeded(setConvolutionConfig(target, entryPointFn, linalgOp, 16))) { LDBG("Convolution Config"); return success(); } diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/conv_pipeline_test_cuda.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/conv_pipeline_test_cuda.mlir index d129117741e3..af33828f9135 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/conv_pipeline_test_cuda.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/conv_pipeline_test_cuda.mlir @@ -1,4 +1,4 @@ -// RUN: iree-opt --split-input-file --iree-gpu-test-target=sm_60 \ +// RUN: iree-opt --split-input-file --iree-gpu-test-target=sm_60 --iree-codegen-llvmgpu-use-tile-and-fuse-convolution=false \ // RUN: --pass-pipeline='builtin.module(hal.executable(hal.executable.variant(builtin.module(iree-llvmgpu-select-lowering-strategy, func.func(iree-llvmgpu-lower-executable-target,canonicalize)))))' \ // RUN: %s | FileCheck %s diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir index feb0e2766303..66fc62f2e482 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir @@ -612,12 +612,11 @@ func.func @forward_dispatch_1_conv_2d_nhwc_hwcf_256x112x112x64x7x7x3_f32() { return } -// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info // ----- From 63cdc7d2e3d3bf6f657bbeb6f38e234da128f7fc Mon Sep 17 00:00:00 2001 From: Krzysztof Drewniak Date: Fri, 13 Dec 2024 13:06:39 -0600 Subject: [PATCH 15/64] Reapply "[Codegen][GPU] Add range information to GPU dispatch IDs" (#19361) (#19372) This reverts commit cb5be1dbd3560f692578c137eadbb413b41e44c7. Compaled to the previous revision, this one works around a correctness bug in dataflow analysis that's being fixed by removing the analysis after SCF->CF. --- First, this patch implements InferIntRangeInterface for hal.interface.workgroup.{size,id,count} using a local upper_bound attribute. Then, it adds a -iree-codegen-gpu-propagate-dispatch-size-bounds pass that adds these upper_bounds identifiers to the interface.workgroup operations and to gpu.thread_id based on static information available late in the codegen pipeline. Then, it uses -optimize-int-arithmetic to optimize indexing after -lower-affine, getting rid of a bunch of "if the input's negative" logic that isn't actually needed in many of our kernels. It also ensures that these upper_bound values propagate to LLVM. --- .../compiler/Codegen/Common/GPU/BUILD.bazel | 1 + .../Codegen/Common/GPU/CMakeLists.txt | 1 + .../GPU/GPUPropagateDispatchSizeBounds.cpp | 103 +++++++++++++++ .../compiler/Codegen/Common/GPU/Passes.td | 5 + .../Codegen/Common/GPU/test/BUILD.bazel | 1 + .../Codegen/Common/GPU/test/CMakeLists.txt | 1 + .../gpu_propagate_dispatch_size_bounds.mlir | 122 ++++++++++++++++++ .../Codegen/LLVMGPU/ConvertToLLVM.cpp | 5 +- .../iree/compiler/Codegen/LLVMGPU/Passes.cpp | 16 ++- .../nvvm_extract_address_computation.mlir | 2 +- .../iree/compiler/Codegen/SPIRV/Passes.cpp | 2 + .../iree/compiler/Dialect/HAL/IR/BUILD.bazel | 2 + .../compiler/Dialect/HAL/IR/CMakeLists.txt | 1 + .../iree/compiler/Dialect/HAL/IR/HALOps.cpp | 36 ++++++ .../iree/compiler/Dialect/HAL/IR/HALOps.td | 74 ++++------- .../HAL/Transforms/MaterializeInterfaces.cpp | 3 +- 16 files changed, 317 insertions(+), 58 deletions(-) create mode 100644 compiler/src/iree/compiler/Codegen/Common/GPU/GPUPropagateDispatchSizeBounds.cpp create mode 100644 compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_propagate_dispatch_size_bounds.mlir diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel index c9e23636142f..128ffa9fc46e 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel @@ -73,6 +73,7 @@ iree_compiler_cc_library( "GPUPatterns.cpp", "GPUPipelining.cpp", "GPUPromoteMatmulOperands.cpp", + "GPUPropagateDispatchSizeBounds.cpp", "GPUReduceBankConflicts.cpp", "GPUReuseSharedMemoryAllocs.cpp", "GPUTensorAlloc.cpp", diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt index 2aeb9add5f0c..97d324042e2c 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt @@ -71,6 +71,7 @@ iree_cc_library( "GPUPatterns.cpp" "GPUPipelining.cpp" "GPUPromoteMatmulOperands.cpp" + "GPUPropagateDispatchSizeBounds.cpp" "GPUReduceBankConflicts.cpp" "GPUReuseSharedMemoryAllocs.cpp" "GPUTensorAlloc.cpp" diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUPropagateDispatchSizeBounds.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUPropagateDispatchSizeBounds.cpp new file mode 100644 index 000000000000..43aa70be6919 --- /dev/null +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUPropagateDispatchSizeBounds.cpp @@ -0,0 +1,103 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "iree/compiler/Codegen/Common/GPU/Passes.h" +#include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h" +#include "iree/compiler/Codegen/Utils/GPUUtils.h" +#include "iree/compiler/Codegen/Utils/Utils.h" +#include "iree/compiler/Dialect/HAL/IR/HALOps.h" +#include "mlir/Dialect/GPU/IR/GPUDialect.h" +#include "mlir/Interfaces/FunctionInterfaces.h" +#include "mlir/Transforms/Passes.h" + +namespace mlir::iree_compiler { + +#define GEN_PASS_DEF_GPUPROPAGATEDISPATCHSIZEBOUNDSPASS +#include "iree/compiler/Codegen/Common/GPU/Passes.h.inc" + +namespace { + +static void applyBounds(FunctionOpInterface funcOp, + ArrayRef workgroupSizes, + ArrayRef workgroupCounts) { + Builder b(funcOp->getContext()); + funcOp->walk([&](Operation *op) { + TypeSwitch(op) + .Case([&](gpu::ThreadIdOp tidOp) { + tidOp.setUpperBoundAttr(b.getIndexAttr( + workgroupSizes[static_cast(tidOp.getDimension())])); + }) + .Case([&](IREE::HAL::InterfaceWorkgroupSizeOp wgSizeOp) { + wgSizeOp.setUpperBoundAttr(b.getIndexAttr( + workgroupSizes[wgSizeOp.getDimension().getZExtValue()])); + }) + .Case([&](IREE::HAL::InterfaceWorkgroupIDOp wgIdOp) { + wgIdOp.setUpperBoundAttr(b.getIndexAttr( + workgroupCounts[wgIdOp.getDimension().getZExtValue()])); + }) + .Case([&](IREE::HAL::InterfaceWorkgroupCountOp wgCountOp) { + wgCountOp.setUpperBoundAttr(b.getIndexAttr( + workgroupCounts[wgCountOp.getDimension().getZExtValue()])); + }) + .Default([](Operation *) {}); + }); +} + +struct GPUPropagateDispatchSizeBoundsPass final + : impl::GPUPropagateDispatchSizeBoundsPassBase< + GPUPropagateDispatchSizeBoundsPass> { + using Base::Base; + + void runOnOperation() override { + FunctionOpInterface funcOp = getOperation(); + IREE::GPU::TargetAttr target = getGPUTargetAttr(funcOp); + if (!target) { + funcOp.emitWarning("no known target attribute late in GPU codegen"); + return; + } + SmallVector workgroupSizes( + target.getWgp().getMaxWorkgroupSizes().asArrayRef()); + SmallVector workgroupCounts( + target.getWgp().getMaxWorkgroupCounts().asArrayRef()); + + std::optional> staticWorkgroupSize = + getWorkgroupSize(funcOp); + + // Late in codegen, we've reconciled the workgroup size onto the export op. + if (std::optional exportOp = + getEntryPoint(funcOp)) { + if (std::optional exportWorkgroupSize = + exportOp->getWorkgroupSize()) { + staticWorkgroupSize = + llvm::map_to_vector(exportWorkgroupSize->getAsRange(), + [](IntegerAttr a) { return a.getInt(); }); + } + } + + if (staticWorkgroupSize) { + // Target info with no workgroup sizes gives a 0-length array, hence no + // zip_equal. + for (auto [size, staticSize] : + llvm::zip(workgroupSizes, *staticWorkgroupSize)) { + size = staticSize; + } + } + SmallVector staticWorkgroupCounts = getStaticNumWorkgroups(funcOp); + assert(staticWorkgroupCounts.size() <= 3 && + "workgroup counts are 3D at most"); + for (auto [count, staticCount] : + llvm::zip(workgroupCounts, staticWorkgroupCounts)) { + if (staticCount != ShapedType::kDynamic) { + count = staticCount; + } + } + + applyBounds(funcOp, workgroupSizes, workgroupCounts); + } +}; +} // namespace + +} // namespace mlir::iree_compiler diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td b/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td index 789130940477..b3fdd50d4d46 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td @@ -178,6 +178,11 @@ def GPUPromoteMatmulOperandsPass : ]; } +def GPUPropagateDispatchSizeBoundsPass : + InterfacePass<"iree-codegen-gpu-propagate-dispatch-size-bounds", "mlir::FunctionOpInterface"> { + let summary = "Pass to annotate workitem and workgroup IDs with known bounds"; +} + def GPUReduceBankConflictsPass : InterfacePass<"iree-codegen-gpu-reduce-bank-conflicts", "mlir::FunctionOpInterface"> { let summary = "Pass to try to reduce the number of bank conflicts by padding memref.alloc ops."; diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel index 41afbb6559f3..dc8e6a181ccf 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel @@ -41,6 +41,7 @@ iree_lit_test_suite( "gpu_pad_operands.mlir", "gpu_pipeline.mlir", "gpu_promote_matmul_operands.mlir", + "gpu_propagate_dispatch_size_bounds.mlir", "gpu_reorder_workgroups_static.mlir", "gpu_reorder_workgroups.mlir", "gpu_reuse_shared_memory_allocs.mlir", diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt index ad86649ada78..4dc0f289d3d5 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt @@ -37,6 +37,7 @@ iree_lit_test_suite( "gpu_pad_operands.mlir" "gpu_pipeline.mlir" "gpu_promote_matmul_operands.mlir" + "gpu_propagate_dispatch_size_bounds.mlir" "gpu_reorder_workgroups.mlir" "gpu_reorder_workgroups_static.mlir" "gpu_reuse_shared_memory_allocs.mlir" diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_propagate_dispatch_size_bounds.mlir b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_propagate_dispatch_size_bounds.mlir new file mode 100644 index 000000000000..f26f2c5dfe52 --- /dev/null +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_propagate_dispatch_size_bounds.mlir @@ -0,0 +1,122 @@ +// RUN: iree-opt %s --split-input-file \ +// RUN: --pass-pipeline="builtin.module(hal.executable(hal.executable.variant(builtin.module(func.func(iree-codegen-gpu-propagate-dispatch-size-bounds)))))" \ +// RUN: | FileCheck %s + +// Note: not the real target definition, missing types +#executable_target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target>}> +#pipeline_layout = #hal.pipeline.layout]> + +hal.executable private @static { + hal.executable.variant public @rocm_hsaco_fb target(#executable_target) { + hal.executable.export public @static ordinal(0) layout(#pipeline_layout) attributes {workgroup_size = [64 : index, 2 : index, 1 : index]} { + ^bb0(%arg0: !hal.device): + %c32 = arith.constant 32 : index + %c8 = arith.constant 8 : index + %c1 = arith.constant 1 : index + hal.return %c32, %c8, %c1 : index, index, index + } + builtin.module { +// CHECK-LABEL: func.func @static + func.func @static() { +// CHECK: gpu.thread_id x upper_bound 64 +// CHECK: gpu.thread_id y upper_bound 2 +// CHECK: gpu.thread_id z upper_bound 1 + %thread_id_x = gpu.thread_id x + %thread_id_y = gpu.thread_id y + %thread_id_z = gpu.thread_id z + +// CHECK: hal.interface.workgroup.size[0] upper_bound 64 +// CHECK: hal.interface.workgroup.size[1] upper_bound 2 +// CHECK: hal.interface.workgroup.size[2] upper_bound 1 + %workgroup_size_x = hal.interface.workgroup.size[0] : index + %workgroup_size_y = hal.interface.workgroup.size[1] : index + %workgroup_size_z = hal.interface.workgroup.size[2] : index + +// CHECK: hal.interface.workgroup.id[0] upper_bound 32 +// CHECK: hal.interface.workgroup.id[1] upper_bound 8 +// CHECK: hal.interface.workgroup.id[2] upper_bound 1 + %workgroup_id_x = hal.interface.workgroup.id[0] : index + %workgroup_id_y = hal.interface.workgroup.id[1] : index + %workgroup_id_z = hal.interface.workgroup.id[2] : index + +// CHECK: hal.interface.workgroup.count[0] upper_bound 32 +// CHECK: hal.interface.workgroup.count[1] upper_bound 8 +// CHECK: hal.interface.workgroup.count[2] upper_bound 1 + %workgroup_conut_x = hal.interface.workgroup.count[0] : index + %workgroup_count_y = hal.interface.workgroup.count[1] : index + %workgroup_count_z = hal.interface.workgroup.count[2] : index + + return + } + } + } +} + +// ----- + +#executable_target = #hal.executable.target<"rocm", "rocm-hsaco-fb", + {iree.gpu.target = #iree_gpu.target>}> +#pipeline_layout = #hal.pipeline.layout]> + +hal.executable private @dynamic { + hal.executable.variant public @rocm_hsaco_fb target(#executable_target) { + hal.executable.export public @dynamic ordinal(0) layout(#pipeline_layout) { + ^bb0(%arg0: !hal.device, %arg1: index, %arg2: index): + %count_x = affine.apply affine_map<()[s0] -> (s0 ceildiv 32)>()[%arg1] + %count_y = affine.apply affine_map<()[s0] -> (s0 ceildiv 8)>()[%arg2] + %count_z = arith.constant 1 : index + hal.return %count_x, %count_y, %count_z : index, index, index + } + builtin.module { + func.func @dynamic() { +// CHECK: gpu.thread_id x upper_bound 1024 +// CHECK: gpu.thread_id y upper_bound 1024 +// CHECK: gpu.thread_id z upper_bound 1024 + %thread_id_x = gpu.thread_id x + %thread_id_y = gpu.thread_id y + %thread_id_z = gpu.thread_id z + +// CHECK: hal.interface.workgroup.size[0] upper_bound 1024 +// CHECK: hal.interface.workgroup.size[1] upper_bound 1024 +// CHECK: hal.interface.workgroup.size[2] upper_bound 1024 + %workgroup_size_x = hal.interface.workgroup.size[0] : index + %workgroup_size_y = hal.interface.workgroup.size[1] : index + %workgroup_size_z = hal.interface.workgroup.size[2] : index + +// CHECK: hal.interface.workgroup.id[0] upper_bound 2147483647 +// CHECK: hal.interface.workgroup.id[1] upper_bound 2147483647 +// CHECK: hal.interface.workgroup.id[2] upper_bound 1 + %workgroup_id_x = hal.interface.workgroup.id[0] : index + %workgroup_id_y = hal.interface.workgroup.id[1] : index + %workgroup_id_z = hal.interface.workgroup.id[2] : index + +// CHECK: hal.interface.workgroup.count[0] upper_bound 2147483647 +// CHECK: hal.interface.workgroup.count[1] upper_bound 2147483647 +// CHECK: hal.interface.workgroup.count[2] upper_bound 1 + %workgroup_conut_x = hal.interface.workgroup.count[0] : index + %workgroup_count_y = hal.interface.workgroup.count[1] : index + %workgroup_count_z = hal.interface.workgroup.count[2] : index + + return + } + } + } +} diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToLLVM.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToLLVM.cpp index c056d44538bb..1441f959b0bb 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToLLVM.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToLLVM.cpp @@ -505,7 +505,10 @@ struct HALInterfaceWorkgroupOpsConverter final int32_t index = static_cast(op.getDimension().getSExtValue()); std::array dimAttr{gpu::Dimension::x, gpu::Dimension::y, gpu::Dimension::z}; - rewriter.replaceOpWithNewOp(op, op.getType(), dimAttr[index]); + NewOpTy newOp = + rewriter.replaceOpWithNewOp(op, op.getType(), dimAttr[index]); + if (IntegerAttr bound = op.getUpperBoundAttr()) + newOp.setUpperBoundAttr(bound); return success(); } }; diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp index 53e49efbf66a..f8ebe1cc0069 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp @@ -1067,7 +1067,13 @@ addLowerAndOptimizeAddressComputationPasses(FunctionLikeNest &funcPassManager) { .addPass(createCSEPass) // Hoist the resulting decompositions. .addPass(createIREELoopInvariantCodeMotionPass) - .addPass(createLowerAffinePass); + .addPass(affine::createAffineExpandIndexOpsPass) + .addPass(createLowerAffinePass) + .addPass(IREE::Util::createOptimizeIntArithmeticPass) + // Do another round of LICM now that we've lowered and optimized + // arithmetic + .addPass(createCSEPass) + .addPass(createIREELoopInvariantCodeMotionPass); } static void addLowerToLLVMGPUPasses(OpPassManager &modulePassManager, @@ -1103,7 +1109,9 @@ static void addLowerToLLVMGPUPasses(OpPassManager &modulePassManager, FunctionLikeNest funcPassManager(modulePassManager); funcPassManager.addPass(createFoldTensorExtractOpPass) .addPass(createLLVMGPUVectorLoweringPass) - .addPass(createExpandGPUOpsPass); + .addPass(createExpandGPUOpsPass) + // Expose workitem and workgroup counts to range inference later. + .addPass(createGPUPropagateDispatchSizeBoundsPass); // This pass needs to run before SCF -> CF. addLowerAndOptimizeAddressComputationPasses(funcPassManager); @@ -1130,9 +1138,7 @@ static void addLowerToLLVMGPUPasses(OpPassManager &modulePassManager, .addPass(memref::createExpandStridedMetadataPass) .addPass(createEmulateNarrowTypePass) .addPass(affine::createAffineExpandIndexOpsPass) - .addPass(createLowerAffinePass) - .addPass(createCanonicalizerPass) - .addPass(createCSEPass); + .addPass(createLowerAffinePass); // Strip out the debug info for the kernel. modulePassManager.addPass(createStripDebugInfoPass()); diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_extract_address_computation.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_extract_address_computation.mlir index 6c1c5e117016..ba6b5da7f1fa 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_extract_address_computation.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_extract_address_computation.mlir @@ -40,7 +40,7 @@ // CHECK-DAG: %[[C8192:.*]] = llvm.mlir.constant(8192 : index) : i64 // // Match the interesting special registers. -// CHECK-DAG: %[[TID_Y:.*]] = nvvm.read.ptx.sreg.tid.y : i32 +// CHECK-DAG: %[[TID_Y:.*]] = nvvm.read.ptx.sreg.tid.y range : i32 // CHECK-DAG: %[[TID_Y_EXT:.*]] = llvm.sext %[[TID_Y]] : i32 to i64 // CHECK-DAG: %[[LANEID:.*]] = nvvm.read.ptx.sreg.laneid range : i32 // CHECK-DAG: %[[LANEID_EXT:.*]] = llvm.sext %[[LANEID]] : i32 to i64 diff --git a/compiler/src/iree/compiler/Codegen/SPIRV/Passes.cpp b/compiler/src/iree/compiler/Codegen/SPIRV/Passes.cpp index ea0aa9f45116..511dbe785300 100644 --- a/compiler/src/iree/compiler/Codegen/SPIRV/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/SPIRV/Passes.cpp @@ -227,9 +227,11 @@ static void addMemRefLoweringPasses(OpPassManager &modulePassManager) { /// Adds passes to perform the final SPIR-V conversion. static void addSPIRVLoweringPasses(OpPassManager &modulePassManager) { FunctionLikeNest(modulePassManager) + .addPass(createGPUPropagateDispatchSizeBoundsPass) .addPass(createCanonicalizerPass) .addPass(createCSEPass) .addPass(createLowerAffinePass) + .addPass(IREE::Util::createOptimizeIntArithmeticPass) // Lower ApplyScale before the i64 Emulation Pass so that new 64-bit ops // are also emulated if not supported by the target. diff --git a/compiler/src/iree/compiler/Dialect/HAL/IR/BUILD.bazel b/compiler/src/iree/compiler/Dialect/HAL/IR/BUILD.bazel index d9d6a92ef71c..3f80245bfc8c 100644 --- a/compiler/src/iree/compiler/Dialect/HAL/IR/BUILD.bazel +++ b/compiler/src/iree/compiler/Dialect/HAL/IR/BUILD.bazel @@ -35,6 +35,7 @@ iree_td_library( "//compiler/src/iree/compiler/Dialect/Util/IR:td_files", "@llvm-project//mlir:BuiltinDialectTdFiles", "@llvm-project//mlir:FuncTdFiles", + "@llvm-project//mlir:InferIntRangeInterfaceTdFiles", "@llvm-project//mlir:InferTypeOpInterfaceTdFiles", "@llvm-project//mlir:OpBaseTdFiles", "@llvm-project//mlir:ViewLikeInterfaceTdFiles", @@ -81,6 +82,7 @@ iree_compiler_cc_library( "@llvm-project//mlir:FuncDialect", "@llvm-project//mlir:FunctionInterfaces", "@llvm-project//mlir:IR", + "@llvm-project//mlir:InferIntRangeInterface", "@llvm-project//mlir:InferTypeOpInterface", "@llvm-project//mlir:MemRefDialect", "@llvm-project//mlir:Parser", diff --git a/compiler/src/iree/compiler/Dialect/HAL/IR/CMakeLists.txt b/compiler/src/iree/compiler/Dialect/HAL/IR/CMakeLists.txt index 837855157e90..846bcf0d38a2 100644 --- a/compiler/src/iree/compiler/Dialect/HAL/IR/CMakeLists.txt +++ b/compiler/src/iree/compiler/Dialect/HAL/IR/CMakeLists.txt @@ -45,6 +45,7 @@ iree_cc_library( MLIRFuncDialect MLIRFunctionInterfaces MLIRIR + MLIRInferIntRangeInterface MLIRInferTypeOpInterface MLIRMemRefDialect MLIRParser diff --git a/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.cpp b/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.cpp index 7210d402598d..cb5bb411810a 100644 --- a/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.cpp +++ b/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.cpp @@ -19,6 +19,7 @@ #include "mlir/IR/SymbolTable.h" #include "mlir/IR/TypeUtilities.h" #include "mlir/Interfaces/FunctionImplementation.h" +#include "mlir/Interfaces/InferIntRangeInterface.h" #include "mlir/Interfaces/InferTypeOpInterface.h" namespace mlir::iree_compiler::IREE::HAL { @@ -2084,24 +2085,59 @@ static void getAsmResultNamesForInterfaceWorkgroupOp( } } +// Minimum is the smallest possible result we could get. It's 0 for ID-like +// operations and 1 for count-like ones. +static void setResultRangesForInterfaceWorkgroupOp( + Value result, const std::optional &upperBound, + SetIntRangeFn setResultRanges, int64_t minimum) { + unsigned width = ConstantIntRanges::getStorageBitwidth(result.getType()); + if (!upperBound.has_value()) { + setResultRanges( + result, ConstantIntRanges::fromSigned(APInt(width, minimum), + APInt::getSignedMaxValue(width))); + return; + } + setResultRanges(result, + ConstantIntRanges::fromUnsigned(APInt(width, minimum), + *upperBound + minimum - 1)); +} + void InterfaceWorkgroupIDOp::getAsmResultNames( function_ref setNameFn) { getAsmResultNamesForInterfaceWorkgroupOp("workgroup_id_", getDimension(), getResult(), setNameFn); } +void InterfaceWorkgroupIDOp::inferResultRanges( + ArrayRef argRanges, SetIntRangeFn setResultRanges) { + setResultRangesForInterfaceWorkgroupOp(getResult(), getUpperBound(), + setResultRanges, /*minimum=*/0); +} + void InterfaceWorkgroupCountOp::getAsmResultNames( function_ref setNameFn) { getAsmResultNamesForInterfaceWorkgroupOp("workgroup_count_", getDimension(), getResult(), setNameFn); } +void InterfaceWorkgroupCountOp::inferResultRanges( + ArrayRef argRanges, SetIntRangeFn setResultRanges) { + setResultRangesForInterfaceWorkgroupOp(getResult(), getUpperBound(), + setResultRanges, /*minimum=*/1); +} + void InterfaceWorkgroupSizeOp::getAsmResultNames( function_ref setNameFn) { getAsmResultNamesForInterfaceWorkgroupOp("workgroup_size_", getDimension(), getResult(), setNameFn); } +void InterfaceWorkgroupSizeOp::inferResultRanges( + ArrayRef argRanges, SetIntRangeFn setResultRanges) { + setResultRangesForInterfaceWorkgroupOp(getResult(), getUpperBound(), + setResultRanges, /*minimum=*/1); +} + //===----------------------------------------------------------------------===// // hal.fence.* //===----------------------------------------------------------------------===// diff --git a/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.td b/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.td index 16f1eadfdffd..d51e430b57c7 100644 --- a/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.td +++ b/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.td @@ -3029,9 +3029,28 @@ def OpGroupInterfaceOps : OpDocGroup { let opDocGroup = OpGroupInterfaceOps in { -def HAL_InterfaceWorkgroupIDOp : HAL_PureOp<"interface.workgroup.id", [ - DeclareOpInterfaceMethods, -]> { +class HAL_InterfaceWorkgroupOp traits = []> + : HAL_PureOp, + DeclareOpInterfaceMethods])> { + let arguments = (ins + IndexAttr:$dimension, + OptionalAttr:$upper_bound); + let results = (outs HAL_Dim:$result); + + let builders = [ + OpBuilder<(ins "unsigned":$dim), + [{ + build($_builder, $_state, $_builder.getIndexType(), $_builder.getIndexAttr(dim), ::mlir::IntegerAttr{}); + }]>, + ]; + + let assemblyFormat = [{ + `[` $dimension `]` (`upper_bound` $upper_bound^)? attr-dict `:` type($result) + }]; +} + +def HAL_InterfaceWorkgroupIDOp : HAL_InterfaceWorkgroupOp<"interface.workgroup.id"> { let summary = [{returns the index of the current workgroup in the grid}]; let description = [{ The global workgroup ID of the current tile in the range of @@ -3046,25 +3065,9 @@ def HAL_InterfaceWorkgroupIDOp : HAL_PureOp<"interface.workgroup.id", [ %z = hal.interface.workgroup.id[2] : index ``` }]; - - let arguments = (ins IndexAttr:$dimension); - let results = (outs HAL_Dim:$result); - - let builders = [ - OpBuilder<(ins "unsigned":$dim), - [{ - build($_builder, $_state, $_builder.getIndexType(), $_builder.getIndexAttr(dim)); - }]>, - ]; - - let assemblyFormat = [{ - `[` $dimension `]` attr-dict `:` type($result) - }]; } -def HAL_InterfaceWorkgroupCountOp : HAL_PureOp<"interface.workgroup.count", [ - DeclareOpInterfaceMethods, -]> { +def HAL_InterfaceWorkgroupCountOp : HAL_InterfaceWorkgroupOp<"interface.workgroup.count"> { let summary = [{returns the total workgroup count of the grid}]; let description = [{ The total number of workgroups along each dimension in the dispatch grid. @@ -3081,24 +3084,9 @@ def HAL_InterfaceWorkgroupCountOp : HAL_PureOp<"interface.workgroup.count", [ ``` }]; - let arguments = (ins IndexAttr:$dimension); - let results = (outs HAL_Dim:$result); - - let builders = [ - OpBuilder<(ins "unsigned":$dim), - [{ - build($_builder, $_state, $_builder.getIndexType(), $_builder.getIndexAttr(dim)); - }]>, - ]; - - let assemblyFormat = [{ - `[` $dimension `]` attr-dict `:` type($result) - }]; } -def HAL_InterfaceWorkgroupSizeOp : HAL_PureOp<"interface.workgroup.size", [ - DeclareOpInterfaceMethods, -]> { +def HAL_InterfaceWorkgroupSizeOp : HAL_InterfaceWorkgroupOp<"interface.workgroup.size"> { let summary = [{returns the size of each workgroup in invocations}]; let description = [{ The number of local invocations within the current workgroup along each @@ -3114,20 +3102,6 @@ def HAL_InterfaceWorkgroupSizeOp : HAL_PureOp<"interface.workgroup.size", [ %z = hal.interface.workgroup.size[2] : index ``` }]; - - let arguments = (ins IndexAttr:$dimension); - let results = (outs HAL_Dim:$result); - - let builders = [ - OpBuilder<(ins "unsigned":$dim), - [{ - build($_builder, $_state, $_builder.getIndexType(), $_builder.getIndexAttr(dim)); - }]>, - ]; - - let assemblyFormat = [{ - `[` $dimension `]` attr-dict `:` type($result) - }]; } def HAL_InterfaceConstantLoadOp : HAL_PureOp<"interface.constant.load"> { diff --git a/compiler/src/iree/compiler/Dialect/HAL/Transforms/MaterializeInterfaces.cpp b/compiler/src/iree/compiler/Dialect/HAL/Transforms/MaterializeInterfaces.cpp index 9f3bee7d529a..d830c078b4bb 100644 --- a/compiler/src/iree/compiler/Dialect/HAL/Transforms/MaterializeInterfaces.cpp +++ b/compiler/src/iree/compiler/Dialect/HAL/Transforms/MaterializeInterfaces.cpp @@ -514,7 +514,8 @@ struct ConvertDispatchWorkgroupInfoPattern final LogicalResult matchAndRewrite(SrcOp op, PatternRewriter &rewriter) const override { rewriter.replaceOpWithNewOp(op, op.getResult().getType(), - op.getDimensionAttr()); + op.getDimensionAttr(), + /*upper_bound=*/nullptr); return success(); } }; From dc29ee7d1bcfcec5a58d42e29125bbda937bbbbc Mon Sep 17 00:00:00 2001 From: Benoit Jacob Date: Sat, 14 Dec 2024 21:13:16 -0500 Subject: [PATCH 16/64] Move GPU ukernel selection to KernelConfig (#19440) This moves the logic deciding whether an op should be a ukernel out of the GPULowerToUKernels pass, into KernelConfig. So KernelConfig decides whether the op should be a ukernel, and encodes that into the resulting `lowering_config`, in a new parameter, that is a new attribute, UKernelSpecAttr. That attribute is directly modeled after the equivalent C++ data structure that we have had in LowerToUKernels passes, `FnNameAndDefAttrs`, which it replaces. If the attribute is present, it means that the op was selected for ukernel lowering, with the fields telling the ukernel name and some function definition attributes (to import any dependencies, such as the `rocm` module for runtime support symbols). All the details about supplying the ukernel bitcode in a `hal.executable.object` are also moved there, becoming a side effect of `KernelConfig`. The GPULowerToUKernels becomes much simpler, since all the decision-making was already done for it. It just looks at the `LoweringConfigAttr` and if it's there, it performs the requested lowering. The motivation for this split is that we need to know in KernelConfig whether it's going to be a ukernel, because ops that will get lowered to a ukernel require a different configuration. The important example for us is `multi_mma`, which in the ukernel case needs to avoid reduction-dimension tiling to 1 so that the ukernel gets to see the reduction loop. A few simplifications arise already in the current argmax ukernel logic, confirming that this was the right design choice: the old ukernel's matching logic was checking that the distribution tile sizes matched what the ukernel could handle; now that is turned upside down: the ukernel matching happens as a helper within KernelConfig where we know we are setting the appropriate tile sizes on purpose. Another nice improvement is that this puts just enough distance between ukernel selection (which creates the `hal.executable.object`) and ukernel lowering, that we are able to insert `HoistExecutableObjectsPass` in between, simplifying the ukernel lowering as it doesn't need to worry anymore about preserving the `hal.executable.object`. --------- Signed-off-by: Benoit Jacob --- compiler/plugins/target/ROCM/test/BUILD.bazel | 3 +- .../plugins/target/ROCM/test/CMakeLists.txt | 3 +- .../test/config_ukernel_argmax_gfx908.mlir | 30 +++ ...mlir => config_ukernel_argmax_gfx942.mlir} | 177 +++++------------- .../ROCM/test/ukernel_pipeline_transform.mlir | 4 +- .../Codegen/Common/GPU/GPULowerToUKernels.cpp | 154 ++------------- .../compiler/Codegen/Common/GPU/Passes.td | 2 +- .../Codegen/Common/GPU/test/BUILD.bazel | 1 + .../Codegen/Common/GPU/test/CMakeLists.txt | 1 + .../GPU/test/gpu_lower_to_ukernels.mlir | 72 +++++++ .../Dialect/GPU/IR/GPULoweringConfigUtils.cpp | 5 + .../Dialect/GPU/IR/GPULoweringConfigUtils.h | 2 + .../Codegen/Dialect/GPU/IR/IREEGPUAttrs.td | 19 ++ .../iree/compiler/Codegen/LLVMGPU/BUILD.bazel | 1 + .../compiler/Codegen/LLVMGPU/CMakeLists.txt | 1 + .../compiler/Codegen/LLVMGPU/KernelConfig.cpp | 41 ++-- .../iree/compiler/Codegen/LLVMGPU/Passes.cpp | 5 + .../Codegen/LLVMGPU/Utils/BUILD.bazel | 4 + .../Codegen/LLVMGPU/Utils/CMakeLists.txt | 4 + .../LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp | 152 +++++++++++++++ .../LLVMGPU/Utils/LLVMGPUSelectUKernels.h | 15 ++ 21 files changed, 392 insertions(+), 304 deletions(-) create mode 100644 compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx908.mlir rename compiler/plugins/target/ROCM/test/{gpu_lower_to_ukernels.mlir => config_ukernel_argmax_gfx942.mlir} (58%) create mode 100644 compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir create mode 100644 compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp create mode 100644 compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h diff --git a/compiler/plugins/target/ROCM/test/BUILD.bazel b/compiler/plugins/target/ROCM/test/BUILD.bazel index f0521a0e8c50..2a71f590c6e3 100644 --- a/compiler/plugins/target/ROCM/test/BUILD.bazel +++ b/compiler/plugins/target/ROCM/test/BUILD.bazel @@ -15,8 +15,9 @@ package( iree_lit_test_suite( name = "lit", srcs = [ + "config_ukernel_argmax_gfx908.mlir", + "config_ukernel_argmax_gfx942.mlir", "default_tuning_specs_amdgpu.mlir", - "gpu_lower_to_ukernels.mlir", "lowering_strategy_from_tuning_spec.mlir", "ukernel_pipeline_transform.mlir", ], diff --git a/compiler/plugins/target/ROCM/test/CMakeLists.txt b/compiler/plugins/target/ROCM/test/CMakeLists.txt index 36d9ba6db31d..bab88582a8b0 100644 --- a/compiler/plugins/target/ROCM/test/CMakeLists.txt +++ b/compiler/plugins/target/ROCM/test/CMakeLists.txt @@ -14,8 +14,9 @@ iree_lit_test_suite( NAME lit SRCS + "config_ukernel_argmax_gfx908.mlir" + "config_ukernel_argmax_gfx942.mlir" "default_tuning_specs_amdgpu.mlir" - "gpu_lower_to_ukernels.mlir" "lowering_strategy_from_tuning_spec.mlir" "ukernel_pipeline_transform.mlir" TOOLS diff --git a/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx908.mlir b/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx908.mlir new file mode 100644 index 000000000000..ba12bf5e10f6 --- /dev/null +++ b/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx908.mlir @@ -0,0 +1,30 @@ +// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx908 --pass-pipeline='builtin.module(iree-llvmgpu-select-lowering-strategy)' %s | FileCheck %s + +// gfx908 a.k.a. CDNA1 is used here as an example of a GPU target that we don't have ukernels for. +// No need to add many ukernels here, just a quick check that we correctly do not select a ukernel. + +func.func @argmax_2d_f32i64(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { + hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> +} { + %c0_i64 = arith.constant 0 : i64 + %cst = arith.constant 0xFF800000 : f32 + %0 = tensor.empty() : tensor<1xi64> + %1 = linalg.fill ins(%c0_i64 : i64) outs(%0 : tensor<1xi64>) -> tensor<1xi64> + %2 = tensor.empty() : tensor<1xf32> + %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1xf32>) -> tensor<1xf32> + %4:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<1x?xf32>) outs(%3, %1 : tensor<1xf32>, tensor<1xi64>) { + ^bb0(%in: f32, %out: f32, %out_0: i64): + %5 = linalg.index 1 : index + %6 = arith.index_cast %5 : index to i64 + %7 = arith.maximumf %in, %out : f32 + %8 = arith.cmpf ogt, %in, %out : f32 + %9 = arith.select %8, %6, %out_0 : i64 + linalg.yield %7, %9 : f32, i64 + } -> (tensor<1xf32>, tensor<1xi64>) + return %4#1 : tensor<1xi64> +} + +// CHECK-NOT: lowering_config<{{.*}}ukernel +// CHECK-LABEL: func @argmax_2d_f32i64( +// CHECK: linalg.generic +// CHECK-NOT: hal.executable.objects diff --git a/compiler/plugins/target/ROCM/test/gpu_lower_to_ukernels.mlir b/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx942.mlir similarity index 58% rename from compiler/plugins/target/ROCM/test/gpu_lower_to_ukernels.mlir rename to compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx942.mlir index 177bd0b36f7c..4a7da4befadd 100644 --- a/compiler/plugins/target/ROCM/test/gpu_lower_to_ukernels.mlir +++ b/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx942.mlir @@ -1,5 +1,4 @@ -// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx942 --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-lower-to-ukernels,cse,canonicalize))" %s | FileCheck %s -// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx908 --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-lower-to-ukernels,cse,canonicalize))" %s | FileCheck %s --check-prefix=CDNA1 +// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx942 --pass-pipeline='builtin.module(iree-llvmgpu-select-lowering-strategy)' %s | FileCheck %s func.func @argmax_2d_f32i64(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> @@ -22,15 +21,11 @@ func.func @argmax_2d_f32i64(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes return %4#1 : tensor<1xi64> } -//CHECK-LABEL: func @argmax_2d_f32i64( -// CHECK-SAME: %[[ARG0:[a-zA-Z0-9]+]]: tensor<1x?xf32> -// CHECK-DAG: %[[C1_index:.+]] = arith.constant 1 : index -// CHECK-DAG: %[[C0_i64:.+]] = arith.constant 0 -// CHECK-DAG: %[[FILL:.+]] = linalg.fill ins(%[[C0_i64]] -// CHECK: %[[MICRO_KERNEL:.+]] = iree_codegen.ukernel.generic {hal.executable.objects = [{{.*}}]} "iree_uk_amdgpu_argmax_f32i64" -// CHECK-SAME: ins(%[[ARG0]] : -// CHECK-SAME: outs(%[[FILL]] : -// CHECK: return %[[MICRO_KERNEL]] +// CHECK-LABEL: func @argmax_2d_f32i64( +// CHECK: linalg.generic +// CHECK-SAME: hal.executable.objects = [ +// CEHCK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", data = dense_resource : vector<{{[0-9]+}}xi8>}>] +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec // ----- @@ -55,65 +50,11 @@ func.func @argmax_4d_unit_parallel_f32i64(%arg0 : tensor<1x1x1x?xf32>) -> tensor return %4#1 : tensor<1x1x1xi64> } -// CHECK-LABEL: func @argmax_4d_unit_parallel_f32i64( -// CHECK: iree_codegen.ukernel.generic -// CHECK-NOT: linalg.generic - -// ----- - -func.func @argmax_2d_non_unit_parallel_f32i64(%arg0 : tensor<4x?xf32>) -> tensor<4xi64> attributes { - hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> -} { - %c0_i64 = arith.constant 0 : i64 - %cst = arith.constant 0xFF800000 : f32 - %0 = tensor.empty() : tensor<4xi64> - %1 = linalg.fill ins(%c0_i64 : i64) outs(%0 : tensor<4xi64>) -> tensor<4xi64> - %2 = tensor.empty() : tensor<4xf32> - %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<4xf32>) -> tensor<4xf32> - %4:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<4x?xf32>) outs(%3, %1 : tensor<4xf32>, tensor<4xi64>) { - ^bb0(%in: f32, %out: f32, %out_0: i64): - %5 = linalg.index 1 : index - %6 = arith.index_cast %5 : index to i64 - %7 = arith.maximumf %in, %out : f32 - %8 = arith.cmpf ogt, %in, %out : f32 - %9 = arith.select %8, %6, %out_0 : i64 - linalg.yield %7, %9 : f32, i64 - } -> (tensor<4xf32>, tensor<4xi64>) - return %4#1 : tensor<4xi64> -} - -// CHECK-LABEL: func @argmax_2d_non_unit_parallel_f32i64( -// CHECK-NOT: iree_codegen.ukernel.generic -// CHECK: linalg.generic - -// ----- - -func.func @argmax_2d_dyn_parallel_f32i64(%arg0 : tensor) -> tensor attributes { - hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> -} { - %c0 = arith.constant 0 : index - %c0_i64 = arith.constant 0 : i64 - %cst = arith.constant 0xFF800000 : f32 - %dim = tensor.dim %arg0, %c0 : tensor - %0 = tensor.empty(%dim) : tensor - %1 = linalg.fill ins(%c0_i64 : i64) outs(%0 : tensor) -> tensor - %2 = tensor.empty(%dim) : tensor - %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor) -> tensor - %4:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor) outs(%3, %1 : tensor, tensor) { - ^bb0(%in: f32, %out: f32, %out_0: i64): - %5 = linalg.index 1 : index - %6 = arith.index_cast %5 : index to i64 - %7 = arith.maximumf %in, %out : f32 - %8 = arith.cmpf ogt, %in, %out : f32 - %9 = arith.select %8, %6, %out_0 : i64 - linalg.yield %7, %9 : f32, i64 - } -> (tensor, tensor) - return %4#1 : tensor -} - -// CHECK-LABEL: func @argmax_2d_dyn_parallel_f32i64( -// CHECK-NOT: iree_codegen.ukernel.generic -// CHECK: linalg.generic +// CHECK-LABEL: func @argmax_4d_unit_parallel_f32i64( +// CHECK: linalg.generic +// CHECK-SAME: hal.executable.objects = [ +// CEHCK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", data = dense_resource : vector<{{[0-9]+}}xi8>}>] +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec // ----- @@ -138,9 +79,10 @@ func.func @argmax_none_ukernel_enabled(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> return %4#1 : tensor<1xi64> } -// CHECK-LABEL: func @argmax_none_ukernel_enabled( -// CHECK-NOT: iree_codegen.ukernel.generic -// CHECK: linalg.generic +// CHECK-LABEL: func @argmax_none_ukernel_enabled( +// CHECK: linalg.generic +// CHECK-NOT: hal.executable.objects +// CHECK-NOT: iree_gpu.ukernel_spec // ----- @@ -165,9 +107,11 @@ func.func @argmax_only_argmax_ukernel_enabled(%arg0 : tensor<1x?xf32>) -> tensor return %4#1 : tensor<1xi64> } -// CDNA2-LABEL: func @argmax_only_argmax_ukernel_enabled( -// CDNA2: iree_codegen.ukernel.generic -// CDNA2-NOT: linalg.generic +// CHECK-LABEL: func @argmax_only_argmax_ukernel_enabled( +// CHECK: linalg.generic +// CHECK-SAME: hal.executable.objects = [ +// CHECK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", data = dense_resource : vector<{{[0-9]+}}xi8>}>] +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec // ----- @@ -192,11 +136,11 @@ func.func @argmax_only_foo_argmax_bar_ukernel_enabled(%arg0 : tensor<1x?xf32>) - return %4#1 : tensor<1xi64> } -// CHECK-LABEL: func @argmax_only_foo_argmax_bar_ukernel_enabled( -// CHECK: iree_codegen.ukernel.generic -// CHECK-NOT: linalg.generic - -// CDNA2-LABEL: func @argmax_only_foo_argmax_bar_ukernel_enabled( +// CHECK-LABEL: func @argmax_only_foo_argmax_bar_ukernel_enabled( +// CHECK: linalg.generic +// CHECK-SAME: hal.executable.objects = [ +// CHECK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", data = dense_resource : vector<{{[0-9]+}}xi8>}>] +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec // ----- @@ -221,9 +165,10 @@ func.func @argmax_only_foo_ukernel_enabled(%arg0 : tensor<1x?xf32>) -> tensor<1x return %4#1 : tensor<1xi64> } -// CHECK-LABEL: func @argmax_only_foo_ukernel_enabled( -// CHECK-NOT: iree_codegen.ukernel.generic -// CHECK: linalg.generic +// CHECK-LABEL: func @argmax_only_foo_ukernel_enabled( +// CHECK: linalg.generic +// CHECK-NOT: hal.executable.objects +// CHECK-NOT: iree_gpu.ukernel_spec // ----- @@ -249,46 +194,16 @@ func.func @argmax_2d_f32i64_not_neg_inf_init(%arg0 : tensor<1x?xf32>) -> tensor< return %4#1 : tensor<1xi64> } -// CHECK-LABEL: func @argmax_2d_f32i64_not_neg_inf_init( -// CHECK-NOT: iree_codegen.ukernel.generic -// CHECK: linalg.generic - -// ----- - -// TODO: No technical reason this architecture is not supported. -// Currently just picking out popular chips to support, -// to minimize compile time and space. - -func.func @argmax_ukernel_unsupported_arch(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { - hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> -} { - %c0_i64 = arith.constant 0 : i64 - %cst = arith.constant 0xFF800000 : f32 - %0 = tensor.empty() : tensor<1xi64> - %1 = linalg.fill ins(%c0_i64 : i64) outs(%0 : tensor<1xi64>) -> tensor<1xi64> - %2 = tensor.empty() : tensor<1xf32> - %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1xf32>) -> tensor<1xf32> - %4:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<1x?xf32>) outs(%3, %1 : tensor<1xf32>, tensor<1xi64>) { - ^bb0(%in: f32, %out: f32, %out_0: i64): - %5 = linalg.index 1 : index - %6 = arith.index_cast %5 : index to i64 - %7 = arith.maximumf %in, %out : f32 - %8 = arith.cmpf ogt, %in, %out : f32 - %9 = arith.select %8, %6, %out_0 : i64 - linalg.yield %7, %9 : f32, i64 - } -> (tensor<1xf32>, tensor<1xi64>) - return %4#1 : tensor<1xi64> -} - -// CDNA1-LABEL: func @argmax_ukernel_unsupported_arch( -// CDNA1-NOT: iree_codegen.ukernel.generic -// CDNA1: linalg.generic +// CHECK-NOT: lowering_config<{{.*}}ukernel +// CHECK-LABEL: func @argmax_2d_f32i64_not_neg_inf_init( +// CHECK: linalg.generic +// CHECK-NOT: hal.executable.objects // ----- // Test user-provided bitcode in the source IR. -func.func @argmax_2d_f32i64(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { +func.func @argmax_2d_f32i64_custom_bitcode(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}>, // Dummy bitcode with an unusual length of 12. The first 4 bytes are the .bc file format signature. hal.executable.objects = [ @@ -316,18 +231,12 @@ func.func @argmax_2d_f32i64(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes return %4#1 : tensor<1xi64> } -//CHECK-LABEL: func @argmax_2d_f32i64( -// CHECK-SAME: %[[ARG0:[a-zA-Z0-9]+]]: tensor<1x?xf32> -// CHECK-DAG: %[[C1_index:.+]] = arith.constant 1 : index -// CHECK-DAG: %[[C0_i64:.+]] = arith.constant 0 -// CHECK-DAG: %[[FILL:.+]] = linalg.fill ins(%[[C0_i64]] -// CHECK: %[[MICRO_KERNEL:.+]] = iree_codegen.ukernel.generic { -// CHECK-SAME: hal.executable.objects = [ -// CHECK-SAME: #hal.executable.object<{ -// CHECK-SAME: path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", -// CHECK-SAME: data = dense<[66, 67, -64, -34, 1, 35, 69, 103, -119, -85, -51, -17]> : tensor<12xi8> -// CHECK-SAME: }> -// CHECK-SAME: ]} "iree_uk_amdgpu_argmax_f32i64" -// CHECK-SAME: ins(%[[ARG0]] : -// CHECK-SAME: outs(%[[FILL]] : -// CHECK: return %[[MICRO_KERNEL]] +// CHECK-LABEL: func @argmax_2d_f32i64_custom_bitcode( +// CHECK: linalg.generic +// CHECK-SAME: hal.executable.objects = [ +// CHECK-SAME: #hal.executable.object<{ +// CHECK-SAME: path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", +// CHECK-SAME: data = dense<[66, 67, -64, -34, 1, 35, 69, 103, -119, -85, -51, -17]> : tensor<12xi8> +// CHECK-SAME: }> +// CHECK-SAME: ] +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec diff --git a/compiler/plugins/target/ROCM/test/ukernel_pipeline_transform.mlir b/compiler/plugins/target/ROCM/test/ukernel_pipeline_transform.mlir index 26ce4c8959f4..15e5169e37b5 100644 --- a/compiler/plugins/target/ROCM/test/ukernel_pipeline_transform.mlir +++ b/compiler/plugins/target/ROCM/test/ukernel_pipeline_transform.mlir @@ -44,7 +44,7 @@ func.func @argmax_1d_f16i64() attributes { // CHECK: #[[$TRANSLATION:.+]] = #iree_codegen.translation_info // CHECK: func.func @argmax_1d_f16i64() // CHECK-SAME: translation_info = #[[$TRANSLATION]] -// CHECK: iree_codegen.ukernel.generic {hal.executable.objects = [{{.*}}]} "iree_uk_amdgpu_argmax_f16i64" +// CHECK: iree_codegen.ukernel.generic "iree_uk_amdgpu_argmax_f16i64" // ----- @@ -94,7 +94,7 @@ func.func @argmax_2d_f32i64() attributes { // CHECK-SAME: translation_info = #[[$TRANSLATION]] // CHECK: %[[SUBVIEW:.*]] = memref.subview{{.*}} memref<16x?xf32 // CHECK-SAME: to memref<1x?xf32 -// CHECK: iree_codegen.ukernel.generic {hal.executable.objects = [{{.*}}]} "iree_uk_amdgpu_argmax_f32i64" ins(%[[SUBVIEW]] +// CHECK: iree_codegen.ukernel.generic "iree_uk_amdgpu_argmax_f32i64" ins(%[[SUBVIEW]] // ----- diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp index c9ff4b8ed96c..796138d55e3f 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp @@ -5,12 +5,11 @@ // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception #include "iree/compiler/Codegen/Common/GPU/Passes.h" +#include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/UKernelOps.h" -#include "iree/compiler/Codegen/Utils/GPUUtils.h" +#include "iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h" #include "iree/compiler/Codegen/Utils/Utils.h" -#include "iree/compiler/Utils/EmbeddedDataDirectory.h" -#include "llvm/Support/FormatVariadic.h" #include "mlir/Dialect/Linalg/IR/Linalg.h" #include "mlir/Dialect/Linalg/Utils/Utils.h" #include "mlir/Dialect/Tensor/IR/Tensor.h" @@ -27,114 +26,12 @@ namespace mlir::iree_compiler { namespace { -// Returns a ExecutableObjectAttr carrying the bitcode for the given ukernel. -// -// First tries finding the bitcode in the input `sourceExecutableObjects`, which -// must be an array of ExecutableObjectAttr's and is typically coming from a -// hal.executable.objects array attribute in the source IR, which is the -// mechanism by which source programs may provide their own ukernel bitcode. -// -// If no matching bitcode was found in `sourceExecutableObjects`, this function -// will then search in bitcode files that we have embedded as static data. -static IREE::HAL::ExecutableObjectAttr -getUKernelBitcode(OpBuilder &builder, - IREE::HAL::ExecutableTargetAttr execTarget, - ArrayAttr sourceExecutableObjects, StringRef ukernelName) { - IREE::GPU::TargetAttr gpuTarget = getGPUTargetAttr(execTarget); - if (!gpuTarget) { - return {}; - } - StringRef gpuArch = gpuTarget.getArch(); - std::string bitcodeFilename = llvm::formatv("{}.{}.bc", ukernelName, gpuArch); - - // Early-return if the source executable.objects already contain an object - // with the expected file name. This happens with user-provided bitcode in the - // source IR. - if (sourceExecutableObjects) { - for (Attribute a : sourceExecutableObjects) { - if (auto object = dyn_cast(a)) { - if (object.getPath() == bitcodeFilename) { - return object; - } - } - } - } - - // No user-provided bitcode, so we search our embedded bitcode files in the - // EmbeddedDataDirectory singleton. - std::optional bitcode; - EmbeddedDataDirectory::withGlobal([&](EmbeddedDataDirectory &dir) { - bitcode = dir.getFile(bitcodeFilename); - }); - if (!bitcode) { - return {}; - } - MLIRContext *context = builder.getContext(); - auto blob = HeapAsmResourceBlob::allocateAndCopyInferAlign( - ArrayRef(bitcode->data(), bitcode->size())); - auto bitcodeDenseAttr = DenseI8ResourceElementsAttr::get( - VectorType::get({static_cast(bitcode->size())}, - builder.getI8Type()), - bitcodeFilename, std::move(blob)); - return IREE::HAL::ExecutableObjectAttr::get( - context, StringAttr::get(context, bitcodeFilename), - cast(bitcodeDenseAttr)); -} - -// Walks parents ops from `op` to return the nearest hal.executable.objects -// array attribute. If the parent hal.executable.variant is reached, its objects -// attribute is returned. -// Adapted from ExecutableTargetAttr::lookup. -static ArrayAttr lookUpExecutableObjects(Operation *op) { - MLIRContext *context = op->getContext(); - auto attrId = StringAttr::get(context, "hal.executable.objects"); - while (op) { - // Take directly from the enclosing variant. - if (auto variantOp = dyn_cast(op)) { - if (std::optional objects = variantOp.getObjects()) { - return *objects; - } - } - // Take from op attributes. - if (auto attr = op->getAttrOfType(attrId)) { - return attr; - } - // Continue walk. - op = op->getParentOp(); - } - return {}; -} - -/// Holds a function name and attributes. -struct FnNameAndDefAttrs { - std::string name; - SmallVector defAttrs; - explicit operator bool() const { return !name.empty(); } -}; - -/// Returns the function name and attributes to use for a ukernel with given -/// `name` and `suffix` on the target described by `targetAttr`. -static FnNameAndDefAttrs -getFnNameAndDefAttrs(const char *name, std::string &suffix, - RewriterBase &rewriter, - IREE::HAL::ExecutableTargetAttr targetAttr) { - FnNameAndDefAttrs result; - if (isROCMBackend(targetAttr)) { - result.name = llvm::formatv("iree_uk_amdgpu_{}_{}", name, suffix); - result.defAttrs.emplace_back(rewriter.getStringAttr("vm.import.module"), - rewriter.getStringAttr("rocm")); - } - return result; -} - /// Matches generic that represent argmax and check if /// we have the ukernel that matches it shape constraint, and types. /// If we do, then we convert into iree_codegen.ukernel.argmax operation, /// that is later lowered into a call to the microkernel. static FailureOr matchArgmaxDAGForUKernel(RewriterBase &rewriter, linalg::GenericOp op) { - auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(op); - const char ukernelName[] = "argmax"; Value input = op.getDpsInputOperand(0)->get(); auto inputType = cast(input.getType()); Value index = op.getDpsInitOperand(1)->get(); @@ -142,41 +39,16 @@ matchArgmaxDAGForUKernel(RewriterBase &rewriter, linalg::GenericOp op) { std::string suffix; llvm::raw_string_ostream(suffix) << inputType.getElementType() << indexType.getElementType(); - FnNameAndDefAttrs fn = - getFnNameAndDefAttrs(ukernelName, suffix, rewriter, targetAttr); - if (!fn) { - return rewriter.notifyMatchFailure(op, "no ukernels on this backend"); + auto loweringConfig = getLoweringConfig(op); + if (!loweringConfig) { + return rewriter.notifyMatchFailure(op, "no lowering_config on this op"); } - - if (!hasUkernel(targetAttr, ukernelName)) { - return rewriter.notifyMatchFailure(op, "ukernel not enabled"); + IREE::GPU::UKernelSpecAttr ukernelAttr = + IREE::GPU::getUkernelSpec(loweringConfig); + if (!ukernelAttr) { + return rewriter.notifyMatchFailure(op, "no ukernel selected for this op"); } - // Currently only support argmax where parallel dims are 1. - // Tiling pipeline is also set to tile all parallel dims to 1, and - // reduction dim to be size of whole reduction problem. Which allow - // this constraint to be true for a lot of argmax variances. - // TODO: Support multi-row or grid-strided argmax ukernel. - SmallVector bounds = op.getStaticLoopRanges(); - SmallVector parallelDims; - op.getParallelDims(parallelDims); - int64_t parallelSize = 1; - for (int64_t dim : parallelDims) { - if (ShapedType::isDynamic(bounds[dim])) { - return failure(); - } - parallelSize *= bounds[dim]; - } - if (parallelSize != 1) { - return failure(); - } - auto execTarget = IREE::HAL::ExecutableTargetAttr::lookup(op); - ArrayAttr sourceExecutableObjects = lookUpExecutableObjects(op); - IREE::HAL::ExecutableObjectAttr bitcodeObject = - getUKernelBitcode(rewriter, execTarget, sourceExecutableObjects, fn.name); - if (!bitcodeObject) { - return rewriter.notifyMatchFailure(op, "no ukernel bitcode for this op"); - } Location loc = op.getLoc(); // Currently only support 1D reduction, where reduc is on fastest dim. // Tiling argmax ukernel is also set to enforce this structure. @@ -184,13 +56,9 @@ matchArgmaxDAGForUKernel(RewriterBase &rewriter, linalg::GenericOp op) { Value reductionDimSize = rewriter.create(loc, input, kReductionDim); auto genericMicroKernelOp = rewriter.create( - loc, indexType, fn.name, ValueRange{input}, index, - ValueRange{reductionDimSize}, - /*fn_def_attrs=*/rewriter.getDictionaryAttr(fn.defAttrs), + loc, indexType, ukernelAttr.getName(), ValueRange{input}, index, + ValueRange{reductionDimSize}, ukernelAttr.getDefAttrs(), /*strided_outer_dims=*/rewriter.getIndexAttr(0)); - genericMicroKernelOp->setAttr( - "hal.executable.objects", - ArrayAttr::get(rewriter.getContext(), bitcodeObject)); return cast( genericMicroKernelOp.getOperation()); } diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td b/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td index b3fdd50d4d46..2c25e02852f4 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td @@ -107,7 +107,7 @@ def GPUInferMemorySpacePass : def GPULowerToUKernelsPass : Pass<"iree-codegen-gpu-lower-to-ukernels", ""> { - let summary = "Lower suitable ops to microkernels."; + let summary = "Lower suitable ops to previously-selected microkernels"; let dependentDialects = [ "::mlir::iree_compiler::IREE::Codegen::IREECodegenDialect", "::mlir::iree_compiler::IREE::GPU::IREEGPUDialect", diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel index dc8e6a181ccf..030e6f4de497 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel @@ -31,6 +31,7 @@ iree_lit_test_suite( "gpu_greedily_distribute_to_threads.mlir", "gpu_infer_memory_space.mlir", "gpu_combine_value_barriers.mlir", + "gpu_lower_to_ukernels.mlir", "gpu_materialize_encoding_gfx908.mlir", "gpu_materialize_encoding_gfx90a.mlir", "gpu_materialize_encoding_gfx942.mlir", diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt index 4dc0f289d3d5..6d1f540f420a 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt @@ -26,6 +26,7 @@ iree_lit_test_suite( "gpu_generalize_named_ops.mlir" "gpu_greedily_distribute_to_threads.mlir" "gpu_infer_memory_space.mlir" + "gpu_lower_to_ukernels.mlir" "gpu_materialize_encoding_gfx1100.mlir" "gpu_materialize_encoding_gfx908.mlir" "gpu_materialize_encoding_gfx90a.mlir" diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir new file mode 100644 index 000000000000..6a13468a1d29 --- /dev/null +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir @@ -0,0 +1,72 @@ +// RUN: iree-opt --split-input-file --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-lower-to-ukernels,cse,canonicalize))" %s | FileCheck %s + +#config = #iree_gpu.lowering_config<{ukernel = #iree_gpu.ukernel_spec}> +func.func @argmax_f32i64_with_selected_ukernel(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { + hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> +} { + %c0_i64 = arith.constant 0 : i64 + %cst = arith.constant 0xFF800000 : f32 + %0 = tensor.empty() : tensor<1xi64> + %1 = linalg.fill ins(%c0_i64 : i64) outs(%0 : tensor<1xi64>) -> tensor<1xi64> + %2 = tensor.empty() : tensor<1xf32> + %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1xf32>) -> tensor<1xf32> + %4:2 = linalg.generic { + indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>], + iterator_types = ["parallel", "reduction"] + } + ins(%arg0 : tensor<1x?xf32>) outs(%3, %1 : tensor<1xf32>, tensor<1xi64>) + attrs = { + // The lowering_config.ukernel is what is essential to the lowering. + lowering_config = #config} { + ^bb0(%in: f32, %out: f32, %out_0: i64): + %5 = linalg.index 1 : index + %6 = arith.index_cast %5 : index to i64 + %7 = arith.maximumf %in, %out : f32 + %8 = arith.cmpf ogt, %in, %out : f32 + %9 = arith.select %8, %6, %out_0 : i64 + linalg.yield %7, %9 : f32, i64 + } -> (tensor<1xf32>, tensor<1xi64>) + return %4#1 : tensor<1xi64> +} + +//CHECK-LABEL: func @argmax_f32i64_with_selected_ukernel( +// CHECK-SAME: %[[ARG0:[a-zA-Z0-9]+]]: tensor<1x?xf32> +// CHECK-DAG: %[[C1_index:.+]] = arith.constant 1 : index +// CHECK-DAG: %[[C0_i64:.+]] = arith.constant 0 +// CHECK-DAG: %[[FILL:.+]] = linalg.fill ins(%[[C0_i64]] +// CHECK: %[[MICRO_KERNEL:.+]] = iree_codegen.ukernel.generic +// CHECK-SAME: "some_ukernel" +// CHECK-SAME: ins(%[[ARG0]] : +// CHECK-SAME: outs(%[[FILL]] : +// CHECK: return %[[MICRO_KERNEL]] + +// ----- + +func.func @argmax_f32i64_without_selected_ukernel(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { + hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> +} { + %c0_i64 = arith.constant 0 : i64 + %cst = arith.constant 0xFF800000 : f32 + %0 = tensor.empty() : tensor<1xi64> + %1 = linalg.fill ins(%c0_i64 : i64) outs(%0 : tensor<1xi64>) -> tensor<1xi64> + %2 = tensor.empty() : tensor<1xf32> + %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1xf32>) -> tensor<1xf32> + %4:2 = linalg.generic { + indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>], + iterator_types = ["parallel", "reduction"] + } + ins(%arg0 : tensor<1x?xf32>) outs(%3, %1 : tensor<1xf32>, tensor<1xi64>) { + ^bb0(%in: f32, %out: f32, %out_0: i64): + %5 = linalg.index 1 : index + %6 = arith.index_cast %5 : index to i64 + %7 = arith.maximumf %in, %out : f32 + %8 = arith.cmpf ogt, %in, %out : f32 + %9 = arith.select %8, %6, %out_0 : i64 + linalg.yield %7, %9 : f32, i64 + } -> (tensor<1xf32>, tensor<1xi64>) + return %4#1 : tensor<1xi64> +} + +//CHECK-LABEL: func @argmax_f32i64_without_selected_ukernel( +// CHECK-NOT: iree_codegen.ukernel.generic +// CHECK: linalg.generic diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.cpp b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.cpp index 6957caf981fc..8ebfba912442 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.cpp +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.cpp @@ -145,4 +145,9 @@ std::optional> getPaddingList(LoweringConfigAttr config) { return getIntegerVector(array); } +IREE::GPU::UKernelSpecAttr +getUkernelSpec(IREE::GPU::LoweringConfigAttr config) { + return config.getAttributes().getAs("ukernel"); +} + } // namespace mlir::iree_compiler::IREE::GPU diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h index c1188b75c1eb..5bebb64a1b05 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h @@ -59,6 +59,8 @@ void setPromotedOperandList(MLIRContext *context, /// Helper to retrieve list of operand to pad. std::optional> getPaddingList(LoweringConfigAttr config); +IREE::GPU::UKernelSpecAttr getUkernelSpec(IREE::GPU::LoweringConfigAttr config); + } // namespace mlir::iree_compiler::IREE::GPU #endif // IREE_COMPILER_CODEGEN_DIALECT_GPU_IR_GPULOWERINGCONFIGUTILS_H_ diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td index a239af395d29..0b1e32fdc362 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td @@ -520,6 +520,25 @@ def IREEGPU_LaneIdAttr : AttrDef { + let mnemonic = "ukernel_spec"; + let summary = "An attribute specifying a ukernel that an op can lower to."; + let description = [{ + An attribute that can be applied to any operation to specify that it has + been match with a ukernel that is a legal lowering for it. + }]; + let assemblyFormat = "`<` struct(params) `>`"; + let parameters = (ins + "StringAttr":$name, + "DictionaryAttr":$def_attrs + ); +} + //===----------------------------------------------------------------------===// // GPU Pipeline Options //===----------------------------------------------------------------------===// diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/BUILD.bazel b/compiler/src/iree/compiler/Codegen/LLVMGPU/BUILD.bazel index 73e039798deb..a5c1bce4beda 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/BUILD.bazel @@ -147,6 +147,7 @@ iree_compiler_cc_library( "//compiler/src/iree/compiler/Dialect/Flow/IR", "//compiler/src/iree/compiler/Dialect/Flow/Transforms", "//compiler/src/iree/compiler/Dialect/HAL/IR", + "//compiler/src/iree/compiler/Dialect/HAL/Transforms", "//compiler/src/iree/compiler/Dialect/LinalgExt/IR", "//compiler/src/iree/compiler/Dialect/LinalgExt/Transforms", "//compiler/src/iree/compiler/Dialect/LinalgExt/Utils", diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/LLVMGPU/CMakeLists.txt index b33641bda92e..5c206210ab30 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/CMakeLists.txt @@ -190,6 +190,7 @@ iree_cc_library( iree::compiler::Dialect::Flow::IR iree::compiler::Dialect::Flow::Transforms iree::compiler::Dialect::HAL::IR + iree::compiler::Dialect::HAL::Transforms iree::compiler::Dialect::LinalgExt::IR iree::compiler::Dialect::LinalgExt::Transforms iree::compiler::Dialect::LinalgExt::Utils diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp index cb22b598a94b..ee4614d7bb05 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp @@ -10,6 +10,7 @@ #include #include +#include "compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h" #include "iree/compiler/Codegen/Common/GPU/GPUHeuristics.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h" #include "iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h" @@ -2042,28 +2043,15 @@ static LogicalResult setTransposeConfig(mlir::FunctionOpInterface entryPoint, /// Set the configuration for argmax when ukernels are enabled. /// Distribute all parallel dim across different workgroups, and only use single /// subgroup per workgroup. -/// -/// TODO(bjacob): This is fragile, as we can't know yet if this argmax will be -/// lowered to a ukernel. We need instead a config that works regardless of -/// ukernels. For now, we use the looser condition that the argmax ukernel is -/// enabled, a necessary but not sufficient condition for this particular op to -/// lower to the ukernel. This is good enough for now for a couple of reasons: -/// 1. Even if a argmax does not actually lower to a ukernel, this config should -/// still work. -/// 2. Ukernels are not enabled by default. static LogicalResult setArgmaxUkernelConfig(IREE::GPU::TargetAttr target, mlir::FunctionOpInterface entryPoint, linalg::GenericOp op) { // Checks if UKernels are enabled. - if (auto target = IREE::HAL::ExecutableTargetAttr::lookup(entryPoint)) { - if (!hasUkernel(target, "argmax")) { - return failure(); - } - } - - if (!target.supportsSubgroupShuffle()) + IREE::GPU::UKernelSpecAttr ukernelSpec = selectUKernelForArgmax(op); + if (!ukernelSpec) { return failure(); + } if (failed(isArgmaxOp(op))) return failure(); @@ -2094,26 +2082,35 @@ setArgmaxUkernelConfig(IREE::GPU::TargetAttr target, return failure(); } - // Tile all the parallel dimension to 1. + // Tile all the parallel dimension to 1. This is a requirement of the ukernel. SmallVector partitionedLoops = cast(op.getOperation()) .getPartitionableLoops(kNumMaxParallelDims); size_t numLoops = partitionedLoops.empty() ? 0 : partitionedLoops.back() + 1; SmallVector workgroupTileSizes(numLoops, 1); - // Currently Argmax Ukernel let's every thread reduce reductionDim/WarpSize + // Currently Argmax Ukernel lets every thread reduce reductionDim/WarpSize // number of elements, and then it does a single step butterfly warp reduce. // Hence it expects workgroupSize to be warpSize(subgroupSize), and // reductionTileSize to be size of the reduction dim. SmallVector reductionTileSizes(op.getNumLoops(), 0); int64_t preferredSubgroupSize = target.getPreferredSubgroupSize(); reductionTileSizes[reductionDims[0]] = preferredSubgroupSize; - TileSizesListType tileSizes; - tileSizes.emplace_back(std::move(workgroupTileSizes)); // Workgroup level - tileSizes.emplace_back(std::move(reductionTileSizes)); // Reduction level std::array workgroupSize = {preferredSubgroupSize, 1, 1}; + + MLIRContext *context = op->getContext(); + Builder b(context); + SmallVector attrs; + attrs.emplace_back(StringAttr::get(context, "workgroup"), + b.getI64ArrayAttr(workgroupTileSizes)); + attrs.emplace_back(StringAttr::get(context, "reduction"), + b.getI64ArrayAttr(reductionTileSizes)); + attrs.emplace_back(StringAttr::get(context, "ukernel"), ukernelSpec); + IREE::GPU::setPromotedOperandList(context, attrs, {0, 1}); + auto configDict = DictionaryAttr::get(context, attrs); + auto loweringConfig = IREE::GPU::LoweringConfigAttr::get(context, configDict); if (failed(setOpConfigAndEntryPointFnTranslation( - entryPoint, op, tileSizes, CodeGenPipeline::LLVMGPUDefault, + entryPoint, op, loweringConfig, CodeGenPipeline::LLVMGPUDefault, workgroupSize))) { return failure(); } diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp index f8ebe1cc0069..b6414e1b6a47 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp @@ -21,6 +21,7 @@ #include "iree/compiler/Codegen/Utils/Utils.h" #include "iree/compiler/Dialect/Flow/IR/FlowOps.h" #include "iree/compiler/Dialect/HAL/IR/HALTypes.h" +#include "iree/compiler/Dialect/HAL/Transforms/Passes.h" #include "iree/compiler/Dialect/Util/Transforms/Passes.h" #include "iree/compiler/Utils/PassUtils.h" #include "llvm/ADT/STLForwardCompat.h" @@ -1197,6 +1198,10 @@ void buildLLVMGPUCodegenConfigurationPassPipeline( void buildLLVMGPUCodegenPassPipeline(OpPassManager &variantPassManager, bool useROCM) { + // LLVMGPUSelectLoweringStrategyPass may have created ExecutableObjectAttr. + // Hoisting them now deduplicates them and ensures that rewrite patterns don't + // need to think about explicitly copying them over to new ops. + variantPassManager.addPass(IREE::HAL::createHoistExecutableObjectsPass()); { OpPassManager &modulePassManager = variantPassManager.nest(); modulePassManager.addPass(createLowerExecutableUsingTransformDialectPass()); diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/BUILD.bazel b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/BUILD.bazel index 113c6d56598f..66bd982ffa89 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/BUILD.bazel @@ -17,10 +17,12 @@ package( iree_compiler_cc_library( name = "Utils", srcs = [ + "LLVMGPUSelectUKernels.cpp", "LLVMGPUUtils.cpp", "PrefetchSharedMemoryCopy.cpp", ], hdrs = [ + "LLVMGPUSelectUKernels.h", "LLVMGPUUtils.h", ], deps = [ @@ -34,6 +36,7 @@ iree_compiler_cc_library( "//compiler/src/iree/compiler/Codegen/Utils:VectorOpUtils", "//compiler/src/iree/compiler/Dialect/HAL/IR", "//compiler/src/iree/compiler/Dialect/LinalgExt/Utils", + "//compiler/src/iree/compiler/Utils", "@llvm-project//llvm:Support", "@llvm-project//mlir:AMDGPUDialect", "@llvm-project//mlir:AffineDialect", @@ -42,6 +45,7 @@ iree_compiler_cc_library( "@llvm-project//mlir:FunctionInterfaces", "@llvm-project//mlir:GPUDialect", "@llvm-project//mlir:IR", + "@llvm-project//mlir:LinalgDialect", "@llvm-project//mlir:MathDialect", "@llvm-project//mlir:MemRefDialect", "@llvm-project//mlir:NVGPUDialect", diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/CMakeLists.txt index 6b66e96ded1f..98ee9404ff61 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/CMakeLists.txt @@ -14,8 +14,10 @@ iree_cc_library( NAME Utils HDRS + "LLVMGPUSelectUKernels.h" "LLVMGPUUtils.h" SRCS + "LLVMGPUSelectUKernels.cpp" "LLVMGPUUtils.cpp" "PrefetchSharedMemoryCopy.cpp" DEPS @@ -27,6 +29,7 @@ iree_cc_library( MLIRFunctionInterfaces MLIRGPUDialect MLIRIR + MLIRLinalgDialect MLIRMathDialect MLIRMemRefDialect MLIRNVGPUDialect @@ -45,6 +48,7 @@ iree_cc_library( iree::compiler::Codegen::Utils::VectorOpUtils iree::compiler::Dialect::HAL::IR iree::compiler::Dialect::LinalgExt::Utils + iree::compiler::Utils PUBLIC ) diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp new file mode 100644 index 000000000000..1940e8f0b102 --- /dev/null +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp @@ -0,0 +1,152 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h" +#include "iree/compiler/Codegen/Utils/GPUUtils.h" +#include "iree/compiler/Codegen/Utils/Utils.h" +#include "iree/compiler/Utils/EmbeddedDataDirectory.h" +#include "llvm/Support/FormatVariadic.h" +#include "mlir/Dialect/Linalg/IR/Linalg.h" +#include "mlir/IR/AsmState.h" +#include "mlir/IR/Attributes.h" +#include "mlir/IR/BuiltinAttributes.h" + +namespace mlir::iree_compiler { + +namespace { + +constexpr StringLiteral executableObjectsAttrName = "hal.executable.objects"; + +// Returns a ExecutableObjectAttr carrying the bitcode for the given ukernel. +// +// First tries finding the bitcode in the input `sourceExecutableObjects`, which +// must be an array of ExecutableObjectAttr's and is typically coming from a +// hal.executable.objects array attribute in the source IR, which is the +// mechanism by which source programs may provide their own ukernel bitcode. +// +// If no matching bitcode was found in `sourceExecutableObjects`, this function +// will then search in bitcode files that we have embedded as static data. +static IREE::HAL::ExecutableObjectAttr +getUKernelBitcode(MLIRContext *context, + IREE::HAL::ExecutableTargetAttr execTarget, + ArrayAttr sourceExecutableObjects, StringRef ukernelName) { + IREE::GPU::TargetAttr gpuTarget = getGPUTargetAttr(execTarget); + if (!gpuTarget) { + return {}; + } + StringRef gpuArch = gpuTarget.getArch(); + std::string bitcodeFilename = llvm::formatv("{}.{}.bc", ukernelName, gpuArch); + + // Early-return if the source executable.objects already contain an object + // with the expected file name. This happens with user-provided bitcode in the + // source IR. + if (sourceExecutableObjects) { + for (Attribute a : sourceExecutableObjects) { + if (auto object = dyn_cast(a)) { + if (object.getPath() == bitcodeFilename) { + return object; + } + } + } + } + + // No user-provided bitcode, so we search our embedded bitcode files in the + // EmbeddedDataDirectory singleton. + std::optional bitcode; + EmbeddedDataDirectory::withGlobal([&](EmbeddedDataDirectory &dir) { + bitcode = dir.getFile(bitcodeFilename); + }); + if (!bitcode) { + return {}; + } + auto blob = HeapAsmResourceBlob::allocateAndCopyInferAlign( + ArrayRef(bitcode->data(), bitcode->size())); + auto bitcodeDenseAttr = DenseI8ResourceElementsAttr::get( + VectorType::get({static_cast(bitcode->size())}, + IntegerType::get(context, 8)), + bitcodeFilename, std::move(blob)); + return IREE::HAL::ExecutableObjectAttr::get( + context, StringAttr::get(context, bitcodeFilename), + cast(bitcodeDenseAttr)); +} + +// Walks parents ops from `op` to return the nearest hal.executable.objects +// array attribute. If the parent hal.executable.variant is reached, its objects +// attribute is returned. +// Adapted from ExecutableTargetAttr::lookup. +static ArrayAttr lookUpExecutableObjects(Operation *op) { + MLIRContext *context = op->getContext(); + auto attrId = StringAttr::get(context, executableObjectsAttrName); + while (op) { + // Take directly from the enclosing variant. + if (auto variantOp = dyn_cast(op)) { + if (std::optional objects = variantOp.getObjects()) { + return *objects; + } + } + // Take from op attributes. + if (auto attr = op->getAttrOfType(attrId)) { + return attr; + } + // Continue walk. + op = op->getParentOp(); + } + return {}; +} + +/// Returns the function name and attributes to use for a ukernel with given +/// `name` and `suffix` on the target described by `targetAttr`. +static IREE::GPU::UKernelSpecAttr +getUKernelSpec(StringRef name, StringRef suffix, MLIRContext *context, + IREE::HAL::ExecutableTargetAttr targetAttr) { + if (isROCMBackend(targetAttr)) { + auto nameAttr = StringAttr::get( + context, llvm::formatv("iree_uk_amdgpu_{}_{}", name, suffix)); + auto defsAttr = DictionaryAttr::get( + context, {{StringAttr::get(context, "vm.import.module"), + StringAttr::get(context, "rocm")}}); + return IREE::GPU::UKernelSpecAttr::get(context, nameAttr, defsAttr); + } + return {}; +} + +} // namespace + +IREE::GPU::UKernelSpecAttr selectUKernelForArgmax(linalg::GenericOp op) { + if (failed(isArgmaxOp(op))) { + return {}; + } + auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(op); + const char ukernelName[] = "argmax"; + if (!hasUkernel(targetAttr, ukernelName)) { + return {}; + } + Value input = op.getDpsInputOperand(0)->get(); + auto inputType = cast(input.getType()); + Value index = op.getDpsInitOperand(1)->get(); + auto indexType = cast(index.getType()); + std::string suffix; + llvm::raw_string_ostream(suffix) + << inputType.getElementType() << indexType.getElementType(); + MLIRContext *context = op->getContext(); + IREE::GPU::UKernelSpecAttr ukernelSpec = + getUKernelSpec(ukernelName, suffix, context, targetAttr); + if (!ukernelSpec) { + return {}; + } + auto execTarget = IREE::HAL::ExecutableTargetAttr::lookup(op); + ArrayAttr sourceExecutableObjects = lookUpExecutableObjects(op); + IREE::HAL::ExecutableObjectAttr bitcodeObject = getUKernelBitcode( + context, execTarget, sourceExecutableObjects, ukernelSpec.getName()); + if (!bitcodeObject) { + return {}; + } + op->setAttr(executableObjectsAttrName, + ArrayAttr::get(context, bitcodeObject)); + return ukernelSpec; +} + +} // namespace mlir::iree_compiler diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h new file mode 100644 index 000000000000..4ed251b36070 --- /dev/null +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h @@ -0,0 +1,15 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.h" +#include "iree/compiler/Dialect/HAL/IR/HALTypes.h" +#include "mlir/Dialect/Linalg/IR/Linalg.h" + +namespace mlir::iree_compiler { + +IREE::GPU::UKernelSpecAttr selectUKernelForArgmax(linalg::GenericOp op); + +} // namespace mlir::iree_compiler From 67a05a45aec34d779bc7ff8968bd1c93133a037c Mon Sep 17 00:00:00 2001 From: Han-Chung Wang Date: Sun, 15 Dec 2024 19:57:04 -0800 Subject: [PATCH 17/64] [DT][NFC] Internalize transposeNarrowN logic to LayoutAttrInterface Impl (#19453) Whether applying transposition from narrow-N to narrow-M is backend implementation details, and we do not need to expose it to the type converter. The encoding itself has enough information, like indexing maps, narrow dimensions, etc., to infer the shapes and encoding info. Instead of updating the RankedTensorType and the attached encoding in type converter, we can just cook the logic in `getEncodingInfo` methods. From the encoding, we know that whether it is narrow-N case, and we can update the MaterializeEncodingInfo correspondingly. The type converter can infer the transposed tensor type from it. Thus, we can simplify the logic in the type conversion. The documentation of `transposeNarrowN` is moved to `[CPU|GPU]EncodingExternalModels.cpp` because all the implementation locates at the files. Signed-off-by: hanhanW --- .../Common/CPU/CPUMaterializeEncodings.cpp | 10 +-- .../compiler/Codegen/Common/EncodingUtils.cpp | 77 +------------------ .../compiler/Codegen/Common/EncodingUtils.h | 7 +- .../Common/GPU/GPUMaterializeEncoding.cpp | 9 --- .../Common/MaterializeEncodingIntoNop.cpp | 1 - .../MaterializeEncodingIntoPackUnPack.cpp | 32 -------- .../CPUEncodingExternalModels.cpp | 53 ++++++++++++- .../GPUEncodingExternalModels.cpp | 15 ++++ 8 files changed, 72 insertions(+), 132 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodings.cpp b/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodings.cpp index d1cb2fab6733..d182649f64ac 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodings.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodings.cpp @@ -67,13 +67,6 @@ materializeFuncOpEncodings(FunctionOpInterface funcOp, IREE::HAL::ExecutableTargetAttr targetAttr) { MLIRContext *ctx = funcOp.getContext(); RewritePatternSet materializeEncodingPattern(ctx); - // On CPU, we use transposeNarrowN=true for a combination of reasons: - // 1. As linalg.matmul materializes into linalg.mmt4d, which has a transposed - // RHS and therefore LHS<->RHS symmetry, transposeNarrowN is easy to - // implement at that level. - // 2. We use ukernels, and this allows writing 2x fewer narrow ukernels. - // 3. Heuristics for cache-friendly dispatch tiling can get complex on CPU, - // so it is nice that they have fewer narrow cases to consider. DictionaryAttr targetConfig = targetAttr.getConfiguration(); IREE::Codegen::LayoutAttrInterface layoutAttr; if (isVMVXBackend(targetAttr)) { @@ -85,8 +78,7 @@ materializeFuncOpEncodings(FunctionOpInterface funcOp, layoutAttr = cast( IREE::CPU::CPUEncodingLayoutAttr::get(ctx, targetConfig)); } - MaterializeEncodingTypeConverter typeConverter( - /*transposeNarrowN=*/true, layoutAttr); + MaterializeEncodingTypeConverter typeConverter(layoutAttr); MaterializeEncodingConversionTarget target(*ctx); auto materializeEncodingValueFn = getMaterializeEncodingValueFn(targetAttr); populateMaterializeEncodingIntoPackUnPackPatterns( diff --git a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp index 05caca808e4a..e3ca734a964b 100644 --- a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp @@ -20,76 +20,9 @@ using IREE::Encoding::EncodingAttr; using IREE::Encoding::getEncodingAttr; using IREE::Encoding::getEncodingContractionDims; -// If tensorType has the encoding of a matmul RESULT with narrow N, returns -// the transposed type. Otherwise, just returns tensorType. -static RankedTensorType transposeIfNarrowNResult(RankedTensorType tensorType) { - auto encoding = - llvm::dyn_cast_or_null(tensorType.getEncoding()); - if (!encoding) { - return tensorType; - } - if (!isNarrowNResult(encoding)) { - return tensorType; - } - SmallVector newOriginalShape(tensorType.getShape()); - auto userIndexingMaps = encoding.getUserIndexingMaps(); - SmallVector maps; - for (auto a : userIndexingMaps) { - maps.push_back(cast(a).getAffineMap()); - } - auto cDims = linalg::inferContractionDims(maps); - SmallVector newShape(tensorType.getShape()); - SmallVector permIndices(maps[0].getNumDims()); - std::iota(std::begin(permIndices), std::end(permIndices), 0); - // Matrix case: there are both M and N dimensions. Transposing means swapping - // them. - if (cDims->m.size() == 1 && cDims->n.size() == 1) { - int m = cDims->m[0]; - int n = cDims->n[0]; - std::swap(permIndices[m], permIndices[n]); - std::optional mDim = encoding.mapDimToOperandIndex(m); - std::optional nDim = encoding.mapDimToOperandIndex(n); - if (mDim.has_value() && nDim.has_value()) { - std::swap(newShape[mDim.value()], newShape[nDim.value()]); - std::swap(newOriginalShape[mDim.value()], newOriginalShape[nDim.value()]); - } - } - // Vector case: there is no N dimension to swap the M dimension with. We - // swap the maps themselves. - if (cDims->n.empty()) { - std::swap(maps[0], maps[1]); - } - - SmallVector newRoundDimsTo(encoding.getRoundDimsToArray()); - assert(newRoundDimsTo.size() == 0 || newRoundDimsTo.size() >= 3); - if (newRoundDimsTo.size() != 0) { - std::swap(newRoundDimsTo[newRoundDimsTo.size() - 3], - newRoundDimsTo[newRoundDimsTo.size() - 2]); - } - auto context = tensorType.getContext(); - AffineMap permutation = AffineMap::getPermutationMap(permIndices, context); - for (auto &map : maps) { - map = map.compose(permutation); - } - auto elemType = tensorType.getElementType(); - auto operandIndex = encoding.getOperandIndex().getInt(); - - // TODO(#17718): Handle the broadcast map for transpose cases. It is on the - // experimental path, so it is not clear what needs to be done here. For now - // just use the original map for the new encoding. - std::optional newBcastMap; - if (encoding.getBcastMap()) { - newBcastMap = encoding.getBcastMap().getValue(); - } - auto newEncoding = IREE::Encoding::EncodingAttr::get( - context, operandIndex, encoding.getOpType().getValue(), - encoding.getElementTypesArray(), maps, newBcastMap, newRoundDimsTo); - return RankedTensorType::get(newShape, elemType, newEncoding); -} - MaterializeEncodingTypeConverter::MaterializeEncodingTypeConverter( - bool transposeNarrowN, IREE::Codegen::LayoutAttrInterface layoutAttr) - : transposeNarrowN(transposeNarrowN), layoutAttr(layoutAttr) { + IREE::Codegen::LayoutAttrInterface layoutAttr) + : layoutAttr(layoutAttr) { addConversion([](IntegerType intType) { return intType; }); addConversion([](IndexType indexType) { return indexType; }); addConversion([](FloatType floatType) { return floatType; }); @@ -98,14 +31,12 @@ MaterializeEncodingTypeConverter::MaterializeEncodingTypeConverter( // For a given tensor type with an encoding, return the materialized // type to use for it. If no encoding is set, then return the tensor type // itself. - RankedTensorType tensorType = - transposeNarrowN ? transposeIfNarrowNResult(type) : type; - MaterializeEncodingInfo encodingInfo = getEncodingInfo(tensorType); + MaterializeEncodingInfo encodingInfo = getEncodingInfo(type); if (IREE::Codegen::isIdentityLayout(encodingInfo)) { return dropEncoding(type); } auto packedType = cast(tensor::PackOp::inferPackedType( - tensorType, encodingInfo.innerTileSizes, encodingInfo.innerDimsPos, + type, encodingInfo.innerTileSizes, encodingInfo.innerDimsPos, encodingInfo.outerDimsPerm)); // There is no swizzle, we are already done. Typically the case on CPU. diff --git a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.h b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.h index 3b59cbd63173..da9aa8a41731 100644 --- a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.h +++ b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.h @@ -36,7 +36,7 @@ using MaterializeEncodingValueFn = class MaterializeEncodingTypeConverter : public TypeConverter { public: MaterializeEncodingTypeConverter( - bool transposeNarrowN, IREE::Codegen::LayoutAttrInterface layoutAttr); + IREE::Codegen::LayoutAttrInterface layoutAttr); const IREE::Codegen::LayoutAttrInterface &getLayoutAttr() const { return layoutAttr; @@ -47,12 +47,7 @@ class MaterializeEncodingTypeConverter : public TypeConverter { return layoutAttr.getEncodingInfo(type); } - bool getTransposeNarrowN() const { return transposeNarrowN; } - private: - bool transposeNarrowN = false; - // TODO(hanchung): Move the logic that takes `transposeNarrowN` into account - // to their own attribute implementation. const IREE::Codegen::LayoutAttrInterface layoutAttr; }; diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp index 92b8ac4df0e8..32536085576b 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp @@ -271,14 +271,6 @@ materializeFuncOpEncodings(FunctionOpInterface funcOp, MLIRContext *ctx = funcOp.getContext(); { RewritePatternSet patterns(ctx); - // On GPU, we use transposeNarrowN=false for a combination of reasons: - // 1. As linalg.matmul materializes into iree_gpu.multi_mma, which inherits - // its semantics from the wrapped intrinsic, we can't rely on any kind of - // LHS<->RHS symmetry. - // 2. We do not currently use ukernels, which would be one of the main areas - // to benefit from transposeNarrowN. - // 3. Heuristics for cache-friendly dispatch tiling are internal to the GPU - // runtime, so we don't need a simplification at that level either. IREE::GPU::TargetAttr gpuTargetAttr; if (targetAttr) { gpuTargetAttr = getGPUTargetAttr(targetAttr); @@ -286,7 +278,6 @@ materializeFuncOpEncodings(FunctionOpInterface funcOp, gpuTargetAttr = getCLGPUTarget(ctx); } MaterializeEncodingTypeConverter typeConverter( - /*transposeNarrowN=*/false, cast( IREE::GPU::GPUEncodingLayoutAttr::get(ctx, gpuTargetAttr))); MaterializeEncodingConversionTarget target(*ctx); diff --git a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoNop.cpp b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoNop.cpp index 35355e843723..4de4b454478a 100644 --- a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoNop.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoNop.cpp @@ -46,7 +46,6 @@ struct MaterializeEncodingIntoNopPass final RewritePatternSet materializeEncodingPattern(context); MaterializeEncodingTypeConverter typeConverter( - /*transposeNarrowN=*/false, IREE::Codegen::EncodingNopLayoutAttr::get(context)); MaterializeEncodingConversionTarget target(*context); populateMaterializeEncodingIntoPackUnPackPatterns( diff --git a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp index ad2ce7c48c05..087d91dccf41 100644 --- a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp @@ -101,22 +101,6 @@ getInnerTileSizesOfr(OpBuilder &rewriter, Location loc, return result; } -static void transposeInPlace(MaterializeEncodingInfo &info) { - // Vector cases: nothing to do. - if (info.innerTileSizes.size() < 2) { - return; - } - // Not a vector case, so all three arrays in `info` have size at least 2, - // outerDimsPerm may have size 3 if there is a batch dimension, but in all - // cases, the last 2 entries of each array are M and N, not batch. - auto transpose = [](SmallVector &a) { - std::swap(a[a.size() - 2], a[a.size() - 1]); - }; - transpose(info.innerDimsPos); - transpose(info.innerTileSizes); - transpose(info.outerDimsPerm); -} - //===---------------------------------------------------------------------===// // Methods to convert `set_encoding` and `unset_encoding` operations // to `pack` and `unpack` operations respectively. @@ -139,9 +123,6 @@ FailureOr lowerSetEncodingOpToPackOp( if (!encoding) { return failure(); } - if (typeConverter.getTransposeNarrowN() && isNarrowNResult(encoding)) { - transposeInPlace(encodingInfo); - } // Create `tensor.empty` operation for the result of the pack operation. Location loc = encodingOp.getLoc(); @@ -180,10 +161,6 @@ FailureOr lowerUnsetEncodingToUnpackOp( return packedValue; } - auto encoding = IREE::Encoding::getEncodingAttr(sourceType); - if (typeConverter.getTransposeNarrowN() && isNarrowNResult(encoding)) { - transposeInPlace(encodingInfo); - } // Create an `tensor.empty` for the result of the unpack operation. Location loc = encodingOp.getLoc(); SmallVector resultDims = @@ -222,11 +199,6 @@ lowerOpWithEncoding(RewriterBase &rewriter, tensor::EmptyOp emptyOp, .getOperation(); } - if (typeConverter.getTransposeNarrowN() && - isNarrowNResult(IREE::Encoding::getEncodingAttr(emptyType))) { - transposeInPlace(encodingInfo); - } - FailureOr> innerTileSizesOfr = getInnerTileSizesOfr( rewriter, loc, emptyType, encodingInfo, materializeEncodingValueFn); if (failed(innerTileSizesOfr)) { @@ -389,10 +361,6 @@ static FailureOr> getPackedDimsForDispatchTensor( if (IREE::Codegen::isIdentityLayout(encodingInfo)) { return failure(); } - if (typeConverter.getTransposeNarrowN() && - isNarrowNResult(IREE::Encoding::getEncodingAttr(boundTensorType))) { - transposeInPlace(encodingInfo); - } SmallVector targetShape = getMixedValues(boundTensorType.getShape(), dynamicDims, builder); diff --git a/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp b/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp index 89de2e6dcc16..3461a29c6b63 100644 --- a/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp +++ b/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp @@ -3,6 +3,30 @@ // Licensed under the Apache License v2.0 with LLVM Exceptions. // See https://llvm.org/LICENSE.txt for license information. // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +//===- CPUEncodingExternalModels.cpp --------------------------------------===// +// +// This file implements the IREE::Codegen::LayoutAttrInterface for CPU backends +// and the VMVX backend. In these backends, we transpose narrow-N into narrow-M +// for a combination of reasons: +// +// 1. As linalg.matmul materializes into linalg.mmt4d, which has a transposed +// RHS and therefore LHS<->RHS symmetry, transposeNarrowN is easy to +// implement at that level. +// 2. We use ukernels, and this allows writing 2x fewer narrow ukernels. +// 3. Heuristics for cache-friendly dispatch tiling can get complex on CPU, +// so it is nice that they have fewer narrow cases to consider. +// +// This transposition is made easier by (and was all along part of the idea in) +// the RHS-transposition in mmt4d (the t in mmt4d), as generally with matrix +// multiplication +// +// B * Transpose(A) == Transpose( A * Transpose(B) ) +// +// so in mmt4d terms +// +// mmt4d(B, A) == Transpose(mmt4d(A, B)) +// +//===---------------------------------------------------------------------===// #include "iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.h" @@ -12,6 +36,7 @@ #include "iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.h" #include "iree/compiler/Codegen/Utils/Utils.h" #include "iree/compiler/Dialect/Encoding/IR/EncodingOps.h" +#include "iree/compiler/Dialect/Encoding/IR/EncodingTypes.h" #include "llvm/Support/Debug.h" #include "mlir/Dialect/Linalg/IR/LinalgInterfaces.h" @@ -28,6 +53,22 @@ namespace { // Utilities. //===----------------------------------------------------------------------===// +static void transposeInPlace(MaterializeEncodingInfo &info) { + // Vector cases: nothing to do. + if (info.innerTileSizes.size() < 2) { + return; + } + // Not a vector case, so all three arrays in `info` have size at least 2, + // outerDimsPerm may have size 3 if there is a batch dimension, but in all + // cases, the last 2 entries of each array are M and N, not batch. + auto transpose = [](SmallVector &a) { + std::swap(a[a.size() - 2], a[a.size() - 1]); + }; + transpose(info.innerDimsPos); + transpose(info.innerTileSizes); + transpose(info.outerDimsPerm); +} + static RankedTensorType dropEncoding(RankedTensorType type) { return RankedTensorType::get(type.getShape(), type.getElementType()); } @@ -576,7 +617,11 @@ struct CPUDeviceEncodingLayoutAttrInterface // taking narrow dimensions into account. TileMxNxK chosenTileMxNxK = chooseMatmulTile( enumeratedTileMxNxK, narrowDim, encoding.getRoundDimsToArray()); - return getEncodingInfoForMatmul(encoding, chosenTileMxNxK); + info = getEncodingInfoForMatmul(encoding, chosenTileMxNxK); + if (Encoding::isNarrowNResult(encoding)) { + transposeInPlace(info); + } + return info; } Operation *lowerOp(Attribute attr, OpBuilder &b, Operation *op, @@ -660,7 +705,11 @@ struct VMVXDeviceEncodingLayoutAttrInterface // taking narrow dimensions into account. TileMxNxK chosenTileMxNxK = chooseMatmulTile( enumeratedTileMxNxK, narrowDim, encoding.getRoundDimsToArray()); - return getEncodingInfoForMatmul(encoding, chosenTileMxNxK); + info = getEncodingInfoForMatmul(encoding, chosenTileMxNxK); + if (Encoding::isNarrowNResult(encoding)) { + transposeInPlace(info); + } + return info; } Operation *lowerOp(Attribute attr, OpBuilder &b, Operation *op, diff --git a/compiler/src/iree/compiler/Codegen/ExternalInterfaces/GPUEncodingExternalModels.cpp b/compiler/src/iree/compiler/Codegen/ExternalInterfaces/GPUEncodingExternalModels.cpp index 031c158fed8e..b3bd093e52fb 100644 --- a/compiler/src/iree/compiler/Codegen/ExternalInterfaces/GPUEncodingExternalModels.cpp +++ b/compiler/src/iree/compiler/Codegen/ExternalInterfaces/GPUEncodingExternalModels.cpp @@ -3,6 +3,21 @@ // Licensed under the Apache License v2.0 with LLVM Exceptions. // See https://llvm.org/LICENSE.txt for license information. // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +//===- GPUEncodingExternalModels.cpp --------------------------------------===// +// +// This file implements the IREE::Codegen::LayoutAttrInterface for GPU backends. +// Different from CPU backends, we do not tranpose narrow-N to narrow-M for a +// combination of reasons: +// +// 1. As linalg.matmul materializes into iree_gpu.multi_mma, which inherits +// its semantics from the wrapped intrinsic, we can't rely on any kind of +// LHS<->RHS symmetry. +// 2. We do not currently use ukernels, which would be one of the main areas +// to benefit from transposeNarrowN. +// 3. Heuristics for cache-friendly dispatch tiling are internal to the GPU +// runtime, so we don't need a simplification at that level either. +// +//===---------------------------------------------------------------------===// #include "iree/compiler/Codegen/ExternalInterfaces/GPUEncodingExternalModels.h" From 05ce39f3fba4b5cc6eee18a431f8f8e16fa9b5d2 Mon Sep 17 00:00:00 2001 From: Han-Chung Wang Date: Mon, 16 Dec 2024 02:12:26 -0800 Subject: [PATCH 18/64] [DT] Unify encoding materialization pass into a single pass. (#19454) The revision creates a generic materialization pass and uses it for backends that implement data-tiling. After months of development, we identify that the needs of GPU is a superset of the needs of CPU. To be more specific, it has the additional "swizzle" field in terms of layout. It means that the GPU set_encoding/unset_encoding lowering patterns cover the needs of CPU path. The lowering of contraction ops is different. CPU lowers it to mmt4d op, while GPU lowers it to multi_mma op. However, the lowering of contraction is implemented through attribute interface. Thus, we can have a generic pattern to lower contraction ops. To make the review process much easier, the revision is created by 5 commits. 1. It directly creates the MaterializeEncoding pass and copy-paste the GPU patterns: SetEncodingOpLoweringConversion, UnSetEncodingOpLoweringConversion, and MaterializeContractionOp. In the first commit, it also updates the GPU tests to use the new pass. 2. The GPU data-tiling does not support element-wise generic op lowering atm. The second commit moves the pattern to shared pattern set and bail out when swizzle is present. This is an NFC for both pipelines. 3. The third commit replaces the existing materialization pass with the generic pass, and deletes all the legacy passes. 4. The four commit moves the lit tests from `Common/[CPU|GPU]/test` to `Common/test`. 5. Now there are duplicate patterns for set_encoding, unset_encoding, and contraction ops lowering. The last commit deletes the legacy patterns, and move the patterns from MaterializeEncoding.cpp to where the legacy patterns locate. Furthermore, it renames the file as `MaterializeEncodingPatterns.cpp`. The revision retains the MaterializeEncodingIntoNop pass, and add a TODO item. Because it is still used by MaterializeHomogeneousEncoding pass. It can be deleted once we deprecate the early materialization path. --------- Signed-off-by: hanhanW --- .../iree/compiler/Codegen/Common/BUILD.bazel | 7 +- .../compiler/Codegen/Common/CMakeLists.txt | 7 +- .../compiler/Codegen/Common/CPU/BUILD.bazel | 4 - .../Codegen/Common/CPU/CMakeLists.txt | 4 - .../compiler/Codegen/Common/CPU/Passes.td | 20 - .../Codegen/Common/CPU/test/BUILD.bazel | 2 - .../Codegen/Common/CPU/test/CMakeLists.txt | 2 - .../compiler/Codegen/Common/EncodingUtils.h | 14 +- .../compiler/Codegen/Common/GPU/BUILD.bazel | 4 - .../Codegen/Common/GPU/CMakeLists.txt | 4 - .../Common/GPU/GPUMaterializeEncoding.cpp | 398 ------------------ .../compiler/Codegen/Common/GPU/Passes.td | 10 - .../Codegen/Common/GPU/test/BUILD.bazel | 4 - .../Codegen/Common/GPU/test/CMakeLists.txt | 4 - ...eEncodings.cpp => MaterializeEncoding.cpp} | 131 +++--- .../Common/MaterializeEncodingIntoNop.cpp | 8 +- ...ck.cpp => MaterializeEncodingPatterns.cpp} | 238 ++++++++--- .../iree/compiler/Codegen/Common/Passes.td | 15 + .../compiler/Codegen/Common/test/BUILD.bazel | 10 +- .../Codegen/Common/test/CMakeLists.txt | 6 + .../gpu_materialize_encoding_gfx1100.mlir | 2 +- .../test/gpu_materialize_encoding_gfx908.mlir | 2 +- .../test/gpu_materialize_encoding_gfx90a.mlir | 2 +- .../test/gpu_materialize_encoding_gfx942.mlir | 2 +- .../test/llvmcpu_materialize_encoding.mlir | 106 ++--- .../test/vmvx_materialize_encoding.mlir | 2 +- .../iree/compiler/Codegen/LLVMCPU/Passes.cpp | 2 +- .../src/iree/compiler/Codegen/Utils/Utils.cpp | 4 + .../src/iree/compiler/Codegen/Utils/Utils.h | 3 +- .../Dialect/VMVX/Transforms/Passes.cpp | 2 +- .../compiler/GlobalOptimization/BUILD.bazel | 2 - .../GlobalOptimization/CMakeLists.txt | 2 - .../MaterializeHomogeneousEncodings.cpp | 6 +- 33 files changed, 357 insertions(+), 672 deletions(-) delete mode 100644 compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp rename compiler/src/iree/compiler/Codegen/Common/{CPU/CPUMaterializeEncodings.cpp => MaterializeEncoding.cpp} (64%) rename compiler/src/iree/compiler/Codegen/Common/{MaterializeEncodingIntoPackUnPack.cpp => MaterializeEncodingPatterns.cpp} (85%) rename compiler/src/iree/compiler/Codegen/Common/{GPU => }/test/gpu_materialize_encoding_gfx1100.mlir (98%) rename compiler/src/iree/compiler/Codegen/Common/{GPU => }/test/gpu_materialize_encoding_gfx908.mlir (98%) rename compiler/src/iree/compiler/Codegen/Common/{GPU => }/test/gpu_materialize_encoding_gfx90a.mlir (99%) rename compiler/src/iree/compiler/Codegen/Common/{GPU => }/test/gpu_materialize_encoding_gfx942.mlir (99%) rename compiler/src/iree/compiler/Codegen/Common/{CPU => }/test/llvmcpu_materialize_encoding.mlir (97%) rename compiler/src/iree/compiler/Codegen/Common/{CPU => }/test/vmvx_materialize_encoding.mlir (99%) diff --git a/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel index e3513ba69d29..f95b0fa81551 100644 --- a/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel @@ -125,8 +125,9 @@ iree_compiler_cc_library( "LinkTuningSpecsPass.cpp", "LowerExecutableUsingTransformDialect.cpp", "LowerUKernelsToCalls.cpp", + "MaterializeEncoding.cpp", "MaterializeEncodingIntoNop.cpp", - "MaterializeEncodingIntoPackUnPack.cpp", + "MaterializeEncodingPatterns.cpp", "MaterializeTuningSpecsPass.cpp", "MemrefCopyToLinalg.cpp", "NormalizeLoopBounds.cpp", @@ -173,8 +174,10 @@ iree_compiler_cc_library( ":PassHeaders", ":PassesIncGen", "//compiler/src/iree/compiler/Codegen/Common:FoldTensorExtractOpIncGen", + "//compiler/src/iree/compiler/Codegen/Dialect/CPU/IR:IREECPUDialect", "//compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR:IREECodegenDialect", "//compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils", + "//compiler/src/iree/compiler/Codegen/Dialect/GPU/IR:IREEGPUDialect", "//compiler/src/iree/compiler/Codegen/Dialect/VectorExt/IR:IREEVectorExtDialect", "//compiler/src/iree/compiler/Codegen/Interfaces:BufferizationInterfaces", "//compiler/src/iree/compiler/Codegen/Interfaces:PartitionableLoopsInterface", @@ -183,9 +186,11 @@ iree_compiler_cc_library( "//compiler/src/iree/compiler/Codegen/Utils", "//compiler/src/iree/compiler/Dialect/Encoding/IR", "//compiler/src/iree/compiler/Dialect/Flow/IR", + "//compiler/src/iree/compiler/Dialect/HAL/Analysis", "//compiler/src/iree/compiler/Dialect/HAL/IR", "//compiler/src/iree/compiler/Dialect/LinalgExt/IR", "//compiler/src/iree/compiler/Dialect/LinalgExt/Transforms", + "//compiler/src/iree/compiler/Dialect/Stream/Analysis", "//compiler/src/iree/compiler/Dialect/Util/Analysis", "//compiler/src/iree/compiler/Dialect/Util/IR", "//compiler/src/iree/compiler/Utils", diff --git a/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt index adec8aad7583..af3c55725838 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt @@ -117,8 +117,9 @@ iree_cc_library( "LinkTuningSpecsPass.cpp" "LowerExecutableUsingTransformDialect.cpp" "LowerUKernelsToCalls.cpp" + "MaterializeEncoding.cpp" "MaterializeEncodingIntoNop.cpp" - "MaterializeEncodingIntoPackUnPack.cpp" + "MaterializeEncodingPatterns.cpp" "MaterializeTuningSpecsPass.cpp" "MemrefCopyToLinalg.cpp" "NormalizeLoopBounds.cpp" @@ -203,8 +204,10 @@ iree_cc_library( MLIRVectorTransforms MLIRViewLikeInterface iree::compiler::Codegen::Common::FoldTensorExtractOpIncGen + iree::compiler::Codegen::Dialect::CPU::IR::IREECPUDialect iree::compiler::Codegen::Dialect::Codegen::IR::IREECodegenDialect iree::compiler::Codegen::Dialect::Codegen::Utils + iree::compiler::Codegen::Dialect::GPU::IR::IREEGPUDialect iree::compiler::Codegen::Dialect::VectorExt::IR::IREEVectorExtDialect iree::compiler::Codegen::Interfaces::BufferizationInterfaces iree::compiler::Codegen::Interfaces::PartitionableLoopsInterface @@ -213,9 +216,11 @@ iree_cc_library( iree::compiler::Codegen::Utils iree::compiler::Dialect::Encoding::IR iree::compiler::Dialect::Flow::IR + iree::compiler::Dialect::HAL::Analysis iree::compiler::Dialect::HAL::IR iree::compiler::Dialect::LinalgExt::IR iree::compiler::Dialect::LinalgExt::Transforms + iree::compiler::Dialect::Stream::Analysis iree::compiler::Dialect::Util::Analysis iree::compiler::Dialect::Util::IR iree::compiler::Utils diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/CPU/BUILD.bazel index f1053da29240..05fb9bed4203 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/CPU/BUILD.bazel @@ -45,7 +45,6 @@ iree_compiler_cc_library( name = "CommonCPUPasses", srcs = [ "CPULowerToUKernels.cpp", - "CPUMaterializeEncodings.cpp", "CPUPrepareUkernels.cpp", "Passes.cpp", ], @@ -56,16 +55,13 @@ iree_compiler_cc_library( ":PassHeaders", ":PassesIncGen", "//compiler/src/iree/compiler/Codegen/Common", - "//compiler/src/iree/compiler/Codegen/Dialect/CPU/IR:IREECPUDialect", "//compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR:IREECodegenDialect", "//compiler/src/iree/compiler/Codegen/Dialect/Codegen/Utils", "//compiler/src/iree/compiler/Codegen/Interfaces:UKernelOpInterface", "//compiler/src/iree/compiler/Codegen/Transforms", "//compiler/src/iree/compiler/Codegen/Utils", "//compiler/src/iree/compiler/Dialect/Encoding/IR", - "//compiler/src/iree/compiler/Dialect/HAL/Analysis", "//compiler/src/iree/compiler/Dialect/HAL/IR", - "//compiler/src/iree/compiler/Dialect/Stream/Analysis", "//runtime/src/iree/builtins/ukernel:exported_bits", "@llvm-project//llvm:Support", "@llvm-project//mlir:AffineDialect", diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/CPU/CMakeLists.txt index 75db95e43291..419c4b0878c9 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/CPU/CMakeLists.txt @@ -42,7 +42,6 @@ iree_cc_library( "Passes.h" SRCS "CPULowerToUKernels.cpp" - "CPUMaterializeEncodings.cpp" "CPUPrepareUkernels.cpp" "Passes.cpp" DEPS @@ -78,16 +77,13 @@ iree_cc_library( MLIRVectorTransforms iree::builtins::ukernel::exported_bits iree::compiler::Codegen::Common - iree::compiler::Codegen::Dialect::CPU::IR::IREECPUDialect iree::compiler::Codegen::Dialect::Codegen::IR::IREECodegenDialect iree::compiler::Codegen::Dialect::Codegen::Utils iree::compiler::Codegen::Interfaces::UKernelOpInterface iree::compiler::Codegen::Transforms iree::compiler::Codegen::Utils iree::compiler::Dialect::Encoding::IR - iree::compiler::Dialect::HAL::Analysis iree::compiler::Dialect::HAL::IR - iree::compiler::Dialect::Stream::Analysis PUBLIC ) diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/Passes.td b/compiler/src/iree/compiler/Codegen/Common/CPU/Passes.td index 8c73c5bca4a9..394de5414ea1 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/Passes.td +++ b/compiler/src/iree/compiler/Codegen/Common/CPU/Passes.td @@ -13,26 +13,6 @@ include "mlir/Pass/PassBase.td" // Common Passes used for CPU-like backends (keep alphabetical) //===---------------------------------------------------------------------===// -def CPUMaterializeHostEncodingPass : - Pass<"iree-codegen-cpu-materialize-host-encoding", "mlir::ModuleOp"> { - let summary = "Convert encoding-specific operations based on target attributes."; - let description = [{ - Examples: - encoding.set_encoding -> tensor.pack - encoding.unset_encoding -> tensor.unpack - linalg.matmul -> linalg.mmt4d "}]; -} - -def CPUMaterializeDeviceEncodingPass : - InterfacePass<"iree-codegen-cpu-materialize-device-encoding", "mlir::FunctionOpInterface"> { - let summary = "Convert encoding-specific operations based on target attributes."; - let description = [{ - Examples: - encoding.set_encoding -> tensor.pack - encoding.unset_encoding -> tensor.unpack - linalg.matmul -> linalg.mmt4d "}]; -} - def CPULowerToUKernelsPass : Pass<"iree-codegen-cpu-lower-to-ukernels", ""> { let summary = diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/test/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/CPU/test/BUILD.bazel index b2d6b916f713..fe5caa3434e2 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/test/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/CPU/test/BUILD.bazel @@ -19,10 +19,8 @@ iree_lit_test_suite( srcs = enforce_glob( # keep sorted [ - "llvmcpu_materialize_encoding.mlir", "lower_to_ukernel_ops.mlir", "prepare_ukernels.mlir", - "vmvx_materialize_encoding.mlir", ], include = ["*.mlir"], ), diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/test/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/CPU/test/CMakeLists.txt index 3dd9de7f98cc..100058fea35a 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/test/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/CPU/test/CMakeLists.txt @@ -14,10 +14,8 @@ iree_lit_test_suite( NAME lit SRCS - "llvmcpu_materialize_encoding.mlir" "lower_to_ukernel_ops.mlir" "prepare_ukernels.mlir" - "vmvx_materialize_encoding.mlir" TOOLS FileCheck iree-opt diff --git a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.h b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.h index da9aa8a41731..bf188b66cf54 100644 --- a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.h +++ b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.h @@ -93,17 +93,9 @@ FailureOr lowerUnsetEncodingToUnpackOp( Value packedValue, const MaterializeEncodingTypeConverter &typeConverter, MaterializeEncodingValueFn materializeEncodingValueFn); -/// Pouplates the set of patterns that lowers set_encoding, unset_encoding, and -/// upstream dialect ops with encoding types to pack/unpack ops. -void populateMaterializeEncodingIntoPackUnPackPatterns( - RewritePatternSet &patterns, - MaterializeEncodingTypeConverter &typeConverter, - MaterializeEncodingValueFn materializeEncodingValueFn); - -/// Pouplates the set of patterns that lowers shape-like operations (e.g., Flow -/// ops, Hal ops, tensor.empty, linalg.fill, etc) with encoding types to the -/// same op with materialized shapes. -void populateShapeIndependentMaterializeEncodingPatterns( +/// Pouplates the set of patterns that lowers operations with encoding types to +/// operations without encodings. +void populateMaterializeEncodingPatterns( RewritePatternSet &patterns, MaterializeEncodingConversionTarget &target, MaterializeEncodingTypeConverter &typeConverter, MaterializeEncodingValueFn materializeEncodingValueFn); diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel index 128ffa9fc46e..66177778f683 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel @@ -65,7 +65,6 @@ iree_compiler_cc_library( "GPUGreedilyDistributeToThreads.cpp", "GPUInferMemorySpace.cpp", "GPULowerToUKernels.cpp", - "GPUMaterializeEncoding.cpp", "GPUMultiBuffering.cpp", "GPUNestedLayoutDistributionPatterns.cpp", "GPUPackToIntrinsics.cpp", @@ -107,10 +106,7 @@ iree_compiler_cc_library( "//compiler/src/iree/compiler/Codegen/Transforms", "//compiler/src/iree/compiler/Codegen/Utils", "//compiler/src/iree/compiler/Codegen/Utils:VectorOpUtils", - "//compiler/src/iree/compiler/Dialect/Encoding/IR", - "//compiler/src/iree/compiler/Dialect/HAL/Analysis", "//compiler/src/iree/compiler/Dialect/HAL/IR", - "//compiler/src/iree/compiler/Dialect/Stream/Analysis", "//compiler/src/iree/compiler/Utils", "@llvm-project//llvm:Support", "@llvm-project//mlir:AMDGPUDialect", diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt index 97d324042e2c..2f065df2bb52 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt @@ -63,7 +63,6 @@ iree_cc_library( "GPUGreedilyDistributeToThreads.cpp" "GPUInferMemorySpace.cpp" "GPULowerToUKernels.cpp" - "GPUMaterializeEncoding.cpp" "GPUMultiBuffering.cpp" "GPUNestedLayoutDistributionPatterns.cpp" "GPUPackToIntrinsics.cpp" @@ -140,10 +139,7 @@ iree_cc_library( iree::compiler::Codegen::Transforms iree::compiler::Codegen::Utils iree::compiler::Codegen::Utils::VectorOpUtils - iree::compiler::Dialect::Encoding::IR - iree::compiler::Dialect::HAL::Analysis iree::compiler::Dialect::HAL::IR - iree::compiler::Dialect::Stream::Analysis iree::compiler::Utils PUBLIC ) diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp deleted file mode 100644 index 32536085576b..000000000000 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp +++ /dev/null @@ -1,398 +0,0 @@ -// Copyright 2024 The IREE Authors -// -// Licensed under the Apache License v2.0 with LLVM Exceptions. -// See https://llvm.org/LICENSE.txt for license information. -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - -#include "iree/compiler/Codegen/Common/EncodingUtils.h" -#include "iree/compiler/Codegen/Common/GPU/Passes.h" -#include "iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.h" -#include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.h" -#include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUDialect.h" -#include "iree/compiler/Codegen/Utils/GPUUtils.h" -#include "iree/compiler/Dialect/Encoding/IR/EncodingDialect.h" -#include "iree/compiler/Dialect/Encoding/IR/EncodingOps.h" -#include "iree/compiler/Dialect/HAL/Analysis/DeviceAnalysis.h" -#include "iree/compiler/Dialect/HAL/IR/HALTypes.h" -#include "iree/compiler/Dialect/Stream/Analysis/Affinity.h" -#include "llvm/ADT/SmallVector.h" -#include "mlir/Dialect/Linalg/IR/Linalg.h" -#include "mlir/Dialect/MemRef/Transforms/Transforms.h" -#include "mlir/Dialect/Tensor/IR/Tensor.h" -#include "mlir/Dialect/Tensor/Transforms/Transforms.h" -#include "mlir/Dialect/Utils/IndexingUtils.h" -#include "mlir/Dialect/Utils/ReshapeOpsUtils.h" -#include "mlir/IR/BuiltinTypes.h" -#include "mlir/IR/MLIRContext.h" -#include "mlir/Transforms/GreedyPatternRewriteDriver.h" - -#define DEBUG_TYPE "iree-codegen-gpu-materialize-encoding" - -namespace mlir::iree_compiler { - -#define GEN_PASS_DEF_GPUMATERIALIZEDEVICEENCODINGPASS -#define GEN_PASS_DEF_GPUMATERIALIZEHOSTENCODINGPASS -#include "iree/compiler/Codegen/Common/GPU/Passes.h.inc" - -using IREE::Codegen::MaterializeEncodingInfo; -using IREE::Codegen::TileSwizzle; - -namespace { - -// TODO(hanchung): Delete this pass and rely on tensor-based analysis to -// materialize encodings based on where tensors are used. This pass is not able -// to handle that. -struct GPUMaterializeHostEncodingPass - : public impl::GPUMaterializeHostEncodingPassBase< - GPUMaterializeHostEncodingPass> { - void getDependentDialects(DialectRegistry ®istry) const override { - registry.insert(); - } - - void runOnOperation() override; -}; - -struct GPUMaterializeDeviceEncodingPass final - : impl::GPUMaterializeDeviceEncodingPassBase< - GPUMaterializeDeviceEncodingPass> { - using GPUMaterializeDeviceEncodingPassBase:: - GPUMaterializeDeviceEncodingPassBase; - void getDependentDialects(DialectRegistry ®istry) const override { - registry.insert(); - } - void runOnOperation() override; -}; - -SmallVector -getReassociationIndices(int outerDims, - const TileSwizzle::ExpandShapeType &expandShape) { - SmallVector result; - int expandedIdx = 0; - for (int i = 0; i < outerDims; ++i) { - result.push_back({expandedIdx++}); - } - for (auto expandShapeDim : expandShape) { - result.push_back({}); - for (int i = 0, e = expandShapeDim.size(); i < e; ++i) { - result.back().push_back(expandedIdx++); - } - } - return result; -} - -/// Convert iree_linalg_ext.set_encoding op to pack + tile swizzling ops. We use -/// expand_shape + linalg.transpose to represent a tile swizzling op. -struct GPUSetEncodingOpLoweringConversion - : public OpMaterializeEncodingPattern { - using OpMaterializeEncodingPattern< - IREE::Encoding::SetEncodingOp>::OpMaterializeEncodingPattern; - - LogicalResult - matchAndRewrite(IREE::Encoding::SetEncodingOp encodingOp, OpAdaptor adaptor, - ConversionPatternRewriter &rewriter) const override { - auto converter = static_cast( - getTypeConverter()); - auto packedValue = lowerSetEncodingOpToPackOp( - rewriter, encodingOp, adaptor.getSource(), *converter, - this->materializeEncodingValueFn); - if (failed(packedValue)) { - Type targetType = - getTypeConverter()->convertType(encodingOp.getResultType()); - Value result = rewriter.createOrFold( - encodingOp.getLoc(), targetType, adaptor.getSource()); - rewriter.replaceOp(encodingOp, result); - return success(); - } - - MaterializeEncodingInfo encodingInfo = - converter->getEncodingInfo(encodingOp.getResultType()); - if (!encodingInfo.swizzle) { - rewriter.replaceOp(encodingOp, packedValue.value()); - return success(); - } - - Location loc = encodingOp.getLoc(); - - // Create expand_shape op to tile the innermost two dimensions. - int origRank = encodingOp.getSourceType().getRank(); - SmallVector expandShapeShape( - cast(packedValue->getType()) - .getShape() - .take_front(origRank)); - expandShapeShape.append( - getExpandedTileShape(encodingInfo.swizzle->expandShape)); - RankedTensorType expandShapeType = - encodingOp.getSourceType().clone(expandShapeShape); - - SmallVector reassociation = - getReassociationIndices(origRank, encodingInfo.swizzle->expandShape); - auto expandShapeOp = rewriter.create( - loc, expandShapeType, packedValue.value(), reassociation); - - SmallVector transposePerm = - llvm::to_vector(llvm::seq(0, origRank)); - for (auto perm : encodingInfo.swizzle->permutation) { - transposePerm.push_back(origRank + perm); - } - SmallVector transposeResultDims = - tensor::getMixedSizes(rewriter, loc, expandShapeOp.getResult()); - applyPermutationToVector(transposeResultDims, transposePerm); - - auto emptyTensor = rewriter.create( - loc, transposeResultDims, encodingOp.getSourceType().getElementType()); - auto transposeOp = rewriter.create( - loc, expandShapeOp, emptyTensor, transposePerm); - rewriter.replaceOp(encodingOp, transposeOp->getResult(0)); - - return success(); - } -}; - -struct GPUUnsetEncodingOpLoweringConversion - : public OpMaterializeEncodingPattern { - using OpMaterializeEncodingPattern< - IREE::Encoding::UnsetEncodingOp>::OpMaterializeEncodingPattern; - - LogicalResult - matchAndRewrite(IREE::Encoding::UnsetEncodingOp unsetEncodingOp, - OpAdaptor adaptor, - ConversionPatternRewriter &rewriter) const override { - auto converter = static_cast( - getTypeConverter()); - - MaterializeEncodingInfo encodingInfo = - converter->getEncodingInfo(unsetEncodingOp.getSource().getType()); - if (IREE::Codegen::isIdentityLayout(encodingInfo)) { - Type targetType = - getTypeConverter()->convertType(unsetEncodingOp.getSourceType()); - Value result = rewriter.createOrFold( - unsetEncodingOp.getLoc(), targetType, adaptor.getSource()); - rewriter.replaceOp(unsetEncodingOp, result); - return success(); - } - - Location loc = unsetEncodingOp.getLoc(); - Value unpackSrc = adaptor.getSource(); - if (encodingInfo.swizzle) { - int targetRank = unsetEncodingOp.getResultType().getRank(); - auto srcConvertedType = - cast(adaptor.getSource().getType()); - SmallVector emptyShape = - tensor::getMixedSizes(rewriter, loc, adaptor.getSource()); - emptyShape.resize(targetRank); - for (auto i : getExpandedTileShape(encodingInfo.swizzle->expandShape)) { - emptyShape.push_back(rewriter.getIndexAttr(i)); - } - auto emptyTensor = rewriter.create( - loc, emptyShape, unsetEncodingOp.getSourceType().getElementType()); - - SmallVector transposePerm = - llvm::to_vector(llvm::seq(0, targetRank)); - for (auto perm : encodingInfo.swizzle->permutation) { - transposePerm.push_back(targetRank + perm); - } - auto invertedTransposePerm = invertPermutationVector(transposePerm); - auto transposeOp = rewriter.create( - loc, adaptor.getSource(), emptyTensor, invertedTransposePerm); - - SmallVector reassociation = getReassociationIndices( - targetRank, encodingInfo.swizzle->expandShape); - SmallVector unpackSrcShape( - srcConvertedType.getShape().take_front(targetRank)); - unpackSrcShape.append(encodingInfo.innerTileSizes.begin(), - encodingInfo.innerTileSizes.end()); - RankedTensorType unpackSrcType = - unsetEncodingOp.getResultType().clone(unpackSrcShape); - unpackSrc = rewriter.create( - loc, unpackSrcType, transposeOp->getResult(0), reassociation); - } - - auto unpackedValue = lowerUnsetEncodingToUnpackOp( - rewriter, unsetEncodingOp, unpackSrc, *converter, - this->materializeEncodingValueFn); - if (failed(unpackedValue)) { - Type targetType = - getTypeConverter()->convertType(unsetEncodingOp.getResultType()); - Value result = rewriter.createOrFold(loc, targetType, - adaptor.getSource()); - rewriter.replaceOp(unsetEncodingOp, result); - return success(); - } - rewriter.replaceOp(unsetEncodingOp, unpackedValue.value()); - return success(); - } -}; - -class GPUConvertToMultiMma final - : public OpInterfaceConversionPattern { -public: - using OpInterfaceConversionPattern< - linalg::ContractionOpInterface>::OpInterfaceConversionPattern; - - GPUConvertToMultiMma( - MLIRContext *context, - const MaterializeEncodingTypeConverter &typeConverter, - MaterializeEncodingValueFn materializeEncodingValueFn = {}, - PatternBenefit benefit = 1) - : OpInterfaceConversionPattern( - typeConverter, context, benefit), - materializeEncodingValueFn(materializeEncodingValueFn) {} - - LogicalResult - matchAndRewrite(linalg::ContractionOpInterface op, ArrayRef operands, - ConversionPatternRewriter &rewriter) const override { - auto converter = static_cast( - this->getTypeConverter()); - auto layoutAttr = converter->getLayoutAttr(); - assert(layoutAttr && "layoutAttr is not set, which is not expected. Are " - "you adding new arch support?"); - SmallVector convertedResTypes; - auto linalgOp = cast(op.getOperation()); - for (auto init : linalgOp.getDpsInits()) { - convertedResTypes.push_back(converter->convertType(init.getType())); - } - Operation *newOp = - layoutAttr.lowerOp(rewriter, op, convertedResTypes, operands); - rewriter.replaceOp(op, newOp->getResults()); - return success(); - } - -protected: - const MaterializeEncodingValueFn materializeEncodingValueFn; -}; - -static LogicalResult -materializeFuncOpEncodings(FunctionOpInterface funcOp, - IREE::HAL::ExecutableTargetAttr targetAttr) { - MLIRContext *ctx = funcOp.getContext(); - { - RewritePatternSet patterns(ctx); - IREE::GPU::TargetAttr gpuTargetAttr; - if (targetAttr) { - gpuTargetAttr = getGPUTargetAttr(targetAttr); - } else { - gpuTargetAttr = getCLGPUTarget(ctx); - } - MaterializeEncodingTypeConverter typeConverter( - cast( - IREE::GPU::GPUEncodingLayoutAttr::get(ctx, gpuTargetAttr))); - MaterializeEncodingConversionTarget target(*ctx); - MaterializeEncodingValueFn materializeEncodingValueFn = - [](RankedTensorType, OpBuilder, - Location) -> FailureOr { return {}; }; - populateShapeIndependentMaterializeEncodingPatterns( - patterns, target, typeConverter, materializeEncodingValueFn); - - patterns.insert( - ctx, typeConverter, materializeEncodingValueFn); - - memref::populateResolveRankedShapedTypeResultDimsPatterns(patterns); - if (failed(applyPartialConversion(funcOp, target, std::move(patterns)))) { - funcOp.emitOpError("materialization failed"); - return failure(); - } - } - - // Add patterns to fold pack/unpack ops with pad/extract_slice ops and - // resolve dims ops. - { - RewritePatternSet patterns(ctx); - tensor::CastOp::getCanonicalizationPatterns(patterns, ctx); - tensor::populateFoldIntoPackAndUnpackPatterns(patterns); - memref::populateResolveRankedShapedTypeResultDimsPatterns(patterns); - if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) { - funcOp.emitOpError("folding patterns failed"); - return failure(); - } - } - - return success(); -} - -static std::optional> -getFuncExecutableTargetAttrs(FunctionOpInterface funcOp, - IREE::Stream::AffinityAnalysis &affinityAnalysis, - IREE::HAL::DeviceAnalysis &deviceAnalysis) { - // Get a set of all unique affinities used by resources within the function. - SetVector uniqueAffinityAttrs; - SmallVector lookupAffinityAttrs; - funcOp.walk([&](Operation *op) { - if (affinityAnalysis.tryLookupExecutionAffinity(op, lookupAffinityAttrs)) { - uniqueAffinityAttrs.insert(lookupAffinityAttrs.begin(), - lookupAffinityAttrs.end()); - } - lookupAffinityAttrs.clear(); - }); - - // Resolve affinities to executable targets. - SetVector executableTargetAttrs; - for (auto affinityAttr : uniqueAffinityAttrs) { - deviceAnalysis.gatherRequiredExecutableTargets(affinityAttr, funcOp, - executableTargetAttrs); - } - return executableTargetAttrs; -} - -} // namespace - -void GPUMaterializeHostEncodingPass::runOnOperation() { - auto moduleOp = getOperation(); - - // Run required analysis passes. - IREE::Stream::AffinityAnalysis affinityAnalysis(moduleOp); - if (failed(affinityAnalysis.run())) { - return signalPassFailure(); - } - IREE::HAL::DeviceAnalysis deviceAnalysis(moduleOp); - if (failed(deviceAnalysis.run())) { - return signalPassFailure(); - } - - for (auto funcOp : moduleOp.getOps()) { - // Gather the required executable targets for the function. Note that it's - // possible there are more required for ops nested within the function but - // this pass is a hack and can't handle that :shrug:. - auto executableTargets = - getFuncExecutableTargetAttrs(funcOp, affinityAnalysis, deviceAnalysis); - if (!executableTargets) { - funcOp.emitOpError() - << "could not determine executable targets for the function"; - return signalPassFailure(); - } else if (executableTargets->empty()) { - // Probably no tensors. - continue; - } - - // HACK: this pass is run on the host _but shouldn't be_. Because it's - // run on the host and IREE is a compiler capable of multi-targeting there - // may be multiple executable targets at any point in the host program. - // This pass can't handle that and assumes it's been checked earlier by - // spooky action at a distance. This needs to be fixed. - if (executableTargets->size() != 1) { - funcOp.emitOpError() << "has multiple executable targets and CPU data " - "tiling isn't built to support that"; - return signalPassFailure(); - } - - // Materialize encodings within the function. - if (failed( - materializeFuncOpEncodings(funcOp, executableTargets->front()))) { - return signalPassFailure(); - } - } -} - -void GPUMaterializeDeviceEncodingPass::runOnOperation() { - FunctionOpInterface funcOp = getOperation(); - auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(funcOp); - if (failed(materializeFuncOpEncodings(funcOp, targetAttr))) { - return signalPassFailure(); - } -} - -} // namespace mlir::iree_compiler diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td b/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td index 2c25e02852f4..ff2b2b94f9b2 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td @@ -247,16 +247,6 @@ def GPUApplyTilingLevelPass : ]; } -def GPUMaterializeHostEncodingPass : - Pass<"iree-codegen-gpu-materialize-host-encoding", "mlir::ModuleOp"> { - let summary = "Materialize the encoding for tensor as specified by the backend."; -} - -def GPUMaterializeDeviceEncodingPass : - InterfacePass<"iree-codegen-gpu-materialize-device-encoding", "mlir::FunctionOpInterface"> { - let summary = "Materialize the encoding for tensor as specified by the backend."; -} - def GPUTensorTileToSerialLoopsPass : InterfacePass<"iree-codegen-gpu-tensor-tile-to-serial-loops", "mlir::FunctionOpInterface"> { let summary = "Pass to tile reduction dimensions for certain GPU ops"; diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel index 030e6f4de497..2f3b092d5676 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/BUILD.bazel @@ -32,10 +32,6 @@ iree_lit_test_suite( "gpu_infer_memory_space.mlir", "gpu_combine_value_barriers.mlir", "gpu_lower_to_ukernels.mlir", - "gpu_materialize_encoding_gfx908.mlir", - "gpu_materialize_encoding_gfx90a.mlir", - "gpu_materialize_encoding_gfx942.mlir", - "gpu_materialize_encoding_gfx1100.mlir", "gpu_nested_layout_contract_amdgpu.mlir", "gpu_nested_layout_vector_distribution.mlir", "gpu_nested_layout_vector_distribution_step.mlir", diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt index 6d1f540f420a..50be391693cc 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt @@ -27,10 +27,6 @@ iree_lit_test_suite( "gpu_greedily_distribute_to_threads.mlir" "gpu_infer_memory_space.mlir" "gpu_lower_to_ukernels.mlir" - "gpu_materialize_encoding_gfx1100.mlir" - "gpu_materialize_encoding_gfx908.mlir" - "gpu_materialize_encoding_gfx90a.mlir" - "gpu_materialize_encoding_gfx942.mlir" "gpu_nested_layout_contract_amdgpu.mlir" "gpu_nested_layout_vector_distribution.mlir" "gpu_nested_layout_vector_distribution_step.mlir" diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodings.cpp b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncoding.cpp similarity index 64% rename from compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodings.cpp rename to compiler/src/iree/compiler/Codegen/Common/MaterializeEncoding.cpp index d182649f64ac..f1776b90f74e 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodings.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncoding.cpp @@ -1,47 +1,46 @@ -// Copyright 2023 The IREE Authors +// Copyright 2024 The IREE Authors // // Licensed under the Apache License v2.0 with LLVM Exceptions. // See https://llvm.org/LICENSE.txt for license information. // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception -#include "iree/compiler/Codegen/Common/CPU/Passes.h" #include "iree/compiler/Codegen/Common/EncodingUtils.h" +#include "iree/compiler/Codegen/Common/PassUtils.h" +#include "iree/compiler/Codegen/Common/Passes.h" #include "iree/compiler/Codegen/Dialect/CPU/IR/IREECPUDialect.h" #include "iree/compiler/Codegen/Dialect/CPU/IR/IREECPUTypes.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenOps.h" -#include "iree/compiler/Codegen/Dialect/Codegen/Utils/Utils.h" +#include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenTypes.h" +#include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.h" +#include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUDialect.h" +#include "iree/compiler/Codegen/Utils/GPUUtils.h" #include "iree/compiler/Codegen/Utils/Utils.h" #include "iree/compiler/Dialect/Encoding/IR/EncodingOps.h" #include "iree/compiler/Dialect/HAL/Analysis/DeviceAnalysis.h" #include "iree/compiler/Dialect/HAL/IR/HALTypes.h" #include "iree/compiler/Dialect/Stream/Analysis/Affinity.h" -#include "llvm/ADT/STLExtras.h" -#include "llvm/ADT/SmallVector.h" -#include "llvm/Support/MathExtras.h" -#include "mlir/Dialect/Arith/IR/Arith.h" -#include "mlir/Dialect/Linalg/IR/LinalgInterfaces.h" #include "mlir/Dialect/MemRef/Transforms/Transforms.h" -#include "mlir/Dialect/Tensor/Transforms/Transforms.h" -#include "mlir/IR/BuiltinAttributes.h" -#include "mlir/IR/BuiltinTypeInterfaces.h" -#include "mlir/IR/BuiltinTypes.h" +#include "mlir/Dialect/Tensor/IR/Tensor.h" +#include "mlir/Pass/PassManager.h" #include "mlir/Transforms/DialectConversion.h" #include "mlir/Transforms/GreedyPatternRewriteDriver.h" +#include "mlir/Transforms/Passes.h" -#define DEBUG_TYPE "cpu-materialize-encoding" +#define DEBUG_TYPE "iree-codegen--materialize-encoding" #define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ") #define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n") namespace mlir::iree_compiler { -using IREE::Codegen::MaterializeEncodingInfo; -using IREE::Codegen::TileMxNxK; +#define GEN_PASS_DEF_MATERIALIZEDEVICEENCODINGPASS +#define GEN_PASS_DEF_MATERIALIZEHOSTENCODINGPASS +#include "iree/compiler/Codegen/Common/Passes.h.inc" -#define GEN_PASS_DEF_CPUMATERIALIZEDEVICEENCODINGPASS -#define GEN_PASS_DEF_CPUMATERIALIZEHOSTENCODINGPASS -#include "iree/compiler/Codegen/Common/CPU/Passes.h.inc" +using namespace IREE::Encoding; + +namespace { static FailureOr chooseDynamicEncodingInfoVMVXMicrokernels(RankedTensorType tensorType, @@ -64,33 +63,46 @@ getMaterializeEncodingValueFn(IREE::HAL::ExecutableTargetAttr targetAttr) { static LogicalResult materializeFuncOpEncodings(FunctionOpInterface funcOp, - IREE::HAL::ExecutableTargetAttr targetAttr) { + IREE::HAL::ExecutableTargetAttr targetAttr, + bool testCLGPUTarget = false) { MLIRContext *ctx = funcOp.getContext(); - RewritePatternSet materializeEncodingPattern(ctx); - DictionaryAttr targetConfig = targetAttr.getConfiguration(); - IREE::Codegen::LayoutAttrInterface layoutAttr; - if (isVMVXBackend(targetAttr)) { - LDBG("Select VMVXEncodingLayoutAttr attribute as the layout attribute."); - layoutAttr = cast( - IREE::CPU::VMVXEncodingLayoutAttr::get(ctx, targetConfig)); - } else { - LDBG("Select CPUEncodingLayoutAttr attribute as the layout attribute."); - layoutAttr = cast( - IREE::CPU::CPUEncodingLayoutAttr::get(ctx, targetConfig)); - } - MaterializeEncodingTypeConverter typeConverter(layoutAttr); - MaterializeEncodingConversionTarget target(*ctx); - auto materializeEncodingValueFn = getMaterializeEncodingValueFn(targetAttr); - populateMaterializeEncodingIntoPackUnPackPatterns( - materializeEncodingPattern, typeConverter, materializeEncodingValueFn); - populateShapeIndependentMaterializeEncodingPatterns( - materializeEncodingPattern, target, typeConverter, - materializeEncodingValueFn); - - if (failed(applyPartialConversion(funcOp, target, - std::move(materializeEncodingPattern)))) { - funcOp.emitOpError("materialization failed"); - return failure(); + { + RewritePatternSet patterns(ctx); + IREE::Codegen::LayoutAttrInterface layoutAttr; + if (isVMVXBackend(targetAttr)) { + LDBG("Select VMVXEncodingLayoutAttr attribute as the layout attribute."); + layoutAttr = cast( + IREE::CPU::VMVXEncodingLayoutAttr::get( + ctx, targetAttr.getConfiguration())); + } else if (isLLVMCPUBackend(targetAttr)) { + LDBG("Select CPUEncodingLayoutAttr attribute as the layout attribute."); + layoutAttr = cast( + IREE::CPU::CPUEncodingLayoutAttr::get(ctx, + targetAttr.getConfiguration())); + } else if (isROCMBackend(targetAttr)) { + LDBG("Select GPUEncodingLayoutAttr attribute as the layout attribute."); + layoutAttr = cast( + IREE::GPU::GPUEncodingLayoutAttr::get(ctx, + getGPUTargetAttr(targetAttr))); + } else if (testCLGPUTarget) { + LDBG("Select GPUEncodingLayoutAttr attribute as the layout attribute. " + "(testCLGPUTarget)"); + layoutAttr = cast( + IREE::GPU::GPUEncodingLayoutAttr::get(ctx, getCLGPUTarget(ctx))); + } else { + LDBG("Select EncodingNopLayoutAttr attribute as the layout attribute."); + layoutAttr = IREE::Codegen::EncodingNopLayoutAttr::get(ctx); + } + MaterializeEncodingTypeConverter typeConverter(layoutAttr); + MaterializeEncodingConversionTarget target(*ctx); + auto materializeEncodingValueFn = getMaterializeEncodingValueFn(targetAttr); + populateMaterializeEncodingPatterns(patterns, target, typeConverter, + materializeEncodingValueFn); + + if (failed(applyPartialConversion(funcOp, target, std::move(patterns)))) { + funcOp.emitOpError("materialization failed"); + return failure(); + } } // Add patterns to fold pack/unpack ops with pad/extract_slice ops and @@ -138,13 +150,13 @@ getFuncExecutableTargetAttrs(FunctionOpInterface funcOp, return executableTargetAttrs; } -struct CPUMaterializeHostEncodingPass - : public impl::CPUMaterializeHostEncodingPassBase< - CPUMaterializeHostEncodingPass> { +struct MaterializeHostEncodingPass + : public impl::MaterializeHostEncodingPassBase< + MaterializeHostEncodingPass> { void getDependentDialects(DialectRegistry ®istry) const override { - registry - .insert(); + registry.insert(); } void runOnOperation() override { @@ -199,22 +211,27 @@ struct CPUMaterializeHostEncodingPass // that. It should _not_ be running on both - target-specific codegen passes // are not allowed on host programs and it's a big violation of layering that // this exists. -struct CPUMaterializeDeviceEncodingPass - : public impl::CPUMaterializeDeviceEncodingPassBase< - CPUMaterializeDeviceEncodingPass> { +struct MaterializeDeviceEncodingPass + : public impl::MaterializeDeviceEncodingPassBase< + MaterializeDeviceEncodingPass> { + using impl::MaterializeDeviceEncodingPassBase< + MaterializeDeviceEncodingPass>::MaterializeDeviceEncodingPassBase; + void getDependentDialects(DialectRegistry ®istry) const override { - registry - .insert(); + registry.insert(); } void runOnOperation() override { auto funcOp = getOperation(); auto executableTargetAttr = IREE::HAL::ExecutableTargetAttr::lookup(funcOp); - if (failed(materializeFuncOpEncodings(funcOp, executableTargetAttr))) { + if (failed(materializeFuncOpEncodings(funcOp, executableTargetAttr, + testCLGPUTarget))) { return signalPassFailure(); } } }; +} // namespace } // namespace mlir::iree_compiler diff --git a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoNop.cpp b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoNop.cpp index 4de4b454478a..d93cb98014de 100644 --- a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoNop.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoNop.cpp @@ -48,11 +48,9 @@ struct MaterializeEncodingIntoNopPass final MaterializeEncodingTypeConverter typeConverter( IREE::Codegen::EncodingNopLayoutAttr::get(context)); MaterializeEncodingConversionTarget target(*context); - populateMaterializeEncodingIntoPackUnPackPatterns( - materializeEncodingPattern, typeConverter, materializeEncodingValueFn); - populateShapeIndependentMaterializeEncodingPatterns( - materializeEncodingPattern, target, typeConverter, - materializeEncodingValueFn); + populateMaterializeEncodingPatterns(materializeEncodingPattern, target, + typeConverter, + materializeEncodingValueFn); if (failed(applyPartialConversion(operation, target, std::move(materializeEncodingPattern)))) { diff --git a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingPatterns.cpp similarity index 85% rename from compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp rename to compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingPatterns.cpp index 087d91dccf41..cd3d27e5c7f9 100644 --- a/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingPatterns.cpp @@ -32,6 +32,7 @@ namespace mlir::iree_compiler { using IREE::Codegen::MaterializeEncodingInfo; +using IREE::Codegen::TileSwizzle; //===---------------------------------------------------------------------===// // Utility methods @@ -237,6 +238,10 @@ static FailureOr lowerGenericOpWithEncoding( return rewriter.notifyMatchFailure( genericOp, "MaterializeEncodingInfo failed for output"); } + if (outMaterializeEncodingInfo.swizzle) { + return rewriter.notifyMatchFailure( + genericOp, "generic op lowering does not support swizzle yet"); + } auto convertedResultType = cast(convertedOutputOperands[0].getType()); @@ -561,60 +566,6 @@ struct MaterializeFlowDispatchTensorStoreOp // the core conversion utilities. //===---------------------------------------------------------------------===// -/// Convert `set_encoding` op to `pack` op. -struct SetEncodingOpToPackOpConversion - : public OpMaterializeEncodingPattern { - using OpMaterializeEncodingPattern< - IREE::Encoding::SetEncodingOp>::OpMaterializeEncodingPattern; - - LogicalResult - matchAndRewrite(IREE::Encoding::SetEncodingOp encodingOp, OpAdaptor adaptor, - ConversionPatternRewriter &rewriter) const override { - auto converter = static_cast( - getTypeConverter()); - auto packOp = lowerSetEncodingOpToPackOp(rewriter, encodingOp, - adaptor.getSource(), *converter, - this->materializeEncodingValueFn); - if (failed(packOp)) { - Type targetType = - getTypeConverter()->convertType(encodingOp.getResultType()); - Value result = rewriter.createOrFold( - encodingOp.getLoc(), targetType, adaptor.getSource()); - rewriter.replaceOp(encodingOp, result); - return success(); - } - rewriter.replaceOp(encodingOp, packOp.value()); - return success(); - } -}; - -/// Convert `unset_encoding` op to `unpack` op. -struct UnsetEncodingOpToUnPackOpConversion - : public OpMaterializeEncodingPattern { - using OpMaterializeEncodingPattern< - IREE::Encoding::UnsetEncodingOp>::OpMaterializeEncodingPattern; - - LogicalResult - matchAndRewrite(IREE::Encoding::UnsetEncodingOp encodingOp, OpAdaptor adaptor, - ConversionPatternRewriter &rewriter) const override { - auto converter = static_cast( - this->getTypeConverter()); - auto unpackedValue = lowerUnsetEncodingToUnpackOp( - rewriter, encodingOp, adaptor.getSource(), *converter, - this->materializeEncodingValueFn); - if (failed(unpackedValue)) { - Type targetType = - getTypeConverter()->convertType(encodingOp.getResultType()); - Value result = rewriter.createOrFold( - encodingOp.getLoc(), targetType, adaptor.getSource()); - rewriter.replaceOp(encodingOp, result); - return success(); - } - rewriter.replaceOp(encodingOp, unpackedValue.value()); - return success(); - } -}; - /// Generic pattern to convert operation that is in Destination Passing Style. template struct MaterializeDPSOperation : public OpMaterializeEncodingPattern { @@ -685,6 +636,166 @@ struct MaterializeOptimizationBarrierOp } }; +static SmallVector +getReassociationIndices(int outerDims, + const TileSwizzle::ExpandShapeType &expandShape) { + SmallVector result; + int expandedIdx = 0; + for (int i = 0; i < outerDims; ++i) { + result.push_back({expandedIdx++}); + } + for (auto expandShapeDim : expandShape) { + result.push_back({}); + for (int i = 0, e = expandShapeDim.size(); i < e; ++i) { + result.back().push_back(expandedIdx++); + } + } + return result; +} + +/// Convert iree_linalg_ext.set_encoding op to pack + tile swizzling ops. We use +/// expand_shape + linalg.transpose to represent a tile swizzling op. +struct SetEncodingOpLoweringConversion + : public OpMaterializeEncodingPattern { + using OpMaterializeEncodingPattern< + IREE::Encoding::SetEncodingOp>::OpMaterializeEncodingPattern; + + LogicalResult + matchAndRewrite(IREE::Encoding::SetEncodingOp encodingOp, OpAdaptor adaptor, + ConversionPatternRewriter &rewriter) const override { + auto converter = static_cast( + getTypeConverter()); + auto packedValue = lowerSetEncodingOpToPackOp( + rewriter, encodingOp, adaptor.getSource(), *converter, + this->materializeEncodingValueFn); + if (failed(packedValue)) { + Type targetType = + getTypeConverter()->convertType(encodingOp.getResultType()); + Value result = rewriter.createOrFold( + encodingOp.getLoc(), targetType, adaptor.getSource()); + rewriter.replaceOp(encodingOp, result); + return success(); + } + + MaterializeEncodingInfo encodingInfo = + converter->getEncodingInfo(encodingOp.getResultType()); + if (!encodingInfo.swizzle) { + rewriter.replaceOp(encodingOp, packedValue.value()); + return success(); + } + + Location loc = encodingOp.getLoc(); + + // Create expand_shape op to tile the innermost two dimensions. + int origRank = encodingOp.getSourceType().getRank(); + SmallVector expandShapeShape( + cast(packedValue->getType()) + .getShape() + .take_front(origRank)); + expandShapeShape.append( + getExpandedTileShape(encodingInfo.swizzle->expandShape)); + RankedTensorType expandShapeType = + encodingOp.getSourceType().clone(expandShapeShape); + + SmallVector reassociation = + getReassociationIndices(origRank, encodingInfo.swizzle->expandShape); + auto expandShapeOp = rewriter.create( + loc, expandShapeType, packedValue.value(), reassociation); + + SmallVector transposePerm = + llvm::to_vector(llvm::seq(0, origRank)); + for (auto perm : encodingInfo.swizzle->permutation) { + transposePerm.push_back(origRank + perm); + } + SmallVector transposeResultDims = + tensor::getMixedSizes(rewriter, loc, expandShapeOp.getResult()); + applyPermutationToVector(transposeResultDims, transposePerm); + + auto emptyTensor = rewriter.create( + loc, transposeResultDims, encodingOp.getSourceType().getElementType()); + auto transposeOp = rewriter.create( + loc, expandShapeOp, emptyTensor, transposePerm); + rewriter.replaceOp(encodingOp, transposeOp->getResult(0)); + + return success(); + } +}; + +struct UnsetEncodingOpLoweringConversion + : public OpMaterializeEncodingPattern { + using OpMaterializeEncodingPattern< + IREE::Encoding::UnsetEncodingOp>::OpMaterializeEncodingPattern; + + LogicalResult + matchAndRewrite(IREE::Encoding::UnsetEncodingOp unsetEncodingOp, + OpAdaptor adaptor, + ConversionPatternRewriter &rewriter) const override { + auto converter = static_cast( + getTypeConverter()); + + MaterializeEncodingInfo encodingInfo = + converter->getEncodingInfo(unsetEncodingOp.getSource().getType()); + if (IREE::Codegen::isIdentityLayout(encodingInfo)) { + Type targetType = + getTypeConverter()->convertType(unsetEncodingOp.getSourceType()); + Value result = rewriter.createOrFold( + unsetEncodingOp.getLoc(), targetType, adaptor.getSource()); + rewriter.replaceOp(unsetEncodingOp, result); + return success(); + } + + Location loc = unsetEncodingOp.getLoc(); + Value unpackSrc = adaptor.getSource(); + if (encodingInfo.swizzle) { + int targetRank = unsetEncodingOp.getResultType().getRank(); + auto srcConvertedType = + cast(adaptor.getSource().getType()); + SmallVector emptyShape = + tensor::getMixedSizes(rewriter, loc, adaptor.getSource()); + emptyShape.resize(targetRank); + for (auto i : getExpandedTileShape(encodingInfo.swizzle->expandShape)) { + emptyShape.push_back(rewriter.getIndexAttr(i)); + } + auto emptyTensor = rewriter.create( + loc, emptyShape, unsetEncodingOp.getSourceType().getElementType()); + + SmallVector transposePerm = + llvm::to_vector(llvm::seq(0, targetRank)); + for (auto perm : encodingInfo.swizzle->permutation) { + transposePerm.push_back(targetRank + perm); + } + auto invertedTransposePerm = invertPermutationVector(transposePerm); + auto transposeOp = rewriter.create( + loc, adaptor.getSource(), emptyTensor, invertedTransposePerm); + + SmallVector reassociation = getReassociationIndices( + targetRank, encodingInfo.swizzle->expandShape); + SmallVector unpackSrcShape( + srcConvertedType.getShape().take_front(targetRank)); + unpackSrcShape.append(encodingInfo.innerTileSizes.begin(), + encodingInfo.innerTileSizes.end()); + RankedTensorType unpackSrcType = + unsetEncodingOp.getResultType().clone(unpackSrcShape); + unpackSrc = rewriter.create( + loc, unpackSrcType, transposeOp->getResult(0), reassociation); + } + + auto unpackedValue = lowerUnsetEncodingToUnpackOp( + rewriter, unsetEncodingOp, unpackSrc, *converter, + this->materializeEncodingValueFn); + if (failed(unpackedValue)) { + Type targetType = + getTypeConverter()->convertType(unsetEncodingOp.getResultType()); + Value result = rewriter.createOrFold(loc, targetType, + adaptor.getSource()); + rewriter.replaceOp(unsetEncodingOp, result); + return success(); + } + rewriter.replaceOp(unsetEncodingOp, unpackedValue.value()); + return success(); + } +}; + /// Pattern to convert contraction operations. class MaterializeContractionOp : public OpInterfaceConversionPattern { @@ -726,21 +837,7 @@ class MaterializeContractionOp } // namespace -void populateMaterializeEncodingIntoPackUnPackPatterns( - RewritePatternSet &patterns, - MaterializeEncodingTypeConverter &typeConverter, - MaterializeEncodingValueFn materializeEncodingValueFn) { - MLIRContext *context = patterns.getContext(); - // TODO(hanchung): Move the generic op pattern to ShapeIndependent category - // after we add the support for tile swizzling variants. - patterns.insert, - MaterializeContractionOp, SetEncodingOpToPackOpConversion, - UnsetEncodingOpToUnPackOpConversion>( - context, typeConverter, materializeEncodingValueFn); - memref::populateResolveRankedShapedTypeResultDimsPatterns(patterns); -} - -void populateShapeIndependentMaterializeEncodingPatterns( +void populateMaterializeEncodingPatterns( RewritePatternSet &patterns, MaterializeEncodingConversionTarget &target, MaterializeEncodingTypeConverter &typeConverter, MaterializeEncodingValueFn materializeEncodingValueFn) { @@ -767,7 +864,10 @@ void populateShapeIndependentMaterializeEncodingPatterns( }); patterns.insert< + MaterializeContractionOp, SetEncodingOpLoweringConversion, + UnsetEncodingOpLoweringConversion, MaterializeDPSOperation, + MaterializeDPSOperation, MaterializeOperation, MaterializeOptimizationBarrierOp, MaterializeFlowDispatchTensorLoadOp, MaterializeFlowDispatchTensorStoreOp, MaterializeInterfaceBindingEncoding>(context, typeConverter, diff --git a/compiler/src/iree/compiler/Codegen/Common/Passes.td b/compiler/src/iree/compiler/Codegen/Common/Passes.td index 5571aba9b1e4..5cc0d555ec24 100644 --- a/compiler/src/iree/compiler/Codegen/Common/Passes.td +++ b/compiler/src/iree/compiler/Codegen/Common/Passes.td @@ -431,6 +431,21 @@ def LowerUKernelOpsToCallsPass : let summary = "Lower micro-kernel wrapper ops into function calls"; } +def MaterializeHostEncodingPass : + Pass<"iree-codegen-materialize-host-encoding", "mlir::ModuleOp"> { + let summary = "Materialize the encoding for tensor as specified by the backend."; +} + +def MaterializeDeviceEncodingPass : + InterfacePass<"iree-codegen-materialize-device-encoding", "mlir::FunctionOpInterface"> { + let summary = "Materialize the encoding for tensor as specified by the backend."; + let options = [ + Option<"testCLGPUTarget", "test-cl-gpu-target", "bool", /*default=*/"false", + "Flag used for lit-testing GPU target only. Not for general usage">, + ]; +} + +// TODO(hanchung): Remove the pass after we deprecate MaterializeHomogeneousEncodingsPass. def MaterializeEncodingIntoNopPass : InterfacePass<"iree-codegen-materialize-encoding-into-nop", "mlir::FunctionOpInterface"> { let summary = "Drop the encodings from tensor types with encodings."; diff --git a/compiler/src/iree/compiler/Codegen/Common/test/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/test/BUILD.bazel index 5de2e3d6b95e..f0652d2c3636 100644 --- a/compiler/src/iree/compiler/Codegen/Common/test/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/test/BUILD.bazel @@ -47,12 +47,17 @@ iree_lit_test_suite( "fold_tensor_extract_op.mlir", "forop_canonicalization.mlir", "generic_vectorization.mlir", + "gpu_materialize_encoding_gfx1100.mlir", + "gpu_materialize_encoding_gfx908.mlir", + "gpu_materialize_encoding_gfx90a.mlir", + "gpu_materialize_encoding_gfx942.mlir", "hoist_statically_bound_allocations.mlir", "hoist_unrolled_vector_extract_insert_slice.mlir", "iree_comprehensive_bufferize.mlir", "iree_expand_strided_metadata.mlir", "iree_loop_invariant_code_motion.mlir", "link_tuning_specs.mlir", + "llvmcpu_materialize_encoding.mlir", "lower_ukernel_to_calls.mlir", "materialize_encoding_into_nop.mlir", "materialize_tuning_specs.mlir", @@ -74,8 +79,8 @@ iree_lit_test_suite( "replace_slow_min_max_ops.mlir", "strip_compilation_info.mlir", "test_partitionable_loops_interface.mlir", - "tile_and_distribute_to_workgroups_func_scope.mlir", "tile_and_distribute_to_workgroups.mlir", + "tile_and_distribute_to_workgroups_func_scope.mlir", "tile_and_distribute_workgroups_using_forall.mlir", "tile_large_tensors.mlir", "transform_buffer_opt.mlir", @@ -88,10 +93,11 @@ iree_lit_test_suite( "type_propagation.mlir", "type_propagation_packing.mlir", "unroll_annotated_loops.mlir", + "vector_layout_analysis.mlir", "vectorize_memref_copy.mlir", "vectorize_tensor_pad.mlir", - "vector_layout_analysis.mlir", "verify_workgroup_distribution.mlir", + "vmvx_materialize_encoding.mlir", ], include = ["*.mlir"], exclude = [ diff --git a/compiler/src/iree/compiler/Codegen/Common/test/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/test/CMakeLists.txt index 4dc774caa54a..2d707f68c3aa 100644 --- a/compiler/src/iree/compiler/Codegen/Common/test/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/test/CMakeLists.txt @@ -43,12 +43,17 @@ iree_lit_test_suite( "fold_tensor_extract_op.mlir" "forop_canonicalization.mlir" "generic_vectorization.mlir" + "gpu_materialize_encoding_gfx1100.mlir" + "gpu_materialize_encoding_gfx908.mlir" + "gpu_materialize_encoding_gfx90a.mlir" + "gpu_materialize_encoding_gfx942.mlir" "hoist_statically_bound_allocations.mlir" "hoist_unrolled_vector_extract_insert_slice.mlir" "iree_comprehensive_bufferize.mlir" "iree_expand_strided_metadata.mlir" "iree_loop_invariant_code_motion.mlir" "link_tuning_specs.mlir" + "llvmcpu_materialize_encoding.mlir" "lower_ukernel_to_calls.mlir" "materialize_encoding_into_nop.mlir" "materialize_tuning_specs.mlir" @@ -88,6 +93,7 @@ iree_lit_test_suite( "vectorize_memref_copy.mlir" "vectorize_tensor_pad.mlir" "verify_workgroup_distribution.mlir" + "vmvx_materialize_encoding.mlir" TOOLS FileCheck iree-opt diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx1100.mlir b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx1100.mlir similarity index 98% rename from compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx1100.mlir rename to compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx1100.mlir index bb0c61072bd3..645fd712442a 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx1100.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx1100.mlir @@ -1,4 +1,4 @@ -// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-materialize-device-encoding))" \ +// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-materialize-device-encoding{test-cl-gpu-target}))" \ // RUN: --iree-gpu-test-target=gfx1100 \ // RUN: --split-input-file %s | FileCheck %s diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx908.mlir b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx908.mlir similarity index 98% rename from compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx908.mlir rename to compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx908.mlir index 4fca56365659..a9fc2bc66f62 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx908.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx908.mlir @@ -1,4 +1,4 @@ -// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-materialize-device-encoding))" \ +// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-materialize-device-encoding{test-cl-gpu-target}))" \ // RUN: --iree-gpu-test-target=gfx908 \ // RUN: --split-input-file %s | FileCheck %s diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx90a.mlir b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx90a.mlir similarity index 99% rename from compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx90a.mlir rename to compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx90a.mlir index cc9cd9d30dbe..89fe357ba33b 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx90a.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx90a.mlir @@ -1,4 +1,4 @@ -// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-materialize-device-encoding))" \ +// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-materialize-device-encoding{test-cl-gpu-target}))" \ // RUN: --iree-gpu-test-target=gfx90a \ // RUN: --split-input-file %s | FileCheck %s diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx942.mlir b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx942.mlir similarity index 99% rename from compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx942.mlir rename to compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx942.mlir index 3338de98ebbf..2544fc127f89 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_materialize_encoding_gfx942.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx942.mlir @@ -1,4 +1,4 @@ -// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-materialize-device-encoding))" \ +// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-materialize-device-encoding{test-cl-gpu-target}))" \ // RUN: --iree-gpu-test-target=gfx942 \ // RUN: --split-input-file %s | FileCheck %s diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/test/llvmcpu_materialize_encoding.mlir b/compiler/src/iree/compiler/Codegen/Common/test/llvmcpu_materialize_encoding.mlir similarity index 97% rename from compiler/src/iree/compiler/Codegen/Common/CPU/test/llvmcpu_materialize_encoding.mlir rename to compiler/src/iree/compiler/Codegen/Common/test/llvmcpu_materialize_encoding.mlir index 553c134b9f78..25b69a7e31e2 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/test/llvmcpu_materialize_encoding.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/llvmcpu_materialize_encoding.mlir @@ -1,4 +1,4 @@ -// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-cpu-materialize-device-encoding),canonicalize,cse)" --split-input-file %s | FileCheck %s +// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-materialize-device-encoding),canonicalize,cse)" --split-input-file %s | FileCheck %s #pipeline_layout = #hal.pipeline.layout, @@ -6,7 +6,7 @@ ]> #encoding = #iree_encoding.encoding (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], round_dims_to = array> func.func @set_encoding_with_padding_semantics_bf16_x86_64_avx512f() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> }{ %c0 = arith.constant 0 : index %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor> @@ -44,7 +44,7 @@ func.func @set_encoding_with_padding_semantics_bf16_x86_64_avx512f() attributes #map2 = affine_map<(d0, d1, d2) -> (d0, d1)> #encoding = #iree_encoding.encoding> func.func @set_encoding_7x7x7_matmul_LHS() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> } { %c0 = arith.constant 0 : index %8 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor> @@ -74,7 +74,7 @@ func.func @set_encoding_7x7x7_matmul_LHS() attributes { #map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)> #encoding = #iree_encoding.encoding> func.func @set_encoding_128x80x32_batch_matmul_LHS() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> } { %c0 = arith.constant 0 : index %8 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor> @@ -105,7 +105,7 @@ func.func @set_encoding_128x80x32_batch_matmul_LHS() attributes { #map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)> #encoding = #iree_encoding.encoding> func.func @set_encoding_128x32x320_batch_matmul_RHS() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> } { %c0 = arith.constant 0 : index %0 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32 @@ -138,7 +138,7 @@ func.func @set_encoding_128x32x320_batch_matmul_RHS() attributes { #map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)> #encoding = #iree_encoding.encoding> func.func @unset_encoding_128x80x320_batch_matmul_RESULT() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> } { %c0 = arith.constant 0 : index %0 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32 @@ -176,7 +176,7 @@ func.func @unset_encoding_128x80x320_batch_matmul_RESULT() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @pack_gemm_fill_dynamic(%arg0 : tensor, %arg1 : tensor) -> tensor attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx,+avx2,+fma"}> } { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index @@ -224,7 +224,7 @@ func.func @pack_gemm_fill_dynamic(%arg0 : tensor, %arg1 : tensor (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], round_dims_to = array> #encoding_result = #iree_encoding.encoding (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], round_dims_to = array> func.func @matvec_shaped_matmul_lowering_f32f32f32_aarch64(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz"}> } { %c0 = arith.constant 0 : index %0 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<16x16xf32> @@ -257,7 +257,7 @@ func.func @matvec_shaped_matmul_lowering_f32f32f32_aarch64(%arg0: !hal.buffer_vi #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f32f32f32_aarch64() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -323,7 +323,7 @@ func.func @matmul_lowering_f32f32f32_aarch64() attributes { #encoding_rhs = #iree_encoding.encoding (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0)>], round_dims_to = array> #encoding_result = #iree_encoding.encoding (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0)>], round_dims_to = array> func.func @matvec_lowering_f32f32f32_aarch64(%arg0: tensor<16x16xf32>, %arg1: tensor<16xf32>, %arg2: tensor<16xf32>) -> tensor<16xf32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz"}> } { %c0 = arith.constant 0 : index %3 = iree_encoding.set_encoding %arg0 : tensor<16x16xf32> -> tensor<16x16xf32, #encoding_lhs> @@ -352,7 +352,7 @@ func.func @matvec_lowering_f32f32f32_aarch64(%arg0: tensor<16x16xf32>, %arg1: te #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matvec_lowering_f32f32f32_aarch64() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz"}> } { %c0 = arith.constant 0 : index %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) @@ -414,7 +414,7 @@ func.func @matvec_lowering_f32f32f32_aarch64() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f16f16f16_aarch64() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -485,7 +485,7 @@ func.func @matmul_lowering_f16f16f16_aarch64() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f32f32f32_x86_64() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -557,7 +557,7 @@ func.func @matmul_lowering_f32f32f32_x86_64() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f32f32f32_x86_64_avx2() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -628,7 +628,7 @@ func.func @matmul_lowering_f32f32f32_x86_64_avx2() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f32f32f32_x86_64_avx512f() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -699,7 +699,7 @@ func.func @matmul_lowering_f32f32f32_x86_64_avx512f() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f16f16f32_x86_64_avx512f() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -770,7 +770,7 @@ func.func @matmul_lowering_f16f16f32_x86_64_avx512f() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f16f16f16_x86_64_avx512f() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -841,7 +841,7 @@ func.func @matmul_lowering_f16f16f16_x86_64_avx512f() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_bf16bf16f32_x86_64_avx512f() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -912,7 +912,7 @@ func.func @matmul_lowering_bf16bf16f32_x86_64_avx512f() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_bf16bf16bf16_x86_64_avx512f() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -983,7 +983,7 @@ func.func @matmul_lowering_bf16bf16bf16_x86_64_avx512f() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_bf16bf16f32_x86_64_avx512bf16() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f,+avx512bf16"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f,+avx512bf16"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1056,7 +1056,7 @@ func.func @matmul_lowering_bf16bf16f32_x86_64_avx512bf16() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_bf16bf16bf16_x86_64_avx512bf16() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f,+avx512bf16"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f,+avx512bf16"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1129,7 +1129,7 @@ func.func @matmul_lowering_bf16bf16bf16_x86_64_avx512bf16() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f32f16f16_aarch64() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", ukernels = "all"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1202,7 +1202,7 @@ func.func @matmul_lowering_f32f16f16_aarch64() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f32f16f16_x86_64_avx512f() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f,+avx512bf16"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f,+avx512bf16"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1276,7 +1276,7 @@ func.func @matmul_lowering_f32f16f16_x86_64_avx512f() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i8i32_aarch64() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1344,7 +1344,7 @@ func.func @matmul_lowering_i8i8i32_aarch64() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i8i32_aarch64_dotprod() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz", cpu_features="+dotprod", ukernels = "all"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", cpu_features="+dotprod", ukernels = "all"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1417,7 +1417,7 @@ func.func @matmul_lowering_i8i8i32_aarch64_dotprod() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i8i32_aarch64_i8mm() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz", cpu_features="+dotprod,+i8mm", ukernels = "all"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", cpu_features="+dotprod,+i8mm", ukernels = "all"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1489,7 +1489,7 @@ func.func @matmul_lowering_i8i8i32_aarch64_i8mm() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i4i32_aarch64() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1563,7 +1563,7 @@ func.func @matmul_lowering_i8i4i32_aarch64() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i4i32_aarch64_dotprod() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz", cpu_features="+dotprod", ukernels = "all"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", cpu_features="+dotprod", ukernels = "all"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1635,7 +1635,7 @@ func.func @matmul_lowering_i8i4i32_aarch64_dotprod() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i4i32_aarch64_i8mm() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="aarch64-xyz-xyz", cpu_features="+dotprod,+i8mm", ukernels = "all"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="aarch64-xyz-xyz", cpu_features="+dotprod,+i8mm", ukernels = "all"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1704,7 +1704,7 @@ func.func @matmul_lowering_i8i4i32_aarch64_i8mm() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f32f32f32_aarch64_sve(%lhs: tensor, %rhs: tensor, %acc: tensor) -> tensor attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {cpu_features = "+sve", target_triple="aarch64-xyz-xyz"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {cpu_features = "+sve", target_triple="aarch64-xyz-xyz"}> } { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index @@ -1736,7 +1736,7 @@ func.func @matmul_lowering_f32f32f32_aarch64_sve(%lhs: tensor, %rhs: te #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_f32f32f32_riscv(%lhs: tensor, %rhs: tensor, %acc: tensor) -> tensor attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="riscv32-xyz-xyz"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="riscv32-xyz-xyz"}> } { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index @@ -1772,7 +1772,7 @@ func.func @matmul_lowering_f32f32f32_riscv(%lhs: tensor, %rhs: tensor> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i8i32_riscv32_ukernel() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="riscv32-xyz-xyz", ukernels = "all"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="riscv32-xyz-xyz", ukernels = "all"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1845,7 +1845,7 @@ func.func @matmul_lowering_i8i8i32_riscv32_ukernel() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i8i32_x86_64_avx2() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx2"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx2"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1918,7 +1918,7 @@ func.func @matmul_lowering_i8i8i32_x86_64_avx2() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i8i32_x86_64_avx512bw() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512bw"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512bw"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -1991,7 +1991,7 @@ func.func @matmul_lowering_i8i8i32_x86_64_avx512bw() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i8i8i32_x86_64_avx512vnni() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -2059,7 +2059,7 @@ func.func @matmul_lowering_i8i8i32_x86_64_avx512vnni() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @extend_batch_vecmat_explicit_unit_dim(%arg0: tensor<32x1x128xi8>, %arg1: tensor<32x128x11008xi8>) -> tensor<32x1x11008xi32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c0_i32 = arith.constant 0 : i32 %4 = iree_encoding.set_encoding %arg0 : tensor<32x1x128xi8> -> tensor<32x1x128xi8, #encoding_lhs> @@ -2122,7 +2122,7 @@ func.func @extend_batch_vecmat_explicit_unit_dim(%arg0: tensor<32x1x128xi8>, %ar #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i16i16i32_x86_64_avx2() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx2"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx2"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -2195,7 +2195,7 @@ func.func @matmul_lowering_i16i16i32_x86_64_avx2() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_lowering_i16ui4i32_x86_64_avx512vnni() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c0 = arith.constant 0 : index %M = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : index @@ -2263,7 +2263,7 @@ func.func @matmul_lowering_i16ui4i32_x86_64_avx512vnni() attributes { #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @vecmat(%arg0: tensor<128xi8>, %arg1: tensor<128x11008xi8>) -> tensor<11008xi32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c0_i32 = arith.constant 0 : i32 %4 = iree_encoding.set_encoding %arg0 : tensor<128xi8> -> tensor<128xi8, #encoding_lhs> @@ -2325,7 +2325,7 @@ func.func @vecmat(%arg0: tensor<128xi8>, %arg1: tensor<128x11008xi8>) -> tensor< #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matvec(%arg0: tensor<11008x128xi8>, %arg1: tensor<128xi8>) -> tensor<11008xi32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c0_i32 = arith.constant 0 : i32 %4 = iree_encoding.set_encoding %arg0 : tensor<11008x128xi8> -> tensor<11008x128xi8, #encoding_lhs> @@ -2387,7 +2387,7 @@ func.func @matvec(%arg0: tensor<11008x128xi8>, %arg1: tensor<128xi8>) -> tensor< #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matvec_with_narrow_M(%arg0: tensor<15x128xi8>, %arg1: tensor<128xi8>) -> tensor<15xi32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c0_i32 = arith.constant 0 : i32 %4 = iree_encoding.set_encoding %arg0 : tensor<15x128xi8> -> tensor<15x128xi8, #encoding_lhs> @@ -2450,7 +2450,7 @@ func.func @matvec_with_narrow_M(%arg0: tensor<15x128xi8>, %arg1: tensor<128xi8>) #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @batch_vecmat(%arg0: tensor<32x128xi8>, %arg1: tensor<32x128x11008xi8>) -> tensor<32x11008xi32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c0_i32 = arith.constant 0 : i32 %4 = iree_encoding.set_encoding %arg0 : tensor<32x128xi8> -> tensor<32x128xi8, #encoding_lhs> @@ -2509,7 +2509,7 @@ func.func @batch_vecmat(%arg0: tensor<32x128xi8>, %arg1: tensor<32x128x11008xi8> #encoding_rhs = #iree_encoding.encoding (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], round_dims_to = array> #encoding_result = #iree_encoding.encoding (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], round_dims_to = array> func.func @batch_matvec(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %0 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<32x11008x128xi8> %1 = hal.tensor.import %arg1 "input1" : !hal.buffer_view -> tensor<32x128xi8> @@ -2535,7 +2535,7 @@ func.func @batch_matvec(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_transpose_a_f32f32f32(%arg0: tensor<256x128xf32>, %arg1: tensor<256x512xf32>, %arg2: tensor<128x512xf32>) -> tensor<128x512xf32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c256 = arith.constant 256 : index %c128 = arith.constant 128 : index @@ -2574,7 +2574,7 @@ func.func @matmul_transpose_a_f32f32f32(%arg0: tensor<256x128xf32>, %arg1: tenso #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @matmul_transpose_b_f32f32f32(%arg0: tensor<128x256xf32>, %arg1: tensor<512x256xf32>, %arg2: tensor<128x512xf32>) -> tensor<128x512xf32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c128 = arith.constant 128 : index %c256 = arith.constant 256 : index @@ -2612,7 +2612,7 @@ func.func @matmul_transpose_b_f32f32f32(%arg0: tensor<128x256xf32>, %arg1: tenso #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @batch_matmul_transpose_a_f32f32f32(%arg0: tensor<2x256x128xf32>, %arg1: tensor<2x256x512xf32>, %arg2: tensor<2x128x512xf32>) -> tensor<2x128x512xf32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c2 = arith.constant 2 : index %c256 = arith.constant 256 : index @@ -2651,7 +2651,7 @@ func.func @batch_matmul_transpose_a_f32f32f32(%arg0: tensor<2x256x128xf32>, %arg #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @batch_matmul_transpose_b_f32f32f32(%arg0: tensor<2x128x256xf32>, %arg1: tensor<2x512x256xf32>, %arg2: tensor<2x128x512xf32>) -> tensor<2x128x512xf32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c2 = arith.constant 2 : index %c128 = arith.constant 128 : index @@ -2690,7 +2690,7 @@ func.func @batch_matmul_transpose_b_f32f32f32(%arg0: tensor<2x128x256xf32>, %arg #encoding_rhs = #iree_encoding.encoding> #encoding_result = #iree_encoding.encoding> func.func @generic_batch_vecmat_transposed_i16u4i32(%arg0: tensor<32x128xi16>, %arg1: tensor<4096x32x128xi4>, %arg2: tensor<4096x32xi32>) -> tensor<4096x32xi32> attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512vnni"}> } { %c0_i32 = arith.constant 0 : i32 %c0_i4 = arith.constant 0 : i4 @@ -2747,7 +2747,7 @@ func.func @generic_batch_vecmat_transposed_i16u4i32(%arg0: tensor<32x128xi16>, % #encoding = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d1, d2)>, round_dims_to = array> #encoding_bcast = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d2)>, round_dims_to = array> func.func @dequantization() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 @@ -2802,7 +2802,7 @@ func.func @dequantization() attributes { #encoding = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d1, d2)>, round_dims_to = array> #encoding_bcast = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d1, d2)>, round_dims_to = array> func.func @broadcast_batch() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 @@ -2841,7 +2841,7 @@ func.func @broadcast_batch() attributes { #encoding = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d1, d2)>, round_dims_to = array> #encoding_bcast = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d1)>, round_dims_to = array> func.func @broadcast_M() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 @@ -2880,7 +2880,7 @@ func.func @broadcast_M() attributes { #encoding = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d1, d2)>, round_dims_to = array> #encoding_bcast = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d2)>, round_dims_to = array> func.func @broadcast_N() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 @@ -2919,7 +2919,7 @@ func.func @broadcast_N() attributes { #encoding = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d1, d2)>, round_dims_to = array> #encoding_bcast = #iree_encoding.encoding (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], bcast_map = affine_map<(d0, d1, d2) -> (d0, d2)>, round_dims_to = array> func.func @broadcast_K() attributes { - hal.executable.target = #hal.executable.target<"xyz", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> + hal.executable.target = #hal.executable.target<"llvm-cpu", "xyz", {target_triple="x86_64-xyz-xyz", cpu_features="+avx512f"}> } { %c0 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/test/vmvx_materialize_encoding.mlir b/compiler/src/iree/compiler/Codegen/Common/test/vmvx_materialize_encoding.mlir similarity index 99% rename from compiler/src/iree/compiler/Codegen/Common/CPU/test/vmvx_materialize_encoding.mlir rename to compiler/src/iree/compiler/Codegen/Common/test/vmvx_materialize_encoding.mlir index 85dd416a8153..2f3b91ff7255 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/test/vmvx_materialize_encoding.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/vmvx_materialize_encoding.mlir @@ -1,4 +1,4 @@ -// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-cpu-materialize-device-encoding),canonicalize,cse)" --split-input-file %s | FileCheck %s +// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-materialize-device-encoding),canonicalize,cse)" --split-input-file %s | FileCheck %s #pipeline_layout = #hal.pipeline.layout, diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp b/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp index 76b2745dbc45..1d2b66ee634e 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp @@ -788,7 +788,7 @@ void buildLLVMCPUCodegenConfigurationPassPipelineImpl( // TODO(#13888): This(createExpandF16OpToF32Pass()) pass is being added // way to late and should insted be be done during lowering to LLVM. .addPass(createExpandF16OpToF32Pass) - .addPass(createCPUMaterializeDeviceEncodingPass) + .addPass(createMaterializeDeviceEncodingPass) // TODO: Remove the following pass the plumb support for // #hal.descriptor_type memory space through the stack. .addPass(createEraseHALDescriptorTypeFromMemRefPass); diff --git a/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp b/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp index f17a353afcc2..812bc9bc2f5e 100644 --- a/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp +++ b/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp @@ -161,6 +161,10 @@ const char *getIreeArchNameForTargetTriple(llvm::Triple triple) { return "unknown"; } +bool isLLVMCPUBackend(IREE::HAL::ExecutableTargetAttr targetAttr) { + return targetAttr && targetAttr.getBackend().getValue() == "llvm-cpu"; +} + bool isVMVXBackend(IREE::HAL::ExecutableTargetAttr targetAttr) { return targetAttr && targetAttr.getBackend().getValue().starts_with("vmvx"); } diff --git a/compiler/src/iree/compiler/Codegen/Utils/Utils.h b/compiler/src/iree/compiler/Codegen/Utils/Utils.h index d8f96de94213..ea3d06956a27 100644 --- a/compiler/src/iree/compiler/Codegen/Utils/Utils.h +++ b/compiler/src/iree/compiler/Codegen/Utils/Utils.h @@ -61,9 +61,8 @@ std::optional getTargetTriple(Attribute attr); const char *getIreeArchNameForTargetTriple(llvm::Triple triple); /// Methods to get target information. +bool isLLVMCPUBackend(IREE::HAL::ExecutableTargetAttr targetAttr); bool isVMVXBackend(IREE::HAL::ExecutableTargetAttr targetAttr); - -/// Methods to get target information. bool isROCMBackend(IREE::HAL::ExecutableTargetAttr targetAttr); // Returns true if the ukernel with given `ukernelName` is enabled. diff --git a/compiler/src/iree/compiler/Dialect/VMVX/Transforms/Passes.cpp b/compiler/src/iree/compiler/Dialect/VMVX/Transforms/Passes.cpp index 00c5c9f9637b..a196e3121894 100644 --- a/compiler/src/iree/compiler/Dialect/VMVX/Transforms/Passes.cpp +++ b/compiler/src/iree/compiler/Dialect/VMVX/Transforms/Passes.cpp @@ -44,7 +44,7 @@ void buildVMVXConfigurationPassPipeline(OpPassManager &variantPassManager) { } modulePassManager.addPass(createMaterializeUserConfigsPass()); FunctionLikeNest(modulePassManager) - .addPass(createCPUMaterializeDeviceEncodingPass) + .addPass(createMaterializeDeviceEncodingPass) // TODO: Remove the following pass the plumb support for // #hal.descriptor_type memory space through the stack. .addPass(createEraseHALDescriptorTypeFromMemRefPass); diff --git a/compiler/src/iree/compiler/GlobalOptimization/BUILD.bazel b/compiler/src/iree/compiler/GlobalOptimization/BUILD.bazel index d85310e8dfe4..50ff8a6fad2b 100644 --- a/compiler/src/iree/compiler/GlobalOptimization/BUILD.bazel +++ b/compiler/src/iree/compiler/GlobalOptimization/BUILD.bazel @@ -76,8 +76,6 @@ iree_compiler_cc_library( ":PassHeaders", ":PassesIncGen", "//compiler/src/iree/compiler/Codegen/Common", - "//compiler/src/iree/compiler/Codegen/Common/CPU:CommonCPUPasses", - "//compiler/src/iree/compiler/Codegen/Common/GPU:CommonGPUPasses", "//compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR:IREECodegenDialect", "//compiler/src/iree/compiler/Dialect/Encoding/IR", "//compiler/src/iree/compiler/Dialect/Flow/Conversion/TensorToFlow", diff --git a/compiler/src/iree/compiler/GlobalOptimization/CMakeLists.txt b/compiler/src/iree/compiler/GlobalOptimization/CMakeLists.txt index 9ca16eed433d..6650602f8c98 100644 --- a/compiler/src/iree/compiler/GlobalOptimization/CMakeLists.txt +++ b/compiler/src/iree/compiler/GlobalOptimization/CMakeLists.txt @@ -91,8 +91,6 @@ iree_cc_library( MLIRTransformUtils MLIRTransforms iree::compiler::Codegen::Common - iree::compiler::Codegen::Common::CPU::CommonCPUPasses - iree::compiler::Codegen::Common::GPU::CommonGPUPasses iree::compiler::Codegen::Dialect::Codegen::IR::IREECodegenDialect iree::compiler::Dialect::Encoding::IR iree::compiler::Dialect::Flow::Conversion::TensorToFlow diff --git a/compiler/src/iree/compiler/GlobalOptimization/MaterializeHomogeneousEncodings.cpp b/compiler/src/iree/compiler/GlobalOptimization/MaterializeHomogeneousEncodings.cpp index adcc12977bad..f7aeb8225d0b 100644 --- a/compiler/src/iree/compiler/GlobalOptimization/MaterializeHomogeneousEncodings.cpp +++ b/compiler/src/iree/compiler/GlobalOptimization/MaterializeHomogeneousEncodings.cpp @@ -4,8 +4,6 @@ // See https://llvm.org/LICENSE.txt for license information. // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception -#include "iree/compiler/Codegen/Common/CPU/Passes.h" -#include "iree/compiler/Codegen/Common/GPU/Passes.h" #include "iree/compiler/Codegen/Common/Passes.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.h" #include "iree/compiler/Dialect/HAL/Analysis/DeviceAnalysis.h" @@ -82,10 +80,10 @@ class MaterializeHomogeneousEncodingsPass // Only llvm-cpu and rocm backends handle encodings for now, others just go // with nop. if (executableTarget.getBackend() == "llvm-cpu") { - passManager.addPass(createCPUMaterializeHostEncodingPass()); + passManager.addPass(createMaterializeHostEncodingPass()); } else if (clEnableExperimentalRocmDataTiling && executableTarget.getBackend() == "rocm") { - passManager.addPass(createGPUMaterializeHostEncodingPass()); + passManager.addPass(createMaterializeHostEncodingPass()); FunctionLikeNest(passManager).addPass([&]() { return createDecomposePackUnPackOpsPass( DecomposePackUnPackOpsPassOptions{/*tileOuterToOne=*/false, From 25de54960ea4c537833d85f471169c43ad19e87b Mon Sep 17 00:00:00 2001 From: Han-Chung Wang Date: Mon, 16 Dec 2024 06:45:28 -0800 Subject: [PATCH 19/64] [NFC] Delete outdated e2e encoding tests. (#19487) It was introduced for GPU matmul data-tiling when the multi_mma code-generation was not ready yet. There is an e2e path in IREE (see tests/e2e/matmul/) now, so we no longer need the test suite. Signed-off-by: hanhanW --- tests/e2e/rocm_specific/BUILD.bazel | 24 --- tests/e2e/rocm_specific/CMakeLists.txt | 26 --- tests/e2e/rocm_specific/encoding.mlir | 232 ------------------------- 3 files changed, 282 deletions(-) delete mode 100644 tests/e2e/rocm_specific/BUILD.bazel delete mode 100644 tests/e2e/rocm_specific/CMakeLists.txt delete mode 100644 tests/e2e/rocm_specific/encoding.mlir diff --git a/tests/e2e/rocm_specific/BUILD.bazel b/tests/e2e/rocm_specific/BUILD.bazel deleted file mode 100644 index 438fb0728d67..000000000000 --- a/tests/e2e/rocm_specific/BUILD.bazel +++ /dev/null @@ -1,24 +0,0 @@ -# Copyright 2024 The IREE Authors -# -# Licensed under the Apache License v2.0 with LLVM Exceptions. -# See https://llvm.org/LICENSE.txt for license information. -# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - -# Tests for end-to-end IREE support specific to the vulkan-spirv lowering. - -load("//build_tools/bazel:iree_check_test.bzl", "iree_check_single_backend_test_suite") - -package( - features = ["layering_check"], - licenses = ["notice"], # Apache 2.0 -) - -iree_check_single_backend_test_suite( - name = "check_rocm_hip", - srcs = ["encoding.mlir"], - compiler_flags = [ - "--iree-global-opt-experimental-rocm-data-tiling", - ], - driver = "hip", - target_backend = "rocm", -) diff --git a/tests/e2e/rocm_specific/CMakeLists.txt b/tests/e2e/rocm_specific/CMakeLists.txt deleted file mode 100644 index c428b12fc7f7..000000000000 --- a/tests/e2e/rocm_specific/CMakeLists.txt +++ /dev/null @@ -1,26 +0,0 @@ -################################################################################ -# Autogenerated by build_tools/bazel_to_cmake/bazel_to_cmake.py from # -# tests/e2e/rocm_specific/BUILD.bazel # -# # -# Use iree_cmake_extra_content from iree/build_defs.oss.bzl to add arbitrary # -# CMake-only content. # -# # -# To disable autogeneration for this file entirely, delete this header. # -################################################################################ - -iree_add_all_subdirs() - -iree_check_single_backend_test_suite( - NAME - check_rocm_hip - SRCS - "encoding.mlir" - TARGET_BACKEND - "rocm" - DRIVER - "hip" - COMPILER_FLAGS - "--iree-global-opt-experimental-rocm-data-tiling" -) - -### BAZEL_TO_CMAKE_PRESERVES_ALL_CONTENT_BELOW_THIS_LINE ### diff --git a/tests/e2e/rocm_specific/encoding.mlir b/tests/e2e/rocm_specific/encoding.mlir deleted file mode 100644 index 2718f9d00805..000000000000 --- a/tests/e2e/rocm_specific/encoding.mlir +++ /dev/null @@ -1,232 +0,0 @@ -//===----------------------------------------------------------------------===// -// Utility Methods -//===----------------------------------------------------------------------===// - -func.func private @generate_2D_source_f16(%height : index, %width : index) -> tensor { - %init_source = tensor.empty(%height, %width) : tensor - %source = linalg.generic { - indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>], - iterator_types = ["parallel", "parallel"]} - outs(%init_source : tensor) { - ^bb0(%b0 : f16): - %outer = linalg.index 0 : index - %inner = linalg.index 1 : index - %strided = arith.muli %outer, %width : index - %linearized = arith.addi %inner, %strided : index - %linearized_i16 = arith.index_cast %linearized : index to i16 - %linearized_f16 = arith.sitofp %linearized_i16 : i16 to f16 - linalg.yield %linearized_f16 : f16 - } -> tensor - // This blocks the fusion for inputs and testing ops. - %0 = util.optimization_barrier %source : tensor - %1 = flow.tensor.tie_shape %0 : tensor{%height, %width} - return %1 : tensor -} - -func.func private @generate_2D_source_f32(%height : index, %width : index) -> tensor { - %init_source = tensor.empty(%height, %width) : tensor - %source = linalg.generic { - indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>], - iterator_types = ["parallel", "parallel"]} - outs(%init_source : tensor) { - ^bb0(%b0 : f32): - %outer = linalg.index 0 : index - %inner = linalg.index 1 : index - %strided = arith.muli %outer, %width : index - %linearized = arith.addi %inner, %strided : index - %linearized_i32 = arith.index_cast %linearized : index to i32 - %linearized_f32 = arith.sitofp %linearized_i32 : i32 to f32 - linalg.yield %linearized_f32 : f32 - } -> tensor - // This blocks the fusion for inputs and testing ops. - %0 = util.optimization_barrier %source : tensor - %1 = flow.tensor.tie_shape %0 : tensor{%height, %width} - return %1 : tensor -} - -func.func private @generate_2D_source_i8(%height : index, %width : index) -> tensor { - %c255 = arith.constant 255 : index - %init_source = tensor.empty(%height, %width) : tensor - %source = linalg.generic { - indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>], - iterator_types = ["parallel", "parallel"]} - outs(%init_source : tensor) { - ^bb0(%b0 : i8): - %outer = linalg.index 0 : index - %inner = linalg.index 1 : index - %strided = arith.muli %outer, %width : index - %linearized = arith.addi %inner, %strided : index - %linearized_rem = arith.remsi %linearized, %c255 : index - %linearized_i8 = arith.index_cast %linearized_rem : index to i8 - linalg.yield %linearized_i8 : i8 - } -> tensor - // This blocks the fusion for inputs and testing ops. - %0 = util.optimization_barrier %source : tensor - %1 = flow.tensor.tie_shape %0 : tensor{%height, %width} - return %1 : tensor -} - -func.func private @generate_2D_source_i32(%height : index, %width : index) -> tensor { - %init_source = tensor.empty(%height, %width) : tensor - %source = linalg.generic { - indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>], - iterator_types = ["parallel", "parallel"]} - outs(%init_source : tensor) { - ^bb0(%b0 : i32): - %outer = linalg.index 0 : index - %inner = linalg.index 1 : index - %strided = arith.muli %outer, %width : index - %linearized = arith.addi %inner, %strided : index - %linearized_i32 = arith.index_cast %linearized : index to i32 - linalg.yield %linearized_i32 : i32 - } -> tensor - // This blocks the fusion for inputs and testing ops. - %0 = util.optimization_barrier %source : tensor - %1 = flow.tensor.tie_shape %0 : tensor{%height, %width} - return %1 : tensor -} - -//===----------------------------------------------------------------------===// -// f32.f32.f32 variants -//===----------------------------------------------------------------------===// - -#map = affine_map<(d0, d1, d2) -> (d0, d2)> -#map1 = affine_map<(d0, d1, d2) -> (d2, d1)> -#map2 = affine_map<(d0, d1, d2) -> (d0, d1)> -#encoding_f32f32f32_lhs = #iree_encoding.encoding> -#encoding_f32f32f32_rhs = #iree_encoding.encoding> -#encoding_f32f32f32_acc = #iree_encoding.encoding> - -func.func @set_encoding_f32f32f32_lhs() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_f32(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xf32> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xf32> -> tensor<129x255xf32, #encoding_f32f32f32_lhs> - %barrire = util.optimization_barrier %1 : tensor<129x255xf32, #encoding_f32f32f32_lhs> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xf32, #encoding_f32f32f32_lhs> -> tensor<129x255xf32> - check.expect_almost_eq(%2, %source) : tensor<129x255xf32> - return -} - -func.func @set_encoding_f32f32f32_rhs() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_f32(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xf32> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xf32> -> tensor<129x255xf32, #encoding_f32f32f32_rhs> - %barrire = util.optimization_barrier %1 : tensor<129x255xf32, #encoding_f32f32f32_rhs> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xf32, #encoding_f32f32f32_rhs> -> tensor<129x255xf32> - check.expect_almost_eq(%2, %source) : tensor<129x255xf32> - return -} - -func.func @set_encoding_f32f32f32_acc() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_f32(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xf32> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xf32> -> tensor<129x255xf32, #encoding_f32f32f32_acc> - %barrire = util.optimization_barrier %1 : tensor<129x255xf32, #encoding_f32f32f32_acc> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xf32, #encoding_f32f32f32_acc> -> tensor<129x255xf32> - check.expect_almost_eq(%2, %source) : tensor<129x255xf32> - return -} - -//===----------------------------------------------------------------------===// -// i8.i8.i32 variants -//===----------------------------------------------------------------------===// - -#encoding_i8i8i32_lhs = #iree_encoding.encoding> -#encoding_i8i8i32_rhs = #iree_encoding.encoding> -#encoding_i8i8i32_acc = #iree_encoding.encoding> - -func.func @set_encoding_i8i8i32_lhs() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_i8(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xi8> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xi8> -> tensor<129x255xi8, #encoding_i8i8i32_lhs> - %barrire = util.optimization_barrier %1 : tensor<129x255xi8, #encoding_i8i8i32_lhs> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xi8, #encoding_i8i8i32_lhs> -> tensor<129x255xi8> - check.expect_eq(%2, %source) : tensor<129x255xi8> - return -} - -func.func @set_encoding_i8i8i32_rhs() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_i8(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xi8> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xi8> -> tensor<129x255xi8, #encoding_i8i8i32_rhs> - %barrire = util.optimization_barrier %1 : tensor<129x255xi8, #encoding_i8i8i32_rhs> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xi8, #encoding_i8i8i32_rhs> -> tensor<129x255xi8> - check.expect_eq(%2, %source) : tensor<129x255xi8> - return -} - -func.func @set_encoding_i8i8i32_acc() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_i32(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xi32> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xi32> -> tensor<129x255xi32, #encoding_i8i8i32_acc> - %barrire = util.optimization_barrier %1 : tensor<129x255xi32, #encoding_i8i8i32_acc> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xi32, #encoding_i8i8i32_acc> -> tensor<129x255xi32> - check.expect_eq(%2, %source) : tensor<129x255xi32> - return -} - - -//===----------------------------------------------------------------------===// -// f16.f16.f32 variants -//===----------------------------------------------------------------------===// - -#encoding_f16f16f32_lhs = #iree_encoding.encoding> -#encoding_f16f16f32_rhs = #iree_encoding.encoding> -#encoding_f16f16f32_acc = #iree_encoding.encoding> - -func.func @set_encoding_f16f16f32_lhs() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_f16(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xf16> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xf16> -> tensor<129x255xf16, #encoding_f16f16f32_lhs> - %barrire = util.optimization_barrier %1 : tensor<129x255xf16, #encoding_f16f16f32_lhs> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xf16, #encoding_f16f16f32_lhs> -> tensor<129x255xf16> - check.expect_eq(%2, %source) : tensor<129x255xf16> - return -} - -func.func @set_encoding_f16f16f32_rhs() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_f16(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xf16> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xf16> -> tensor<129x255xf16, #encoding_f16f16f32_rhs> - %barrire = util.optimization_barrier %1 : tensor<129x255xf16, #encoding_f16f16f32_rhs> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xf16, #encoding_f16f16f32_rhs> -> tensor<129x255xf16> - check.expect_eq(%2, %source) : tensor<129x255xf16> - return -} - -func.func @set_encoding_f16f16f32_acc() { - %height = arith.constant 129 : index - %width = arith.constant 255 : index - %0 = call @generate_2D_source_f32(%height, %width) : (index, index) -> tensor - %source = tensor.cast %0 : tensor to tensor<129x255xf32> - - %1 = iree_encoding.set_encoding %source : tensor<129x255xf32> -> tensor<129x255xf32, #encoding_f16f16f32_acc> - %barrire = util.optimization_barrier %1 : tensor<129x255xf32, #encoding_f16f16f32_acc> - %2 = iree_encoding.unset_encoding %1 : tensor<129x255xf32, #encoding_f16f16f32_acc> -> tensor<129x255xf32> - check.expect_eq(%2, %source) : tensor<129x255xf32> - return -} From 7013101715084a29255f66ac5f99bc4960e489a7 Mon Sep 17 00:00:00 2001 From: Twice Date: Tue, 17 Dec 2024 01:04:49 +0800 Subject: [PATCH 20/64] [PJRT] Fix compile error while tracing is enabled (#19485) While trace is enabled in the PJRT plugin, i.e. cmake option `-DIREE_ENABLE_RUNTIME_TRACING=ON` is passed, the compilation will abort due to a wrong use of `IREE_TRACE_SCOPE_NAMED`, where an undefined identifier is passed as a name. And it should be a string literal instead, as in other places. ci-exactly: build_packages, test_pjrt Signed-off-by: PragmaTwice --- integrations/pjrt/src/iree_pjrt/common/api_impl.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/integrations/pjrt/src/iree_pjrt/common/api_impl.cc b/integrations/pjrt/src/iree_pjrt/common/api_impl.cc index 6f2d787ea1d2..b0cf64ca67e0 100644 --- a/integrations/pjrt/src/iree_pjrt/common/api_impl.cc +++ b/integrations/pjrt/src/iree_pjrt/common/api_impl.cc @@ -1777,7 +1777,7 @@ void ExecutableImage::BindApi(PJRT_Api* api) { }; api->PJRT_Executable_Name = +[](PJRT_Executable_Name_Args* args) -> PJRT_Error* { - IREE_TRACE_SCOPE_NAMED(PJRT_Executable_Name); + IREE_TRACE_SCOPE_NAMED("PJRT_Executable_Name"); const char* dummy_name = "iree_vmfb"; args->executable_name = dummy_name; args->executable_name_size = strlen(dummy_name); From fdf4ae6a41c736526699afe17bc1b085904315fc Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Mon, 16 Dec 2024 09:06:07 -0800 Subject: [PATCH 21/64] Update emsdk in samples workflow. (#19490) Fixes https://github.com/iree-org/iree/issues/19425. The newer emsdk is needed to link when we produce executables with newer LLVM cpu features. Tested here: https://github.com/ScottTodd/iree/actions/runs/12356828162 --- .github/workflows/samples.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/samples.yml b/.github/workflows/samples.yml index d13221705197..0fce80dee882 100644 --- a/.github/workflows/samples.yml +++ b/.github/workflows/samples.yml @@ -95,7 +95,7 @@ jobs: - name: "Setup emsdk" uses: mymindstorm/setup-emsdk@6ab9eb1bda2574c4ddb79809fc9247783eaf9021 # v14 with: - version: 3.1.44 + version: 3.1.74 - name: "Test experimental web samples" env: HOST_TOOLS_BINARY_DIR: ${{ env.VENV_DIR }}/bin From 6ff00a8a008d06b604d4ca4e0ae6e601ae810b4f Mon Sep 17 00:00:00 2001 From: Prashant Kumar Date: Mon, 16 Dec 2024 22:40:29 +0530 Subject: [PATCH 22/64] [LLVMGPU] Deprecate the matmul simt pipeline (#19335) This patch deprecates the matmul simt pipeline and replaces all its uses with the newer TileAndFuse pipeline. --- .../test/gpu_reorder_workgroups_static.mlir | 2 +- .../Dialect/Codegen/IR/IREECodegenAttrs.td | 22 +++--- .../compiler/Codegen/LLVMGPU/KernelConfig.cpp | 70 +++++++++++++++++-- .../LLVMGPU/LLVMGPULowerExecutableTarget.cpp | 3 - .../iree/compiler/Codegen/LLVMGPU/Passes.cpp | 66 ----------------- .../iree/compiler/Codegen/LLVMGPU/Passes.h | 4 -- .../compiler/Codegen/LLVMGPU/Verifiers.cpp | 11 +-- .../Codegen/LLVMGPU/test/config_matvec.mlir | 5 +- .../test/config_root_op_attribute.mlir | 2 +- .../LLVMGPU/test/distribute_to_thread.mlir | 8 +-- .../LLVMGPU/test/gpu_set_num_workgroups.mlir | 28 +++----- .../LLVMGPU/test/illegal_configuration.mlir | 38 ---------- .../LLVMGPU/test/nvvm_pipeline_test.mlir | 31 ++++---- .../LLVMGPU/test/rocdl_pipeline_test.mlir | 13 ++-- tests/e2e/matmul/BUILD.bazel | 24 ------- tests/e2e/matmul/CMakeLists.txt | 26 ------- tests/e2e/matmul/generate_e2e_matmul_tests.py | 14 +--- 17 files changed, 108 insertions(+), 259 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir index 1b7a99184dcb..992dc8ec4435 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir @@ -25,7 +25,7 @@ ]> hal.executable private @main_dispatch_0 { hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb">) { - hal.executable.export public @main_dispatch_0_matmul_transpose_b_32000x32000x4096_f16 ordinal(0) layout(#pipeline_layout) attributes {subgroup_size = 64 : index, translation_info = #iree_codegen.translation_info, workgroup_size = [64 : index, 16 : index, 1 : index]} { + hal.executable.export public @main_dispatch_0_matmul_transpose_b_32000x32000x4096_f16 ordinal(0) layout(#pipeline_layout) attributes {subgroup_size = 64 : index, translation_info = #iree_codegen.translation_info, workgroup_size = [64 : index, 16 : index, 1 : index]} { ^bb0(%arg0: !hal.device): %c250 = arith.constant 250 : index %c500 = arith.constant 500 : index diff --git a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td index 26b37dd07e24..e5c6f6f649cd 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td +++ b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td @@ -40,26 +40,24 @@ def LLVMGPU_SimpleDistribute : I32EnumAttrCase<"LLVMGPUDistribute", 102>; def LLVMGPU_Vectorize : I32EnumAttrCase<"LLVMGPUVectorize", 103>; -def LLVMGPU_MatmulSimt - : I32EnumAttrCase<"LLVMGPUMatmulSimt", 104>; def LLVMGPU_MatmulTensorCore - : I32EnumAttrCase<"LLVMGPUMatmulTensorCore", 105>; + : I32EnumAttrCase<"LLVMGPUMatmulTensorCore", 104>; def LLVMGPU_TransposeSharedMem - : I32EnumAttrCase<"LLVMGPUTransposeSharedMem", 106>; + : I32EnumAttrCase<"LLVMGPUTransposeSharedMem", 105>; def LLVMGPU_WarpReduction - : I32EnumAttrCase<"LLVMGPUWarpReduction", 107>; + : I32EnumAttrCase<"LLVMGPUWarpReduction", 106>; def LLVMGPU_PackUnPack - : I32EnumAttrCase<"LLVMGPUPackUnPack", 108>; + : I32EnumAttrCase<"LLVMGPUPackUnPack", 107>; def LLVMGPU_MatmulTensorCoreMmaSync - : I32EnumAttrCase<"LLVMGPUMatmulTensorCoreMmaSync", 109>; + : I32EnumAttrCase<"LLVMGPUMatmulTensorCoreMmaSync", 108>; def LLVMGPU_VectorDistribute - : I32EnumAttrCase<"LLVMGPUVectorDistribute", 110>; + : I32EnumAttrCase<"LLVMGPUVectorDistribute", 109>; def LLVMGPU_PadAndVectorDistribute - : I32EnumAttrCase<"LLVMGPUPadAndVectorDistribute", 111>; + : I32EnumAttrCase<"LLVMGPUPadAndVectorDistribute", 110>; def LLVMGPU_WinogradVectorize - : I32EnumAttrCase<"LLVMGPUWinogradVectorize", 112>; + : I32EnumAttrCase<"LLVMGPUWinogradVectorize", 111>; def LLVMGPU_TileAndFuse - : I32EnumAttrCase<"LLVMGPUTileAndFuse", 113>; + : I32EnumAttrCase<"LLVMGPUTileAndFuse", 112>; def SPIRV_BaseLowering : I32EnumAttrCase<"SPIRVBaseLowering", 200>; @@ -98,7 +96,7 @@ def DispatchLoweringPassPipelineEnum : I32EnumAttr< // LLVMGPU CodeGen pipelines LLVMGPU_Default, LLVMGPU_BaseLowering, LLVMGPU_SimpleDistribute, - LLVMGPU_Vectorize, LLVMGPU_MatmulSimt, LLVMGPU_MatmulTensorCore, + LLVMGPU_Vectorize, LLVMGPU_MatmulTensorCore, LLVMGPU_TransposeSharedMem, LLVMGPU_WarpReduction, LLVMGPU_PackUnPack, LLVMGPU_MatmulTensorCoreMmaSync, LLVMGPU_VectorDistribute, LLVMGPU_PadAndVectorDistribute, LLVMGPU_WinogradVectorize, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp index ee4614d7bb05..1f44bf693a55 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp @@ -1295,9 +1295,11 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, CodeGenPipeline pipeline) { TileSizesListType tileSizes; unsigned numParallelLoops = op.getNumParallelLoops(); - SmallVector workgroupTileSizes(numParallelLoops - 2, 1); - workgroupTileSizes.append({tileX, tileY}); - workgroupTileSizes.append(op.getNumReductionLoops(), tileK); + unsigned numReductionLoops = op.getNumReductionLoops(); + SmallVector workgroupTileSizes( + numParallelLoops + numReductionLoops, 1); + workgroupTileSizes[numParallelLoops - 2] = tileX; + workgroupTileSizes[numParallelLoops - 1] = tileY; SmallVector partitionedLoops = cast(op.getOperation()) @@ -1311,11 +1313,65 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, } } - tileSizes.emplace_back(std::move(workgroupTileSizes)); // Workgroup level. std::optional subgroupSize = std::nullopt; if (!subgroupSizes.empty()) subgroupSize = subgroupSizes.front(); + // For the LLVMGPUTileAndFuse pipeline, we need to split tile sizes + // for workgroup, thread, and reduction. + if (pipeline == CodeGenPipeline::LLVMGPUTileAndFuse) { + + auto context = op.getContext(); + Builder b(context); + SmallVector attrs; + + SmallVector threadTileSizes(numParallelLoops + numReductionLoops, + 0); + std::fill(threadTileSizes.begin(), + threadTileSizes.begin() + numParallelLoops, 1); + + threadTileSizes[numParallelLoops - 2] = + (tileX / workgroupSize[0]) < 1 ? 1 : (tileX / workgroupSize[0]); + threadTileSizes[numParallelLoops - 1] = + (tileY / workgroupSize[1]) < 1 ? 1 : (tileY / workgroupSize[1]); + + SmallVector reductionTileSizes( + numParallelLoops + numReductionLoops, 0); + reductionTileSizes[numParallelLoops + numReductionLoops - 1] = tileK; + + attrs.emplace_back(b.getStringAttr("workgroup"), + b.getI64ArrayAttr(workgroupTileSizes)); + attrs.emplace_back(b.getStringAttr("thread"), + b.getI64ArrayAttr(threadTileSizes)); + attrs.emplace_back(b.getStringAttr("reduction"), + b.getI64ArrayAttr(reductionTileSizes)); + + // Promote operands to use shared memory for LHS and RHS. + IREE::GPU::setPromotedOperandList(context, attrs, {0, 1}); + auto configDict = b.getDictionaryAttr(attrs); + auto loweringConfig = + IREE::GPU::LoweringConfigAttr::get(context, configDict); + SmallVector pipelineAttrs; + auto pipelineOptions = IREE::GPU::GPUPipelineOptionsAttr::get( + context, /*prefetchSharedMemory=*/false, + /*no_reduce_shared_memory_bank_conflicts=*/true, + /*use_igemm_convolution=*/false, + /*reorder_workgroups_strategy=*/std::nullopt); + pipelineAttrs.emplace_back( + b.getStringAttr(IREE::GPU::GPUPipelineOptionsAttr::getDictKeyName()), + pipelineOptions); + auto pipelineConfig = b.getDictionaryAttr(pipelineAttrs); + + return setOpConfigAndEntryPointFnTranslation( + entryPoint, op, loweringConfig, pipeline, workgroupSize, subgroupSize, + pipelineConfig); + } + + // Other pipeline (MatmulTensorCore) expect the reduction tile size to be in + // the same list. + workgroupTileSizes[numParallelLoops + numReductionLoops - 1] = tileK; + tileSizes.emplace_back(std::move(workgroupTileSizes)); + return setOpConfigAndEntryPointFnTranslation( entryPoint, op, tileSizes, pipeline, workgroupSize, subgroupSize, getSoftwarePipeliningAttrDict(op->getContext(), softwarePipelineDepth, @@ -1390,7 +1446,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, return setMatmulConfig( sizeN, sizeM, 4, {sizeM, sizeN, 1}, target.getWgp().getSubgroupSizeChoices().asArrayRef(), - softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUMatmulSimt); + softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUTileAndFuse); } // SIMT matmul case. Query the best configuration. @@ -1404,7 +1460,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, config.tileSize[0], config.tileSize[1], config.tileSize[2], config.workgroupSize, target.getWgp().getSubgroupSizeChoices().asArrayRef(), - softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUMatmulSimt); + softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUTileAndFuse); } } } @@ -1429,7 +1485,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, return setMatmulConfig(tileX, tileY, tileK, workgroupSize, target.getWgp().getSubgroupSizeChoices().asArrayRef(), softwarePipelineDepthSimt, - CodeGenPipeline::LLVMGPUMatmulSimt); + CodeGenPipeline::LLVMGPUTileAndFuse); } //====---------------------------------------------------------------------===// diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp index 73688d2b92d5..1773e229c284 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp @@ -114,9 +114,6 @@ void LLVMGPULowerExecutableTargetPass::runOnOperation() { case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUWinogradVectorize: addGPUWinogradVectorizePassPipeline(pipeline); break; - case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUMatmulSimt: - addGPUMatmulSimtPassPipeline(pipeline, pipelineOptions); - break; case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUMatmulTensorCore: { FailureOr maybeDepth = getSoftwarePipelineDepth(translationInfo.getConfiguration()); diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp index b6414e1b6a47..18aa3b41f64e 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp @@ -526,72 +526,6 @@ void addGPUWinogradVectorizePassPipeline(OpPassManager &funcPassManager) { funcPassManager.addPass(createOptimizeTensorInsertExtractSlicesPass()); } -//===---------------------------------------------------------------------===// -// MatmulSIMT -//===---------------------------------------------------------------------===// - -void addGPUMatmulSimtPassPipeline(OpPassManager &funcPassManager, - const GPUPipelineOptions &options) { - tileAndDistributeToWorkgroup(funcPassManager, /*useForall=*/false); - - funcPassManager.addPass(createConfigTrackingCanonicalizerPass()); - funcPassManager.addPass(createConfigTrackingCanonicalizerPass()); - funcPassManager.addPass(createCSEPass()); - - funcPassManager.addPass(createGPUTensorTileToSerialLoopsPass()); - funcPassManager.addPass(createGPUTensorAlloc()); - funcPassManager.addPass(createGPUTensorTilePass()); - - // Linalg -> vector - addGPUVectorizationPasses(funcPassManager); - - // tensor to memref - addBufferizePasses(funcPassManager); - - // distribute foreach threads - funcPassManager.addPass(createGPUDistributePass()); - - funcPassManager.addPass(createMemrefCopyToLinalgPass()); - funcPassManager.addPass(createGPUDistributeSharedMemoryCopyPass()); - funcPassManager.addPass(createCanonicalizerPass()); - funcPassManager.addPass(createCSEPass()); - - if (options.enableReduceSharedMemoryBankConflicts) { - funcPassManager.addPass(createGPUReduceBankConflictsPass()); - } - - ReorderWorkgroupsStrategy reorderStrategy = - getReorderWorkgroupsStrategy(options.reorderStrategy); - funcPassManager.addPass( - createReorderWorkgroups(reorderStrategy, canReorderWorkgroups)); - - funcPassManager.addPass(createCanonicalizerPass()); - funcPassManager.addPass(createCSEPass()); - - funcPassManager.addPass(memref::createFoldMemRefAliasOpsPass()); - funcPassManager.addPass(createCSEPass()); - funcPassManager.addPass(createCanonicalizerPass()); - funcPassManager.addPass(createCSEPass()); - - // Even though we vectorize before bufferization we are not able to hoist - // accumulator load/store out of the K loop until distribution. This is - // because we materialize the fill and the matmul in two different scf.forall - // regions, when they should be in the same scf.forall. Newer pipelines - // like TileAndFuse don't have this problem, because they coalesce these - // scf.forall regions into a single scf.forall. - // - // Therefore we still rely on buffer level transformations for transfer ops - // hoisting and store to load forwarding. This relies on shacky alias - // analysis and we need to move this to tensor level once we have better - // abstractions. - funcPassManager.addPass(createOptimizeVectorTransferPass()); - - // Hoist loop invariant code to avoid pipelining it. - funcPassManager.addPass(createIREELoopInvariantCodeMotionPass()); - // Pipeline memory operations. - funcPassManager.addPass(createGPUPipeliningPass()); -} - //===---------------------------------------------------------------------===// // Matmul Tensor Core //===---------------------------------------------------------------------===// diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h index caacfb2656e3..17b7b866be11 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h @@ -28,10 +28,6 @@ using IREE::GPU::GPUPipelineOptions; // LLVMGPU Backend Pass Pipelines //----------------------------------------------------------------------------// -/// Lowering using SIMT CUDA core operations. -void addGPUMatmulSimtPassPipeline(OpPassManager &funcPassManager, - const GPUPipelineOptions &options); - /// Lowering using mma.sync Tensor Core operations. void addGPUMatmulTensorCoreMmaSyncPassPipeline( OpPassManager &funcPassManager, const GPUPipelineOptions &options, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp index f2e3e2da4e3f..bab5de877eb3 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp @@ -38,10 +38,6 @@ getInstructionShape(Operation *op, CodeGenPipeline pipeline, Type inputElementType, SmallVector &instructionShape) { switch (pipeline) { - case CodeGenPipeline::LLVMGPUMatmulSimt: - // SIMT Pipeline / CUDA Cores - instructionShape = {1, 1, 1}; - break; case CodeGenPipeline::LLVMGPUMatmulTensorCore: // Tensor Core Pipeline / WMMA API if (inputElementType.isF16() || inputElementType.isBF16()) { @@ -81,8 +77,7 @@ verifyGPUMatmulPipeline(Operation *op, ArrayRef workgroupSize) { // This verifier only applies to matmul. CodeGenPipeline pipeline = translationInfo.getDispatchLoweringPassPipeline(); - if (pipeline != CodeGenPipeline::LLVMGPUMatmulSimt && - pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCore && + if (pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCore && pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCoreMmaSync) { return success(); } @@ -180,10 +175,6 @@ verifyGPUMatmulPipeline(Operation *op, << pipelineName; } - // Return success for SIMT/CUDA cores. - if (pipeline == CodeGenPipeline::LLVMGPUMatmulSimt) - return success(); - // // Additional verification Tensor Core pipelines. // diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir index 1e5dbf63f2f9..059761c1ae19 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir @@ -267,12 +267,11 @@ func.func @not_vmt() { return } -// CHECK-DAG: #[[$CONFIG:.+]] = #iree_codegen.lowering_config -// CHECK: #[[$TRANSLATION:.+]] = #iree_codegen.translation_info +// CHECK-DAG: #[[$TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @not_vmt() // CHECK-SAME: translation_info = #[[$TRANSLATION]] // CHECK: linalg.generic -// CHECK-SAME: lowering_config = #[[$CONFIG]] +// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], reduction = [0, 0, 8], thread = [1, 128, 0], workgroup = [1, 128, 1]}> // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir index f3e0d81fb961..3c7e52aa475a 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir @@ -9,4 +9,4 @@ func.func @matmul(%lhs: tensor<4x4xf32>, %rhs: tensor<4x4xf32>) -> tensor<4x4xf3 return %result : tensor<4x4xf32> } -// CHECK: %2 = linalg.matmul {lowering_config = #config, root_op} ins(%arg0, %arg1 : tensor<4x4xf32>, tensor<4x4xf32>) outs(%1 : tensor<4x4xf32>) -> tensor<4x4xf32> +// CHECK: %2 = linalg.matmul {lowering_config = #{{.*}}, root_op} ins(%arg0, %arg1 : tensor<4x4xf32>, tensor<4x4xf32>) outs(%1 : tensor<4x4xf32>) -> tensor<4x4xf32> diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir index cd69906aec13..cec55cdaf0a5 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir @@ -9,7 +9,7 @@ #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 256)> #map2 = affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @dot_dispatch_0() attributes {translation_info = #translation} { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index @@ -79,7 +79,7 @@ func.func @dot_dispatch_0() attributes {translation_info = #translation} { #map2 = affine_map<(d0, d1, d2)[s0] -> (d0 * 32768 + s0 + d1 * 1024 + d2)> #map3 = affine_map<(d0, d1, d2)[s0] -> (d0 * 65536 + s0 + d1 * 64 + d2)> #map4 = affine_map<(d0, d1, d2)[s0] -> (d0 * 2048 + s0 + d1 * 64 + d2)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @batch_matmul_func() attributes {translation_info = #translation} { %c0 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 @@ -148,7 +148,7 @@ func.func @batch_matmul_func() attributes {translation_info = #translation} { #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 32)> #map2 = affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @dot_dispatch_0() attributes {translation_info = #translation} { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index @@ -312,7 +312,7 @@ module { #hal.pipeline.binding ]> #config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 256)> #map2 = affine_map<(d0)[s0] -> (-d0 + s0, 2)> diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir index 66fc62f2e482..b407aa64e864 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir @@ -54,14 +54,12 @@ func.func @dot_dispatch_1() { return } -// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config -// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info +// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @dot_dispatch_1 // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul -// CHECK-SAME: lowering_config = #[[CONFIG]] +// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], reduction = [0, 0, 4], thread = [2, 1, 0], workgroup = [4, 2, 1]}> // ----- @@ -83,14 +81,12 @@ func.func @unaligned_k() { return } -// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config -// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info +// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @unaligned_k // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul -// CHECK-SAME: lowering_config = #[[CONFIG]] +// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], reduction = [0, 0, 2], thread = [1, 16, 0], workgroup = [32, 128, 1]}> // ----- @@ -123,7 +119,6 @@ func.func @predict_dispatch_153() { // CHECK: func.func @predict_dispatch_153() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.generic // CHECK-SAME: lowering_config = #[[CONFIG]] @@ -254,7 +249,7 @@ func.func @static_3d_fft_stage3() { #hal.pipeline.binding ]> #config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info #compilation = #iree_codegen.compilation_info func.func @_lowering_config_test_dispatch_1() { %cst = arith.constant 0.000000e+00 : f32 @@ -274,11 +269,10 @@ func.func @_lowering_config_test_dispatch_1() { } // CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info // CHECK: func.func @_lowering_config_test_dispatch_1() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul // CHECK-SAME: lowering_config = #[[CONFIG]] @@ -341,7 +335,7 @@ func.func @matmul_config_sm35() { return } -// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @matmul_config_sm35() // CHECK-SAME: translation_info = #[[TRANSLATION]] @@ -501,7 +495,6 @@ func.func @large_matmul_f16() { // SM80: func.func @large_matmul_f16() // SM80-SAME: translation_info = #[[TRANSLATION]] // SM80: linalg.fill -// SM80-SAME: lowering_config = #[[CONFIG]] // SM80: linalg.matmul // SM80-SAME: lowering_config = #[[CONFIG]] @@ -534,7 +527,6 @@ func.func @large_matmul_f32() { // SM80: func.func @large_matmul_f32() // SM80-SAME: translation_info = #[[TRANSLATION]] // SM80: linalg.fill -// SM80-SAME: lowering_config = #[[CONFIG]] // SM80: linalg.matmul // SM80-SAME: lowering_config = #[[CONFIG]] @@ -659,14 +651,12 @@ func.func @_main_dispatch_15_generic_512x4x42x42x64_f32() { return } -// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @_main_dispatch_15_generic_512x4x42x42x64_f32() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.generic -// CHECK-SAME: lowering_config = #[[CONFIG]] +// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], reduction = [0, 0, 0, 0, 32], thread = [1, 1, 1, 16, 0], workgroup = [1, 1, 32, 128, 1]}> // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir index 2c3df44b325b..8dccac1fb4a6 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir @@ -1,43 +1,5 @@ // RUN: iree-opt --iree-gpu-test-target=sm_60 --pass-pipeline="builtin.module(iree-llvmgpu-select-lowering-strategy)" --verify-diagnostics --split-input-file %s -#pipeline_layout = #hal.pipeline.layout, - #hal.pipeline.binding, - #hal.pipeline.binding -]> -#config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info -func.func @illegal() attributes {translation_info = #translation} { - %c0 = arith.constant 0 : index - %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) : memref<4x8xf32> - %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) : memref<8x16xf32> - %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) : memref<4x16xf32> - // expected-error @+1 {{Total number of threads in a thread block 2048 exceeds the limit of 1024 with compilation pipeline LLVMGPUMatmulSimt}} - linalg.matmul {lowering_config = #config} ins(%0, %1 : memref<4x8xf32>, memref<8x16xf32>) outs(%2 : memref<4x16xf32>) - return -} - -// ----- - -#pipeline_layout = #hal.pipeline.layout, - #hal.pipeline.binding, - #hal.pipeline.binding -]> -#config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info -func.func @illegal() attributes {translation_info = #translation} { - %c0 = arith.constant 0 : index - %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) : memref<4x8xf32> - %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) : memref<8x16xf32> - %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) : memref<4x16xf32> - // expected-error @+1 {{Expected workgroup size in z-dim = 1, but got 2 with compilation pipeline LLVMGPUMatmulSimt}} - linalg.matmul {lowering_config = #config} ins(%0, %1 : memref<4x8xf32>, memref<8x16xf32>) outs(%2 : memref<4x16xf32>) - return -} - -// ----- - #pipeline_layout = #hal.pipeline.layout, #hal.pipeline.binding, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir index 9cb3fed6254c..ee2578c3e9e9 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir @@ -83,20 +83,14 @@ hal.executable @dot_dispatch_0 { } } -// CHECK-LABEL: hal.executable public @dot_dispatch_0 -// CHECK: hal.executable.variant public @cuda -// CHECK-NOT: llvm.store -// CHECK-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> -// CHECK: llvm.br -// CHECK-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> -// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> -// CHECK-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// CHECK-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> -// CHECK: llvm.br -// CHECK-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> -// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> -// CHECK-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// CHECK-COUNT-4: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> +// CHECK-LABEL: hal.executable public @dot_dispatch_0 +// CHECK: hal.executable.variant public @cuda +// CHECK-NOT: llvm.store +// CHECK: llvm.br +// CHECK: llvm.load {{.*}} : !llvm.ptr<3> -> vector<32xf32> +// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<16xf32> +// CHECK-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> +// CHECK: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> // ----- @@ -158,11 +152,10 @@ hal.executable @dot_dispatch_0 { } // CHECK-LABEL: hal.executable public @dot_dispatch_0 -// CHECK: hal.executable.variant public @cuda -// CHECK: llvm.br -// CHECK-COUNT-8: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// CHECK: llvm.br -// CHECK-COUNT-2: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> +// CHECK: hal.executable.variant public @cuda +// CHECK: llvm.br +// CHECK-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> +// CHECK: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir index 578d28b027b5..8055b9e8c412 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir @@ -87,17 +87,12 @@ hal.executable @dot_dispatch_0 { // RDNA3-LABEL: hal.executable public @dot_dispatch_0 // RDNA3: hal.executable.variant public @rocm // RDNA3-NOT: llvm.store -// RDNA3-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> // RDNA3: llvm.br -// RDNA3-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> -// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> -// RDNA3-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// RDNA3-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> +// RDNA3-COUNT-1: llvm.load {{.*}} : !llvm.ptr<3> -> vector<32xf32> +// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<16xf32> +// RDNA3-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> +// RDNA3-COUNT-1: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> // RDNA3: llvm.br -// RDNA3-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> -// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> -// RDNA3-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// RDNA3-COUNT-4: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> // ----- diff --git a/tests/e2e/matmul/BUILD.bazel b/tests/e2e/matmul/BUILD.bazel index 0bad5e06eef7..8ffe93c0ffac 100644 --- a/tests/e2e/matmul/BUILD.bazel +++ b/tests/e2e/matmul/BUILD.bazel @@ -385,30 +385,6 @@ X86_64_AVX512_BF16 = X86_64_AVX512 + [ ## ########################################################################### -iree_generated_e2e_runner_test( - name = "e2e_matmul_cuda_f32_large_simt", - generator = ":generate_e2e_matmul_tests", - generator_args = [ - "--lhs_rhs_type=f32", - "--acc_type=f32", - "--shapes=easy_large_static", - "--compilation_info=LLVMGPUMatmulSimt", - ], - tags = [ - # CUDA cuInit fails with sanitizer on. - "noasan", - "nomsan", - "notsan", - "noubsan", - "requires-gpu-nvidia", - ], - target_backends_and_drivers = [ - ("cuda", "cuda"), - ], - test_runner = "//tools/testing/e2e:iree-e2e-matmul-test", - test_type = "matmul", -) - # Testing Ampere + TensorCore path. # WMMA TensorCore(F32): wmma.161616.f32.tf32 iree_generated_e2e_runner_test( diff --git a/tests/e2e/matmul/CMakeLists.txt b/tests/e2e/matmul/CMakeLists.txt index 9e7ec415b564..b744d346ebef 100644 --- a/tests/e2e/matmul/CMakeLists.txt +++ b/tests/e2e/matmul/CMakeLists.txt @@ -1016,32 +1016,6 @@ iree_generated_e2e_runner_test( "--iree-opt-data-tiling" ) -iree_generated_e2e_runner_test( - NAME - e2e_matmul_cuda_f32_large_simt - TEST_TYPE - matmul - GENERATOR - "generate_e2e_matmul_tests.py" - GENERATOR_ARGS - "--lhs_rhs_type=f32" - "--acc_type=f32" - "--shapes=easy_large_static" - "--compilation_info=LLVMGPUMatmulSimt" - TEST_RUNNER - iree_tools_testing_e2e_iree-e2e-matmul-test - TARGET_BACKENDS - "cuda" - DRIVERS - "cuda" - LABELS - "noasan" - "nomsan" - "notsan" - "noubsan" - "requires-gpu-nvidia" -) - iree_generated_e2e_runner_test( NAME e2e_matmul_cuda_f32_large_tensorcore diff --git a/tests/e2e/matmul/generate_e2e_matmul_tests.py b/tests/e2e/matmul/generate_e2e_matmul_tests.py index a97b5626c069..3061fb620af0 100644 --- a/tests/e2e/matmul/generate_e2e_matmul_tests.py +++ b/tests/e2e/matmul/generate_e2e_matmul_tests.py @@ -50,7 +50,6 @@ class ShapesId(enum.Enum): @enum.unique class CompilationInfoId(enum.Enum): NONE = "" - LLVMGPUMatmulSimt = "LLVMGPUMatmulSimt" LLVMGPUMatmulTensorCore = "LLVMGPUMatmulTensorCore" LLVMGPUMatmulTensorCoreMmaSync = "LLVMGPUMatmulTensorCoreMmaSync" LLVMGPUVectorDistributeMFMA = "LLVMGPUVectorDistributeMFMA" @@ -461,18 +460,7 @@ def get_test_compilation_infos( software_pipeline_depth = 0 tile_workgroup_size_pairs = [] - if compilation_info_id == CompilationInfoId.LLVMGPUMatmulSimt: - tile_workgroup_size_pairs = [ - TileWorkgroupSizePair([[32, 128, 32]], [32, 8, 1]), - TileWorkgroupSizePair([[128, 64, 8]], [16, 8, 1]), - TileWorkgroupSizePair([[16, 256, 32]], [64, 2, 1]), - TileWorkgroupSizePair([[8, 32, 32]], [8, 8, 1]), - TileWorkgroupSizePair([[8, 128, 4]], [32, 1, 1]), - TileWorkgroupSizePair([[16, 64, 4]], [16, 2, 1]), - TileWorkgroupSizePair([[1, 128, 8]], [32, 1, 1]), - ] - software_pipeline_depth = 3 - elif compilation_info_id == CompilationInfoId.SPIRVCooperativeMatrixVectorize: + if compilation_info_id == CompilationInfoId.SPIRVCooperativeMatrixVectorize: tile_workgroup_size_pairs = [ TileWorkgroupSizePair( [[64, 128], [32, 64], [0, 0, 32], [16, 16, 16]], [64, 2, 1] From 3ab9d4be1c491fdd455a91b255a288986550c786 Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Mon, 16 Dec 2024 10:14:35 -0800 Subject: [PATCH 23/64] Revert "Skip test_sharktank job until quota issues are fixed." (#19491) Reverts iree-org/iree#19458. Re-enable the tests now that we have some quota. ci-exactly: build_packages, test_sharktank --- .github/workflows/pkgci.yml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/.github/workflows/pkgci.yml b/.github/workflows/pkgci.yml index d55f91534d5f..e3c2f4c7824b 100644 --- a/.github/workflows/pkgci.yml +++ b/.github/workflows/pkgci.yml @@ -104,11 +104,10 @@ jobs: if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'test_onnx') uses: ./.github/workflows/pkgci_test_onnx.yml - # TODO(https://github.com/iree-org/iree-test-suites/issues/56): re-enable when git LFS quota is available test_sharktank: name: Test Sharktank needs: [setup, build_packages] - if: false && contains(fromJson(needs.setup.outputs.enabled-jobs), 'test_sharktank') + if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'test_sharktank') uses: ./.github/workflows/pkgci_test_sharktank.yml test_tensorflow: From f90771d4f7d835a25f8dbe5feeedf037509a25ce Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Mon, 16 Dec 2024 11:20:00 -0800 Subject: [PATCH 24/64] Refresh project status in README and website homepage. (#19482) * Replace "IREE is still in its early phase." language with information about IREE being a sandbox project at the LF AI & Data Foundation. * Link to release notes via the non-prerelease releases page since the standard https://github.com/iree-org/iree/releases page is littered with nightly releases. Grouping prereleases under a single release like in iree-turbine and other projects would help here too. * Format "Presentations and talks" section with a table. We've had a few talks that aren't yet included here and the community meetings are now split between YouTube and Zoom (LF AI) ... not sure how best to organize them. * Add more icons to https://iree.dev/ homepage and expand on sections: ![image](https://github.com/user-attachments/assets/ecfc2fbb-46af-4d7c-bb19-f11d4d20f229) --------- Co-authored-by: Marius Brehler --- README.md | 43 ++++++++--------- .../developers/general/release-management.md | 3 ++ docs/website/docs/index.md | 47 ++++++++++++++----- docs/website/requirements.txt | 2 +- 4 files changed, 59 insertions(+), 36 deletions(-) diff --git a/README.md b/README.md index 188a5ea08d52..dcd58e65f385 100644 --- a/README.md +++ b/README.md @@ -15,15 +15,17 @@ guides, and instructions on building from source. [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit) [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/8738/badge)](https://www.bestpractices.dev/projects/8738) -#### Project Status +## Project news -IREE is still in its early phase. We have settled down on the overarching -infrastructure and are actively improving various software components as well as -project logistics. It is still quite far from ready for everyday use and is made -available without any support at the moment. With that said, we welcome any kind -of feedback on any [communication channels](#communication-channels) +* 2024-05-23: [IREE joins the LF AI & Data Foundation as a sandbox-stage project](https://lfaidata.foundation/blog/2024/05/23/announcing-iree-a-new-initiative-for-machine-learning-deployment/) + +## Project status + +### Release status + +Releases notes are +[published on GitHub releases](https://github.com/iree-org/iree/releases?q=prerelease%3Afalse). -#### Release status | Package | Release status | | -- | -- | @@ -32,7 +34,7 @@ GitHub release (nightly) | [![GitHub Release](https://img.shields.io/github/v/re Python iree-base-compiler | [![PyPI version](https://badge.fury.io/py/iree-base-compiler.svg)](https://badge.fury.io/py/iree-base-compiler) Python iree-base-runtime | [![PyPI version](https://badge.fury.io/py/iree-base-runtime.svg)](https://badge.fury.io/py/iree-base-runtime) -#### Build status +### Build status [![CI](https://github.com/iree-org/iree/actions/workflows/ci.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/ci.yml?query=branch%3Amain+event%3Apush) [![PkgCI](https://github.com/iree-org/iree/actions/workflows/pkgci.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/pkgci.yml?query=branch%3Amain+event%3Apush) @@ -46,7 +48,7 @@ Windows | [![CI - Windows x64 MSVC](https://github.com/iree-org/iree/actions/wor For the full list of workflows see https://iree.dev/developers/general/github-actions/. -## Communication Channels +## Communication channels * [GitHub issues](https://github.com/iree-org/iree/issues): Feature requests, bugs, and other work tracking @@ -59,14 +61,14 @@ https://iree.dev/developers/general/github-actions/. * (Legacy) [iree-discuss email list](https://groups.google.com/forum/#!forum/iree-discuss): Announcements, general and low-priority discussion -#### Related Project Channels +### Related project channels * [MLIR topic within LLVM Discourse](https://llvm.discourse.group/c/llvm-project/mlir/31): IREE is enabled by and heavily relies on [MLIR](https://mlir.llvm.org). IREE sometimes is referred to in certain MLIR discussions. Useful if you are also interested in MLIR evolution. -## Architecture Overview +## Architecture overview ![IREE Architecture](docs/website/docs/assets/images/iree_architecture_dark.svg#gh-dark-mode-only) @@ -74,21 +76,16 @@ https://iree.dev/developers/general/github-actions/. See [our website](https://iree.dev/) for more information. -## Presentations and Talks +## Presentations and talks Community meeting recordings: [IREE YouTube channel](https://www.youtube.com/@iree4356) -* 2021-06-09: IREE Runtime Design Tech Talk ([recording](https://drive.google.com/file/d/1p0DcysaIg8rC7ErKYEgutQkOJGPFCU3s/view) and [slides](https://drive.google.com/file/d/1ikgOdZxnMz1ExqwrAiuTY9exbe3yMWbB/view?usp=sharing)) -* 2020-08-20: IREE CodeGen: MLIR Open Design Meeting Presentation - ([recording](https://drive.google.com/file/d/1325zKXnNIXGw3cdWrDWJ1-bp952wvC6W/view?usp=sharing) - and - [slides](https://docs.google.com/presentation/d/1NetHjKAOYg49KixY5tELqFp6Zr2v8_ujGzWZ_3xvqC8/edit)) -* 2020-03-18: Interactive HAL IR Walkthrough - ([recording](https://drive.google.com/file/d/1_sWDgAPDfrGQZdxAapSA90AD1jVfhp-f/view?usp=sharing)) -* 2020-01-31: End-to-end MLIR Workflow in IREE: MLIR Open Design Meeting Presentation - ([recording](https://drive.google.com/open?id=1os9FaPodPI59uj7JJI3aXnTzkuttuVkR) - and - [slides](https://drive.google.com/open?id=1RCQ4ZPQFK9cVgu3IH1e5xbrBcqy7d_cEZ578j84OvYI)) +Date | Title | Recording | Slides +---- | ----- | --------- | ------ +2021-06-09 | IREE Runtime Design Tech Talk | [recording](https://drive.google.com/file/d/1p0DcysaIg8rC7ErKYEgutQkOJGPFCU3s/view) | [slides](https://drive.google.com/file/d/1ikgOdZxnMz1ExqwrAiuTY9exbe3yMWbB/view?usp=sharing) +2020-08-20 | IREE CodeGen (MLIR Open Design Meeting) | [recording](https://drive.google.com/file/d/1325zKXnNIXGw3cdWrDWJ1-bp952wvC6W/view?usp=sharing) | [slides](https://docs.google.com/presentation/d/1NetHjKAOYg49KixY5tELqFp6Zr2v8_ujGzWZ_3xvqC8/edit) +2020-03-18 | Interactive HAL IR Walkthrough | [recording](https://drive.google.com/file/d/1_sWDgAPDfrGQZdxAapSA90AD1jVfhp-f/view?usp=sharing) | +2020-01-31 | End-to-end MLIR Workflow in IREE (MLIR Open Design Meeting) | [recording](https://drive.google.com/open?id=1os9FaPodPI59uj7JJI3aXnTzkuttuVkR) | [slides](https://drive.google.com/open?id=1RCQ4ZPQFK9cVgu3IH1e5xbrBcqy7d_cEZ578j84OvYI) ## License diff --git a/docs/website/docs/developers/general/release-management.md b/docs/website/docs/developers/general/release-management.md index e568ad903042..31aa79ca9455 100644 --- a/docs/website/docs/developers/general/release-management.md +++ b/docs/website/docs/developers/general/release-management.md @@ -15,6 +15,9 @@ We periodically promote one of these candidates to a "stable" release by removing the "pre-release" status. This makes it show up as a "latest" release on GitHub. We also push the Python packages for this release to PyPI. +All stable (non-prerelease) releases can be viewed at +. + ## Release status | Package | Release status | diff --git a/docs/website/docs/index.md b/docs/website/docs/index.md index fe96af71c89a..15af0efa0139 100644 --- a/docs/website/docs/index.md +++ b/docs/website/docs/index.md @@ -11,7 +11,7 @@ lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments. -## Key features +## :octicons-sparkles-fill-16: Key features
@@ -61,7 +61,7 @@ considerations of mobile and edge deployments.
-## Support matrix +## :material-table-star: Support matrix IREE supports importing from a variety of ML frameworks: @@ -97,7 +97,7 @@ Support for hardware accelerators and APIs is also included: - [ ] AMD AIE (experimental) - [ ] WebGPU (experimental) -## Project architecture +## :octicons-telescope-fill-24: Project architecture IREE adopts a _holistic_ approach towards ML model compilation: the IR produced contains both the _scheduling_ logic, required to communicate data dependencies @@ -109,7 +109,7 @@ like [SPIR-V](https://www.khronos.org/spir/). ![IREE Architecture](./assets/images/iree_architecture_dark.svg#gh-dark-mode-only) ![IREE Architecture](./assets/images/iree_architecture.svg#gh-light-mode-only) -## Workflow overview +## :octicons-book-16: Workflow overview Using IREE involves the following general steps: @@ -132,7 +132,7 @@ Using IREE involves the following general steps: Use IREE's runtime components to execute your compiled model -### Importing models from ML frameworks +### :octicons-package-dependents-16: Importing models from ML frameworks IREE supports importing models from a growing list of [ML frameworks](./guides/ml-frameworks/index.md) and model formats: @@ -143,7 +143,7 @@ IREE supports importing models from a growing list of * [:simple-tensorflow: TensorFlow](./guides/ml-frameworks/tensorflow.md) and [:simple-tensorflow: TensorFlow Lite](./guides/ml-frameworks/tflite.md) -### Selecting deployment configurations +### :octicons-rocket-24: Selecting deployment configurations IREE provides a flexible set of tools for various [deployment scenarios](./guides/deployment-configurations/index.md). Fully @@ -159,7 +159,7 @@ runtime entirely or interface with custom accelerators. IREE supports the full set of these configurations using the same underlying technology. -### Compiling models +### :octicons-file-code-24: Compiling models Model compilation is performed ahead-of-time on a _host_ machine for any combination of _targets_. The compilation process converts from layers and @@ -172,13 +172,27 @@ SPIR-V kernels and Vulkan API calls. For [CPU execution](./guides/deployment-configurations/cpu.md), native code with static or dynamic linkage and the associated function calls are generated. -### Running models +### :octicons-terminal-24: Running models IREE offers a low level C API, as well as several sets of [API bindings](./reference/bindings/index.md) for compiling and running programs using various languages. -## Communication channels +## :octicons-people-24: Community + +IREE is a [sandbox-stage project](https://lfaidata.foundation/projects/iree/) +of [LF AI & Data Foundation](https://lfaidata.foundation/) made possible thanks +to a growing community of developers. + +See how IREE is used: + +[:octicons-arrow-right-24: Community](./community/index.md) + +### :material-newspaper: Project news + +* 2024-05-23: [IREE joins the LF AI & Data Foundation as a sandbox-stage project](https://lfaidata.foundation/blog/2024/05/23/announcing-iree-a-new-initiative-for-machine-learning-deployment/) + +### :octicons-broadcast-24: Communication channels * :fontawesome-brands-github: [GitHub issues](https://github.com/iree-org/iree/issues): Feature requests, @@ -193,10 +207,19 @@ using various languages. * :fontawesome-solid-envelope: (Legacy) [iree-discuss email list](https://groups.google.com/forum/#!forum/iree-discuss): Announcements, general and low-priority discussion -## Roadmap +## :octicons-project-24: Project operations + +### :octicons-book-24: Developer documentation + +Interested in contributing to IREE? Check out our developer documentation: + +[:octicons-arrow-right-24: Developers](./developers/index.md) + +### :octicons-project-roadmap-24: Roadmap -IREE is in the early stages of development and is not yet ready for broad -adoption. We use both +IREE uses +[GitHub Issues](https://github.com/iree-org/iree/issues) for most work +planning. Some subprojects use both [GitHub Projects](https://github.com/iree-org/iree/projects) and [GitHub Milestones](https://github.com/iree-org/iree/milestones) to track progress. diff --git a/docs/website/requirements.txt b/docs/website/requirements.txt index 365c7ef9cfa2..9f26226aeae5 100644 --- a/docs/website/requirements.txt +++ b/docs/website/requirements.txt @@ -1,2 +1,2 @@ -mkdocs-material==9.5.19 +mkdocs-material==9.5.48 mkdocs-redirects==1.2.1 From 8e86bcf8bee75e456b985211812ba2bd5a400735 Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Mon, 16 Dec 2024 13:36:23 -0800 Subject: [PATCH 25/64] Refresh architecture diagram with ONNX and LiteRT inputs. (#19494) Progress on https://github.com/iree-org/iree/issues/18174. I suggest the rich diff "swipe" view to see what changed. Updates in text form: * Added ONNX to input frameworks in the top left and swapped "TOSA" for "LiteRT". * Reordered the input frameworks (roughly ordered by momentum and grouped TF with LiteRT, but could also sort alphabetically). * Exporting somehow changed the font to from [Virgil, Cascadia, Assistant] sourced from e.g. `https://excalidraw.com/Cascadia.woff2` to just Cascadia, sourced from an inlined base64-encoded blob (so +1.87KB file size). I don't mind the changes. * Removed _extremely_ small "Stacked Column" text from the "Profiling Feedback" chart on the right that looked like an artifact of the diagram generation. --- docs/website/docs/assets/images/.gitignore | 2 ++ .../images/iree_architecture.excalidraw.gz | Bin 32752 -> 34579 bytes .../docs/assets/images/iree_architecture.svg | 15 ++------------- .../assets/images/iree_architecture_dark.svg | 15 ++------------- 4 files changed, 6 insertions(+), 26 deletions(-) create mode 100644 docs/website/docs/assets/images/.gitignore diff --git a/docs/website/docs/assets/images/.gitignore b/docs/website/docs/assets/images/.gitignore new file mode 100644 index 000000000000..cae9608b1475 --- /dev/null +++ b/docs/website/docs/assets/images/.gitignore @@ -0,0 +1,2 @@ +# Commit as iree_architecture.excalidraw.gz instead (~10x compression ratio). +iree_architecture.excalidraw diff --git a/docs/website/docs/assets/images/iree_architecture.excalidraw.gz b/docs/website/docs/assets/images/iree_architecture.excalidraw.gz index 7f82aa23bb8b119e4e24cf372bb1ed227298fdbb..62270942ccd4750839935357c95f1b093334d75e 100644 GIT binary patch literal 34579 zcmV(?K-a$?iwFp~psr;E0BLe%WnW=(V`yn~Wn*-8a%C=Mcw=E~X=HL?cL3~NSy$>- z6n^io=xSd}dgh^Tf}rArqN284t^h$$CK;T%`rrE2?!Jym! z?blz&pYST3)|XEO&7c*u2YsCP`wu_j_1EanZ?4k|Dg(cL-3;6Z@S407S7iF_gHC&N zU23Q{Mhm51&S>;Ealt`Q#o?OY><8l`?yaAWcY6KdRr#WIc97l;@I}}AgTwK4YmH{J zICyG)`Bi?UKJ0z7s{WwYxeZE<>Y(nfKK1^~=lUJogYl<&o#A!89rXJXpXqe{N@MWk z-Xr55IJymZZ~V@qJ40#eS4NId==KZl;*3U=Hl+QVk^}I$3>y7qKEnDubyX-Q#F0yOzIl`o}{}yi#jst>ym{*%d5Wb_v1VlvKnAUcX;yHLi+*&eNb%$XsnU zZ_0MfWmieMF0xCR(MB4p|1-in2p$Jva!Wa}(ax?VN=ZrMV{UDukt-E0ibK|T&_{da z^c6|nBwlWbR333?Wr$GUnwZh&8MU5!7+P4Sw45Wi941DPxpF(``R#t!??C`&WtVZ{ z5vwD_RA1Jw82i$mp`@jQR8%f`=r!68xpa(dA#5M=6k>XZJBVkxGqzx zX_PK2t~le8$Jgaf;bJK7h<|szb#cVbQrYL=h9q8>xr8oT9pN#u*l8x5RIBLu9wTF> z9HNcY~bS zf!#Y-`sDcR_PmySu~ov;!Wb!G0Sq4el-ye7agLDB0Q(siVkL5Gr04uUu0wo9H6x%@ z8BTG!T`LIns)rK77AF+Ix+1&+k1&_=^=RJm{^!W1PBtw9_bn9(w#|nA0fr z;)aG179Qh`j%lU2Ff6{x7B|ZBIDeeFdcH2F`$6jdac7YH7*7Dg$s9{#*T&LF?&3r8 z-zjKBS{fyFWNt!oGBP&t<(9XPNwd3kc-}hPK7YL2+bP@~t%2NH;u(wN1SjF9Mt^{* z!5WxL2C;aih7tI$IDf5jzwb0Pbp(pl?`LX8mrsJJ5z#73tksZW89$wvKYDn`21Sa* zT)JKB?X_;7o>byh7Siu=xQ7bqLSm4Pn6@=KrmZ1mjR!SfWvTT2axKKx2t8?Q7BDqN zTPU)WYo*5vni?vjOOHw@hp{}i+@72j55P%z%ATFxlcUSZW7n^>lP|a0cvfh^xzMQP zSUqiOWK2`@F1e)&p_QN_uC+0EFc1;4R<)Y<8Jn~HKicW1K~U*bgO9c}-(LGWni->_ zBN`{aoZ9$iMg&)bVd^4#Z0qIPeWl%_y3|d)x-t-1$Oq-lL`4NPxuS@MqiTXD%Ktkw zmq;_CnTSmJ5=KyseM9+%6bfhOo$g>rQBt>lX>R~~Xv!WB7k_QIpCx+Ajv_0WRZY#@(xf3=T9kj;Y)zF) zg1E&lgYf&W(2T`~eSPQe4_4s*!0!#xjdr!szW!!mW3v0`hx}yxqr&+y-#DqY=`g46 z)46=_3t8bsrI}# zn;Q-Q>VZF&fbp*yADcyrF1~T9s6PJNk0Y5K|If?c|NQ0kvu9*`{OKRQ{`Va&Gh{D1 z!i6Q&Z}L5I7wG{p+564n?cMO6Xur2teMmgA393B_d;>KHqC<_DKsKwbrx|X80npE7 zkAP(({PJSCRLa>B2#WAs0KaVY)UO4Z7`e=5wsGFrzt~Be<1N4Qd~=cOEP>%su4jwI z0^iDCpXTXJx>wxJ^^OzE@Bpqx`Vy2w=M3a9w$2?7pXk}(v^D7N(i^g!-+A6DoveY* z84npWRGt-iDOQi23Bo)eods)4L!_-5yFo!ERW%qoks{3zst7J`q2j9sLrHpgCP;(( z*JfX<4l*9eXEG{^x|(uDJWmY$w z(E}C{czJNEpdSm_BgE?XM~Ri^a28aKOzA0OuAi?Cio?6!-Su_uep8pfqjHQFI?D>= zk&2-x^j?NH0}~luR>Md$s2d&Gl(3WtV6(Q@~EZok`*y< zf#~W;zG_rLLum-tL|mz92+7xa!ZH7h24EScCKThV+f6{)R`1t~wTnhC&vPd?C+&2- zUAs9vEE*>5|GsAA{EeY$INW_n ztj^)4F^1BB;PE*=q<>gAzdOH7bwo;w$KqyX_imVYyyMz~j0wjUaOv^Ms8kpy`BdR< zK-RcgA>L{3xwZ*Q9kNmz<|R|t!{^fWEus@=DE$ z`m!L85i+O#U=##`()f~J8W zwZuIg03Mtuu{zL0o2;Yz5l*3FaOgwQp&l3LPj%%eE!jAL9b+E)&YidVf5g1oyMt=G z_H@3NdCrnSyJBp?4kGXc_sB40G@)^51BSyCQqy!R76$l3;pu@KvtrZ(=png?h-2h> zVo@v%_28orhNq8J35$IAdc$6b#Q*!6k@e+8x8UDJcd!Wf&s@h(a))+fCmr)3)t;fo zO%cHl)@BOx421)ZlD{js!0A(JWtbSrmEjA2A!<3Z(0? zPWFFH=NXaa`(;&q-m5$ww*q>l4qNT0;8vj`Kp1W~YIeki^lt6c2{yPfCYppm-S?_6HcVfrceZ=vf= z0JV}jT&NBEj&->C`UCHFPRv3OrrI}*=m-VSgewqI(@4J{ABAGHi^wMxNW)9S5eyap zl+gE$+gY&!A<7q!ApV;E91?OA9HyxpX41ooRMp)+ul)4Or-H3S!EnV$1^a@@TEusd zLj$0MD;OAw1c_&O#`vvfy4EeF6McaeM9N&PJJBWzq9d7`NVnc5DwH#^1nOdF%4!MU z8=C%K!uN*OQeXexFCQ)uT^a6?%>G?l+~>A;4*1#j^~+Ue7w{bs3Lg&zC)B&=G#N@L zBvO{~!Rah#A8sEXuJYR%bz?5Bx6-GBZo76AQMU5Kuqm^9-!;Orr z2|Mj_v%hKlXuWiQd0B3swCm4jnXBikQqc47cash`D34$urgYg5fi|v5UkGhy@J{e7 z8J;6Drz=4lYZ1|aisNWUM6xhX2#XXNfuG~*aD#_3OM!8{*~mJmPq6u~fem+c1lSpe z4F)x8J%J6=qCw;<-xn%x|Gr!zRl6`q;0z|*wli#LV54<3*hrU>)h2#5_AtNa-(GGu z5FZS#%&>b~V#UkkgAEb_sK>d{-uStb{1bt~9YP^jsYTL2nN|5VF1|Am7D-W*5@Az% zkfSu((NJM=aPEw2BQ3ldvHgiJFp89s^qee3&IlJ zAAzM4uz2Eg=W0yeh|Pzl<)vn`xqG=?mQ9{;D21+$1a#Ps5eBv<1C#wx7=BlhGZq<{_E*M6Q{$83WZ15CfP2 zuLZohDUz9#1}+rDRJ-~2R=oCo$UTBlPF)Y2t144DRYWzy&8bkJof{6>jDcM!ytcdM*eU5*0A~S~cCUfC&S{wMQElP8ynAtP#f1ySe%FyZ&=ngJQCF zTwRUy6hOs@325M+^G~Rw{R@V}_J{u#bVW^FR3dWJ1YA_ba7W@hNjqmx0jq9s|MYVE z@Tpm-J|Q`ubP_HS(qQUfFIl(A9s1^{rm!$KghjZ*zcen9K;M;HpUc99w79XPNNeg? zTil@t|I~%=I zr3*D(8-Dz}Ph~Srk4|!=_e^=dm3|)hsp1+;mPcD8V$Jbz?6C^FnHk_)BM}Uwmg;@Q z__->KM~fa{?(7IsOrb-%C3-u#Yi@QsRH9TGGA;%aJOoMX__Fg>b^r1CV0YN>?ls!2 z+isy19K6nbdG&iwdTJ}Eowj;6rF5=*U3@%JkHsZ?Z&74|yG*w#PWK!6*2YOc$G7d< zIW=#=xP?ryk#qW<9nVC*p&#aW5&Qb+!!?Dva<%< zz9~x#HFk4OpjiaokYbT)Y(}re)fg~>q*~OjE_}ukk4j;15mv6U%M{Z0O+X%*z@1jN z)5DUE1q_&Ul?L#5lL`%?Z62$IhQ-XIp2MS|%%zZ7e7bS^^qAb|&H01L zJyae#7uPkvnoqo9a4}&XiI`Fh?SX2HXpMk>M^~vCN_is-9M3vZ$AAW|+5b^eG{15F z9gG*Q`uz-7G||n`+}w`yCVs**Y)n_;H*A=;TzXPJE#e{6oxsCJXeQs+e*6K33nR7A zM7@mJAuBOlp?!pjG+btmUT{<^`5U&FEtO2Vmp@IN=T5h8204>WRc`Bv*9##h<2pH| z7;cAIJQ8fUi*ux1-6S!jBoIoAxN#kI?i&(`WbLzv@h#U$?mHN+liYW+Tb9om^e3U4 zsMN(z@`sMpO}u{2wOc~tA%M59M_T}`v{tta!v)(4Qazz7cQzNaTcLgG3A9^)<7m4@ zM>~I+_yJd|xHTwej`mud?d5aXhoicwKP6r-81-Bu3xnr6E>e&60*MH)qpfr1R+@-_ zeZ;=zZwR=I>niyO9ayH6zSC?S7B{|!;lgECX}W@Dv(fE`8ZH%WxVVNy!mWz0pS{9P zde=14-ROK%+5>05Ov3wSu2fQSaPEoj*cq?KxH(3lH0ZlytuKM+VXRCdh6#JPbVtz`^>?r zc#fXndY4E0w@r$!>jWVPg*a#?;Ki;kAXAn zl+P?f?bE5@*@y5_UB`lSIVf~2@R3&O~&NDRKKKqZK{AZ*f-Qt;+{Amc z!p&Tpt7|QDpxYIiqwjZi#|E@^7-XaVZ}J`5()guX5Iud=8Z3jp zWe85MZHm{Kr{-?1oW0pBHvr;|g?bu4RJ4(hQTU;*{}`MKF(uj(245rc1xH5WU$p|! zCa@0_laEHD_1!)gDr7J}1}nk1eC;MMIDN)xM)L>e#lpq^FzNp6YW$$C##E??{QH^} zj%5wW!X_z^6?oV@3iaUe6GD62``yFG`cdX|zq5-jJpbvkYZ49*%-vgt4oOsOUf8l@ zbPyO`hIUg|$8e)ilM4`G|zVmy9@?p={y7_!{ zysdK&yP1Xd*7!(pAjI8U*P!dBlM_AbJE&0_St+2$y8^$}qJxF!0Cdow=XXyxV6jJ~;|VJ5PFO`wZ8_*5+-zRt{XQv`u}D0)%90YZ&I8BxD>#4`9ztT_ zff3iJLq0O={)rF&b$kFzATssf=B8m%LHw%oM&qzwQvB}8&u`pT0-1e0+e#%39;hdk z%imeINvflds!6fFOBLw;Rd#D2yVdo>3+1-{#Q<%$J=AnT-;Or_uoi@G z2ChhQy$fiZ+kWli7%QJ(@!toFIb)u2AvD9sr$p>#Pp*zmOV@>ioBe|ge|Xf*)X&!s zE6`2pIz>g%6KeacDxB4n^GdCpf4nQiqrZnOM9) zlqoRh{sZ~2pK$S?hYKBD_kg3##hdCL7rR@9jpB3NJZ+W_%4yNktz6<*6R+ci^`v%p z8TXu_$77mZtUNdIk+`a{WeIjKK$&GRwy^nxi%+i!RqYY#Za#kN z6z57&O5ZQn zW)&9QQlDy@KoA8xZ{~d=j{B%9tMdQTaLX-7(@YGr6I#&VA(wbfdTg)XRc0er5C<`iI`1i zQ4amWAL}HYZ`r_v&zP86v-9w=+m^U1Mw$;cwOdK(W+i%YV}kn|ZA{Slk92f=|F3;o zdnk8Ko;Iqcp4qzSoeu5uF!4$`-C~t^%G~U!+Fpm9^@@k}ekt*C`wx3()|AQ>1mO4liu{2) z)%)r5=)PTd6!pad1vga0-Cv(G<{=Rrg#^raGiu$sRlqP%-QVfGlXhl`f(C z0$GT;S7O5-CDZrK`}p(jG@jj^{=loH+p~^;1HXjQaiM|=90&HKTI4LIM~(e;3GrkO z6Nw2=KHO;{NNz^oYh>ELJ55d-ov%Gg(i{F~=Lmtx2m1=55RWicwl%%fYTlpjY~W1y z@w9P$e$w=d(yJtbWBaMpt>ZQ7?3Qp6IvN+-R&4G9G(rbU6V{(~_o}{e zH?w~CBsTllIMVX#gklDHniu)CrSI)I^{O&432mOY!k}ss(F&nqR{v7&n0dOFZ{>@R zaz+hiXLeV=_K4bD{m*v%VwCVfG#d~IM(8)KVP|)H^W?TZ-Fs->9Qbu>WAU)`>c~kQ zqnS9x5F^6dGOTEnNmL3hQ7pPmEOV-PM4Kf^J3)kCKB%AskWN!l^rVy_sYZBb8Xk-I zY7ZLob2`^7 zp*l+zkk~ysuqPFLqus8rlYa>{DvQJ0nrsHn7~Bj4LA0ooaSD+V$}K5xlo-il2%p=a zgLM*MYLO-(i9Rp`93_$SdsXPXr;#}H9;?Mh@Axqp+rLjW)A!yu$_w0n z|JitM^?q+(_kN$gZ@1Hb8&}7_&pow{?q45XZCo`@UO&uN*W-=rZu?zGdbe^9<)53G zgcOUt#(;zhGJFu{_WJJXa&tenYR8Li$8)fJFaI#GImz<;eyD zkc!b&6xsgOAc2qTZ7oBU+`eU^h#cN6$q(k{!N?gHeXv4G-I1exR+2yeaxV8G zfRppn^~cuc)8aoW$^SZr{0lh;D=twLNq-Q+@Tc73?#cD+^FkM17Hdy?)6E~(yC-Ge z62KTTHhFc8Vs3Z$NqHWzuz=tpaMVe3w0ee;-Z%g=yV^TZOV0vEPq#oUVEj&h8b#v} z$?g9p-bL+I=vwSyF$R+`%BEgwwqJxF7JnSyF1C9M?cUSX&GR>qY$uc`NpccA-wL*e zOhFS9A@|lNA=i)*B%QKE5H=!Xe}~+X2?L_C91G|nP^wdRA1LjRVX6q zXFxbcZKf)OLnQbh{6g+wc;l|Su~4t=Jznp$)*rk3%X7QkP`@fW8YpBy(E>OHa?}Af zfGI}9NNPTrM$56lA*q%|{B-w4#fFLv!?M9+VTOT`z>?k!U(&L;v^E#c>I;pH`^DPT z?tz~E*xW2V94PN@G0C%Z7=3^Z2^qvr2+>PyI0V7OFZi?xyJADdhGE$dt+)vXFi)^C z4_{1h$IrbxF_~ZQ>|E{{)@*a_rR;D(&nO0$Jm{m)w^Yx;GXv$VLm0wZOVO(@@FDyjNgTE0i4sSXt z=2Ge5U|DWKDUbvQ^Z`1gr$pvp?5riG13D8&MX2ad(P2DvASjMV=GL^T>F`m9`ohEI z!qz`LlMtCimwyHGQ-G_#I)P0X0u`CWVXrwJM%b z$US~`vPrG_9}6v9Tvf2jc2=|)BQ3l*+X+c!(v1vXY~F!pZPg(>Og)@FTK#c<=3;*S z`|*M^bR!2uh$j8M2xjTU>;T9~7p1(s5MvdgPj_lnyr_6FG%wVOdBN#VNQVzqQ1_vI zxwt;Nboz2rU$c$Q%f@EuN~b>M1!9H~0tc`dg2q1kf-(b7%9y83%WPl7pq7)Z3SCsV z7#l7^aoU3*-A}X;I;`7MOLwuR2hG);)8*w*>+H9B-_N|j^b8f{*I%Um;6=y~14WEc z0>pn+IzLfdB-BSGwNE$vR=lWqF*GmCiWgr5L{b_))H!vse73w&KfhSEr;UTfyWQ5o zzI;nu#Nf&d!J-W~BZ(!J9ySh+C8kB57;#grtc!{kBhuoNHYF--Q_>`KQDuah%$fPs zgU7fLt#-H;_L}n-+q2dz8#4NHr$QOW=nyD~1n4i=_ACH$Ub2bGGKr#0)GWl9r(y(y z064$blAE~(vLq*}76Zy)d=p44;-`B$$A^fLG%B?^owM%k`{JFrUnJhqBL~9~_R=17 zPJOB7^^3#n#rxWQZEf>@rgS^+-w|(RLQB6WaiBCnvd96ID2Y{=NIya@-XY}93YxG- zd`G-B4+?soa1?^zz=OHI2x8)&u9f^>Nw*aj&5|o&x|QL(Z7zSb*P1`)pH5Du9yiWz zuAAHYm*?L=y2X=tp%FM*AGk+VHBNh*LZ=<~zk^sO8cQDGP^bRRX^(k>X?ifI5?En{#^dSPUF#o8HIlp* znnwz3^3F85EI%?NJsc=^EHMlkBH;V)_&&|Sg;{l_Y zetL87Yr|jO8=!^!2}#!Tcmv9a4q9>5yA~R+Zo;{Kjc~j_*ae2p<8%=FYKY_Kpk>I8=@!5ws9<;xgsKW2~ff1 zk>HoO1rh<&{mte2rS{Xy6{x~rpEX3Vbmzg--SHS1SUhAnFFA`IM~G=pchh#GSzO?$ z=F#QV(q?bHKD+o0An19J+L-erPzGYyAzFISIk*dtAA*`WS`al)m^JeF&J@QC(UByV z-fprhQ0N(hwco+=~MUMy4}8+ zSzhO@(vRY!fSD$$CXo{HNPTow$$&7AIk-prcU{+$AH|0n#7+GRUG9=Zl0KGmEZ$VW zWa)mw2m@yyX0eUOC->bW;q6vy_FzkiWYh1`c#NsLQS(nx($?wKGb9K3QSOd?P^ zj!}G?kQ*q(uQMV_BHhK-vA_pICV>H9ls)L{bG?qgT&(P${`A?4v&+Tv=iP50*^(p! zA%X*FKffWPkQa_RBC(dYnJq%ni}d=c+WeO@kSAT>QJh5qF5c+j8VO;94Z(}J);^h% zUh8^w^KgBt`&8R%mwl!Vxn!@&!Qdno-m*brrXEPqotWTX*>)3UgODe%aW#vtV#B!D zfH7F1>qi!oK8&#Rehr&D-reEZ_4}|k*TC!brQ_|=qXDu&gfXE|f`Shi4!~e+Q36t# z3&|~D;9%9vjEWB9qCHM=)chh@h@umk`m z0(CRDK94zICV7A?XC+LgEOb10$n#T-f?roRSJB~@r$ez{fe?{B!U%EWspYu1Uwgcs z-fS-|&L7@Mt&7)Xr-SF_fn#1ZF6c4{2PbxL2MscbqT2DmA-XC9S8%A{FfKS?48=pf zgQQy$8{V;)I_lcYcCE2AJ9kw-*=g-JYkORJHgFaNfkPfsSH%H1kdVGsvvK6|MkFz> z$V1a=RA)tpanXSgw75f+P0)RWj&@5)zvDe$fTN?w+I+9x(VL~qvf<&g7FCLnA!B3( z;ecj+fe6bKATRt9bty3}LY_b*uI7GMxELQUiYuTk>5K{S@afH`>lIwMY^=0v7gMd( z^_Bf=JpTC|X<=r_Zj_MFAGS2iKrja8=on}cnfx)~0%J4?8YyJ@mdKFC1OP6BZbV`*`zvxGGpv>MKqJZ z%qrj>xV%Y^2xbBjs+4F2i?P8X6qi~O60pz+d+S%O8>c^c_VRLxYgY$Xm*+Fxi`(x< zi5mRDGpckPG=a+46MKxc)EvEib)f+A@{f`t#CA-!8T=Xq;B zEU)hAQ76%BGZyJq()7pvzZ33a zM+UXH6kG)oxkt#s&e&mlW8wMkad&sW-gs%vEgf}BSF!PqXe+TwFhaxr-f7AN%#z%A zVxmuC{K^<+@u|kV6!i;7OjcgY7kMq!EWR(8#Ye!!2fReJ^lrl+@LP9JHwPE{I)9+c zCr5XkXWCfZ`vw%G{CidBoHI5sLV}`5WJL!<9X|)hF^kVdC_xab;LR_G!D!g(q1k-0x=yh%E8(5Z+|d^okYbGUy_QN z$oLUBj!Aq8&k9(Xg1;gIt5JO4cN8Cd;AXQZs*eCy>UFzKe>#7BIey;0T!-z&O(^{+ zKGu){3lKCy@jn#~i*W5lG_~ zuC6SJY=1C^j|xrW2$XvZ!;7`Fj;}ZCqQ}>EwY1Y)d^*|PEZyS80kOs;SZ2A4#T1bH z*ew|bw#0BxPx11Wf?*(|AYV8@wK9&s$T*&Kbw@EAq7{#@Pn-`)d4}(Ked;_l50=i> z7SDd%#*NGN>60uym7NYD|L&s&Cl{A~IPhx(8zjeMiCfmpAXupO8C7f;BO9EGtIA)i zoOFa0e$Ba~v%Qo3-Pz6SK=p9A+2zZrva`W7i{!6WX$by=CP^Z4K zqQnjc%H!D>q0Aj=dBL2s@E68|Dmwh~bU-O~ETVB9Z5!UP*gQErZmb?&EuPjlR(4=+ zuYPn}D?J`GqZpN(^6Dp(0X!%sNODO3r6{GvL!Pi!qiWPvMTc?G0izX$+9L!5!x0we zoo*h}(fa<}@$+og7k(UVJ#9^wz9czsPfgEbWOOqfOxGtWpaFYQKl)esq=~`-DcvTx zT4g&_bQlvI2x9SW5J55!A0h5AyMOw;u{YK2)h>@08m+7PY&$Z@ z1>)d70AW=Cfh5T#riC#?q-q;L#fq`90$B^Qo0vesJiM2&viow-K7LucyuI=B_4e(} z!9x$fBPxO*y5~htk#JyXq_B_)5CLF$S;4|Fk*x|@RngS(1Exb2mS|p=hf558HM4$5+&DsI=x)dYbZ-wxQR^fuypwK zfh0yKxX z)0?oq_vwc4{b5`vCRzx3rTxH5vUsK|M`RyYX|sSDW90;`RO{$f%881g0=@ zzTLt&_;NmZF*>>arO`NFmxB zK1@I}`@m}LK50^VbqF71eZ`OvQYWGO<(vJ*-Qxp0-o4U=^KE*Z+r63o1at=iixIeo zXk~p4yIHabqW5AQMpvekA%-Qh=Ep_iJ^|d$ULSc@a&hLjN1mq-?>BzCUc6h{AI1BN zr>SKzcYpK==yuaoBXpO1f8I0lkkTHDpO1~TtIz|UW(rILG{@C}=^7d*X)`n)TzHLC?)vdJ-ii$)>;irCQ3wnd+qM*_BX!yc6Bv5 zxqKp%ztY#u-DT)|4Ya}P<>gs3Q+in^M&y=f0&aSF#I1vdxF zrjQUbQF0)e1Oai|kk!+3B-BF)00V@Qo5M^A%xUS)ktEpmO}_ltJG2|>6g28{s)ia$ z`gZ+Xz1mz{+PmA?-k;n(A70)!xV-tb-}N)OXqqXra|Si(r(=mfBXg87j zqp7`-5tsV8l+F>>LD1-2H|E9`MwgF=uKBhx1LwH%WvSm)#Gs@FJDGqIVUvnr0dOV* zPlO(rH&B?>Oe`&uFv(li*`*|>8D=_b#k?T^`_MHG0}=Tk5vj-H>1YC@eIy2&DwzNx zLemnEz}s3zy-^^mZ`80P$B?(@z;9>gPuG4upNIRgU&rT1cMq?Zzg_gZ0)-YNBN9^* zZi>2a6#cXQ0zs6!u0K{((wW>bF(9cbKTt9uGa#8$l(QPMrT`KiY}@!BD$r0fA=q3Z zC}tL^fT2R<-d2X5d^ThiVmAt8)52<{BfF<5n~1$sI zEcCm=TNVZc;S7XiZ3>}jF~JPgG_vfz##3<@RrCaw=bzFlhQ^i>q?Cbx)fk#2F{b=T zdH5ej;;~Q?2|)-Xi|EUoUo#tRr4EY(F{Fc$1WR)O4T(a?+=Ky$+n(>cbm@Bc{<@Ve zA)3?*695QBcF<;aeDBN4G=IH1yt?)6WN54L$iu(AZ&l}e%7orofQd*Y9g{dHkhGHW zgP;g?Aq=nI*@9ux*m_l)enD+Cw)U#(gcQXjgk4*;2UjzE&s?e22vJ80^Ks2$|Gca7 zv+L!<$3NYh7r&iooF1M$ytvCeqx<~XJ*!CLZ0Dzuo4vg!P=9{$fZ>zF{j)#H)Slk> z?&S;AuMf8yCx1WS`utzM(Du_yHlA+Tm(`Ex{PJ-B@T_sN((dJ-Z~0UIl(r!0zX-V6 zeg5W<1^Opa03)Y1B_I`Y(_haM{3aXKkHYkguoCH!r@0_`}N^)La!8c zB7ji`IY&Qs|9ZT=G&ho~J733cr@s6){jM$l?|~cM3wIV!BuPs$sL(AX!cipL!L$3d zQ&t?Gb(hH@!82O5-DV;v2or-^i?Rnqq;OaN;XKOM=~mEs!vq9DI@nU5_AtIUzp*OQ zix&r1_eWPQd+%EF^1*bQfKVoiqi_@65hYE;FcpKW`w5q>;4KJBNiMgqzf2#W%xxY``qzi8#>K(6>+CnYwPdD4I$4@~ShE?PC0JBIRS_#zIF}$L z2LKH1e%nr0U`J5n^dHa)f1UQwdfF3$-Qn%=?~(K0j*jr^*Xy&1yW5M|#}SOlo6iP2 zLkUa(Q%4hNGF=>{7fxB=Q9P?1;d83rLI_~U|aPVYl85cO0=??wMfcX&56jbNhz@BikV zUm5nS6x307)z$I+!SZkGQ?Y(|G(9yww|kJAednsS#a+2jl7<)~0W~|ToCOKdLX0>W z`;>PgRuFI7w)tc6POSAQ(U1TN>0pM++TrN%Zwtp8w0bx?^V{k9$>M{py$AUBlzP7v z_)5ax3t1#7shFxY@$FI4&DC67DDnMyqqBDl5H%M3?LSv02WoYO@l>^KF(MfbTWKUCj%5^0Znl5U>6o{ImrS=JkS&+ zg|fx6zeM1W54<_|3Ej1AYAm&~t&wEl{&t^g{Bd(>Yx?nFd-1RVn^SlBI1bwPU1}my zL<7JpJ!*&ug%UJ^JreAdMfzJ#q3>%_^DL~WM%qHMTdcJuw*Z&V-l9hF))uw3U)R3Y z^rfBb?p^nwEA9Ce>s!%;Lj9kgt@_eP-+Q|;txZ#UXs9~9{gT`|xP-^CmElL489wE; z{a|J;V(#A5!T|4!v)h>K*UsmAgdQ= z^)b~VSqS!2ZdW9`t+e|%O%G=iGhgn$oX?LoMy`)O^DXB!!UO6zu7ikK2W@GOiyOFi zJG10^xI|OK7vocldfxYhGiwPBN+1hzrzXO=0GPT2Du;d74$dAEE@2YRZL^U-65uXg z3%IFjbW$XHvUzlI8Y6Ld4ck1rJ~F?*cOm@`cw`B3K_ItC`Lqfysm4ZagVjG2-}RML z1g;jU@^-*~TYOl!-WaBDld$@5^?0~&dVf9A_ka^~35^hBV??70;SOf3r0GG?$An`f zCQ$q4`Trf_b@>>xsO8%UL&D!)Id#j8yQA^Zq3eaE&2t*RTBNP@%l^k(ybz_3QWyqk z6W&=k9o&N$je9~jd+{U=0We^oXf-{MNK(bj97sGlT5MA`LutD}uX}J05;7l@kf{aa z`KY4NX}Y6c9cryRW<;S*+?VZ382U1t-xiOSFL?NN_-f*G!TMeuSfYSPdLXKkTa!9? zNfH(_0buQo`yyD-%s~K5Ei7BeL-khYVW`=5wfaB}KAjp=c=(>VZncZ>0O~zE({$(n zXF!<0U}V(d(%s%ZeLLoh>j#*pJGvUy?E7n>m^Wi};G&}J zU+GU$gUAkUpXUqb_nRjpcV~O^b6-a`<$n8gceL;EW|6|qE+$TtfK9;xDXwWwn8ebf zoEUHcN@8b8cg7ZQM-ed!GD|}MsnpnUCb8NeO!)u1d%m>R!|h+su9ER4W4#%-fBsmW ze@OuVgvw^p;mx!@GElEIz^H)HK@MWmul=BLbM1|zuW>ku+j~weESx=AtVd#_1wzx*89~{&rP?ThstTi$wwk1g%;p?zyDkQD^BLuaN!3X`MZ;g1 z>Mzk~Yp7RgYu@pd6c`lhqL5(SiVRrRMg|ctK0W?lb2tHa%6AgVRU;M_RseF zze#`+N+24fJ2<&l3>zbo)I>GX-UJ2^w2+5~0GekQwGf*>M;wTmrK7hTFQAn8hGBRE zvH6T*^Y2LHzn3LV&cx)(?$zqM)TU3u1h4U}^?~*jH0J1_ir{MU z3hyp-eQIMB*J5gCcgAKW`k!!&;vB@v9+FMn(i~Y399&T}@jD~!y(Ju7*a!)1fcQU% zhu0(SsUp--?&>FJ=uSlZp_b~pML3k#3iMeBXgj-v%Rz#g`|XJCI*6l ze_u4L$B2_*c-sMNKN8@|SPQrUnq~*XH81aH@6T7x$H%6|cBYrvhZ;j8{hupUhe8nW zEaA!~+$7;%gcF?Df;hh~m(gRsIfx_LkJS932(O=dk)Xn49qci=IG)0}k<+{5xvNop zIDxV4(WCy))~RSQ6E#gHL5?N`$Kp!l#Ac}YeO+w5C0x=)(p(0B{{wivjiO2bWcl_L zYm-y6n~S&yv%0c!aDKKwF}wZv*#Ceh7bf%wRCKlq-@q;?z!Zl1eFb*C1svH!McD>c z_`l+|QtxO+k{MYCR`~L_iXvuBSK+>5v{N+7#pn77P=BBYCmFbjEB0)>KX)HW;8J+=o~ zi4V$3)Z_7VRKe&lkMIs{kPK3Rc)R=JTPKHib6+3MR*uMHk03^C!) zc$>b|hKvHrN|xlts69#}cwtmg7cf?-Dg$xJaIw%oE~Hii&r(T}+73h?Xv1gJhE{Yu z9aS#6B@9#_npR~sh7J}doZgTAdOkFDG{3xemfLGEYC~uJ-@~a+g)>~(1A;x8l);OK zfpI!;WRF7A&BEV88l;E`2skGxsc^+=NG42cyMuTj4WCXLhH3$MI;v=NtJXN8g; zn#lZid6*lzp_N}JE>Cdw^z?l3+xc(fcm3}(t7u6S&qP!tk&08nfr(1$MBJmeXB4t9 zz))}jF>cx9&CD)=Vu^^>EDww(;nH>!;y@lgr99LC@^mzz(Y>B*5vr93#?(xFyIV0k zerJd4cO&<^cSkxmyS<%>oBgj2m?a=F!6Z}NDsF{(aH2X8^(tuW#hB5F2vJhCg*bRY z6GJmoK*4HHHX=I*=fHUQAF4wgAWugXjjo9U*2WwFR4qH%fpL2(xBJ6!a`||hi$m*| z4L#_8&qC4)MhJ##P&N(z6xPBajI3^o+$(jEVki+$-=Im1E!2UEqZuTrhdV;Od0 znS+qz|4|<52zffHY;-O3ptbsdlBk#@Z}%;x4~I?~i^GlW<EI zUcfxTIXsM;1cJPXAUYv)Wb{ZN*a~~F7^OfkX{*tas1fRac$jXXSuOmu0ExCET?YE_ z>GYwNk3Yntf=0J;7V7mCDG{0m@OI~7f2MJH@t~($%d<0MIJtf@r^EdpdXSKU=)vMF z3QgI1<6>?Q>3+CRaZ*r4F_KyQY zL{i%U?tLydI*^;@N7n=Lhdxx;=vtVwULTMU0u4LJg4`S*d^|I|RsMsr_M!){XTO)7`L@}6n9x^f_fg@+mx%%1&Swa&tOx3Lg;E9De z*vx>De2@XIC=3Z69{}%Wz)#T9NqabG^plS)E#3{!-p%iS!t1TI>DyHt`~g1*EuO@H z!U_=Dl8m;JffP*UM)0y7V*Ld;GLSNVe>miKLb#z;*ISGe!o2%}?D@#f#oDKnslDU7 zt6I}8zcz$_8T>VQXjqscGfc_Yv~AIbIS2+eB;%vo3m69MWTBlgKM6enMF`M8ZyD9F1;Q8UEe2egE zHzk-^J$rD&mtcBxd+YdSYs05;aq(<@X5#dC@BtoDM939|62PMk;b`HYk^n9*JP4)F z2v=~1_-?BDcS5*V@5N_LPqH2cQm*Q^vx~!Ue0Z=ks(bbGyTzG}^MOY=m=-V;NVhbQ z)+y~QMkWCg9%_T^q+mo6h>QJh2(PR)GE@q<5_39s@1bnYoSZfEa%pU1@-Qbi9*%r= z>}cTOjiM#OlqJ|wgIWeEs4xN`YlZ6>PDQzY1$T6C^CULim>%hAYMMGjo^$Srtl#UzoJ0= zgh_2Bt8QV;%n-2d1!B!khZkTRzOlm`_1%J2?zoa_+j7e>!%CaeeMgU+Bt0O5sFC90=@NjB<@?!631{S`o55D&R z;e{bNFj&=5TSm3LD5`-1vy=5H>+mX*+N_-59v@lBUr-)?%A~fkRyz*@&WJsnRZT~e z&C#KuTsXYu#mSjmSU>Q0&^aVgFd`hx6j@u(FUg`#VyuaX*dKj(ok4A+@R08gYJbSX zyOW2XG^lNcmGWSeNrBxTQ8z#H<#uBWM|X#JE=M+Sa&5?l245Y(wGfAA0HS+~^I%#6 zoLxx*0{RqMyuzfmMo=<;xDLObHvAfs+6L8mPEBTa)b8sb^Kw5sTR)rq{CQ+}W3sls zaq@ZieP|5=kp-b*VB1m%p>X3ga_Y|<=BvzUBQv`O!iOXB%ZbfTn$y;2)WHKINqT7i zUY>96F4pLFS1!jJI&*z|vbh5TuMUV&Tq44Rl~|)K$!Q7X0x77_ry|H}%xROmy4Z(2 z{Ce{6)8@2=E9bP;%oDhWJjTvZ%_ny158FCB%NJk2v&pfM!5`CR-~x!u=#&(ME&6~3 zG&4kzJ^f{QjQ!dUK<7Lsh@!E05GZqdtkV9PHi6&-P>$~|au}am8{0l?EKTS6&!x` zPXVr2gd)UDNSR;OJl|h{BLk@g3wy=-j2_&t`6yD z@Z9#`A2rNsg~A}x6~M^bBwR%pKvDx<76JAd;6M%*-(5NWTjATPTfjrmmA!{rEMC(Z zR>$dP=E~N+h~xQm;^_Ux5bDByj@?a7USG@%JjThmfGQaivIDeqE1DM*!a!GL2pMEQ z#U;WdKOo-SB##(#21fUpsJW&2>*FsgOS84yzP*~ctuJ1!4-GuR1*j+onE|7-w^q!l zgbD#W2;xh}So@1`0%Sz_{^2OU9K@@a_g63H1|e{S?!A=yUd(KLxf#)^tFg;{xS1V4 z-C7=ae3MIYGS3tWho|snj}i_9nUU%*+;>-nZpy-dh?a;#LE16TN~oTR0IcoMMu9YO zdqZe{y=Gafg`s*CCQ72KYxln0@Y?w-Y@TggPmG^5n}^r6&-I7B zfma4bEuKj(0_s4mjpkky1;d#m&HEF+a2BG}6ctiu#!kIrB;P;)K%?F8g^)UacX$7X zI=nk|s6xo&)d?G~CJwm_*uw&hTi%&CJGncRhx*X?;_b<$#?s&~5@pMx1dhhU z%+PY`4?;1FKr)jT9{E$L4qQw#4O~(Z)3k$Itc6rlS(u3^+tmRbSsCrSYZgD$;r*#Y z6+#}bHf+3}I&ftRhX;^BdiTAg=9dQxba{C{e{^zodwROEeRFv*`04;zSeV2;Od?vu zL5fPGgF%M8a6@I4JV3h|uoX28H4{`8=@3VlxaD_eRzI}h-D$y-SGNM;wTLgzoCXMq zNZNy6u(vw7KRkK6FnYZbys>pWGjlWg4w#Q#gprdN+!eH??h%DlJ<%csHTyz+j*{WR zriyAA^xe3Dp+j@00tBdjcX#K98oWC-Xh%qO+`#Z^&7Gpp45<=0DM0tVEeq=xBZ~{W zjp2#ahwa&ovC)au=Gfp1gmEDTBw-H;vo_zrQ;^LtL9#yr(ei{1siU(H5P)J#%AVTh zE|6jZe*_M{o;+0I_kveX_0S93LCo-@$X^;pYqmMx1*VBff zDttU%ZOG_V8k+Sl&5lE3T-`Fr$_m(hsq*sdj^lD^eQ{?WmT7)!bY%L92VNM^rGOZj zn1pc4e5z^`&S23LC)4Vj0)|O#=Xq&Bt<`q+J4l0MQBE-Glxyrzc~B>IrOsU?J@NbP zYU%&q)x~*p`RL*A8R#efI#D}2I(_`w|HJ;@z2Z zzCYhRY+RgQ++6>$yn9&RyE%Q-fG7WZ`1I)D{P8Vzzy7y+;|U9zN4vGt|6b6%xcvX8 zyI=38_SK*^+P}Yx#?isidF^!NMfdh-ttbC`$OTEi!OnI4@yCxpkRReTKDW0fiIk!T zlP`07muuHYOIDvdJ3nY_J|F`PJhMfjhz5WM(oKPPNFcaxMPCr~;Z0PCc$o9UDPH|O% z-j)k2PtWGur^DvW(8-nD&Ri`YoXidnK62Ts>cE6l$t>WfZ9!xhDMi%#T z@TdAgrHG^(DY~n+FjG(phFU5E3Q#pOS-2+8CwH2S+WGZmt??E39Xd`a3b>(Tcn5+? zJ3)mwRLtI$0qI0g?mCW}S{fh!W9jC+d308PTbj0fas^rK$>p8M%KYOVAaV6!1*W2g zkllNLF($Kd_^=xf>y4AKxy4h5$@M`etg9EJU=uaU2yWr4QxS6%4|0@{mxN}|^OdnU znx}hGG?=13D_{B~^l(=pmzEAWQw6WyCplAc>MRh^s6lK4`usPnWzTZ;{~r&iNMh6DkWlqd0_XZack+k zd3WL}y1BH!IEQ$h8&d->E@WCV2~363!draVa4r;xqHMvy{Sy~5D~L)g+@&vT#G(ZH z3c~VsX7FXc>`JY1P;dS*S!?Xw)f#W?&OZI{zIwEkl*Op-NXBqXbLQ*^g5JMmco_e!kS(qa=T!2_!z0f zb~yf}zI)TG?VQ%He^Q0;n>%>rI^gp`rvy~3oE#o( z41Rkcf)o=a1TsV9r#wxMf^0$&%m5xQ;_2V@z`j)-B$z`S`%U1#fABxrs)NKQ_-Kqk z^Tdw4KD{z`^l5UZakea<|Lr|`MDP#3uPzUr-Ccj#Wj^(}v&D<%&C25CH}Ct;-(FL8 ze#|N+_;cq;F&k(1wRh&_&BHJ2JU_A_i}i=A(D|zze7dm}jSNNh}()?oc=YvX3{h$+R9 zI3fc$#PSJG!=Pw71{1k*VoP5vzM@sE^j?8pZWRp)lUe)j)&8LI+Bn;r{qivP`R4j^ z=IH$FXv+M@C`z1F6ip;a^d6=`b9&)^{I)T9v$(dpF;c%cJDi@A!J{ZV<)nCFS5!9_ zRDQZ&ia<*M8z_N9K=rIw{B`|!mK51aDxza}S3V|3q#HrsW4E70%?<_s^UIs&AG0%4 zOK;21|3CMq`I{=$O>>z63em%v8`qnQO539kDbd_ssxRTrf$a}GQ%wp81_^gakCHq| zkG}9^;b*Cu9gh>vLf>BPOI%!u6Q%Dl;VT95gh2R|3LH>ST!ma!g{W_(J>epNAO;pr$lgWUfCmRb$Sc^bf$bXT&>PT&?zZ#> z7=r>cvl=s)w)F;J0YDDo00PM5`1iN(tz%(#Yy=lqCZO?vJbgCu(5wwDTc^8%qG?GA z4u&QjyA@m_JXlGciQjdzLL0hY(yQ=%HVK?6TlrE9yqDhFweiiRz1hjd@#W#$-O1s* z2l#Y6_`SD=P{@LSos|{s2dgSI!@BG}9~;+AGN40j?1hjMJlqY0@<;MVC~4|w>MvKg zzF7K+(XsKPjicGksgYRvRJ(XM-kiAjiI%>4QUspq31C7`>OO(AzqiJ-SGzkgKXZS+ zvb3?4gA~5&^qw_JtKjg$NuNyQ_X+KmcZX-(B6;gy_pc z)8VBzu6_(J^|km>oz#WuKvb9@O?z0mf48@CY3uj(nWLTkyGC>EsD5>L`3^`)#gPp? z4GgJu>@bU>3#qarr1XW9Dn(SmqgzrUCzu2bs#O|YzaNc*hwZuLo2$mv!NJ7snD6x0 zD!q*I3o9FMOZl%K_){oPk=5}ws}p7qwa$a-iN;|7oYeA?80Fa-$b}^*=tL;V-l^W0UJ+(FknDVPw@^ZD?~vMOkqZl;2%q( zbxgWc-4(f6FotSkLT1}tX(6(I+T9r0k&BETODce%*M*#%bwdaH8TXTE}JJ`Dtpi_b4Xask^}?XSlrr5+!ExN zuQ6XRuqyYf?H+vc>b6JR>z3kbrv@Ra3*w5JV%&Xluw(G~4%nSVpJBIIi8o<4MIlx~ zA!g37j^~^dFmp}>MMTE-v^jc${Mv=V-p7fnGi2?41u+W(fVq1B5Ve;Kr6ikrau_Ew z=(~0-DslICwIQxY?WyuBG^>@zHHDL``*3ReY-55a?jPOE?xS&$!nm#?Lqw&x8#G zX+qvZ+xkh~uGaCeu{OOlK0I=f+gHs6dk6H6f|BM0q~@&^4x}gyfet_c;?y^Ka~1&; z12>VtVC|GH31?Se)KH;p%a#x)NxsTJ$2(+um1D7VbB}Ay)wAa1XFP_<>8XcL>njzG z1v-N!x(aG0bV#G%yY?e0VfJ{n0j$>qMWxo5d-Z^fr3LrU-!L~eemPcO7{lSa`7gM0 zzjUy-TYncy)MqM2pkwhe{L$yWI zf>;uk z%ZXuDcPoksC}a7jX3Q?yz=oe6SNIoo8$jnS_8` zK+piVgAc(8fQ;w@g2eB;ia*qbvzH9oRw@fobxHwKk^s=UmzIv1ly@L4P~6d&Q4v~| z1&oR!g`)>MlJ!qopiqJ`fs>eeN;ldst;4lZq>&b5J0Oai0;eW@|qxY-#!zux?bbgn*jGPs(_Op3xioCCHpda+_S)2!{E-0#&7 zo4&m{aWL>qwkS?)LP}&D(lT2t-_90C^liA9Jc@yeb-95O6C%=}|D=6Qn7*v)H+ga0 z{JE3Gk9RN_bHx%-U7`nzu!!#C>6`V~S;OsvH9c6K_;NRVIRE8l0QUlXry>Zdg_Afv zJOzVV18gk_Lc*$Ok>0leo2*j}&{-T6BUHpJ+n36s6WXhs=<^E&*nhHjUc0RvF&KW8 zAi4H{y{gjrFRQtG@F?sO~%@DJH!L z5Z3Z;Z13oLWn+DLX|aAi`|^5ryEu5;4K!>oL2v4AqET_6Z|ZsbSa(wbO<0*ZL(`;1 z#9&Z7Kq5Se5wxk~FI*Xb{-2x8aNYurRDSqh{+d2{uQhko!G?oY2OD7&q&+0>TMd93 zz`kQ0gAFT!5tLz`=q;-7B0+>C2%(S3d@B~gs^ShB=*@e97DqQ1g(P8yI2pkvS%8pf zB;*IQC>z)f;oUMLD8R|Cu53vx&?1LV1tfqRIMjtO{$Xd}$?c~Lq?zTdv7IHkKflb2 zqm}!Lm+~~tM9Bm)B5x7)h*Kg2f_bFHaNjgk8T~JREzP}GiA0Ct|5Du>6#$5^hhV7X zcOJa$upw=NQ6j=zogz#c(k8B-&@dQTi;9yVAtd`hsu{ol%oO$y*!TL9{2Z2Xi3DC< zOM^l}-a|p%?dr|K)GDJ)+MLbi{(5a~JO&;Lm134Eh$4>E07Xcb{eIp9g=Xntjtz+l zWK!Ra|0B)weM!XemCeu18UFmhex?XJ+>#SlNA3im0@{6szf~?~@aJ`46cdoJWE9QD ztRP2WU`Ulrb9vWg-FHVP$8vnoFJDC=NdtU8-ZG6mh`x%*Dt#4)r9ndX;*s;%$mh-N zkj0hN`qVqd5TD_|npb#r?Zpahctk-yp= zUH|0n_CG(dukG!tlFU-w>Y7u8kh-rbnclpgnAF|NrLCnUWgfx%IKCcqB#{E!bau+j zLZB{bGJ-Y%W~Uy`m5;TnvmXv`$9A?4XD{YXcFvb47xxC9 zYa}S(mI0d0Tr+Ez>7Mrt6c^5tl&w``7>ziq7;;GeJEZi5+rP!nXx+xo8z(pQ+uAoP zwcE4W!4Kao-X8xFrk;IsaC7|4&p7}1aDVd6_HFIz`u5=Xo0XpdzW7Dd_46+(g6^Y} zoBNYmAHwbrc+tjKowQYWj`ELOIU7_FpWzm)J>++{ee(Kfx_))FF4u>0zASk9GClCb zLwYesW^*^d#(E1hDv9Zgoa$zMRDd+YPFd8|88O5dt<`8mB_jEVAyi+!{NM294WIbK zhBZjG8P;HF20e^7hllyIwf&8YT|U~Wt*!p}JaKR}4BgzVC>sYV5_4uF*;+6FOx-CI z`qy-|71>sZ9YrEs%mAqQ_>F4dA2Q$k3+;ae?Z$8XQOZpLsm^pG1b|TYneMZr7GgzeS2}|#- z-a_!w{a8pv4`uP8KZSRHi}gFw_0PS@`s1nnyv?fCLqRp4fR4r}-FsesJh?qwTH6|5 znR$F(<0p;9!^wdM)FF$YRROBptt9F=CE&Duj_mZX?)>_!M~Dnp_0{H zigz0lzlTeLLIe)osy~QOnKA!IXnt~wh_TSz&Af*Z=P%Fmw~uEFyAO#s z^YgEZKhCaA```mf+}2U1e*%!|MV-jB+dvkA)D%CWCD^bKx%N=M|6C}4dE&prrDb)8 zEevK6>0Xhwx;UzjGBQ!0-P+ihm^+-myO|uo76Xl3qO{}E3_gKN&%%n`QpsQ>Ve&tM za${=Y;{N1wvHb0s{RFeUvR|>f+{VH)L%Dl8*A|~>85B-Y}_bfR)dlP0N|DsN_~-FAWK08LnW4fs_b`86Lsi&Z2J$D{VjVr52CUkwHgam zGL&8}Gv|+I=cZov>&tq7`nWs!$aSwB4LlY|+)i+)4h4mj!Q2!v`htZ}H_i$<#032# zI#v&hx7zw-J^xl;)%Aqzs z5IFk(>fU)~bm4P@t)CzGuSHuFM%C5+7#tEkWbF3O^8EJYLiaA?`u1S^Wcji7L<5f{ zskU=fM%<;Vf8aj>glA#)$gWjV8RClg5v7O2;;qIvCq_Tl zUH$UJf6refN%hgn2!XAKDneP>TX?=$UtSv-+d6rTh}A{D8+6q?dYWB2QwnAkhx^Yhy31>Y}SoDVz`jL=TkYuT-?_dQIA6?66I zQk{gRq9%qPpwu50e>F3^^11Hn=Li04-iq4^2R&Ssfx9QOws5nxGQx)fwYO>Jk>R%dnSGyJza&BnW_5Z=E}+8V(7*} z{c7NuU}$^1Qj1UUR=xH8#K1tn|CMs@f7SStMX(Y~HWx(eK?h-cyYo16es{eO7c*=3 zF#Cj;gMYLVvj2j!LPJlbv=3<9@c;H!oLUJ5S(0Ez_xXm!>yh32*4z9;FF88etl+Ni?Z7LENwSt1{2h&2cLCa zUtHXK_A|RbyE}W)#p6-h9DGTukbTVcm863DKg06_d{$V1c=SIT@?ZB^Vpau(A}g`= zkPbW@yL>pU*N*pJ?rs-0E@zpa?&k)d374Xub$?aWtJze86iCv_&o~wm478*a2;yeVghQy{ zM|6woOVl-Nll=MR##b*k7iTuMYg-Q>6BBET^?xx+tRB%x)#ZS}ikd0iqulbJJ@6(!Wx_b8OBv3zzxX=K`bRP$^O~&%?fYU zJXaHli3}OZKpnO9fKDLiurOg)O$^hap1=9#*G=}_V~4ww>+6dj>n$S`YjcYsLT<>> zy?WE0;>Wr1^Yyda?b_bLGA-A~cRsp)?8CRU3@)LpibN5D*wDR_q$qHv2avN!-y#$! zq=D#FqL}7iG#7iCxTQ(oMa>_%q5Wgq)yk`J<4knJ?oIdb;qphkQ9^AvpJ6eMk$)v(rW*TuT9Aq`aIhM3}kQLj!IyXhyV!%c3#a35TtU1ffD)Kw}TQH9O91Pq-!Po(AC>12B?HPve4gG;}V8phRA>riP3R6E1m%&h$@Nn#!g0{V6cWnx`6|C zv^(MA=IKrmD*l#;5h@~LDBwxKKpp?<*b9dCjg5ea7QW-3r3g$qNpMeHLQmjmTo5M5z+bW3ShdCs4G#x+vTvN<1c@% zhLGNQy%8uts;CGd0HHwG@OgpUG>H|}NyMzf_GTV< zb8LG0{P}SG=;?=p_2ZT8Aqb(!&`C51tzlM(rW6!o4@C>)_e4+~CWRE@2-b|i6pj`O zoI-*`;Au`ut)|6ex+=PV1yTRCAPNhKsDPvbr?J2B7gQjUI0%q~0-)n=V#N@h63Npo z-5b1_5JZNO5+q2c)sTxT8wyc?u;^P?TN2p72w4LSI_~czC56b6fW%1OhIa)3m|X}s z(Wv8Y%w94RL=8d!)Ze}vlPG#9n^PCjr)V(}R?r|8;Ww@fOPD60l16ejuBVCn20uXp z5zKDPtZC^Vweg!r^)MOr2!S55l?w|OKaP(7@btWNIijxzCntA$u{rQNKNn+4Om=C? z`KViA4p9a*llS#Z{DiGvC>>q#YwQ2o`oFgRudTmrzF)1nE+$mOA+-CpQLj&Db4#@R zC~@G;GvF12pgcPGEI!YG-BshnKU-<%`4p z(anjMjvb^^9l{ z&#;jpseT#vU&f6nmByWJDds&)WWTD{jSoidruJ4}f4Et{dN|v>n;LH84k^eGNNLJp zKB&f>D3lVP7pxD1aa;9)R7@m|vU?T6xz}#C^oK1RKR6q`%(I=D{n#IB<2JyO#1JZ^ z;`F}R7X34BVMk$t&r3oMh;g^CykMc~dXCgu2L*`N#-RlzJfnfdhX{ODz_P2W6>J<=KEMK3!lriAk6o-At@S^p98)Aca6c zA_JS_B34yR%`Xso1cBQ08f9ijvF`mr+&^w^%GLea*!uFp`0>QJ&dJ4akib|-$PyzY z*xG=Q(66|NL=423z)^rZgV1~<%L!y;@Wnxo96)s%h*1+jdl&ZK7r8DBd`_+3Nj=VOF?M_!8k?9P`^Ov z4TPZTn=BKSj2;|9;?=@A9Sz}_P4&C>;sI6X%&+gpKobBGtA0J&_uDTv<4F?9o1%-_{NtsX@ z{DGBJj*w7tr+vkH zr;e$WhD)bbV{OI!Z8vIx3Nz zn=-74L@8f4Zaq_wSm_cm!yqJj%}sAS?DOjS>)qNOTrQm~Z43tn&5FG0n2_G* z61(K!pDFSXoH8SWzd+~-1SPG^vQn}c^xzLJx3))rT$`((9*y4AXBVFi=5KD+hXaHL zMcx27gf2jEY!NClg(m$kL&bM+Ak`eG38*V}ufP}?oqM?C>Gjpo=h%Fj+`iKvCSQhv zgASTJiV`Y87RhAqym9-DgO5lPw9uxeYC4O5anK_N3e^&AB15MhuHE(zZl6zIe|*?D zy;@w~JX$%dFA)w02+fkbY7X8xWZ_p1{?W~IBL~tvQ!{ndWt3PN?8wV$;q7Tqj(@=1r z)yh;!4>Q8i^4CK@m* zG>^_wRX|hyIwtqVK|6l{Vw64Tvo9|07k+%HKh!sI@qG1eYk%i(en^6%W<6fDH+b*p zsejHvL!p9aGY13#fs9|rgI+nPR@`Kmu=dc5Wri-EQZ`oa#!mTmhgbJs&+mpqans^1 zz1ku8zjjP+NEewl-5;<^L;!qo(CuwHF|_O15(tSsOyGD~8eN>jO?X+FtsRbR&q|zK zFAw_U7<9LSqygcmO#^X_wFqUH2O|piZ*oj{2`1o{Lu166P645mE}p|QCZ^@UYUp-g z^-AsbXB)b>Is5)*>g443m;d03Ywx3(+ zIaJx-?l4YBdRP&JZBfOFswbg7eL9@(9>D=?=a+8FOK8cz;dLMp`HOf(heH0>k4igi zDYPW~`Gf6uM1{jsY~*L>`#h(+b-MrI;c0pC_WEL}d3Jhyv2xm29>SSkm^we-nBJP2 zI^MwjtbKtBtL?y2E|kF5La! z?&Ff9s+a;hb0l@Xk4u&j2Ek+kOwNCN`qx)>kMwYEt-|!azB(CQkAm+?*CURV`KhFv zSr7BWYckcioVwVTjStO*(~0%#S+3{!dmfNtLJ91^1}!CfK`6|T8ZJOAGFX+S%J;Yw z3j&WMaZS=L=0TJk5!r?ioxeLrvo}47->=WY)f4i<>EzSRPHoM;?dU}ml$bO!4cXk= zkBDb+2aO~J@gS0atD9-FvO$JODt!u9pNnz;rCxS(tY6t&<6z_T>dWTj?C#;(!}s>E z_6}T%78bV52=n0Q=|f-^k^o7QWKtZBN6`Yo=17T%$KL6JycCI8>0RJ)WK6lXKe z$nZqbv^Jl@P{ff#h1oR7|Ftfq)yxKQDOH233|w83FgJo;RQS0-+&sO>`pW6l`s(ub z+Q|`byaS&i1*%IDkbe(T2ApgE>QlO!Yya)RVd9by9@gnMlRvKA%fHaFMqjOLoiM7KW14}X>g`qp9Q7hgFKpogee_OuKvf%bTKh8}2@bo>; zDa!G}{%v@D2<67zqnWj}6`or-+*rUblUHl?*{jtzEH_q;x2C*sAtaI{9J(Loxeq;C zKe>Ngyw}eMXA@6bx4CpR?vM-S7wY>c1jU6sfr!we5^?d{tAE6q7B(do_D=2SHgKV% zQl<`pOW(L57u`1G*3ZvRu3vZZwB?blEjOW1|Lb?B1{`T$&?KR(-r)qAD%QP<_Hq$7 z8uLf{A7&RP=I6gmK3p#??T$M#-MmB?x`Re05XC9<}PlDA+_hp;pNXvI#{n1Ni9f@NMIwCcH?C&070BC2Far{UNkA%1LL3rVkC@S*)dEr9RlU-9X0melx}`~ z18+e!TCL4uQKgmaPkk!2h)DI-VDGJRGw)*g-*Pp+31cw_N+Yrk=FwR{)T zV_4B3%<6-_ron*T!@{n^J%2jkjk%N6)tTcjwfeelEPfb&T(A{^aBxv~U~a($LLo?q zFa$Qmk>Db%fPFRh?&=wwj$sNThrlq6bg+WP=ET~E*R5y#`ovL|Gb)t;h}Ekkg)typ zy2oWlcQp36RL^3NLv3D;HR3iQgho8F_5Gd$Oq-+^by^w6jup#>| zlYYY_13K-p_999EjHs3o-cK}~VhU(51nJR8g5jyn$Ei4c`eAwP_VViT@No8S+IPn= z7)Gb+awJ)>0`#zvX8wxiE^PI)PCd*ltuAcE^z8Qb_@e@~FoObuyQFXnEEEc|tEzyL zfQaoqc{O#g8Y{#YkqC2 z_UV;5>$hjV9cdv-6*v=B1u46aDC-LcKb$Oz&a9r+K5X4wPTR+W&ts3yV0p$_@dlh_ zRyaUiRD>Ks`l75%Vnd3NWASlp&wr=V-dz-hlxldy+ zUB(c7Mq2g;q$OIghn@o)XOhc*sO9hw_PT`i5cYO#WonhUbP^{<>0U+o@t}VE8Ta`6 z3tPCIu0MSEc)z?e?znWd65wb|?x5-|l9qrH8kE2w;bZwGga&*04j%d7%P@^}2x_dW+q1XSV*T>?ktnemCnwE|>y8xT)eIA%j$z#W z%-Y(*`sCJz)$PFcCy!@p+o$V0!(Nd!1dK{?B#y`c4zX-q@JIzEJV;ZC+Y0VRB~b|j zoRnj9Gi*>u1JTXZqdg+*@E2`3X3LNScvm>?+mx-6Zpev(dbwvav-|ywt=DQfeXurh zQk$yh=KL5gXZ#0K7B55=?%-}}t(MFMLzxoQlxQG*;fO6ckdV;fUCD#{Hsq|-0ysIS zSPwe(o7LUP#qXDG;uepai{~33R(E#i-<=`56;gC!2qYtJw`2=Y5=|kZ7-LIDHz1T8 z{ZR3%nR4Z>3#rat#U)KWd&pjoA7|GpBb(1<@MrrsyW7hJE-aoyaJ%nNH1|FiUesE@>A`kmRc7}j z)#n2Nk+7Ft*sF(MWb<%&@d(#0W_K3mn=?z-k3$Cc5Ll`O+X`MF)FHve)LPPeVJL1P z$tr40J~CQDfMP6eYSdMRPlQ1h`mU-_gC+Psk=9UQaq_iO`#&E%+@!vWb)~wO6A;EM z-aYBY&FRF|;`Z&q?9OWa@YWvB|ai&E0bLnkwvoJMX znLoNan6U%pyRGS!&9lUfidh(CaCH+y@x;Kk_e4^}5gi3l9W&ndC`^UD|L**L_*5?jyWMy9JW2n%gsNR-tvLL!i@#eg_Z;FA^#kL}{tfiUI8s1F-89`VSbO z-3aS|0gM7pRFP;)oB1cZha!*M{)cIzof2)fZWHa5liS1Q*>%5Z_Gxo}|D@3=)9%K9 zwBxsV9Qc>1$;;#Mw_1r3sLrm8POZgVOD#xZi7J%J1JO<8avt>J zLfwvCs+*Ak!+WyO!SukJjorDGjefbN-;(avr@oyoSd=UAna)5WJq$*#tW*Bb{Jgre zwR^d~Tz{N@Tx^cLv=Ys!gn&d!N~E@E0vx3vn@Dm%uvW9meoDYWf0uaCutX#UeT@mb zpp{PImUQ-nA=H6h54Qag3qe*B3#1YxcOMQd?=9-1%*NBw{gNEr-tI3f&5VB{zA{k{SWTNDMcvh;>OTENaR5uHv4-SY_(=`sV!fwOQrsQ{RkL5JW4% zsR=5PJ?M2hRd~mKC!+urPf<}aV4>&ORIviWf3DgeX!G354Sl8p5I}~+AQl~fDqb{E z7|aB$1FeK0kd9t6*;0Dzp~uwT^%}$4o4HiMLRCqjmt4rNmnWKy!^68g-duV79;R;h zrpI3BB7{YgH;Ec zgHgM|D_2ie2E7sK6`{}vy zH{0X_A>?G>8sIHvt6ma;!x_InxZfab@b~Q>0pV`S0De%M~wcl zhtor1^v&-9<+nQ*7i%{=H;vx}A-{il&HZ2hAA%$d9frFy9?Z!BhjfeF|QOP0ftZP^HnWMr8wC;uMHNf5S29ETYucA^Imy;XJV z-oDifzkT!O&D;FGAHDtI&D-d%753tG7T&)7u1LR$vO%0CqbxiAdyo#Z*0|((p7#eo zeE)q}BWR^pf7Xh6(N&b>gHgGkzj-ryMmCD4w39`xJWRU1sHjjp6Q&%&bn=HZ8QV2t zScECkXw*=5FaNQ$+lhO zvsYp3e3(7)YLI8?Wpo_3^Yg;`RQhARK{^_PvR0N3yXQ$X7);bj`(Z22?~61jJ6H@u z(MFkaSCk+&fwDE?tYviim^-eIB7t$_t%!2~EK`oHr@b^Q9Q(GdUZ7LD)(l&h-I0Gu zyDZabb7Xv3=JxOIk_klAinW%IEIyVvkK*omJ}QC`ctAiE<2^cKvGgPwe-aA@fSfNg z3iJK#_81<&mfs-@ucGauMl$U69+Ste7DG9m(JV4*CDNWWHSD*;Kk;A~P`xvj47tbr zUYuM`H`hyBmnDS8slR{sBJi3@@G>c=S21&AckuD2xr2CgA+v(Y1=L7!u-n($6w&dXTWS0fJ$ZS>B z{ZFv#Z8dfQL==smf8q@Xt*iK~5%G?R(*5!MjjDjlO+GT=i=~bHKjrgJX9*!nb zH^ZyASB!nxu>!xv_-)r~wKscdd)SKxZzpnBdvP}@iu9t+B-Hb`74{yp^R!>)wno;% zIEk{^(Mhwo8z*7!@L9{fQRS__zmN$GO+5Sr3WKO9TXfcahRhIF$_yd`7@_%9=4QVR z7wdy;=OFE!H}e&z7sKA-v(=jbshqi=Qk5uB6^|lh9e92}YM1?`3#J8K`>yW9(p>)Bp5Y_2DR(=X7jyHEHa{c)Tue? zv{srC%~IKz$%xVGESF^}Wb`xJOoA4i2eV94{IZAH6mFFQ=R{H1gLU4&3T_1h6jaAJi&vTfs3CDiP<5#g z`l(>i%Dlm1BM${8Z40{24(b*vd7Y z1txZ0?EI8%tTwT}Q(rlI=r-2|(aO!;cE0#)JTDNyGfJaNqcrG{EW&^1K%-J=2)S}^ z0v(LSW`5kdqz=9QyB|-ke%v~_`)PN3|N3AFaH|lAvQ-D4mKy#FQlk)9Q6M{?)L0-# zqOeqZzn>{J=V2DL{}rhz%`bw~_-e|6DmzfbxhcDIaC=*e8b$@yl1{dJb$NeJ3r|@f z7dZApmKa3EPiqa&sWs4KEBHKR4TfQ}^9il-JgGG=kQ!knM~F+s;}=Se!Rn(&^5mST zn;W->M~x5dEWEEBeY}B#pIUeQu#+r4Zdn5L?49#$h?S?M1}CNFb#QAaVDF8eS8YVp zks(;BvzpHno1?v7lC}FhYNhSyD=JO3?XNF0Le(pp0>94a{4(RCvwXO6TC3}Mvvbo* zGQ-FHg{P~v7{FKBJM%4yI`~Q}8fSHX+IyD zm7h*TuX|wK)#Ag}Q9$CnMFrYiw<5t<;z;J(J+sDu2-xM?-~Bw5S^p(z6!&ev zvc!~zUr%4kGo)FLid0h6I`)?56PD_8u7zi)a}XGG#(ER zyKj&2CgH*UN!_g;b`mq(q?@(Povgg9Fs6KKSa`W&0uVy9;>8&&4CYCxZrC5oa$wmL z5(|4U&d>BcO4`r*^xP$bS)U*PK(7GAuEzfw>3kUgndUX}& zqru-#<0LN!`N?Rm78AgE6qX1mdsXJm&J_JLo<3AB-`48;#5B;SF$@?Y3_herEHA12oPNk*YN!6EmB#25cRW zSjW%a9x*~bY`*Xiz}YWwL6KOkch=eky*aYEzEO)$;=R-DHL1S~(}#=G&GZZ0@Mxab z8jXGTZm0Wk$6TysjjheBzOXkugrPE)FiweuROkOsqT#(c%0FJ^{rBbqwsy83-W?w< z0p}#}!wg^)otfbrTVAqc4dz#1>nWM z$w&qQOJc#XLx+YYxdBcCL5&Hu z7Rc*@l4Je&eqFi`XCE5F>+HJQ-Mrc0=D$ad>gA_Vc9p~+W86R5*}OhmTgfbLuXg!v z^Qc=(JgUE15SG>3mqmj#j%j!Dx1SjTjSy+bxNr4?1p?jT4p($xG`IB zH#E2!Oh_{%GsQcFBXK4-@~=>h?(IJQ<>Z&);_Kn@$=0?GSDqaFT4U8HH6SEh)FP2# zT{X&R<{DKu|6pc!Q=<#^^vUb1<41mMdbPX!bZ1!nV!RqfXgYJG!`#WbM`|IUv?8QI zk$BKrw9ngbR>#k~()4b)6Xs9MrirjSG^Q_!zuS7cd%nB3aE67zhv&ID=Koj%Dsq(`f(%W--AM@Y)1BS< zhh8>gGD2_?N=Bha(F~mIBo0)~j(KQl=B%*LbVHYU(#qI4)R>(ttJ5AV7m}cquaJ2^ z4t6iDA8lU0w~O_wmp?yynsxrZhb~esVxnGJD3%Ug(1M~=RpC}HHGATT5)={uQ#tb( z47*8D1wmmI`ml+2+uuis*LC%KOZ137(SJzS0`v5V-sZ;HBI5k_{fq7AYp<8&4Gqp> z{+Z{E*@O!v=20WeQ_WUIY38U(Hp~us-ij%blwB$_yU;^Zu(?`Nm>7A7c@SYy{~DQh zdcOH;c>Hm9ed+oUT?`Mip4LSVUsyCn2&@Q^tQ^zRR2D}v6-UBTXRd99Nnuc~5RsH| z)T>gGq@ZGyvOR3<^?3Zv>iYV94NW4?q0;`w{Q^ z*6+hUaqA2siBL(S8dE#Vrr}juNHO!wPc!f4b391!blh%KVRk(3K4(5mEc@`2Ng$)r z>F<5M7@n;>kLRn4hg;8V=ZB9Q8WE{bkXbQCI|*4$8Bv6kl=xpF z^ZVn2o4+o1*wCtk+00#CA!8J#TI1ppDGK4_PPWBr4ul29lINfBdim<@(c0Pd(dL_l zlZ~^zU-@$J@U7x=XH})AeuT5%GmrrOGQq`L7D?0iCI_CkT(SQPkwA{)^11PkW zkSZq`F=CDbND1H|ofJP?M9+@47TGSB7B3IYU#-2uFONFt*mZ}+;*T9ts|EKuROendgpB(3scKcdFYOx$gBadOG(ZgwY^|LM?F$ zIvR^-h=S6pv^s1GHD^(m|HGy)L*I{VnW@O`yfzbso!K0LU}89mF|xZUj8PUI>=lTZ zWUuRRBa+=)B!@^i7*dX|V!tD9O#5>o6^08>uhw_2_V!*5e;K~M-d@_j-rqU;NVvra8*aM>*#vuCe;ct!EjT4120;c~?hQ5({jjkVv&iYYv0JO_2d8_> zZ;I<;KQG^&?#T1K`G*ZSvQnJGY*csw)Y#RzP(kWBtK7y~Rf?){WkS%!O*PG6j+F4K z?4&kU>QaVptohPdW6_E$m6R(BR=3z-RzX7SQjz32$zDoB1w4#GHN^jI&l*L5MM;EQ z#3xzn`tHc_(3C33Obn&&c3JuZspGOz$F>Jk*o7= zktU@Yq3TG2-WoD9OGP+aP{0e5?5Q;kWe+Hk1lz49Y9ff;O(T`LreGI0z^VwQRq&B< z5N-;UL~1dkBy5tsZaj<}&>CqdC&N9NE$=Ujt9}+jR4xBQh3d!M#lgoF;o4f*7+$Zx zh_`e)_x-7Ak?i3>F$s@WsKi>aT3NcFd}j5jVlseSl_Jn(p*2V`m040&Ms`>z8W4Ku zit!r@$FQqfRzOV4n7HURyrkTrM1UyLc#^?>hr3y5tO9p4b$2bjH`Y{d-sNfKr8#{o z`~({Z!-eIgcPF(!*!V!J`j|8EKnDmse>daqJKtP(x~;)aH&@>o>8FkLkDuNk@2=vnc$4hAgsd!McGm=X#?Iu21>D(d zz~B*iwYTti-JIKf2hQ@~1u;>;XsCq6i;#P&AfOuQ^jtB_iz}mZP~cdTZjGXBtl}C< z z^J}-KTm18v+NaYSL`imyK*LCqAtm-AYU~sU$YfMm#_{{l%gY!O3wu?n3y&&sO1iPU zdDyH>!vVQkh(Go&`-AJ&;0@wthraz%CuGu>i>_1k3wRC};Cx1``Ob%*;g|FI)!y#h z-C*JN^W3M87gG0%<OOO$G3KE6;co)M&VK5W0 zF-C2On8XJ5p33-TZC^(9^UVx!MxF?|m1b78PvjS%;u)&Jgkgq5(2Wv*vD)XD9&{NB z2&GNf_9ZKaAlKDvv*L~T@Oi5~J>~YpUVm`heyNpzuXA$JZ9PxOhQ9n(ZDVld+}btC zENCJYs9yZ9KMP>-6cr@{7Bc(smJFo`5YbQ|#xz)f3P1oE5`$Ri3j|CpIb6K1)G&6< zqlT@Rxjo@CO77`2`E%oi{ZDYa8i#k$xI4b<|9p1ZJXx)MVX*ulk03*Z;l8Bi$2L<7%Co37diHViA5|5%^N zC0HWFgjL0BE?kJlX=F7OuCaTBwG^&+qAr+}S#|O~nS)putnSYpHuhJ}uN$#6cYM*V zeO-wD*w^V$VR%6X%}iGfO_}0-UdaHNw68-{%anw7KSi8ex%L`wAY6Nmx4Wm~Cyf4- zRW)IvA%C@{Ztd&yN!=31f&}P)MgY~j;}TW_0EC+A*4RDu8tRs2R_hj`vPMkaaP@YV zua6dY)_Y9rtBs|*op!VzYF`(GB}J1c6xB$F>OzX3MzX3*br6wW*z_3lOM-?nIZBAs8M-9o;>!OI z=mu3sB#K#R(B$31(tT%TVdeAA@;qa=yY#qv*ZsO)d*~XJaFNI)5YM5FPE#q0VTf!D zKwM+&5-FyJy{q>A<*hC;Jwd#TA&!%eqP)FxOrLo9$`7)2yyqV-yL?=F^BR!!?d_UXsVMF(bU4iH48`rx>LFf`zU%13#B5efd~&{ml+h;|JF%ikMiT~ zN<7Yc>#o-xEd8>0Fj)Mv-~8`EoFpqD4rZC5fRnqR9WXo|#OW_{J1ZycgYARv`lda# z_*Dml2b8dM4hIq=8iIIW2@oJjbOQH(0`a5n_kSA&H)ZkAFf=D6V;TiFV+m(ZW+LLi z98GKhi(sMu8@{Vn8VO7*p`9U&_#6|Q!PnS5TYTuQHIA0fKOA)qvemh*J+!f;F(R>x zc*qd66$-f+2cZU`O##|aTv17cEELJaMzM`nFd6^YCJ5Bg2@Qx+5NNyv-LtL6I&U@3 zRYTLw|5rzH~MRt+V94s&qb4Cs`a(FdzfYi#t1zV)DJvrq$J8O^q=IZD3GuocJeYj}Y-0Ah!Rk!BsKuCpM6B8WO zO-2t7EGcdxBFR`=dLZy%1C|*-%=qE8_<=wwM-RryD*TMapw8CSf$_J8=IZ>#aZ8u( zKQGVKjUKS5VQOTkNfLz(@day%3;@;?5*@zBa4DIoK?-8x^}WYX4haRb8C}fi;KcZ z=}<2MQi5289Raw$lL}PM)bJTy%;@6v=t3$-6`G*L%+J_qyY1XxHMh3Tx4&HJ)`v#p z{ALT@8(pAA;p7A}MHj*0R)mTwm^q50LDZHn2t+JRXLK>6i&vuyr)tw7K$d6Sd;Gk% zygk4B*p7z}N1I0r((_)UcCRM>o-eYfd!S-6A=~ilVk-omkj|P^pt*dw_>umv!a1ySuS+)1f}D9NwY z|Mj$Ub~9*Rf5+>SjvG_3nXmXGs{A)}oa2Vz%hwB8sv81I1Xlu`{BC0Zrg7cVd0)Hl zAMb7+&L3QMwrXFqz+M~(Ms2u%2aZEA4qya$+RF{Yq;CE8QGf4Y{v_Iq%RdkHZ}Rb`_C@)$mPm3@ z2vc-s&X5|5i4&`f#D71iM5tO!%q2+!cw*c_{}gp+@qFD**ZcH&u)Ed1I;cHve`D`V zmfE_4F#IZc0iWu5STav$76{l-5=hJ-#LUxYS|mu85s92!l51~V<>De$_4!VB-|qjX z)S@V8FsuuxfW;9pa6q7Pk};c6gng`=B$SjWKEL|<_!r6DeZSMX`iHKPf#3OWT_zeM zPi1(5ntwpXc5VG-x8G?UK6h{CnmebRdsv9d+XuaKn^9wI6aSgvPP{B^0z^u}@Wt|2^;D-PM2g zirQWM*F1hP#lO?d1_Xi+?3*_5a(8?4o@v!p8krhjtNlaX@6+B4U8 z`@)x!XD z{4nmmdb7QA(m&t1JMV79`Qh`_Vg(IKzaPei1wcRzjMa3&trUx}kf=alFx$V~Cr1lN z3iw>FaQ?&Kd#9fyDC?0gQ>)+kX{vp9-)|lLI^g!x*Vap?|5Nv-|I_qCxBJUKTi3@w%{_OH9)A1p)z)?E zR)HctT1ekkj59 z$)A5cmxry!D;+Pu$@%H}Q)lye@iQ9)yw|^{(H<{?BH*6mms7hZH?uDbJ$PMgJnv1n zzufGeRNWcTrD}!&4n(;YV{?RX=hPpu(HaFM3>=s_)Mz-oqc~rcJ?<+?Rh6%ADGDfk1k-&hG$IpnD*5>@IxzO5pSZrMH9_Z(9wLgYFkq?|5XC^*5ZM$>D@z9yqExV&4mBMnMhBFUbnr9@Wy$de)Q?VD z&7J9k%l_KR3vC_SW^d=d>U1C~Aek=CLdipLNEYKT6*cqf!U3F!QL?TpYC24;t_UAR z2Lp#hOB`Ro<*D6hUQVx1+13hPbeiisTkWOF!@&xsp#q5qU<}a#nUdL!gSEnRz+mE3 zgqjXD9VTUqGrA+mhNx;fe%GP7@OZVbH4l3at*5j8{PgwA;r2$==JD( z9cV)&7dIEONI*lXE-lQHLILVZqo&0$EhcX6??jH0=GE86&6+%%+tl>! z`qO8Gg*vK0xOt`x&>{-C5G4m0p;ea_A)zW*zwNAPF+p04-bh3c2oCqLd2tEt#%h2+ zPCcGJS@UUs=3;*S^YMZUhEV__q{(0{f+aOL0FaSIs>=&8RuM9*c~SFXQoK+b%?n0k zAsxR}LA}TB)#Cc>(&_7MbImp`UpF=@H#+q$FAxhRI6GLnYYyZWR15;e1Wc-&U!<$0 z2G^~N8W$78h091>5Ttn;jgO1&)Y5%6^q{@EbGp3jjm!N`|MR66h(%SDfBTCx*u3z9 z=#-c#4vCe;1+t`)n$*pUnirERM{Kn5g(H#@kMHn%vV69@(mcOdw&#t5#rxgP!M=P- zT%?CGgIn5AF`_Ewp~fCqVOn4?g{jurMNNxwXfghp645ARCI2 zuK8Yj{$hL9+Eqga3=l%7aAts$i#QApY)5fGHZekFi64Ln)Z&?m8GC5J`>q$E2kk*ay*4+?MhG*2|)Nc={4Ye|iW1QY@f;GyWY1ThJ% zqx(-a>e=D?m-+cZ)#3Qm;ADu`=24K%|nzGH3)DZ zuCpkpvbrjPjhxVUIz7AZ{Kcq75@b_MayA(bYAQsklH^W^r0By5iL}eq=^NueDg+U1VQ8Ewpzl2mv7nN;I3jCk z&>JJc5~R%N>X>Hw?uQ#b4R5PAKyxHPRYyidnb_33j!a&|gu~hN11{-yfqwZk-FVce zgPH5z*6w-l6L8SMjY$p3f`$|%RDLg>|kaAp+TlFKtTZy1_B__^nCZDQ4!V< zWRL{Vh&PWLjN11eH+&QZzpE!$Xn1HkPmqB=T5Kf9GLk-EP0{Oqy_v-Yo@yUmT`z6+ z*PFA8p8$d(=3&f+#xm5y_Ovh#_6Xr0kqic!i&|u+QRFZLxgZ9TfSIMH*&qa!f}OJN zU;pi(3r|a$g0?wLDNQ9VS6K(#} z=(S2QPo|FuGVHCNJ@#f8PM>=RH{I^-%gBQgmL03Y!|U!Ut=#_Pq({^{44y*RsCJb&5! z1d%PMj*t)lI>>J*OLBIeW{ixU;%*5}kJMcIvR|cyKGgOavNcVFMMjV-Zg# zAxmJXu2!cgBcouw^hsM1Z^=xLl(OR0FyKbKBboSeg zJ+3?(xTGMk3o@zV5FAKI!_b(MtJ{%8aRW51r8;XmOe}Rk(9s*JY=RLV*l4$-`_}PZ zEWpvxQ)9m0ywuyJtE%DQy%kkTD9D(sAna&1(1@_201O18Ar;1jGZ7Nky5BV}CWnjB z6VQf?Cd4{6EuL>yaN(-8(rsK!byn9`_HXd`>u02enIVS}fef}S4GRdSGiRU^R~HwU zY2fO0?NQ@ma=7r(;}DFfkRPxK*xj1)$GeSJ`AR1nGc(J#&AG?^XXHg-sDxlL5D)M| zimRl88W>l%Q-KPBm{EHU)wq}(E@UJwEHuOH1FeefOTS!cVRwG}>zAv3Xa00nj;B8z zEiB6KtWagf94JM!h%>VaL^xOXrALC9fO!2%w1&mRVBs)Yiw_A{=mWm>D>tpvuRMEo zwZx6Uttv&-xK#?Oek$Mcl&` zOxc7hE5#RP4CeN~+rs;rKsfLL~MYZ@;6V#=4C;9JH zEwvWkKcK}&z@smCNi}2~|AODTf4)7q*w^_3T|POwzkH#M)xA$ZK|)p6V0S_rl8~Sj zNmdLn)M{$+g$M})*3Fy$90qGCz8@^bM<6~Dfn=IsAMnQAzdVnPyZ++R>D|#Df0>8f zmF{xo5r_p)*@MHuVYr2D3Ly-JW+K&8;==$1tW&}NA_HqFzRxSg2j3~PSrj!s09Wew zdhK{Ve|kNB*}htb?Zr)~yc8d+7hnMb5o8+_DHAUiqMXiPY*cY6J|RXX`rkE@-g(Kz zqqhS&Sb=c-FrRy+r|#YUZ2zo%x4p5lvc~tT-Kysh2E~L?1=Uc7dP(3$N?;VJR#6?k zlmyb0|K0no@96N+$kaH2JQ&6gYv~-{Y&J!YZ|r($r@#1ovb$M1Kys~BnZeIHt)ohp`8-hl%fs{>HKVae4o;y0* zJK5iz-Mn#X`r&4eucoTb2Gf!hp%I}fI1GnK4NP?l;OdHWq9hnFO0D2q<6&|w_mOz8 zkns4>82LeesYBh9?qT=%XtufcGQFY;m8Sz1A#=3|*B~8exocq{1$0cP@|shNmBm9kq8imwTQwafS6bM}R(paQ7=FMw?{xE+j@I|*j$dYbvGC<+>v?Os@*~NX zg%N;}G0b%Mb$ytk0Y^Ggg_X6)6d}~=+@Yq!#Bzrm?F}MG1`;3WJIwB%zHIDG_4FtfN>F2olUV zzVOBF>p}PUb?NHvHqO_(cee+Ref*55a6ya+p-93Z(@0?<6CeV>>av1`GZgBUWetnT z!QwlXgpNi=_5pK%)z;R*^d;>y=l7ORj@CLSt(p7F&xZ>u8ct-OY~Vb!I>Ab=;Y6$z zPAyPFBe9+wu5s~CikS&F`BXR>0qH!%!9#P!2}1V zV51}=f|5sMmG!`l1f9HgEv#uVIa=6AS{S*S(+6A&UoIclZ0)u6we|Jv=xKZ9c4O%( zJ|8Z$Km?>HkPJI}b1|X_2_(u24_6@q1`yYB;6Jl>W=m~!M*x16RC3L!-j_@E)huS` z#w9@P`wmZ^IDeAgXfz->A{k>%*}kyFAXA^d>C>mX&(IbFvjx@XEs|YBxsRTj{mJpU zr;VGh+xwd@7t^yluzB#y4dG*zVv30tf?jDq@RBT^>BCGX0khOTT5K=#>{MGxz<@V{3owjdv>2i-=7TWi-zX08Vz+wdM zAzC>;huth$1ktmpQ_2v-l35dmaQ_Lnv)97yN-oaKJ<~mV{Ivb;ZuxQLa2%hmz8x&b z!qdqwK)0Kw8lgjFAiGfUNOMM%493Rd?W*7&8NVCe^-CGBI<^_q#kr5AjO>ll&eX)r z;VeF#PR*{oJnzQHuYfk73WFPCrA@>NIR=px1n)$Yh6T`2q7UPnL(8y=^gYiY? zA($Yzh*nZd;FReo!|6nyE}R&F;!c1@0;-^h`Z}CI7Lr6tAXK2__X89O5mVGqOy}rW zrU62Tin6JwtGT`V9PpU<7zos11uKyzgb4!5 zG;k|eF$fUy2#jFCZdMQ@C1=P$OsT9PJPg%Mhc(HcP6{JbELjLh(@0EEU&R(!h#45& zQpG|T-@Ri8ah-Rp(DwfcRtW0r8I%_4oApp}HW{f_cL~8K(%vi# zJRK21ls{m3rIOC%j)?(DRsBH8gv@|sPEpQk&YA*9czD{zCp6GVCn4BbA}D4SseqwE zy9**#6!MC`Tc#3D!`8BsVTddXQs5SsCUe>;Qklb4e`{cxUlcT;kGyLR$0wbXROTNVZc;S7Xi?MR_% zF~JPgG_oAL>A#RLFCk<7iP(4KsCl@{;T zN7r|+&qj7zFFg9~=Z@;c|6KtlB9(MZ;-Em%%9bAlMW_p5c>S3z7$&XVH{J9%)K+Wv zplT0FBm;LKD_v6nOgheu6^?g^}FM}*4ckvaCiB`T4?Vd-?DXl z(Y~qvh%RrA505WeXKQbN`#&D}Gk=!0An6YT+_t~}^Y?#{pY8-@@cJ$h0b(NSJZt(`1$zK`0e?{#_pHByQS9n)cWq%*~O*<&RW8ZIMP+nsRMB4;+zNoAsXoK0p5QP zfEPkVM?oi}lq%ckC775sj3ooWDsU5^D6kCS{u6GYtc4rR1UdRx`K?Y)%VT?Nr8TvS zvn$YE^t;oo#={-OBUBPa!bo1j?VtsSm;#ehSsDsAqe7VBknSJp&NpHc76D|*p6R~d z+VYv_nUTr+#YNgUTAtWnd6{oI-EJin5Ev+kAv)A;QPcxD7*!n_N_SE+qz>W!5pFfE zh1FLYp<-*p}?h9-#PtIA>;RYy*1{kS=*@<=*kJO;=zpSZI34tLZ zgv0N~_S?hZgx)CXL;#~c_8k4xed}~@WnnDW_qQew%zkK1_xMxT$?rfjywb#i=cC!;%; z%ZGX~|MhUJX}nt@Od>M{aZb?=-qi~TMT#gVl6CS$ivQG4OAbstw} z_Ojb*yuq|Mf+4C(pms)|fyD%k4J-`@Uem!PLct|C#RNgUTksi3s=&$$PRP}fA%LLf zzu!*DTWpJCE!!@j7#Y2@y+5`c8~aO(lb2t1*Y|1Z@cC?RwDD|P$v})y<5a*-wz(t_ zpekhMN0gct0Znl5U>6qdI>`b>JkS&+g|hl^&P3o4`t33awQXuFwc6H5GH~af$)0@K zS=pU^dEQ$-Zo$sXW4=s+Hoi+uM2cttc%?@T5us3mMzBYMy>XEKu2bmGHK}-Z=a+f7M!O{6~^P3Q1U)gGl7-8Y?3z0%oZWf}Z-{ zoKW~Tq}>SX^2FhS=`0{(sz8v{4`=l?)goC4_EdhaNOn)D{W43>7gKXzAHQBMj|LoLrs9SRCKM9*=L1Egl|RN%H}ZEI}>^ zx}Jvh=i8U#rSqq|vBm>V%q28J zkc|T<=Y8E!uuE-$gRhd$?=i9rInpan!H`6 z-OZck<1Jo@Qb;Kb1GEG0ESwJRL5#*9LN|NyBn|;EV4&zWJ&;IJ#mpQ?JUJ?KBb%Y< zPX*w^4Vfly$kYMy{pyCsfM(8_)kb&Bh(i4=U-oWb$oGEW7BT`l$0T)YKT-;IrAF!ox?*n{h2j9>5*ogC*cB&6qy>19Upzx!a-q}6yf+WAN}+a zsuTdIh5}t81A+nsY07L$o=~Z?3y%a*qM-^tn+lYQadHwsa3;g68Ie^q-nn0uw~hh8 zk_pMo0t$g>gOUwuoovh*1l_cE**phtjvkI*FD}is9%q;GZ03Hu`Lc;DsUQ$cM4eeA zEX0)Gf1?BQ$E>k|i+G}#1)~RbTVq=ZM*{#LOM>t&!!p28X=s~IXd5WO=_=;poJc~rdIF2&$E<6j1S%Sti>O*?RX9;ZgHpu;(3rbSRTE6mfT3?b zoo`4ByF{3PfP@*p5>6)@??yLO;3{utFh!!l{6;+Ers~{#;^YX|ZF*yW=H=ynVefqD z;&eKiAJ2$T0klX;#R%8RzAu9*3Oi&1HF(!RLoL~VzkT1@v|#cZ{Yh#N*~jDaa_RDE z=WOip;$U%MYiviJ_RibmjmMis3Ol=)I8g$2BnPCpra55}OMFF*~tE*(ZbGs^TuWnaZ(kTFdP}xidmKmsb4KONT zSnsac@jE|i-QRiZWGjxRaqnPe?DQ9K4HqD=OjK1;u9i@FgoU#wi+zx6v_NQ@IwL3> zceRZosH!k3X}3w5$ZXDmGStl{)D4rWlX{AVzY5hm(0K1&RoXhw_(}>43UyIPut8-8 zENe3Z4r1BQr1kZ^r^o4=uOm5ju>Cx~Hw%Xshs|#ipo9{L2I&q??v=pCh$J;pjWj<@ zHss+UfaV!SU1XC52~o@}9lh�i{IOp=>^-Y<|o{{_wR`Fy6II6}S2{qygdJ@&sZ~ zUtr-z7=5H6mnZfXX76SnW*%4W?x$wRPzbPD9%Bw>>;V~8u7@2;NXg) ziINQ_99`H532cb?pNNOoBkrjp)W;y@(UluqTiD!w+!#MQe4cGB&R?E?Z9L+VQp}M# zP~F*H$BaT!LNXHrLBONQh?8Ns(Gc+e0IrO+fGePBruIH4{AvE_a_w?*VrF80c9ngk zH8R%xy;5~31Ocq<$|l^Qa4*6MPHaJ()MUOnh$9eVNcf+GdoAG!DoiH56Ml6%g9~Hl zkEaW_sqDlZ{ znLPtOJu|#fwv8>#J+y7dN|WqwV(A8Z{no zRVi+&j;d*4_&Nx8DCprt$j+W0WNSxFQMvJw7NfHK^b zX!5Q^4IY17RWPdi2!nQmWRMC(AIC86o*h3fY&~DB-?mR$BPW>)k4MdSgGUKNOi&R@ z*4J*xD4?ulNp6fPA0-Dy6?Fk)m8uztONNVuqJw6)foG{CNn}Ged_p%===kHRaxo|| zkk@DKl+_seI5^?_Y5dFO$jr&&>cK_sZNRvVTr|IjQ=JNDxUdHVdvut=i-&=6I&kEN zq^7Il@1h1NVgdrrNlGeQi5ij#(;S9s_-twzsR87VtBS_J;s&YJL!xLRbMJas7`dml zFH<*XIDdYAxqN;3ZSt}CYi1QKiQ<`viX>85RB<k~$IVN9lnq3@{X2K#aRKc{8(1 zpjaZJRqD9WBwTp7s`!+8r~>lutBS_J^gy+GU`)-#-j@~glaF@1`8f8}emv2I`MteN z+-bfYFiSvUf=Q;jRhAX%!HMcXq#u@YFlKZjLX=ePA_p&MVrXUxDCpI)Yh>r(90m`c z&<-_#e7~w_3``DKn{xnAwe05%jE8f1I2?_$o0o@N9@)HU=~44n7LrylLNHW=vSaY4 zuoezsWOY;IkFtXlLy15M&?Lq#cA(;D21)AS&QMJ|0FiU%FeUjJ^-xF1_p8drp!}e< zen3f7OcL(BwE1}CthGGa+FLz8T0GcTY~8&~e`&lRRK0+Cf^&EncL;*Kh#)#4b7Xvw zAlM3fuo$I4Flo2ZlBf~t0?z3cniWB~1xOGL{qWiRQ03!Cc~sFDkRR0RD^emf4WReM z#o=7*=IU9`cUR}jcex zP^KZNFYZi8;XcwByXTt|7xUSk#HE-yJkQnT)vfhkz%*PkkU^3?Av;xrh0?`IK^2kZ zLu6xAP!!PJIn%k@{%98GKrk~2b&YEOI50#cas1l4(V=cCudV>{$`2Jb2A1Zm_X83_ zpkW`oAor(7*T=^b2bVFnGyT;c7uPo0Ta6clk{2USVpLC)YMN6cgN!ZN{bT%qSvbIe z9Y{&sx$7<|iZN-Dh&oiG5z=s@9O~iI>7j;?KdvengX_^&sTTw%XcD;gUVHiF?0W5J z^P;`GtDCKz?T4xTFU_y7Le>&Y9O9;womCV@C4v}}gN(r&Tv{wa=)~`K_}!eu&i(4- z*$B4bxjox{nVy_E{6AC{>;IxXM7W-CXEXE2o(bQa8h@BvJsaPc96Nux$Cd513vPal z{CD=QE2)tyif+Xl;ML9HlQ@9M8Ga}MB1a)$VD&c6sMT25fE!GIOihh6>IR*2ZoUs8 zOK4(-sk-$5Jh2c5n;9^YHyPlH!jRzc1@J)z`~)qXlz8y@Cm&f_yc?dqo8SM2*IR4T zx2xFv0Y3;Wp2UE{3J}_pjJA@26intuU=0R1GLSNpej)r%gd1voy~Q{o%m**Xo{#KY ztbIF~+B?3xYIW`MZf4_q^Vj5|VPT5QFePKt%AyT*VB;Va#b`3dRm~LKR)Q&(>!qPLG=p@Q@-xt|*iM9u{C&ImcFFtE}lD%Lc<*I!KUk@!U%w@DVl`MozhN742*vU?ypipqe}_zGp0SbIG|^^ zLcSmiHNUgCH8!*~nTHpxE8myfm6g%!=BtBc$skk?LN)CPU;qk+1O|tM2y;{Cv{4Y$ zo#RU!KA||g_nfw3eR^6fCGZ8q6f%1^X1dw_aT|9=503A~ba8a|0|)~b#dL7cK*-+u zOO`}uRUtPZZSI&hGP{^V%9k{JHfi{e=CqZf{sKv; zH{ZM3{8K5Ay#T8laT2F#C9F_n0&^5$qXrLZBRFG-DSQdUXB3G4U{YJj>RT8yGX!jK zfmpNC;e{LW^Ud*gYk$sW)@JU1Y`#Dslw?5VBqz#hWZ{q&4CKPm+(~Unwcs!J;Zv%^ ze=(`8R9iMB&coPhRXD&A`I<_2zpI5MCIP1A|o^wP#e@i=rAR zFgsb3C$(8Q!9BjRlAll>{*y^RkAW{Rx6=a*zrCo$GUL~Qb)Hd1&9(U&}YHhK7u2DQzwRvwHpDG&!QTFiXE-Ppp> z-Qk_fkqw+&8?vG1s{^s`^h@!E05GZpsR;j+HA`qMa%JH|0 z9LDF?#PV1Oe7sRavtA^b0ma9)RS14;pT!AV9FSBJAJ z({_71KEJv;q?^HW+s!{}nAHk}L8L2ykyj*KMHoO*1Ek3S2Xe4T_`>&}__q2M@DOxm ze?cu4uUZ%L*&v4hQ+nrwcCgH=lay0cq=s#hc&!z{uSl`^%-5N9VKH?S{^ z|LLVT*2n2)=E|mnWBhzNarAy;2z6mU$L^*kuP;OI8isr?HFwj*Q zLYnNSxI~!b7sLmfef)i8X||Qyw^uW_?ZvD0p~fRzfQoXE88A9~ zZ^fKSs1UG&AgVPP;RMKt!v8-{rG9yT{c>&)0!J9!OKI=L%+~jt5uLgkyWEGH+40k@ z<;LTiT#A!{GJZl|19lGvU3j_c(s)jFw)Dgeb;nS%@9YP+j ze%N?5f#BL^b0pWm7c9WI<(-MMle<%SXb+7q-kw})EH!_TC|edKa5N@n26?Iv2*of0 z$xM{rMjg1AW*WGpB&KN}w^$3Qrm`>-QC8If9a$Mo<4YYrojTMZgMowaISJ0liM-)=^M2i&Ed>hXoN`?!YDyn7B-^L9LeVRKJAVAf>JcCcC z26YH|y!v6|)#8R{hExfh6kzb)mWB0;k;R4G&hW(Q!}jdP*yzM+cdYpWVO)p-N!UZe ztl}GZ3bGj{NWOtURM~GxeVv7X02E_V_SA~IK#B?c6*zo8d8o(7<5fk*YsteieG_UJ zA^Vk(0F+e(86Yv2<%urY}^>rTrEJ<-^P2eqDCZ2 zq!NtOcl!-E2n#rofmNhIB7ls(0*TM34MTPKc)Tjec&RjW+uyr=hsLaPU&) z<=Gv_<{EGACw8U&T_p{zrT_U<7w6sOqlaH-pr8EfMCb0=K6=_-NW|Y&FP~CJo(?lr$-0p zj~}tyKJT_WPgu}B+HIZw@rLfj<)5$a{`@?xpABlK`un@+9332;w@z37)w4ZX>&ZX9 z)YsUg+(x2AVwYWi{jOp>wSxqr0t7m4`W^34b+TQQJ zDp#v_ta4B&)c^dmA0bJfC1j_)+x-QPeOZ}MKZAk_W+H>f>4(dU9X;I~*}i!o!o$mz zy;X1ira3_?ip*w~N!XLaM-VJXsw|*x==IFhecs!Z=b7oLuuoNHB1?%#3L-M?%TI+O zpzxanHx;gJ2oweVK zrqw&wlhvMF-jA%z|J(y4u0O26RMZf1a1SuXWHt^TcH?2ab22u!cLu?2BHD zTJ&FXCK`JS1(C+>)+3M`wwL0(U&Q@>!mL6?A zWijeIk}(`p83$%8kBlE}C48IlV~U+~)zjt9S>qWiB*jAjF$99(sW%&zg*j5g1&Br7 zq9wr&M@JHwAi))AD-`l%4-PX=&`}(~rgw-c_g*xEiL8k=N^OR6rY;c}l zEavvkWh|af#pM0?(VQF}Y&5?;5J8HG5(1ea@>8CsM?p5B2xb6}SEvWRS=B*;ImGd; z3H<8|f1<5ENPL2i&Ioi*?8w{GD|1KRCU-h#%ktx6?@^Br{L;_W<)O2?>+ieFr#^SK zc+tICS-kwq=l=G$|Grj|;Ln{W#cZ59xLM@o&BO2OJU_A_i|vQ2(D|!uKHb=gMusAL za!POc-JhDpCY0Yv+$@RoqZ`Jr@_U^fMA7;K=F>7&eeiZSU!0hI_%=P=%^9CqgY~Pe zjhp5XQ;H*TLyuALv5)Ed+e1YxC^P^kbJ|A9P+rJEmtvG03%zZeDL;)28 zieT`F9*Sg1NDWLAGo!bO@PZ03_O)ZxY12SpF^mr%;nf*3t{Adb=cs%5UY4wS%c?1( z9$VJel=(lSC~?+NG?66H(FdpK^uqo4ZD;akacy;DqT95gh2R| z3LH>ST!ma!h3L&nd%{HkK@2RMkbQu*0S^v>5d6el12w$?HM(2V8(<6y%*<-cU|Q)7 zzyg3A!~q15$x->~2kKbZ9UH;Ll?muPAWxr-Jak(_%hvCyplDi>f`g$+-);q$2oF|L zXW|dttWZJspuGw&&nAIWZ7W}jfxo2pc5QrfX>WFNaeR6Bc6W04?g73XH^2AR5DHlk zu(PtF{lltC%djnb&&S4XlMLt+8+#$-1P^xuq5MZx6-iS^Q*WrHpBNn*KiW8&-JBYU zrEjf^hvUtOizX(J44&x;U_wtCJb|>ox5l$qyE`#IbAP_Fw6Na2I&1zhJ4RSkEEOp^ zMY%G2Xd#e9!0wuF!V0581P+V?4vPc?V5a|%${)i^?^XS%-0*$JL>%e^QDK5KeZk88 zySZ{kGEN!Fm-Us zA534&oEiGnU9RsePi*aO40S%h&L8w5X>94KY}(_Ul9mV&WD2!-yDTw-6~jaU8?jw@ljjQ#~@P_oQ&zGSw*76r9+*5L(=(0X~Xgq&a zxhR?|BLP-tu7L}ii35g_CB9kmSg|N6I1+(_d-hxW$r4}=7Rk(FTJ^9ggcaidm_&Qg zfQ=_X7%(8Gr+A-~6`~*rrZ6K&@QagZ^+}hyry@5C#!yX6$gDh-79#tt-Hnk=K6E0j z7W0o+745IIs?d?@2ik!dGkkEjWpj6V`|?IN{CWfq_EtwaJhQRVeA#n*re=l!R%xDm z8`*=Z%pqYhOA-ulV{xmLxFyIjg+H;%eYV|$Pu^Yqifg$2a;F9%>I>qEnqoXqaw|gMMhoq84|LR|Rpe5IF4za4cNCN~Cm=QNt#BYkVF+{p3J|9^ zqc>*}Ffniw2@KYL>5_1E1x5`O%1X9`I7yQA|H<|`$71Q`9@o07XWh#m@faqjryjno zuhcje=nR_ZDyW&zCyjz1+K;G(+2d6O*lTHxxz`WKSXwX*t~7IFH4(ntdE1)EQXZm&*v4raC|7h2on+smEBv*YHc zDWBh{fWTJL$T8d~;j8zme2B*cyxAT!>nm^t8wsf<%a=&~tw=RQ zk}wBwdlOcK4AN%c;q2!Br;fkzpDo)DqwmyJP`T&3cM)(tlW^A{5l|h(Tdm*h(!c`; zxS*gWnX*QD@>7D6KyktVlQ1-_HWer;BzmZd=G*P>%pysEjBKO|(Jx9C6z~8_6V$M* z9t8v*Y5YHRUTfmynjKX&g?$AxNQfjgU<&Iuv_k*@2$Ck5z(S~)#{dzs$V=7|yXh)-Y%!w|w$iKh*T83hO<0Le@> z-Zug7#e@SgjDlFS>Mp1mAn5;6X+A~%z4t4AwfuY5cPoksDC7&=jGgn2(9q&(d*tfy z7jX3Q?yz-ne6SNI{by_unS_8`K+piVj}O5KfQ;w@g2W%Yia%6~v&Rcw zOIe8OQwo@p1b{Y>wA5!(K7h19aYtiDMd(!)Fe-`^jvnkt_IA<&g%XqroW#^q2GJG* zG|QwfZTX9~)aQJw+5*U)jKO3eB#ipWDtSXyV7=YczNI}Ag#e^P4xXbtTEyGs^-Yc2 zwQue3Yd&=_-~6RKXsX3biCxHCJ)SCxh!zwxLI*)Wd$Sw}CKXpk=b%6m<9-|Npa`P4 z2LeKxRl5)&Ljt_1+aEXY_(1wlwy6)n==ZY0P0)c~EvQ87YA%(6mhMkh4?pHAHU z#NK%|H&O&~{Hsv#il;hFc?{jijyEAB>O)b!x(^>!rrf~83$!t zB7tN5+;1TvC+%W5>DAoCGNX*!j7{h6>g~#C^gR?R#Vl12MI5OKijYdbpZAACvve`X zhC~H2$v;mJadc_DM=`_d1A9Ly(TM9kcLGoW%}%JSjm7l;ybg?F0uq*tqS+i3f);B}umzEnN zVrR1Xs4CXre;AD5#MNSOVgQP!U$ZM3g)3qb)!QB%esHHiuVx)!n}G z#E(#{`XIB^x4PyOAtde?$;tJ*v2oo#U)We!Q05`Li^I!)M-nNpO=qXfEClM3Moz%J zyjGG}9Ec;8Q5-8^nrlX;ea`-IgBebK_3NRgFZ^!(+UhGzNPxfzx+wcrTwL7PfX(&C zLtMV!ncEtDc}dH;H3Yjc14Z02a|ba8xsr|-E&f&y+CpjqXb`E@$@V+M)~ zXGzM|Rbv>sRx#uYVNXpru zj`$3>VC^ElyXoVX2a}D9i&eSYm(xYT4G=nb2o5TI=$;$59*)|_+-L5Qu zdm7uj7=&)_R+Nnc6^S{snQSc>0H*E~3e;1wtq?nkM7Wp%Pz7y6HE@)#o(_YGcI6wt z7u#({MlwPG2w~^*;|>l_8soPoGqWpjb>wGwokLrF4>zP23-AyNVx}gx6MHm=Mchgv z1)FmqK_}O)$&3s%zQ`UNvMb-YlUvZ;4`_C+SM5l2>0)eo<6&>}d1-O`=j9XjQQern z*ur#V}5_U?*VnlB4~AhDocle9;O5ww~vt>ANmnW70&R| zc}0|_0f`WQ9yXVr?@oJmk+Sm4-gB8=%cUAKOLj3teVQ+~wrOefY-fJ`_T}JiVRY{2 z{!dA{P|12O#jyum5)>kE=(PHS2$dQ06QMcU<5I={1W0UCjD_ZI=Gr-s&rh@04<~cm z_lZ}tvoG`CrdKAt|A8cKE0pOy08+iE6Pb4E$U=~sB7Gu9cICct*CQw^PwX|9mh~OB zFqlPzJLYnEepnx5Xsj{4v9>ifvp;)tHQt9U2Aa785B- zY)sLlng%5W0KhFNls=>c16c|(7%H*+)5?D5G*O3st?X~v%efyb`%&w$P$fg@CYd>V zI6X7*yxUmRyW@xL@ds}B?LpsTfyC_!4%MNckTRH?BF2YcA=Hrz$ss1__KDPLqgVC( zmFM+brL;|s0yD$ZyJ#Hob2jed?CN0dVqy4tb#LN!65_p!`VaRSr$_IsQig4Dl&IkN zG5JR{Df#z5-m2Ju%im3++NIUpJt=sgi2XNLC?)IT_+hN-+PMusE!~~*{mRn(^u(`N z$Mf1D9Q6N8Vp3`6Xh}-tzW@6x3RAptw9wynwJLg(Xs)hbe%8PBv+7%ix{4$ucYfLb zZfEUcbZYVZVtjYv@@jo{>}j(92L!T1+ie}V2ld}7rSq1dK;Y<~D(^fsJl7*R?S&MnzrS?qr# zQrjt&F;$f6{gLo2%pTcsl~jhfBFbk<_pXl(_e!p+Pwc&3B}x6!$_Rn2i(7=Uurv2` zwz{}7G_rB@5)sSueAEAMzM)s)1EE(&!mMolI(vO^^eViTiKNh!ZW_B^ZpGNn37?+c zEUu4$2ZAM5%+&)s)GJd_6T?q5hJ9*!sYh~Ed0_9=irW7Wg_dZqQPZq&i zFxgxXt&2Jct>$cepxt-MC$Te7;`U8H;`Ce<+Oqd$kHF>O!5$l@hV0ex{or8~2wNdp*3A zRefgfHA~x#nZX2g>O!+F8}su!Pkv%|CpRb0I)6A!>-}HSDrE1IzLHc>|IailEI>R^ zzclOLX_lDPL7~V>Y+Zx{k4Mh$j~lm#yU#b*b8F|*%#U|7{m+C;(ca{*>U&jNbWv#$ zq8ON;NwDUR?)ENwrB~HQ_Fk^M-2_%qW$vZ{$Ljvv{>tOhx~@-6?rk6Z?8DRdlz<-f)?sa`ZEe9R!|R_&Gq-N)z(xZV!v8mSh!|>%_>ut~)jRJEHQR6=x?Ww0 z=UY5|+?zOE`Nn;(Ex=jKl-PyLwYjj>phuJZ)w8%VItK-kw9*;JLV|&olmbCqt(kBL z6^wrMx(02MKR;jj^4a?Q)Y|6l#y!Z`*vfpP2VP?LBRZ-6I$*G(W=cmW*Z;Ho9%W`I z%<7xQZp`Jw z(h8i9Hg0E@uP;yc?}j+f?=Qb^F;$aXLs)||Aj7zB1h^p?A&8|UCRx|hH`}RMb*?55 z6B#m+fjVmI0i8h3VPV3qni!@7J%97nuPgT5Wrw??%geKGjg}sYwaH?LkQ;Jf=UI%$ z_%<_ox_WZGdAl>WNQ;fpt-c=$lffmFRgow{5SzMJk`x8b^Z;@e`7jR!3TYsET~SQ+ z7uCz2CT?kh{~Q0ho$YFEYTP&z-LP}h-M>G7kjb_Et*bqHKHnaOskxKBZ~vCTMT12E zMMBM+Oc%0f5F}A3s!<=Cu5gMY3|4X1KepWZTe}(r?$UW&n2 zcW>t4_;xBzX4kH+<6AV@?JN8H@2wU+r62=4hf`4ME3;M2q8el-Ovw>>>stN zn)QN_5l#w00wz_PH7E%Nf~1&?p4B)87^P^oE>HT)!@fD2UKqBsH9eUfI=$UH-__ak zKhNj?k8{)itvNOnR5b`d0YTxW_Sj+tm_#*_~5|g;2u6k(I)P75vp=T0}Vv0Gt){?N`%+j1edv zh5@3lma_s121F8);Hqz}$5C99Q%PMY*_z%C(gQWbjT7EpPPAmECz~k)+3UBX5||_+ zKth3UujT~^QaQpviTw53L5U0waYt}cwhq=q6Lvx&Atr1xJM$z=4@Y+B=3Bxc$WI%|-cyl=`o&h3=Dv9*UPKKahu!cmsfdkW<%W-k@bf*XvCw{dY6%jEM z@T6cMd(%$5a5w{FvRekdT8@jWK#&SBhdR8;EKvzDg9HIXk5|h{302_)U=al7HzC7< z5oUlWst80)>tTkan~8cvy#00sFx^Phl_=o#a#+$45z*BU(%)Wh2nvuYDnbZAC=mYG z{P%8}#ER-9VrE@MVBnpGg^<=y0!nla(k{kgU+hfX^ZLl-cQjBd#i^_n*-oN zk)e}l5L(l$5KSp4#vY0m$Q^r7-A@WB#1X6+fhimc@aWHvPYdTm`m%R)bfahMeZTW_F{Z?1mrBe>-3oJvGN_qg^!e5=l#Z_Wwe^2( z{a;)E*Vf-Q-mg}FFD6vPA++~?%-WvTBse=17OfNYRJ z(iAzBVkj6P8R7cxTDct`AGd5a5a9%@p-xua^Rq~hhM)(fIaJzmrm!o5{Hu`ZAm#(w z7&d~{8a8U80h(*)BMf#Xww7jpem)srJlo$LULSj2yyrf*>|iJ!fa)TSApF|0&5IkT zYo;;!2Uxb1aCA%L00|DQ+O`)_PiI6|&xjWO3>z7e>X&iv@@}Q>THnF#b)zwKx_5VQ+6PAN!NoX|*ongJm5sYVaaFfS zW>?4#Dz>|?Sn!g}X~+glrq#I>)t`8a*2=12mT#H)7NRg8)ppdm-1;THy7 zVITl?7?2==I(IGyjoIf%q?K=D7uO5BOUn}rql>aK5EKM>5rF^?;82g(6tESK@CZ`Y zNdF)Xl$eCc(aprP$^l|YfD{4&i43gzMXaitnqMGv2?Dk0HOkD6Vx8+j+&!$X%f;Qw z$m-(W=;7F?&dAwdkib|-$PyzY*xG=Q@KJsdi5Q46fujJw4MH^|%L!y;@Wny5f)KSF zFlqv*YUi`T7bf6o{Csx(;^0Vb7GiX4WpZ*bI8ZGhqONHg*nB`hDjAj_=w#sbUnv4> zaiAbWLa`K-W)O^1lnnI?gzi8Hss%w!STed0gv5)z5oA7Fm(}HETs`OY+v~vq0kg26 zONI!Gwhb8he&p2Q#&Kja3$Y6+C0fF7!K;kQM+G(*Ss!1IK zA_gKtolC(T%x*u=S7S2tRLUr z&pci{9W35Hz~~q~$o}TQY!Qs?5s-=66h~7k;0TGTnMm{@I|QN@5E4!&ah4YK+|X(~ z0|YaF?GT@Dho~4#=QsVjcc!s6`x=TG3SBr7dlzSW_d8G5%clqEM>7+vW2b{rgb-#W zMHK_U7@+N;vX`Ni`lM6KF%irOV)*4rcRXxQz0JU6_8?O4oPa0Y{ds8mD9H}No{9J`5!gSH{lon#0+c0_k`)8Y1dOT-R8Nmi z$_TEXvfiN5u)A8K;jPfPRNOMmz4hr0i{cXKDI&#s?8K?VXMBm}6T+NzHIaeaAi9{)18Mm$}NUW7a%&^EVN-*{h9%m-k?ss{4_2p(|2hJCc7S;v>gK9-y zy-!H`4jg_9! z5Sl9T`alpm0Ku_^tH>0Z^eJ7%f5Cy&bD$=muBaVzFf=@Kf6kMu%fnBx{y4sQqd$*7 z4+IBq)Z|f=Pzka~CL{kC4ys{-?rT$1HJ!!3IOviCh58b0B15Mx(r&wZ*H6bU-|p9r zFXmU*50>^D3xtCKLbW8Xo&){}4(c<{^&Cibrer!>$k+LxOAMIWwB%P0?q;y?^7ibG zoDQ9BZ=BqZZ%-d=#m_h(HQMbf@<0nHrBEPIp!JC!RKv!>Nvi05lxi9X4z#{9Rno(Z zkUO7Mx&$w47YCc06S;mdmWO+{>pu@jfp)bXul{rp|CJhWQ8hBsL<2^J>gX(01vJ&K zWAgjy0fv&WRR<(rXXzTV>H-Ivpw!Kk>Yh2b zB_aU6IOup(HAB0eErF1j-CWigo}a;WcwU&k-5=VVmN>aw?Dxqr=xzl`1Hw@&196SD z2xXWDBMN`qd$myMpN6pJA_ct8-YAMHq|)4B(ARfB+3S4WY{ZcISYeo-JVuKY0vQ ztfab6s862`=XN{)*=~6WE%^t$PHulSIu!EXe^simrO=Y_+b>r4hzf_N7&mnN@9z2G zyT|AC)%)A4wf@EV>D9)0Z~Y%w(+hK#m)rBZb91NLc)0uZ-dE3eXV}Cwz3N_zI2B-uy^&h+W49Q%`fh;XS#9tC5X(8|c2Njn5dn2C?dWUk@9RfC>5}`y$YDk1Mbg>!{ z7`U4!iz-z}(*Ocl5kJATy*2K>x*wMuRmBw8nImcVeq6GQFbF0SU~;Z}`iio9Di`Or zs;2ki{Z*r@3cg?Gdc?6dKb3Sd8{_`)rp)!O=dKQA`(1zKd}ix*k=xmP&jV6SD1jZ= zV4!3#2!%OP!v%;%cp|>XrC1PnB#CR1j&L7D$q|t)KS4)xFh7f5Z7sshGxEy$?DJi> zwP~LZ^&$#NOd6SnZ0^-1;#u55BS}F#h~&TOW~x?Jef!_{DO|rU$^n$dna#0vV@tiG z?em)tJF|=X$D5B|+2iIExD+icY?%?}!7tN?z$_#Ik|xQdI2n(k1%l0y5)qSSlpAIu zk-$p+ze}liDJ>|@W}K1XiK1y0pTbbYkwb;qG|2z6E~R2-6WBng`X?L z{qwtQZ=BCHKVFD-Rc1?PK8G+2~&EY15(YLd8spUC4Y; zLyDqWR6&$63|MJm-2u`pu8ht>fh1OiI|i0yKng>5OruJ<5kMW-2^(5&+lKNqrxMptz7H5D{9vKq-Fuc|GDx3!4%P`>=L& zRi^Ezl&RC-kc&|pa$A>|XSZ*5^R#+owdE!h>c9UT)_@~@U9Vgy>t{HDrizWOqP~zQ?Xy7` zs_@ifbL3`u|EAY#?e{JYPR_eGr>*nL!~3(|8@9{qA7adgImX`_!dai4iZT((7^2_K z2Y*O_wVjzxuY0}KzPRk{NaInfUW|fGbQo%ttRS0E1T%og|hB{Nfo&}tL` z14h*BIS&FB>-amC`n-q2$zv?S?`V-k4yC|21yWTxf zq$q?U=m3=AF)=AILvaPRM8XVlkWnN?PM#2-Xz#90DNr9cRh0tZsTRu0fh?IeR1B%Y zks^}e&cA5|rcx15wX?}hszvkKaM)NRc?{k3)&6Gx{QX1sS#D=eb`NIm_PQT;8t+~l zs6_l84!?9SAW+bmNZB%=ejr)buqTIqPWnBQ3>Y@c+KVUwFrr#U_-mr!6jMM0Z)h+) zxAQa?$IsuaZ{Aq0k(CaGH64^|tf)W&iH|+~LMi`+j3@8aOKo{$9cv7#5XoFDf`O zA<~lK$O#_q2EzQR5{|fIT0l0GwHD{*7y7&XcYDjT(OK$VK7ZI-ylf(CnL>SKlQGG1 z^hoaROFTL6Zp6Eb`L7qgK0aNe)=BeGfmsAOSOg)+`6VjCimDJ1I44+6hO(qYn#8Dj z6emDTkWY!SmN(~G@86EIs%O4EX(84s#u8NpDRJ~ECM!qZoUMv3bk1AvcJHp|?aQOD z8;{OldBIunFL0Jw;Q)0}5po216=h`-8&Whi(V8C0vieXC$j#KoAg!6B-JO%0yA{6M zy=&hc?A*7O-Zx&lj3N4hwCrCXEzyEw^c>hYlUzQbmcwJ%8xdCZz_({BQ>(?LlQ=QT z=qk!DkJ_hS;{ktlWh?jd?Z5;b-ir&dI_*e8l#U<(|%rkym)$k-oCxs-@NbMZe2W3G~IXusDP)D5Q+{l-Ym|5 zfP|jvlQUivEMus#M3j|-ahi z+uAEzv%A~Ywj(>7JzZ$+oo{uac^H-ANF0#?9AY^*;E@VSc#x(NR|@V%B~b|joRnj7 zci5ni2BMp*M>Qg>zM@JtY#EXOQ-$MKo3ho?4LMQJIQwiC_P<)NtyU}Nk2Yt{T668( zS#IEP#=kIS@j_(b4(_HlXvthKlqpe7iLAkv97ss$Fjey4R~vHHY5|-aRBQ|#`(0;$ zcJ-@uo4Lo+{_5rSyH0n1dFl+=t&pMCsqQRDoZa^qW{~e!PJL^K~ zvsZCRQ_nH7*VD(vt=9b953$zAy{nT`Ub{a!{HNz{Uf$jP!}B*mumGp1DzVals)fl5 z5d$Xxpj%GtelCN*n7=vNTrO~7@#IfZG&h|CFIqL>RJJp5W%yVs~Y^zp!@u)I63N6vToT2z5wsF|~p8UKom7NV19=lTVD65TF=~n;MOF z!L6U`nm?hOAC5Mcca#lkn`VS+$?&8VP?d{b8^3Brn^YO=(>-G1u2i^YgskesC zsQ{jA97L6CCt3grGgzD{5o(6s{?#M3bYuDCxsh8_UvephAnHep&(oLqd)Mf%*?U1rP`? z5-J!(X{j-Y0UZ$ou&0J_?{b&T%jmPsKjQXob(d87Fv=B4UN^wv=+IN3hJ*!u6jWRLnW4x#++0-H)s;* zDANSx*RhRK{ajTf5|e|%aiLMKU83a)Ap=w<_kOHxR{O-YRaVj_l{R>Vf66&AJRoT|9z@3G3<{q5c5`5UrI_0+d# z6@>a!1vNn>ax}e8ZH2e-GZ_V_c#4XW0SgUu7sLt(|2AqLC)dztDgXgwNDN{ztk~d1 z6NSM{z=qID2m&d*^|IVCdQ8=>R|9Kr=C#R?s*=Juxsczk&-8o8#~<@_XXEKBn7coi zZ@kb&2#Y4kuaz$uKt*L?)8u3><5l0#(ZZ6ZU*|Gz`E3%_w|e@j9{l!|HcItc`T`IX zG~vE0@zkJ$&yT_StzHDdv@D?d*pyH>|IV5A#dSH{0X_A>?G> z8sGzFt6ma;!x_P=4OM zy4t+!-t~SGg#7a5^$&mihrD>4&p-Bi53|D(>!YL3Kl=&pKf|i`lMLtE&%XU0Bfubr HqbmgfcqZ0E diff --git a/docs/website/docs/assets/images/iree_architecture.svg b/docs/website/docs/assets/images/iree_architecture.svg index a98ba0d2e404..60effddf6e76 100644 --- a/docs/website/docs/assets/images/iree_architecture.svg +++ b/docs/website/docs/assets/images/iree_architecture.svg @@ -3,19 +3,8 @@ - IREE ModulesStaticLibrarySharedLibraryVMBytecodeCSourceIREE Compiler using MLIRIREE ImportersPyTorchJAXTFTOSATargetConfigurationflowstreamhalVMlinalgHardware Abstraction Layer for Buffer and Execution ManagementCommand BuffersVMVXDevice Placement and Asynchronous SchedulingLLVMSPIR-VIREE RuntimeTarget HardwareDevice ExecutablesLLVMarithInput MLIRARMPTXDevice Code GenerationEmit CHost Code GenerationCPUWASMTensor Program Modeling and Compute Workload PartitioningBindingsPythonCRustTFLiteVMHALCUDAHIPCPUVulkanMetalStacked ColumnProfilingFeedbackToolsiree-benchmark-moduleiree-check-moduleWASMROCDLRISC-Vx86iree-run-module~25-150KBPluginsWebGPUCustomPluginPlugins \ No newline at end of file + IREE ModulesStaticLibrarySharedLibraryVMBytecodeCSourceIREE Compiler using MLIRIREE ImportersPyTorchONNXJAXLiteRTTargetConfigurationflowstreamhalVMlinalgHardware Abstraction Layer for Buffer and Execution ManagementCommand BuffersVMVXDevice Placement and Asynchronous SchedulingLLVMSPIR-VIREE RuntimeTarget HardwareDevice ExecutablesLLVMarithInput MLIRARMPTXDevice Code GenerationEmit CHost Code GenerationCPUWASMTensor Program Modeling and Compute Workload PartitioningBindingsPythonCRustTFLiteVMHALCUDAHIPCPUVulkanMetalProfilingFeedbackToolsiree-benchmark-moduleiree-check-moduleWASMROCDLRISC-Vx86iree-run-module~25-150KBPluginsWebGPUCustomTFPluginsPlugin \ No newline at end of file diff --git a/docs/website/docs/assets/images/iree_architecture_dark.svg b/docs/website/docs/assets/images/iree_architecture_dark.svg index 63dfbcf6d06b..8986ef182b03 100644 --- a/docs/website/docs/assets/images/iree_architecture_dark.svg +++ b/docs/website/docs/assets/images/iree_architecture_dark.svg @@ -3,19 +3,8 @@ - IREE ModulesStaticLibrarySharedLibraryVMBytecodeCSourceIREE Compiler using MLIRIREE ImportersPyTorchJAXTFTOSATargetConfigurationflowstreamhalVMlinalgHardware Abstraction Layer for Buffer and Execution ManagementCommand BuffersVMVXDevice Placement and Asynchronous SchedulingLLVMSPIR-VIREE RuntimeTarget HardwareDevice ExecutablesLLVMarithInput MLIRARMPTXDevice Code GenerationEmit CHost Code GenerationCPUWASMTensor Program Modeling and Compute Workload PartitioningBindingsPythonCRustTFLiteVMHALCUDAHIPCPUVulkanMetalStacked ColumnProfilingFeedbackToolsiree-benchmark-moduleiree-check-moduleWASMROCDLRISC-Vx86iree-run-module~25-150KBPluginsWebGPUCustomPluginPlugins \ No newline at end of file + IREE ModulesStaticLibrarySharedLibraryVMBytecodeCSourceIREE Compiler using MLIRIREE ImportersPyTorchONNXJAXLiteRTTargetConfigurationflowstreamhalVMlinalgHardware Abstraction Layer for Buffer and Execution ManagementCommand BuffersVMVXDevice Placement and Asynchronous SchedulingLLVMSPIR-VIREE RuntimeTarget HardwareDevice ExecutablesLLVMarithInput MLIRARMPTXDevice Code GenerationEmit CHost Code GenerationCPUWASMTensor Program Modeling and Compute Workload PartitioningBindingsPythonCRustTFLiteVMHALCUDAHIPCPUVulkanMetalProfilingFeedbackToolsiree-benchmark-moduleiree-check-moduleWASMROCDLRISC-Vx86iree-run-module~25-150KBPluginsWebGPUCustomTFPluginsPlugin \ No newline at end of file From f2690e222c3b9b2a15b3a3b00d73ed2dea3d7a05 Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Mon, 16 Dec 2024 13:37:07 -0800 Subject: [PATCH 26/64] [docs] Expand on instructions for installing torch for CPU. (#19493) Context: https://github.com/iree-org/iree-turbine/pull/343#discussion_r1887549654 Screenshot: ![image](https://github.com/user-attachments/assets/7643cb54-4283-4de8-b8f5-c49d34efc2e9) --- .../docs/guides/ml-frameworks/pytorch.md | 65 ++++++++++++------- 1 file changed, 42 insertions(+), 23 deletions(-) diff --git a/docs/website/docs/guides/ml-frameworks/pytorch.md b/docs/website/docs/guides/ml-frameworks/pytorch.md index 20e5d2971ed8..e2dc9075c4a0 100644 --- a/docs/website/docs/guides/ml-frameworks/pytorch.md +++ b/docs/website/docs/guides/ml-frameworks/pytorch.md @@ -62,37 +62,56 @@ graph LR ## :octicons-download-16: Prerequisites -We recommend first installing a recent version of PyTorch for CPU by following -the [official instructions](https://pytorch.org/get-started/locally/). +1. First install a recent version of PyTorch by following + the [official instructions](https://pytorch.org/get-started/locally/): -``` shell -python -m pip install \ - --index-url https://download.pytorch.org/whl/test/cpu torch>=2.3.0 -``` + === ":fontawesome-brands-linux: Linux" -Install iree-turbine: + ``` shell + python -m pip install torch --index-url https://download.pytorch.org/whl/test/cpu + ``` -=== ":octicons-package-16: Stable releases" + === ":fontawesome-brands-apple: macOS" - Stable release packages are - [published to PyPI](https://pypi.org/project/iree-turbine/). + ``` shell + python -m pip install torch + ``` - ``` shell - python -m pip install iree-turbine - ``` + === ":fontawesome-brands-windows: Windows" -=== ":octicons-beaker-16: Nightly pre-releases" + ``` shell + python -m pip install torch + ``` - Nightly pre-releases are published on - [GitHub releases](https://github.com/iree-org/iree-turbine/releases/tag/dev-wheels). + !!! tip - ``` shell hl_lines="2-4" - python -m pip install \ - --find-links https://iree.dev/pip-release-links.html \ - --pre \ - --upgrade \ - iree-turbine - ``` + IREE includes its own GPU support, so we recommend the CPU versions of + PyTorch. You can install CUDA or ROCm as you wish, but those packages + can be quite large. + +2. Then install iree-turbine: + + === ":octicons-package-16: Stable releases" + + Stable release packages are + [published to PyPI](https://pypi.org/project/iree-turbine/). + + ``` shell + python -m pip install iree-turbine + ``` + + === ":octicons-beaker-16: Nightly pre-releases" + + Nightly pre-releases are published on + [GitHub releases](https://github.com/iree-org/iree-turbine/releases/tag/dev-wheels). + + ``` shell hl_lines="2-4" + python -m pip install \ + --find-links https://iree.dev/pip-release-links.html \ + --pre \ + --upgrade \ + iree-turbine + ``` ## :octicons-flame-16: Just-in-time (JIT) execution From 362b554894c46021d32749bf01c9c4410f8cbbc4 Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Mon, 16 Dec 2024 16:11:26 -0800 Subject: [PATCH 27/64] [docs] Refresh `status: new` usage across website pages. (#19495) This puts a little icon to the right of each "new" page in the nav: ![image](https://github.com/user-attachments/assets/949fb830-b48d-4348-9a26-d382498af901) Docs: https://squidfunk.github.io/mkdocs-material/reference/?#setting-the-page-status I liked having the "new" status next to the ONNX and PyTorch guides.... but those pages were added multiple months ago, so it's time to move on from that label. --- docs/website/docs/developers/general/versioning-scheme.md | 1 + docs/website/docs/guides/ml-frameworks/onnx.md | 1 - docs/website/docs/guides/ml-frameworks/pytorch.md | 1 - docs/website/docs/guides/parameters.md | 1 - docs/website/docs/reference/tuning.md | 1 + 5 files changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/developers/general/versioning-scheme.md b/docs/website/docs/developers/general/versioning-scheme.md index 918db86d0c91..722f31263ce1 100644 --- a/docs/website/docs/developers/general/versioning-scheme.md +++ b/docs/website/docs/developers/general/versioning-scheme.md @@ -1,5 +1,6 @@ --- icon: octicons/versions-16 +status: new --- # Versioning scheme diff --git a/docs/website/docs/guides/ml-frameworks/onnx.md b/docs/website/docs/guides/ml-frameworks/onnx.md index 64dadad5d2ff..d6134c284406 100644 --- a/docs/website/docs/guides/ml-frameworks/onnx.md +++ b/docs/website/docs/guides/ml-frameworks/onnx.md @@ -6,7 +6,6 @@ tags: - Python - PyTorch icon: simple/onnx -status: new --- # ONNX support diff --git a/docs/website/docs/guides/ml-frameworks/pytorch.md b/docs/website/docs/guides/ml-frameworks/pytorch.md index e2dc9075c4a0..d19e4a9ef32e 100644 --- a/docs/website/docs/guides/ml-frameworks/pytorch.md +++ b/docs/website/docs/guides/ml-frameworks/pytorch.md @@ -5,7 +5,6 @@ tags: - Python - PyTorch icon: simple/pytorch -status: new --- # PyTorch + IREE = :octicons-heart-16: diff --git a/docs/website/docs/guides/parameters.md b/docs/website/docs/guides/parameters.md index f41a91a3160e..e4b36b173602 100644 --- a/docs/website/docs/guides/parameters.md +++ b/docs/website/docs/guides/parameters.md @@ -1,6 +1,5 @@ --- icon: octicons/file-symlink-file-16 -status: new --- # Parameters diff --git a/docs/website/docs/reference/tuning.md b/docs/website/docs/reference/tuning.md index fe4a7ff78ab8..5dea9b484282 100644 --- a/docs/website/docs/reference/tuning.md +++ b/docs/website/docs/reference/tuning.md @@ -1,5 +1,6 @@ --- icon: octicons/meter-16 +status: new --- # Tuning From 1894af3bd276183a9088532cfdc813f7d1fbe95c Mon Sep 17 00:00:00 2001 From: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> Date: Tue, 17 Dec 2024 18:30:47 +0700 Subject: [PATCH 28/64] Update LLVM to llvm/llvm-project@dd6f6a0 (#19489) Update LLVM to https://github.com/llvm/llvm-project/commit/3f136f7 (https://github.com/iree-org/iree/pull/19479) Carrying the following reverts - https://github.com/llvm/llvm-project/pull/116470 - https://github.com/llvm/llvm-project/pull/117424 - https://github.com/llvm/llvm-project/pull/119671 - https://github.com/llvm/llvm-project/pull/119970 First two are carry over from previous-previous integrate. It is being fixed in https://github.com/iree-org/iree/pull/19451 . The last one is a from the previous integrate. The last one is a new error being tracked in https://github.com/iree-org/iree/issues/19498 --------- Signed-off-by: Stanley Winata --- compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel | 1 + compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt | 1 + .../Codegen/Common/GPU/GPUNestedLayoutDistributionPatterns.cpp | 2 +- third_party/llvm-project | 2 +- 4 files changed, 4 insertions(+), 2 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel index 66177778f683..f4f8c414936e 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/BUILD.bazel @@ -125,6 +125,7 @@ iree_compiler_cc_library( "@llvm-project//mlir:GPUDialect", "@llvm-project//mlir:GPUTransformOps", "@llvm-project//mlir:GPUTransforms", + "@llvm-project//mlir:GPUUtils", "@llvm-project//mlir:IR", "@llvm-project//mlir:LLVMCommonConversion", "@llvm-project//mlir:LinalgDialect", diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt index 2f065df2bb52..4c1422e11ba5 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/CMakeLists.txt @@ -103,6 +103,7 @@ iree_cc_library( MLIRGPUDialect MLIRGPUTransformOps MLIRGPUTransforms + MLIRGPUUtils MLIRIR MLIRLLVMCommonConversion MLIRLinalgDialect diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUNestedLayoutDistributionPatterns.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUNestedLayoutDistributionPatterns.cpp index e2d773f13b37..0b4d812def17 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUNestedLayoutDistributionPatterns.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUNestedLayoutDistributionPatterns.cpp @@ -17,7 +17,7 @@ #include "mlir/Dialect/Affine/IR/AffineOps.h" #include "mlir/Dialect/Affine/Utils.h" #include "mlir/Dialect/Arith/IR/Arith.h" -#include "mlir/Dialect/GPU/Transforms/Utils.h" +#include "mlir/Dialect/GPU/Utils/GPUUtils.h" #include "mlir/Dialect/Utils/IndexingUtils.h" #include "mlir/Dialect/Vector/IR/VectorOps.h" #include "mlir/IR/AffineExpr.h" diff --git a/third_party/llvm-project b/third_party/llvm-project index 1cfbe1f2e035..ccdbcf948ba2 160000 --- a/third_party/llvm-project +++ b/third_party/llvm-project @@ -1 +1 @@ -Subproject commit 1cfbe1f2e035fce940fef0dd6a0568a05d989d11 +Subproject commit ccdbcf948ba24cfc80860e9a0256eb343f3373da From a5cf548388e0e923189baa41ad56e31febf1d8a5 Mon Sep 17 00:00:00 2001 From: Benoit Jacob Date: Tue, 17 Dec 2024 12:43:35 -0500 Subject: [PATCH 29/64] [NFC] GPU ukernels cleanups (#19503) 1. Rename `UKernelSpec` to `UKernelConfig`. I was grappling for the right word, but now that it's part of `LoweringConfig`, it's clearer. 2. Drop unused `KernelConfig` case for ukernel ops. The lowering to ukernel ops happens after `KernelConfig`. 3. To stringify types, instead of using a stringstream, we can actually just use `llvm::formatv`. 4. Reorganize LLVMGPUSelectUKernels.cpp to make it easier to add logic for other ukernels. Signed-off-by: Benoit Jacob --- .../test/config_ukernel_argmax_gfx942.mlir | 14 +-- .../Codegen/Common/GPU/GPULowerToUKernels.cpp | 2 +- .../GPU/test/gpu_lower_to_ukernels.mlir | 2 +- .../Dialect/GPU/IR/GPULoweringConfigUtils.cpp | 4 +- .../Dialect/GPU/IR/GPULoweringConfigUtils.h | 3 +- .../Codegen/Dialect/GPU/IR/IREEGPUAttrs.td | 8 +- .../compiler/Codegen/LLVMGPU/KernelConfig.cpp | 9 +- .../LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp | 112 +++++++++++------- .../LLVMGPU/Utils/LLVMGPUSelectUKernels.h | 2 +- 9 files changed, 90 insertions(+), 66 deletions(-) diff --git a/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx942.mlir b/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx942.mlir index 4a7da4befadd..9a537875c6ab 100644 --- a/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx942.mlir +++ b/compiler/plugins/target/ROCM/test/config_ukernel_argmax_gfx942.mlir @@ -25,7 +25,7 @@ func.func @argmax_2d_f32i64(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes // CHECK: linalg.generic // CHECK-SAME: hal.executable.objects = [ // CEHCK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", data = dense_resource : vector<{{[0-9]+}}xi8>}>] -// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_config // ----- @@ -54,7 +54,7 @@ func.func @argmax_4d_unit_parallel_f32i64(%arg0 : tensor<1x1x1x?xf32>) -> tensor // CHECK: linalg.generic // CHECK-SAME: hal.executable.objects = [ // CEHCK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", data = dense_resource : vector<{{[0-9]+}}xi8>}>] -// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_config // ----- @@ -82,7 +82,7 @@ func.func @argmax_none_ukernel_enabled(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> // CHECK-LABEL: func @argmax_none_ukernel_enabled( // CHECK: linalg.generic // CHECK-NOT: hal.executable.objects -// CHECK-NOT: iree_gpu.ukernel_spec +// CHECK-NOT: iree_gpu.ukernel_config // ----- @@ -111,7 +111,7 @@ func.func @argmax_only_argmax_ukernel_enabled(%arg0 : tensor<1x?xf32>) -> tensor // CHECK: linalg.generic // CHECK-SAME: hal.executable.objects = [ // CHECK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", data = dense_resource : vector<{{[0-9]+}}xi8>}>] -// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_config // ----- @@ -140,7 +140,7 @@ func.func @argmax_only_foo_argmax_bar_ukernel_enabled(%arg0 : tensor<1x?xf32>) - // CHECK: linalg.generic // CHECK-SAME: hal.executable.objects = [ // CHECK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_argmax_f32i64.gfx942.bc", data = dense_resource : vector<{{[0-9]+}}xi8>}>] -// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_config // ----- @@ -168,7 +168,7 @@ func.func @argmax_only_foo_ukernel_enabled(%arg0 : tensor<1x?xf32>) -> tensor<1x // CHECK-LABEL: func @argmax_only_foo_ukernel_enabled( // CHECK: linalg.generic // CHECK-NOT: hal.executable.objects -// CHECK-NOT: iree_gpu.ukernel_spec +// CHECK-NOT: iree_gpu.ukernel_config // ----- @@ -239,4 +239,4 @@ func.func @argmax_2d_f32i64_custom_bitcode(%arg0 : tensor<1x?xf32>) -> tensor<1x // CHECK-SAME: data = dense<[66, 67, -64, -34, 1, 35, 69, 103, -119, -85, -51, -17]> : tensor<12xi8> // CHECK-SAME: }> // CHECK-SAME: ] -// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_spec +// CHECK-SAME: #iree_gpu.lowering_config<{{.*}}ukernel = #iree_gpu.ukernel_config diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp index 796138d55e3f..fd58c29d2654 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp @@ -43,7 +43,7 @@ matchArgmaxDAGForUKernel(RewriterBase &rewriter, linalg::GenericOp op) { if (!loweringConfig) { return rewriter.notifyMatchFailure(op, "no lowering_config on this op"); } - IREE::GPU::UKernelSpecAttr ukernelAttr = + IREE::GPU::UKernelConfigAttr ukernelAttr = IREE::GPU::getUkernelSpec(loweringConfig); if (!ukernelAttr) { return rewriter.notifyMatchFailure(op, "no ukernel selected for this op"); diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir index 6a13468a1d29..7acab19f945a 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir @@ -1,6 +1,6 @@ // RUN: iree-opt --split-input-file --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-lower-to-ukernels,cse,canonicalize))" %s | FileCheck %s -#config = #iree_gpu.lowering_config<{ukernel = #iree_gpu.ukernel_spec}> +#config = #iree_gpu.lowering_config<{ukernel = #iree_gpu.ukernel_config}> func.func @argmax_f32i64_with_selected_ukernel(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> } { diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.cpp b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.cpp index 8ebfba912442..df85e48a7379 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.cpp +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.cpp @@ -145,9 +145,9 @@ std::optional> getPaddingList(LoweringConfigAttr config) { return getIntegerVector(array); } -IREE::GPU::UKernelSpecAttr +IREE::GPU::UKernelConfigAttr getUkernelSpec(IREE::GPU::LoweringConfigAttr config) { - return config.getAttributes().getAs("ukernel"); + return config.getAttributes().getAs("ukernel"); } } // namespace mlir::iree_compiler::IREE::GPU diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h index 5bebb64a1b05..b6afde5d4dd4 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h @@ -59,7 +59,8 @@ void setPromotedOperandList(MLIRContext *context, /// Helper to retrieve list of operand to pad. std::optional> getPaddingList(LoweringConfigAttr config); -IREE::GPU::UKernelSpecAttr getUkernelSpec(IREE::GPU::LoweringConfigAttr config); +IREE::GPU::UKernelConfigAttr +getUkernelSpec(IREE::GPU::LoweringConfigAttr config); } // namespace mlir::iree_compiler::IREE::GPU diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td index 0b1e32fdc362..e4b66bffbd89 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td @@ -521,12 +521,12 @@ def IREEGPU_LaneIdAttr : AttrDef { - let mnemonic = "ukernel_spec"; +def IREEGPU_UKernelConfigAttr : + AttrDef { + let mnemonic = "ukernel_config"; let summary = "An attribute specifying a ukernel that an op can lower to."; let description = [{ An attribute that can be applied to any operation to specify that it has diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp index 1f44bf693a55..fbc0f37a129b 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp @@ -2103,14 +2103,11 @@ static LogicalResult setArgmaxUkernelConfig(IREE::GPU::TargetAttr target, mlir::FunctionOpInterface entryPoint, linalg::GenericOp op) { - // Checks if UKernels are enabled. - IREE::GPU::UKernelSpecAttr ukernelSpec = selectUKernelForArgmax(op); - if (!ukernelSpec) { + IREE::GPU::UKernelConfigAttr ukernelConfig = selectUKernel(op); + if (!ukernelConfig) { return failure(); } - if (failed(isArgmaxOp(op))) - return failure(); SmallVector parallelDims; SmallVector reductionDims; op.getParallelDims(parallelDims); @@ -2161,7 +2158,7 @@ setArgmaxUkernelConfig(IREE::GPU::TargetAttr target, b.getI64ArrayAttr(workgroupTileSizes)); attrs.emplace_back(StringAttr::get(context, "reduction"), b.getI64ArrayAttr(reductionTileSizes)); - attrs.emplace_back(StringAttr::get(context, "ukernel"), ukernelSpec); + attrs.emplace_back(StringAttr::get(context, "ukernel"), ukernelConfig); IREE::GPU::setPromotedOperandList(context, attrs, {0, 1}); auto configDict = DictionaryAttr::get(context, attrs); auto loweringConfig = IREE::GPU::LoweringConfigAttr::get(context, configDict); diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp index 1940e8f0b102..2f2861f926cc 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp @@ -18,7 +18,49 @@ namespace mlir::iree_compiler { namespace { -constexpr StringLiteral executableObjectsAttrName = "hal.executable.objects"; +// Returns ukernel name and suffix for argmax. Empty name = no ukernel. +static std::tuple +getUKernelNameAndSuffixForArgmax(linalg::GenericOp op) { + Value input = op.getDpsInputOperand(0)->get(); + auto inputType = cast(input.getType()); + Value index = op.getDpsInitOperand(1)->get(); + auto indexType = cast(index.getType()); + return {"argmax", llvm::formatv("{}{}", inputType.getElementType(), + indexType.getElementType())}; +} + +// Returns ukernel name and suffix for any op. Empty name = no ukernel. +static std::tuple +getUKernelNameAndSuffix(Operation *op) { + if (auto genericOp = dyn_cast(op)) { + if (succeeded(isArgmaxOp(genericOp))) { + return getUKernelNameAndSuffixForArgmax(genericOp); + } + } + return {}; +} + +// Returns the UKernelConfigAttr for any op. Returns {} if no ukernel. +static IREE::GPU::UKernelConfigAttr getUKernelConfig(Operation *op) { + MLIRContext *context = op->getContext(); + auto [name, suffix] = getUKernelNameAndSuffix(op); + if (name.empty() || suffix.empty()) { + return {}; + } + auto target = IREE::HAL::ExecutableTargetAttr::lookup(op); + if (!hasUkernel(target, name)) { + return {}; + } + if (isROCMBackend(target)) { + auto nameAttr = StringAttr::get( + context, llvm::formatv("iree_uk_amdgpu_{}_{}", name, suffix)); + auto defsAttr = DictionaryAttr::get( + context, {{StringAttr::get(context, "vm.import.module"), + StringAttr::get(context, "rocm")}}); + return IREE::GPU::UKernelConfigAttr::get(context, nameAttr, defsAttr); + } + return {}; +} // Returns a ExecutableObjectAttr carrying the bitcode for the given ukernel. // @@ -77,7 +119,8 @@ getUKernelBitcode(MLIRContext *context, // array attribute. If the parent hal.executable.variant is reached, its objects // attribute is returned. // Adapted from ExecutableTargetAttr::lookup. -static ArrayAttr lookUpExecutableObjects(Operation *op) { +static ArrayAttr lookUpExecutableObjects(Operation *op, + StringRef executableObjectsAttrName) { MLIRContext *context = op->getContext(); auto attrId = StringAttr::get(context, executableObjectsAttrName); while (op) { @@ -97,56 +140,39 @@ static ArrayAttr lookUpExecutableObjects(Operation *op) { return {}; } -/// Returns the function name and attributes to use for a ukernel with given -/// `name` and `suffix` on the target described by `targetAttr`. -static IREE::GPU::UKernelSpecAttr -getUKernelSpec(StringRef name, StringRef suffix, MLIRContext *context, - IREE::HAL::ExecutableTargetAttr targetAttr) { - if (isROCMBackend(targetAttr)) { - auto nameAttr = StringAttr::get( - context, llvm::formatv("iree_uk_amdgpu_{}_{}", name, suffix)); - auto defsAttr = DictionaryAttr::get( - context, {{StringAttr::get(context, "vm.import.module"), - StringAttr::get(context, "rocm")}}); - return IREE::GPU::UKernelSpecAttr::get(context, nameAttr, defsAttr); +// Ensures that the op has ukernel bitcode as a hal.executable.object, stored +// as a hal.executable.objects attribute on the op itself, ready to be hoisted +// by the HoistExecutableObjects pass. +// Returns failure if no bitcode was found for the configured ukernel. +static LogicalResult +ensureUKernelBitcode(Operation *op, + IREE::GPU::UKernelConfigAttr ukernelConfig) { + constexpr StringLiteral executableObjectsAttrName = "hal.executable.objects"; + auto target = IREE::HAL::ExecutableTargetAttr::lookup(op); + ArrayAttr sourceExecutableObjects = + lookUpExecutableObjects(op, executableObjectsAttrName); + MLIRContext *context = op->getContext(); + IREE::HAL::ExecutableObjectAttr bitcodeObject = getUKernelBitcode( + context, target, sourceExecutableObjects, ukernelConfig.getName()); + if (!bitcodeObject) { + return failure(); } - return {}; + op->setAttr(executableObjectsAttrName, + ArrayAttr::get(context, bitcodeObject)); + return success(); } } // namespace -IREE::GPU::UKernelSpecAttr selectUKernelForArgmax(linalg::GenericOp op) { - if (failed(isArgmaxOp(op))) { - return {}; - } - auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(op); - const char ukernelName[] = "argmax"; - if (!hasUkernel(targetAttr, ukernelName)) { - return {}; - } - Value input = op.getDpsInputOperand(0)->get(); - auto inputType = cast(input.getType()); - Value index = op.getDpsInitOperand(1)->get(); - auto indexType = cast(index.getType()); - std::string suffix; - llvm::raw_string_ostream(suffix) - << inputType.getElementType() << indexType.getElementType(); - MLIRContext *context = op->getContext(); - IREE::GPU::UKernelSpecAttr ukernelSpec = - getUKernelSpec(ukernelName, suffix, context, targetAttr); - if (!ukernelSpec) { +IREE::GPU::UKernelConfigAttr selectUKernel(Operation *op) { + IREE::GPU::UKernelConfigAttr ukernelConfig = getUKernelConfig(op); + if (!ukernelConfig) { return {}; } - auto execTarget = IREE::HAL::ExecutableTargetAttr::lookup(op); - ArrayAttr sourceExecutableObjects = lookUpExecutableObjects(op); - IREE::HAL::ExecutableObjectAttr bitcodeObject = getUKernelBitcode( - context, execTarget, sourceExecutableObjects, ukernelSpec.getName()); - if (!bitcodeObject) { + if (failed(ensureUKernelBitcode(op, ukernelConfig))) { return {}; } - op->setAttr(executableObjectsAttrName, - ArrayAttr::get(context, bitcodeObject)); - return ukernelSpec; + return ukernelConfig; } } // namespace mlir::iree_compiler diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h index 4ed251b36070..cb7fa2abac61 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h @@ -10,6 +10,6 @@ namespace mlir::iree_compiler { -IREE::GPU::UKernelSpecAttr selectUKernelForArgmax(linalg::GenericOp op); +IREE::GPU::UKernelConfigAttr selectUKernel(Operation *op); } // namespace mlir::iree_compiler From a31da1f7ab2c81d3fd6eb74f3950d73b56607852 Mon Sep 17 00:00:00 2001 From: Ben Vanik Date: Tue, 17 Dec 2024 09:51:02 -0800 Subject: [PATCH 30/64] Fixing missing trace zone end in iree_io_scope_map. --- runtime/src/iree/io/scope_map.c | 1 + 1 file changed, 1 insertion(+) diff --git a/runtime/src/iree/io/scope_map.c b/runtime/src/iree/io/scope_map.c index 6b22455bccd1..c76cea31ac56 100644 --- a/runtime/src/iree/io/scope_map.c +++ b/runtime/src/iree/io/scope_map.c @@ -40,6 +40,7 @@ IREE_API_EXPORT iree_status_t iree_io_scope_map_lookup( if (iree_string_view_equal(scope, entry->scope)) { IREE_TRACE_ZONE_APPEND_TEXT(z0, "hit"); *out_index = entry->index; + IREE_TRACE_ZONE_END(z0); return iree_ok_status(); } } From 72d98bcafaf91b6bd541480868a4001eabf2c6f4 Mon Sep 17 00:00:00 2001 From: Benoit Jacob Date: Tue, 17 Dec 2024 15:31:52 -0500 Subject: [PATCH 31/64] GPU ukernel lowering config for data-tiled multi_mma, and a simple ukernel. (#19504) This PR adds the KernelConfig logic to generate a lowering_config selecting a ukernel for multi_mma. In order to be able to test it, this PR also adds a very simple `multi_mma` ukernel, but it isn't actually exercised yet, other than successfully compiling to bitcode. The compiler logic only cares about the existence of the resulting bitcode file. The actual lowering to ukernel op will come in the next PR. --------- Signed-off-by: Benoit Jacob --- .../target/ROCM/builtins/ukernel/BUILD.bazel | 16 +++++- .../ROCM/builtins/ukernel/CMakeLists.txt | 13 +++++ .../target/ROCM/builtins/ukernel/common.h | 7 +++ ...i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c | 53 +++++++++++++++++++ compiler/plugins/target/ROCM/test/BUILD.bazel | 1 + .../plugins/target/ROCM/test/CMakeLists.txt | 1 + .../test/config_ukernel_multi_mma_gfx942.mlir | 29 ++++++++++ .../Dialect/GPU/TargetUtils/ConfigUtils.cpp | 21 +++++--- .../Dialect/GPU/TargetUtils/ConfigUtils.h | 7 ++- .../compiler/Codegen/LLVMGPU/KernelConfig.cpp | 33 +++--------- .../Codegen/LLVMGPU/ROCDLKernelConfig.cpp | 4 +- .../LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp | 35 ++++++++++-- 12 files changed, 177 insertions(+), 43 deletions(-) create mode 100644 compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c create mode 100644 compiler/plugins/target/ROCM/test/config_ukernel_multi_mma_gfx942.mlir diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel b/compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel index aff7b8965b32..840d45fc27cb 100644 --- a/compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel +++ b/compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel @@ -46,8 +46,8 @@ argmax_types = [ [iree_amdgpu_bitcode_library( name = "iree_uk_amdgpu_argmax_%s_%s" % (type, gpu_arch), srcs = [ - "iree_uk_amdgpu_argmax_%s.c" % type, "common.h", + "iree_uk_amdgpu_argmax_%s.c" % type, ], out = "iree_uk_amdgpu_argmax_%s.%s.bc" % (type, gpu_arch), gpu_arch = gpu_arch, @@ -59,9 +59,21 @@ argmax_bc_files = [ for gpu_arch in gpu_archs ] +iree_amdgpu_bitcode_library( + name = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4_gfx942", + srcs = [ + "common.h", + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c", + ], + out = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc", + gpu_arch = "gfx942", +) + iree_c_embed_data( name = "iree_uk_amdgpu_bitcode", - srcs = argmax_bc_files, + srcs = argmax_bc_files + [ + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc", + ], c_file_output = "iree_uk_amdgpu_bitcode.c", flatten = True, h_file_output = "iree_uk_amdgpu_bitcode.h", diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/CMakeLists.txt b/compiler/plugins/target/ROCM/builtins/ukernel/CMakeLists.txt index 71d4705eed1a..ad1a19028a5b 100644 --- a/compiler/plugins/target/ROCM/builtins/ukernel/CMakeLists.txt +++ b/compiler/plugins/target/ROCM/builtins/ukernel/CMakeLists.txt @@ -206,6 +206,18 @@ iree_amdgpu_bitcode_library( "iree_uk_amdgpu_argmax_f32i64.gfx1100.bc" ) +iree_amdgpu_bitcode_library( + NAME + iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4_gfx942 + GPU_ARCH + gfx942 + SRCS + "common.h" + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c" + OUT + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc" +) + iree_c_embed_data( NAME iree_uk_amdgpu_bitcode @@ -226,6 +238,7 @@ iree_c_embed_data( "iree_uk_amdgpu_argmax_f32i64.gfx1100.bc" "iree_uk_amdgpu_argmax_f32i64.gfx90a.bc" "iree_uk_amdgpu_argmax_f32i64.gfx942.bc" + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc" C_FILE_OUTPUT "iree_uk_amdgpu_bitcode.c" H_FILE_OUTPUT diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/common.h b/compiler/plugins/target/ROCM/builtins/ukernel/common.h index 3ccc92afba28..14b65a253c5d 100644 --- a/compiler/plugins/target/ROCM/builtins/ukernel/common.h +++ b/compiler/plugins/target/ROCM/builtins/ukernel/common.h @@ -57,6 +57,13 @@ typedef __UINT64_TYPE__ uint64_t; #define FLT_MIN __FLT_MIN__ #define FLT_MAX __FLT_MAX__ +//===----------------------------------------------------------------------===// +// Vector typedefs +//===----------------------------------------------------------------------===// + +typedef __attribute__((__vector_size__(8 * 2))) int64_t int64x2_t; +typedef __attribute__((__vector_size__(4 * 4))) int32_t int32x4_t; + //===----------------------------------------------------------------------===// // Declarations for Clangd, which may be slightly older than actual clang. // Drop these as clangd versions used in practice gain these builtins. diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c b/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c new file mode 100644 index 000000000000..7d0e2643050e --- /dev/null +++ b/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c @@ -0,0 +1,53 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "compiler/plugins/target/ROCM/builtins/ukernel/common.h" + +// Very naive kernel. TODO(bjacob): +// 1. Shared memory: can't allocate it within the microkernel (which is just a +// helper device function, not the actual amdgpu_kernel). Need to get it +// passed down here as a `T [[clang::address_space(3)]] *` parameter. +// 2. Better scheduling via either barrier intrinsics or inline assemby. +// 3. Subgroups1x4 being asymmetric is a historical accident... should be 2x2. +[[clang::always_inline]] void +iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4( + const int8_t *a_buffer, int64_t a_offset, const int8_t *b_buffer, + int64_t b_offset, int32_t *c_buffer, int64_t c_offset, int64_t k_size) { + int tid = __builtin_amdgcn_workitem_id_x(); + + // Load existing accumulators. + int32x4_t acc[8][2] = {{0}}; + int32x4_t *c_global = (int32x4_t *)(c_buffer + c_offset); + for (int i = 0; i < 8; ++i) { + for (int j = 0; j < 2; ++j) { + acc[i][j] = c_global[256 * (2 * i + j) + tid]; + } + } + + // Arithmetic loop. + const int64x2_t *a_global = + (const int64x2_t *)(a_buffer + a_offset) + (tid % 64); + const int64x2_t *b_global = (const int64x2_t *)(b_buffer + b_offset) + tid; + for (int k_outer = 0; k_outer < k_size; ++k_outer) { + for (int i = 0; i < 8; ++i) { + for (int j = 0; j < 2; ++j) { + for (int k = 0; k < 2; ++k) { + acc[i][j] = __builtin_amdgcn_mfma_i32_16x16x32_i8( + a_global[64 * i][k], b_global[256 * j][k], acc[i][j], 0, 0, 0); + } + } + } + a_global += 512; + b_global += 512; + } + + // Store accumulators. + for (int i = 0; i < 8; ++i) { + for (int j = 0; j < 2; ++j) { + c_global[256 * (2 * i + j) + tid] = acc[i][j]; + } + } +} diff --git a/compiler/plugins/target/ROCM/test/BUILD.bazel b/compiler/plugins/target/ROCM/test/BUILD.bazel index 2a71f590c6e3..7201e4b988e8 100644 --- a/compiler/plugins/target/ROCM/test/BUILD.bazel +++ b/compiler/plugins/target/ROCM/test/BUILD.bazel @@ -17,6 +17,7 @@ iree_lit_test_suite( srcs = [ "config_ukernel_argmax_gfx908.mlir", "config_ukernel_argmax_gfx942.mlir", + "config_ukernel_multi_mma_gfx942.mlir", "default_tuning_specs_amdgpu.mlir", "lowering_strategy_from_tuning_spec.mlir", "ukernel_pipeline_transform.mlir", diff --git a/compiler/plugins/target/ROCM/test/CMakeLists.txt b/compiler/plugins/target/ROCM/test/CMakeLists.txt index bab88582a8b0..06249daa0039 100644 --- a/compiler/plugins/target/ROCM/test/CMakeLists.txt +++ b/compiler/plugins/target/ROCM/test/CMakeLists.txt @@ -16,6 +16,7 @@ iree_lit_test_suite( SRCS "config_ukernel_argmax_gfx908.mlir" "config_ukernel_argmax_gfx942.mlir" + "config_ukernel_multi_mma_gfx942.mlir" "default_tuning_specs_amdgpu.mlir" "lowering_strategy_from_tuning_spec.mlir" "ukernel_pipeline_transform.mlir" diff --git a/compiler/plugins/target/ROCM/test/config_ukernel_multi_mma_gfx942.mlir b/compiler/plugins/target/ROCM/test/config_ukernel_multi_mma_gfx942.mlir new file mode 100644 index 000000000000..646418f80666 --- /dev/null +++ b/compiler/plugins/target/ROCM/test/config_ukernel_multi_mma_gfx942.mlir @@ -0,0 +1,29 @@ +// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx942 --pass-pipeline='builtin.module(iree-llvmgpu-select-lowering-strategy)' %s | FileCheck %s + +func.func @multi_mma_mfma_i32_16x16x32_i8(%a : tensor<1x2x8x4x16x2x8xi8>, + %b : tensor<1x2x4x2x4x16x2x8xi8>, + %c : tensor<1x1x8x4x2x4x16x4xi32>) + -> tensor<1x1x8x4x2x4x16x4xi32> attributes { + hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "multi_mma"}> +} { + %d = iree_gpu.multi_mma %a, %b, %c {indexing_maps = [ + affine_map<(d0, d1, d2) -> (d0, d2)>, + affine_map<(d0, d1, d2) -> (d1, d2)>, + affine_map<(d0, d1, d2) -> (d0, d1)> + ], iterator_types = [ + #iree_gpu.iterator_type, + #iree_gpu.iterator_type, + #iree_gpu.iterator_type + ], kind = #iree_gpu.data_tiled_mma_layout< + intrinsic = MFMA_I32_16x16x32_I8, + unroll_m = 8, unroll_n = 2, subgroups_n = 4, unroll_k = 2 + >} : tensor<1x2x8x4x16x2x8xi8>, tensor<1x2x4x2x4x16x2x8xi8> into tensor<1x1x8x4x2x4x16x4xi32> + return %d : tensor<1x1x8x4x2x4x16x4xi32> +} + +// CHECK-LABEL: @multi_mma_mfma_i32_16x16x32_i8 +// CHECK: iree_gpu.multi_mma +// CHECK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc" +// CHECK-NOT: promote_operands +// CHECK-SAME: reduction = [0, 0, 0] +// CHECK-SAME: #iree_gpu.ukernel_config(op); if (!multiMmaOp) { return failure(); @@ -70,7 +69,7 @@ setDataTiledMultiMmaLoweringConfig(IREE::GPU::TargetAttr target, SmallVector reductionTileSizes(iterationRank, 0); for (int64_t kDim : contractionDims.k) { workgroupTileSizes[kDim] = 0; - reductionTileSizes[kDim] = 1; + reductionTileSizes[kDim] = ukernelConfig ? 0 : 1; } // Set tile sizes. @@ -81,8 +80,16 @@ setDataTiledMultiMmaLoweringConfig(IREE::GPU::TargetAttr target, b.getI64ArrayAttr(workgroupTileSizes)); attrs.emplace_back(b.getStringAttr("reduction"), b.getI64ArrayAttr(reductionTileSizes)); - // Promote operands to use shared memory for LHS and RHS. - GPU::setPromotedOperandList(context, attrs, {0, 1}); + if (ukernelConfig) { + attrs.emplace_back(b.getStringAttr("ukernel"), ukernelConfig); + } else { + // Promote operands to use shared memory for LHS and RHS. + // Don't do that with ukernels: their untiled reduction dimension is too + // large to fit in shared memory, so they just want global memory and they + // will take care of moving small chunks at a time into a shared memory + // operand that will be created together with the ukernel op. + GPU::setPromotedOperandList(context, attrs, {0, 1}); + } auto configDict = b.getDictionaryAttr(attrs); auto loweringConfig = IREE::GPU::LoweringConfigAttr::get(context, configDict); diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.h b/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.h index 0458ea91d6ad..636ffe5f0898 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.h +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.h @@ -16,10 +16,9 @@ namespace mlir::iree_compiler::IREE::GPU { /// Helper for setting up a data tiled multi_mma config based on the specified /// target. -LogicalResult -setDataTiledMultiMmaLoweringConfig(IREE::GPU::TargetAttr target, - mlir::FunctionOpInterface entryPoint, - Operation *op); +LogicalResult setDataTiledMultiMmaLoweringConfig( + IREE::GPU::TargetAttr target, mlir::FunctionOpInterface entryPoint, + Operation *op, IREE::GPU::UKernelConfigAttr ukernelConfig); /// Helper for setting up a convolution config using IGEMM based on the /// specified target. diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp index fbc0f37a129b..7f752e2c559f 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp @@ -2099,15 +2099,9 @@ static LogicalResult setTransposeConfig(mlir::FunctionOpInterface entryPoint, /// Set the configuration for argmax when ukernels are enabled. /// Distribute all parallel dim across different workgroups, and only use single /// subgroup per workgroup. -static LogicalResult -setArgmaxUkernelConfig(IREE::GPU::TargetAttr target, - mlir::FunctionOpInterface entryPoint, - linalg::GenericOp op) { - IREE::GPU::UKernelConfigAttr ukernelConfig = selectUKernel(op); - if (!ukernelConfig) { - return failure(); - } - +static LogicalResult setArgmaxUkernelConfig( + IREE::GPU::TargetAttr target, mlir::FunctionOpInterface entryPoint, + linalg::GenericOp op, IREE::GPU::UKernelConfigAttr ukernelConfig) { SmallVector parallelDims; SmallVector reductionDims; op.getParallelDims(parallelDims); @@ -2170,15 +2164,6 @@ setArgmaxUkernelConfig(IREE::GPU::TargetAttr target, return success(); } -/// Make UKernels take the LLVMGPUDefault lowering pipeline. -static LogicalResult -setUKernelConfig(mlir::FunctionOpInterface entryPoint, - IREE::Codegen::UKernelOpInterface ukernelOp) { - auto translationInfo = IREE::Codegen::TranslationInfoAttr::get( - entryPoint->getContext(), CodeGenPipeline::LLVMGPUDefault); - return setTranslationInfo(entryPoint, translationInfo); -} - /// Decides the tiling and distribution parameters for one convolution /// dimension. Returns true if we can succesfully deduce. /// @@ -2358,13 +2343,14 @@ static LogicalResult setConvolutionConfig( static LogicalResult setRootConfig(IREE::GPU::TargetAttr target, mlir::FunctionOpInterface entryPointFn, Operation *computeOp) { + IREE::GPU::UKernelConfigAttr ukernelConfig = selectUKernel(computeOp); LLVM_DEBUG({ DBGS() << "Selecting root config for: "; computeOp->print(llvm::dbgs(), OpPrintingFlags().skipRegions()); llvm::dbgs() << "\n"; }); if (succeeded(setDataTiledMultiMmaLoweringConfig(target, entryPointFn, - computeOp))) { + computeOp, ukernelConfig))) { LDBG("Tile and fuse data tiled multi_mma config"); return success(); } @@ -2410,8 +2396,9 @@ static LogicalResult setRootConfig(IREE::GPU::TargetAttr target, if (genericOp && succeeded(setTransposeConfig(entryPointFn, genericOp))) { LDBG("Transpose Config"); return success(); - } else if (genericOp && succeeded(setArgmaxUkernelConfig( - target, entryPointFn, genericOp))) { + } else if (genericOp && ukernelConfig && + succeeded(setArgmaxUkernelConfig(target, entryPointFn, genericOp, + ukernelConfig))) { LDBG("Argmax Ukernel Config"); return success(); } @@ -2435,10 +2422,6 @@ static LogicalResult setRootConfig(IREE::GPU::TargetAttr target, LDBG("Pack Config"); return setPackConfig(target, entryPointFn, packOp); }) - .Case([&](auto ukernelOp) { - LDBG("Ukernel Config"); - return setUKernelConfig(entryPointFn, ukernelOp); - }) .Case([&](auto customOp) { LDBG("CustomOp Config"); return setDefaultCustomOpLoweringConfig(entryPointFn, customOp, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/ROCDLKernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/ROCDLKernelConfig.cpp index 26245cba1b25..c52cd07a1cd1 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/ROCDLKernelConfig.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/ROCDLKernelConfig.cpp @@ -6,6 +6,7 @@ #include "iree/compiler/Codegen/LLVMGPU/ROCDLKernelConfig.h" +#include "compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h" #include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.h" #include "iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.h" @@ -272,8 +273,9 @@ setWarpReductionConfig(IREE::GPU::TargetAttr target, static LogicalResult setRootConfig(IREE::GPU::TargetAttr target, mlir::FunctionOpInterface entryPointFn, Operation *computeOp) { + IREE::GPU::UKernelConfigAttr ukernelConfig = selectUKernel(computeOp); if (succeeded(setDataTiledMultiMmaLoweringConfig(target, entryPointFn, - computeOp))) { + computeOp, ukernelConfig))) { return success(); } if (auto linalgOp = dyn_cast(computeOp)) { diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp index 2f2861f926cc..453669db7426 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp @@ -5,6 +5,7 @@ // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception #include "iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.h" +#include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.h" #include "iree/compiler/Codegen/Utils/GPUUtils.h" #include "iree/compiler/Codegen/Utils/Utils.h" #include "iree/compiler/Utils/EmbeddedDataDirectory.h" @@ -18,8 +19,13 @@ namespace mlir::iree_compiler { namespace { +struct UKernelNameAndSuffix { + std::string name; + std::string suffix; +}; + // Returns ukernel name and suffix for argmax. Empty name = no ukernel. -static std::tuple +static UKernelNameAndSuffix getUKernelNameAndSuffixForArgmax(linalg::GenericOp op) { Value input = op.getDpsInputOperand(0)->get(); auto inputType = cast(input.getType()); @@ -29,13 +35,34 @@ getUKernelNameAndSuffixForArgmax(linalg::GenericOp op) { indexType.getElementType())}; } +// Returns ukernel name and suffix for multi_mma. Empty name = no ukernel. +static UKernelNameAndSuffix +getUKernelNameAndSuffixForMultiMma(IREE::GPU::MultiMmaOp op) { + auto mma = dyn_cast(op.getKind()); + if (!mma) { + return {}; // Only handling DataTiledMMAAttr for now. + } + std::string suffix{ + stringifyMMAIntrinsic(mma.getIntrinsic().getValue()).lower()}; + if (mma.getUnrollM() != 1 || mma.getUnrollN() != 1 || mma.getUnrollK() != 1) { + suffix += llvm::formatv("_unroll{}x{}x{}", mma.getUnrollM(), + mma.getUnrollN(), mma.getUnrollK()); + } + if (mma.getSubgroupsM() != 1 || mma.getSubgroupsN() != 1) { + suffix += llvm::formatv("_subgroups{}x{}", mma.getSubgroupsM(), + mma.getSubgroupsN()); + } + return {"multi_mma", suffix}; +} + // Returns ukernel name and suffix for any op. Empty name = no ukernel. -static std::tuple -getUKernelNameAndSuffix(Operation *op) { +static UKernelNameAndSuffix getUKernelNameAndSuffix(Operation *op) { if (auto genericOp = dyn_cast(op)) { if (succeeded(isArgmaxOp(genericOp))) { return getUKernelNameAndSuffixForArgmax(genericOp); } + } else if (auto multiMmaOp = dyn_cast(op)) { + return getUKernelNameAndSuffixForMultiMma(multiMmaOp); } return {}; } @@ -44,7 +71,7 @@ getUKernelNameAndSuffix(Operation *op) { static IREE::GPU::UKernelConfigAttr getUKernelConfig(Operation *op) { MLIRContext *context = op->getContext(); auto [name, suffix] = getUKernelNameAndSuffix(op); - if (name.empty() || suffix.empty()) { + if (name.empty()) { return {}; } auto target = IREE::HAL::ExecutableTargetAttr::lookup(op); From 3509ead1d7b94b873d885f4387d98366fdaac4fd Mon Sep 17 00:00:00 2001 From: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Tue, 17 Dec 2024 14:19:48 -0800 Subject: [PATCH 32/64] Cleanup `ConvertToStream` to accomodate llvm/llvm-project@3f136f7 (#19451) The upstream change https://github.com/llvm/llvm-project/commit/3f136f7 allows `ConvertToStream` to better handle the 1:N type conversion, specifically the type conversion of a `tensor<...>` to `!stream.resource<*>, index`. Now instead of trying to work around `builtin.unrealized_conversion_cast`s the conversion can get the converted values directly using the `OneToNAdaptor` and can also replace a `tensor<..>` directly with multiple values using the `ConversionPatternRewriter::replaceOpWithMultiple`. These changes are required to drop the revert of https://github.com/llvm/llvm-project/pull/116470 in the IREE ToM. The change drops these reverts as well. Fixes #19448 --------- Signed-off-by: MaheshRavishankar --- .../Conversion/FlowToStream/Patterns.cpp | 341 ++++++++++-------- .../Conversion/HALToStream/Patterns.cpp | 72 ++-- .../Stream/Conversion/PatternUtils.cpp | 95 ++--- .../Dialect/Stream/Conversion/PatternUtils.h | 45 +-- .../Conversion/StandardToStream/Patterns.cpp | 132 ++++--- .../Conversion/UtilToStream/Patterns.cpp | 68 ++-- .../UtilToStream/test/compiler_hints.mlir | 4 +- .../Stream/Transforms/ConvertToStream.cpp | 18 +- .../Transforms/test/convert_to_stream.mlir | 3 +- third_party/llvm-project | 2 +- 10 files changed, 418 insertions(+), 362 deletions(-) diff --git a/compiler/src/iree/compiler/Dialect/Stream/Conversion/FlowToStream/Patterns.cpp b/compiler/src/iree/compiler/Dialect/Stream/Conversion/FlowToStream/Patterns.cpp index 31d61516e3eb..44c8a4630ea0 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Conversion/FlowToStream/Patterns.cpp +++ b/compiler/src/iree/compiler/Dialect/Stream/Conversion/FlowToStream/Patterns.cpp @@ -13,6 +13,7 @@ #include "iree/compiler/Dialect/Stream/IR/StreamOps.h" #include "mlir/Dialect/Arith/IR/Arith.h" #include "mlir/Dialect/Tensor/IR/Tensor.h" +#include "mlir/IR/BuiltinDialect.h" #include "mlir/IR/IRMapping.h" #include "mlir/Interfaces/FunctionInterfaces.h" @@ -20,6 +21,14 @@ namespace mlir::iree_compiler { namespace { +static SmallVector flattenValues(ArrayRef values) { + SmallVector vec; + for (auto v : values) { + vec.append(v.begin(), v.end()); + } + return vec; +} + // Inserts a sizeof calculation for the given tensor value type and dims. // This should only be used to produce sizes for values produced by an op; the // size of operands must be queried from the input resource. @@ -39,7 +48,7 @@ struct ConvertTensorConstantOp public: using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorConstantOp constantOp, OpAdaptor adaptor, + IREE::Flow::TensorConstantOp constantOp, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { // Capture the tensor constant strongly typed with constant lifetime. @@ -55,10 +64,13 @@ struct ConvertTensorConstantOp auto unknownType = rewriter.getType(); auto constantSize = rewriter.createOrFold( constantOp.getLoc(), rewriter.getIndexType(), newOp.getResult()); - rewriter.replaceOpWithNewOp( - constantOp, unknownType, newOp.getResult(), constantSize, constantSize, + auto transferOp = rewriter.create( + constantOp.getLoc(), unknownType, newOp.getResult(), constantSize, + constantSize, /*source_affinity=*/executionAffinityAttr, /*result_affinity=*/executionAffinityAttr); + rewriter.replaceOpWithMultiple(constantOp, + {{transferOp.getResult(), constantSize}}); return success(); } }; @@ -68,7 +80,7 @@ struct ConvertTensorDynamicConstantOp public: using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorDynamicConstantOp constantOp, OpAdaptor adaptor, + IREE::Flow::TensorDynamicConstantOp constantOp, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto attrType = dyn_cast(constantOp.getValue().getType()); @@ -103,10 +115,12 @@ struct ConvertTensorDynamicConstantOp auto unknownType = rewriter.getType(); auto constantSize = rewriter.createOrFold( constantOp.getLoc(), rewriter.getIndexType(), newOp.getResult()); - rewriter.replaceOpWithNewOp( - constantOp, unknownType, newOp.getResult(), constantSize, constantSize, + auto transferOp = rewriter.create( + constantOp.getLoc(), unknownType, newOp.getResult(), constantSize, + constantSize, /*source_affinity=*/executionAffinityAttr, /*result_affinity=*/executionAffinityAttr); + rewriter.replaceOpWithMultiple(constantOp, {{transferOp, constantSize}}); return success(); } }; @@ -123,21 +137,23 @@ struct ConvertTensorCastLikeOp : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern< CastOpTy>::AffinityAwareConversionPattern; - LogicalResult - matchAndRewrite(CastOpTy op, typename CastOpTy::Adaptor adaptor, - ConversionPatternRewriter &rewriter) const override { + LogicalResult matchAndRewrite( + CastOpTy op, + typename OpConversionPattern::OneToNOpAdaptor adaptor, + ConversionPatternRewriter &rewriter) const override { auto resultAffinityAttr = this->lookupResultAffinity(op.getResult()); - auto source = this->transferTensorOperand(op.getLoc(), op.getSource(), - adaptor.getSource(), - resultAffinityAttr, rewriter); + auto source = this->transferTensorOperands(op.getLoc(), op.getSource(), + adaptor.getSource(), + resultAffinityAttr, rewriter); auto resultSize = buildResultSizeOf(op.getLoc(), op.getResult(), op.getResultDims(), resultAffinityAttr, rewriter); auto unknownType = rewriter.getType(); - rewriter.replaceOpWithNewOp( - op, unknownType, source.resource, op.getSource().getType(), + Value cloneOp = rewriter.create( + op.getLoc(), unknownType, source.resource, op.getSource().getType(), op.getSourceDims(), source.resourceSize, op.getResult().getType(), - adaptor.getResultDims(), resultSize, resultAffinityAttr); + flattenValues(adaptor.getResultDims()), resultSize, resultAffinityAttr); + rewriter.replaceOpWithMultiple(op, {{cloneOp, resultSize}}); return success(); } }; @@ -146,15 +162,16 @@ struct ConvertTensorAllocaOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorAllocaOp op, OpAdaptor adaptor, + IREE::Flow::TensorAllocaOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto resultSize = buildResultSizeOf(op.getLoc(), op.getResult(), op.getResultDims(), executionAffinityAttr, rewriter); auto unknownType = rewriter.getType(); - rewriter.replaceOpWithNewOp( - op, unknownType, resultSize, executionAffinityAttr); + auto allocaOp = rewriter.create( + op.getLoc(), unknownType, resultSize, executionAffinityAttr); + rewriter.replaceOpWithMultiple(op, {{allocaOp.getResult(), resultSize}}); return success(); } }; @@ -163,16 +180,18 @@ struct ConvertTensorEmptyOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorEmptyOp op, OpAdaptor adaptor, + IREE::Flow::TensorEmptyOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto resultSize = buildResultSizeOf(op.getLoc(), op.getResult(), op.getResultDims(), executionAffinityAttr, rewriter); auto unknownType = rewriter.getType(); - rewriter.replaceOpWithNewOp( - op, unknownType, op.getResult().getType(), adaptor.getResultDims(), - resultSize, executionAffinityAttr); + auto emptyOp = rewriter.create( + op.getLoc(), unknownType, op.getResult().getType(), + flattenValues(adaptor.getResultDims()), resultSize, + executionAffinityAttr); + rewriter.replaceOpWithMultiple(op, {{emptyOp.getResult(), resultSize}}); return success(); } }; @@ -181,16 +200,18 @@ struct ConvertTensorSplatOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorSplatOp op, OpAdaptor adaptor, + IREE::Flow::TensorSplatOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto resultSize = buildResultSizeOf(op.getLoc(), op.getResult(), op.getResultDims(), executionAffinityAttr, rewriter); auto unknownType = rewriter.getType(); - rewriter.replaceOpWithNewOp( - op, unknownType, adaptor.getValue(), op.getResult().getType(), - adaptor.getResultDims(), resultSize, executionAffinityAttr); + auto splatOp = rewriter.create( + op.getLoc(), unknownType, adaptor.getValue().front(), + op.getResult().getType(), flattenValues(adaptor.getResultDims()), + resultSize, executionAffinityAttr); + rewriter.replaceOpWithMultiple(op, {{splatOp, resultSize}}); return success(); } }; @@ -199,17 +220,19 @@ struct ConvertTensorCloneOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorCloneOp op, OpAdaptor adaptor, + IREE::Flow::TensorCloneOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { - auto operand = transferTensorOperand(op.getLoc(), op.getOperand(), - adaptor.getOperand(), - executionAffinityAttr, rewriter); + auto operand = transferTensorOperands(op.getLoc(), op.getOperand(), + adaptor.getOperand(), + executionAffinityAttr, rewriter); auto unknownType = rewriter.getType(); - rewriter.replaceOpWithNewOp( - op, unknownType, operand.resource, op.getOperand().getType(), + auto cloneOp = rewriter.create( + op.getLoc(), unknownType, operand.resource, op.getOperand().getType(), op.getArgumentDims(), operand.resourceSize, op.getResult().getType(), - adaptor.getArgumentDims(), operand.resourceSize, executionAffinityAttr); + flattenValues(adaptor.getArgumentDims()), operand.resourceSize, + executionAffinityAttr); + rewriter.replaceOpWithMultiple(op, {{cloneOp, operand.resourceSize}}); return success(); } }; @@ -218,20 +241,21 @@ struct ConvertTensorTransferOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorTransferOp op, OpAdaptor adaptor, + IREE::Flow::TensorTransferOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { if (!executionAffinityAttr) { return rewriter.notifyMatchFailure(op, "invalid stream affinity attr"); } - auto operand = resolveTensorOperand(op.getLoc(), op.getOperand(), - adaptor.getOperand(), rewriter); + auto operand = resolveTensorOperands(op.getLoc(), op.getOperand(), + adaptor.getOperand(), rewriter); auto unknownType = rewriter.getType(); - rewriter.replaceOpWithNewOp( - op, unknownType, operand.resource, operand.resourceSize, + auto transferOp = rewriter.create( + op.getLoc(), unknownType, operand.resource, operand.resourceSize, operand.resourceSize, /*source_affinity=*/operand.affinity, /*result_affinity=*/executionAffinityAttr); + rewriter.replaceOpWithMultiple(op, {{transferOp, operand.resourceSize}}); return success(); } }; @@ -240,21 +264,24 @@ struct ConvertTensorSliceOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorSliceOp op, OpAdaptor adaptor, + IREE::Flow::TensorSliceOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto source = - transferTensorOperand(op.getLoc(), op.getSource(), adaptor.getSource(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getSource(), adaptor.getSource(), + executionAffinityAttr, rewriter); auto resultSize = buildResultSizeOf(op.getLoc(), op.getResult(), op.getResultDims(), executionAffinityAttr, rewriter); auto unknownType = rewriter.getType(); - rewriter.replaceOpWithNewOp( - op, unknownType, source.resource, op.getSource().getType(), - op.getSourceDims(), source.resourceSize, adaptor.getStartIndices(), - adaptor.getLengths(), op.getResult().getType(), adaptor.getResultDims(), - resultSize, executionAffinityAttr); + auto sliceOp = rewriter.create( + op.getLoc(), unknownType, source.resource, op.getSource().getType(), + op.getSourceDims(), source.resourceSize, + flattenValues(adaptor.getStartIndices()), + flattenValues(adaptor.getLengths()), op.getResult().getType(), + flattenValues(adaptor.getResultDims()), resultSize, + executionAffinityAttr); + rewriter.replaceOpWithMultiple(op, {{sliceOp, resultSize}}); return success(); } }; @@ -263,20 +290,23 @@ struct ConvertTensorUpdateOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::TensorUpdateOp op, OpAdaptor adaptor, + IREE::Flow::TensorUpdateOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto target = - transferTensorOperand(op.getLoc(), op.getTarget(), adaptor.getTarget(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getTarget(), adaptor.getTarget(), + executionAffinityAttr, rewriter); auto update = - transferTensorOperand(op.getLoc(), op.getUpdate(), adaptor.getUpdate(), - executionAffinityAttr, rewriter); - rewriter.replaceOpWithNewOp( - op, target.resource.getType(), target.resource, - op.getTarget().getType(), adaptor.getTargetDims(), target.resourceSize, - adaptor.getStartIndices(), update.resource, op.getUpdate().getType(), - op.getUpdateDims(), update.resourceSize, executionAffinityAttr); + transferTensorOperands(op.getLoc(), op.getUpdate(), adaptor.getUpdate(), + executionAffinityAttr, rewriter); + auto updateOp = rewriter.create( + op.getLoc(), target.resource.getType(), target.resource, + op.getTarget().getType(), flattenValues(adaptor.getTargetDims()), + target.resourceSize, flattenValues(adaptor.getStartIndices()), + update.resource, op.getUpdate().getType(), op.getUpdateDims(), + update.resourceSize, executionAffinityAttr); + rewriter.replaceOpWithMultiple( + op, {{updateOp.getResult(), target.resourceSize}}); return success(); } }; @@ -296,10 +326,10 @@ struct ConvertTensorLoadOp : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(IREE::Flow::TensorLoadOp op, OpAdaptor adaptor, + matchAndRewrite(IREE::Flow::TensorLoadOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { - auto source = resolveTensorOperand(op.getLoc(), op.getSource(), - adaptor.getSource(), rewriter); + auto source = resolveTensorOperands(op.getLoc(), op.getSource(), + adaptor.getSource(), rewriter); // If the source is not a staging resource then we need to transfer it to // a staging resource. We slice out just what is being loaded so that we @@ -311,10 +341,13 @@ struct ConvertTensorLoadOp auto stagingType = rewriter.getType( IREE::Stream::Lifetime::Staging); auto resultType = getTypeConverter()->convertType(op.getResult().getType()); + SmallVector convertedSourceDims = + flattenValues(adaptor.getSourceDims()); + SmallVector convertedIndices = flattenValues(adaptor.getIndices()); if (source.resource.getType() == stagingType) { rewriter.replaceOpWithNewOp( op, resultType, source.resource, op.getSource().getType(), - adaptor.getSourceDims(), source.resourceSize, adaptor.getIndices()); + convertedSourceDims, source.resourceSize, convertedIndices); return success(); } @@ -328,19 +361,18 @@ struct ConvertTensorLoadOp /*result_affinity=*/source.affinity); rewriter.replaceOpWithNewOp( op, resultType, transferOp.getResult(), sourceEncoding, - adaptor.getSourceDims(), transferOp.getResultSize(), - adaptor.getIndices()); + convertedSourceDims, transferOp.getResultSize(), convertedIndices); return success(); } // Slice out the individual element value. IndexSet indexSet(op.getLoc(), rewriter); - indexSet.populate(adaptor.getIndices()); + indexSet.populate(convertedIndices); SmallVector sliceIndices; SmallVector sliceLengths; SmallVector loadIndices; SmallVector resultDims; - for (auto index : adaptor.getIndices()) { + for (auto index : convertedIndices) { // TODO(benvanik): support larger buffer slices. sliceIndices.push_back(index); sliceLengths.push_back(indexSet.get(1)); @@ -354,9 +386,8 @@ struct ConvertTensorLoadOp op.getLoc(), resultEncoding, ValueRange{}, source.affinity); auto sliceOp = rewriter.create( op.getLoc(), source.resource.getType(), source.resource, sourceEncoding, - adaptor.getSourceDims(), source.resourceSize, sliceIndices, - sliceLengths, resultEncoding, ValueRange{}, resultSize, - source.affinity); + convertedSourceDims, source.resourceSize, sliceIndices, sliceLengths, + resultEncoding, ValueRange{}, resultSize, source.affinity); auto transferOp = rewriter.create( op.getLoc(), stagingType, sliceOp.getResult(), sliceOp.getResultSize(), sliceOp.getResultSize(), @@ -374,33 +405,37 @@ struct ConvertTensorStoreOp : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(IREE::Flow::TensorStoreOp op, OpAdaptor adaptor, + matchAndRewrite(IREE::Flow::TensorStoreOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { - auto target = resolveTensorOperand(op.getLoc(), op.getTarget(), - adaptor.getTarget(), rewriter); + auto target = resolveTensorOperands(op.getLoc(), op.getTarget(), + adaptor.getTarget(), rewriter); // If the target is a staging resource then we can directly store into it // with a fast-path. Otherwise we need to stage an upload. auto stagingType = rewriter.getType( IREE::Stream::Lifetime::Staging); if (target.resource.getType() == stagingType) { - rewriter.replaceOpWithNewOp( - op, target.resource.getType(), target.resource, - op.getTarget().getType(), adaptor.getTargetDims(), - target.resourceSize, adaptor.getIndices(), adaptor.getValue()); + auto storeOp = rewriter.create( + op.getLoc(), target.resource.getType(), target.resource, + op.getTarget().getType(), flattenValues(adaptor.getTargetDims()), + target.resourceSize, flattenValues(adaptor.getIndices()), + adaptor.getValue().front()); + rewriter.replaceOpWithMultiple(op, {{storeOp, target.resourceSize}}); return success(); } // Use fill to store the value. // TODO(benvanik): support larger buffer slices (stage + update). IndexSet indexSet(op.getLoc(), rewriter); - indexSet.populate(adaptor.getIndices()); - SmallVector lengths(adaptor.getIndices().size(), indexSet.get(1)); + SmallVector convertedIndices = flattenValues(adaptor.getIndices()); + indexSet.populate(convertedIndices); + SmallVector lengths(convertedIndices.size(), indexSet.get(1)); auto targetEncoding = op.getTarget().getType(); - rewriter.replaceOpWithNewOp( - op, target.resource, targetEncoding, adaptor.getTargetDims(), - target.resourceSize, adaptor.getIndices(), lengths, adaptor.getValue(), - target.affinity); + auto fillOp = rewriter.create( + op.getLoc(), target.resource, targetEncoding, + flattenValues(adaptor.getTargetDims()), target.resourceSize, + convertedIndices, lengths, adaptor.getValue().front(), target.affinity); + rewriter.replaceOpWithMultiple(op, {{fillOp, target.resourceSize}}); return success(); } }; @@ -409,15 +444,15 @@ struct ConvertTensorTraceOp : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(IREE::Flow::TensorTraceOp op, OpAdaptor adaptor, + matchAndRewrite(IREE::Flow::TensorTraceOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { SmallVector resources; SmallVector resourceSizes; SmallVector resourceEncodings; for (auto [tensorOperand, resourceOperand] : llvm::zip_equal(op.getValues(), adaptor.getValues())) { - auto source = resolveTensorOperand(op.getLoc(), tensorOperand, - resourceOperand, rewriter); + auto source = resolveTensorOperands(op.getLoc(), tensorOperand, + resourceOperand, rewriter); auto stagingType = rewriter.getType( IREE::Stream::Lifetime::Staging); auto traceSource = source.resource; @@ -432,10 +467,10 @@ struct ConvertTensorTraceOp resourceSizes.push_back(source.resourceSize); resourceEncodings.push_back(TypeAttr::get(tensorOperand.getType())); } - rewriter.replaceOpWithNewOp( op, adaptor.getKey(), resources, resourceSizes, - rewriter.getArrayAttr(resourceEncodings), adaptor.getValueDims()); + rewriter.getArrayAttr(resourceEncodings), + flattenValues(adaptor.getValueDims())); return success(); } }; @@ -444,7 +479,7 @@ struct ConvertChannelDefaultOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::ChannelDefaultOp op, OpAdaptor adaptor, + IREE::Flow::ChannelDefaultOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { rewriter.replaceOpWithNewOp( @@ -497,7 +532,7 @@ struct ConvertAllGatherOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::CollectiveAllGatherOp op, OpAdaptor adaptor, + IREE::Flow::CollectiveAllGatherOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto collectiveAttr = rewriter.getAttr( @@ -509,14 +544,14 @@ struct ConvertAllGatherOp auto elementCount = rewriter.create( op.getLoc(), op.getType().getNumElements()); auto newTargetCast = - transferTensorOperand(op.getLoc(), op.getTarget(), adaptor.getTarget(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getTarget(), adaptor.getTarget(), + executionAffinityAttr, rewriter); auto newSourceCast = - transferTensorOperand(op.getLoc(), op.getSource(), adaptor.getSource(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getSource(), adaptor.getSource(), + executionAffinityAttr, rewriter); - rewriter.replaceOpWithNewOp( - op, collectiveAttr, + auto collectiveOp = rewriter.create( + op.getLoc(), collectiveAttr, /*target=*/newTargetCast.resource, /*target_size=*/newTargetCast.resourceSize, /*target_offset=*/zeroOffset, @@ -528,8 +563,10 @@ struct ConvertAllGatherOp /*source_end=*/newSourceCast.resourceSize, /*source_length=*/newSourceCast.resourceSize, /*element_count=*/elementCount, - /*channel=*/adaptor.getChannel(), + /*channel=*/adaptor.getChannel().front(), /*param=*/mlir::Value(), executionAffinityAttr); + rewriter.replaceOpWithMultiple( + op, {{collectiveOp, newTargetCast.resourceSize}}); return success(); } }; @@ -538,7 +575,7 @@ struct ConvertAllReduceOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::CollectiveAllReduceOp op, OpAdaptor adaptor, + IREE::Flow::CollectiveAllReduceOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto collectiveAttr = rewriter.getAttr( @@ -550,14 +587,14 @@ struct ConvertAllReduceOp auto elementCount = rewriter.create( op.getLoc(), op.getType().getNumElements()); auto newTargetCast = - transferTensorOperand(op.getLoc(), op.getTarget(), adaptor.getTarget(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getTarget(), adaptor.getTarget(), + executionAffinityAttr, rewriter); auto newSourceCast = - transferTensorOperand(op.getLoc(), op.getSource(), adaptor.getSource(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getSource(), adaptor.getSource(), + executionAffinityAttr, rewriter); - rewriter.replaceOpWithNewOp( - op, collectiveAttr, + auto collectiveOp = rewriter.create( + op.getLoc(), collectiveAttr, /*target=*/newTargetCast.resource, /*target_size=*/newTargetCast.resourceSize, /*target_offset=*/zeroOffset, @@ -569,8 +606,10 @@ struct ConvertAllReduceOp /*source_end=*/newSourceCast.resourceSize, /*source_length=*/newSourceCast.resourceSize, /*element_count=*/elementCount, - /*channel=*/adaptor.getChannel(), + /*channel=*/adaptor.getChannel().front(), /*param=*/mlir::Value(), executionAffinityAttr); + rewriter.replaceOpWithMultiple( + op, {{collectiveOp, newTargetCast.resourceSize}}); return success(); } }; @@ -579,7 +618,7 @@ struct ConvertAllToAllOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::CollectiveAllToAllOp op, OpAdaptor adaptor, + IREE::Flow::CollectiveAllToAllOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto collectiveAttr = rewriter.getAttr( @@ -591,14 +630,14 @@ struct ConvertAllToAllOp auto elementCount = rewriter.create( op.getLoc(), op.getType().getNumElements()); auto newTargetCast = - transferTensorOperand(op.getLoc(), op.getTarget(), adaptor.getTarget(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getTarget(), adaptor.getTarget(), + executionAffinityAttr, rewriter); auto newSourceCast = - transferTensorOperand(op.getLoc(), op.getSource(), adaptor.getSource(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getSource(), adaptor.getSource(), + executionAffinityAttr, rewriter); - rewriter.replaceOpWithNewOp( - op, collectiveAttr, + auto collectiveOp = rewriter.create( + op.getLoc(), collectiveAttr, /*target=*/newTargetCast.resource, /*target_size=*/newTargetCast.resourceSize, /*target_offset=*/zeroOffset, @@ -610,8 +649,10 @@ struct ConvertAllToAllOp /*source_end=*/newSourceCast.resourceSize, /*source_length=*/newSourceCast.resourceSize, /*element_count=*/elementCount, - /*channel=*/adaptor.getChannel(), + /*channel=*/adaptor.getChannel().front(), /*param=*/mlir::Value(), executionAffinityAttr); + rewriter.replaceOpWithMultiple( + op, {{collectiveOp, newTargetCast.resourceSize}}); return success(); } }; @@ -620,7 +661,7 @@ struct ConvertReduceScatterOp : public AffinityOpConversionPattern< IREE::Flow::CollectiveReduceScatterOp> { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::CollectiveReduceScatterOp op, OpAdaptor adaptor, + IREE::Flow::CollectiveReduceScatterOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto collectiveAttr = rewriter.getAttr( @@ -632,14 +673,14 @@ struct ConvertReduceScatterOp : public AffinityOpConversionPattern< auto elementCount = rewriter.create( op.getLoc(), op.getType().getNumElements()); auto newTargetCast = - transferTensorOperand(op.getLoc(), op.getTarget(), adaptor.getTarget(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getTarget(), adaptor.getTarget(), + executionAffinityAttr, rewriter); auto newSourceCast = - transferTensorOperand(op.getLoc(), op.getSource(), adaptor.getSource(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getSource(), adaptor.getSource(), + executionAffinityAttr, rewriter); - rewriter.replaceOpWithNewOp( - op, collectiveAttr, + auto collectiveOp = rewriter.create( + op.getLoc(), collectiveAttr, /*target=*/newTargetCast.resource, /*target_size=*/newTargetCast.resourceSize, /*target_offset=*/zeroOffset, @@ -651,8 +692,10 @@ struct ConvertReduceScatterOp : public AffinityOpConversionPattern< /*source_end=*/newSourceCast.resourceSize, /*source_length=*/newSourceCast.resourceSize, /*element_count=*/elementCount, - /*channel=*/adaptor.getChannel(), + /*channel=*/adaptor.getChannel().front(), /*param=*/mlir::Value(), executionAffinityAttr); + rewriter.replaceOpWithMultiple( + op, {{collectiveOp, newTargetCast.resourceSize}}); return success(); } }; @@ -661,7 +704,7 @@ struct ConvertCollectiveSendRecvOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::CollectiveSendRecvOp op, OpAdaptor adaptor, + IREE::Flow::CollectiveSendRecvOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto collectiveAttr = rewriter.getAttr( @@ -673,11 +716,11 @@ struct ConvertCollectiveSendRecvOp auto elementCount = rewriter.create( op.getLoc(), op.getType().getNumElements()); auto newTargetCast = - transferTensorOperand(op.getLoc(), op.getTarget(), adaptor.getTarget(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getTarget(), adaptor.getTarget(), + executionAffinityAttr, rewriter); auto newSourceCast = - transferTensorOperand(op.getLoc(), op.getSource(), adaptor.getSource(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getSource(), adaptor.getSource(), + executionAffinityAttr, rewriter); // Pack send, recv into param. The values are checked to be within the // 16-bit range during lowering to Flow dialect. @@ -693,8 +736,8 @@ struct ConvertCollectiveSendRecvOp rewriter.create(op.getLoc(), 16, 32)); auto param = rewriter.create(op.getLoc(), hi, lo); - rewriter.replaceOpWithNewOp( - op, collectiveAttr, + auto collectiveOp = rewriter.create( + op.getLoc(), collectiveAttr, /*target=*/newTargetCast.resource, /*target_size=*/newTargetCast.resourceSize, /*target_offset=*/zeroOffset, @@ -706,8 +749,10 @@ struct ConvertCollectiveSendRecvOp /*source_end=*/newSourceCast.resourceSize, /*source_length=*/newSourceCast.resourceSize, /*element_count=*/elementCount, - /*channel=*/adaptor.getChannel(), + /*channel=*/adaptor.getChannel().front(), /*param=*/param, executionAffinityAttr); + rewriter.replaceOpWithMultiple( + op, {{collectiveOp, newTargetCast.resourceSize}}); return success(); } }; @@ -716,7 +761,7 @@ struct ConvertDispatchOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::DispatchOp op, OpAdaptor adaptor, + IREE::Flow::DispatchOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { // Zero is going to be used for each operand to start. @@ -729,12 +774,14 @@ struct ConvertDispatchOp SmallVector dispatchOperandEnds; SmallVector dispatchOperandLengths; SmallVector operandSizes; - for (auto [oldOperand, newOperand] : + + for (auto [oldOperand, convertedOperands] : llvm::zip_equal(op.getArguments(), adaptor.getArguments())) { + Value newOperand; if (llvm::isa(oldOperand.getType())) { auto newOperandCast = - transferTensorOperand(op.getLoc(), oldOperand, newOperand, - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), oldOperand, convertedOperands, + executionAffinityAttr, rewriter); newOperand = newOperandCast.resource; dispatchOperandSizes.push_back(newOperandCast.resourceSize); operandSizes.push_back(newOperandCast.resourceSize); @@ -743,6 +790,7 @@ struct ConvertDispatchOp dispatchOperandLengths.push_back(newOperandCast.resourceSize); } else { operandSizes.push_back({}); + newOperand = convertedOperands.front(); } dispatchOperands.push_back(newOperand); } @@ -773,12 +821,19 @@ struct ConvertDispatchOp } } - auto newOp = rewriter.replaceOpWithNewOp( - op, resultTypes, adaptor.getWorkload(), adaptor.getEntryPointsAttr(), - dispatchOperands, dispatchOperandSizes, dispatchOperandOffsets, - dispatchOperandEnds, dispatchOperandLengths, resultSizes, - adaptor.getTiedOperandsAttr(), executionAffinityAttr); + auto newOp = rewriter.create( + op.getLoc(), resultTypes, flattenValues(adaptor.getWorkload()), + adaptor.getEntryPointsAttr(), dispatchOperands, dispatchOperandSizes, + dispatchOperandOffsets, dispatchOperandEnds, dispatchOperandLengths, + resultSizes, adaptor.getTiedOperandsAttr(), executionAffinityAttr); newOp->setDialectAttrs(op->getDialectAttrs()); + SmallVector> replacementsVec = llvm::map_to_vector( + llvm::zip_equal(newOp->getResults(), resultSizes), [](auto it) { + return SmallVector{std::get<0>(it), std::get<1>(it)}; + }); + SmallVector replacements = llvm::map_to_vector( + replacementsVec, [](ArrayRef v) -> ValueRange { return v; }); + rewriter.replaceOpWithMultiple(op, replacements); return success(); } }; @@ -821,7 +876,7 @@ struct ConvertFuncOp : public OpConversionPattern { struct ConvertCallOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::Flow::CallOp op, OpAdaptor adaptor, + IREE::Flow::CallOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { // Zero is going to be used for each operand to start. @@ -834,12 +889,13 @@ struct ConvertCallOp : public AffinityOpConversionPattern { SmallVector callOperandEnds; SmallVector callOperandLengths; SmallVector operandSizes; - for (auto [oldOperand, newOperand] : + for (auto [oldOperand, convertedOperand] : llvm::zip_equal(op.getArguments(), adaptor.getArguments())) { + Value newOperand; if (llvm::isa(oldOperand.getType())) { auto newOperandCast = - transferTensorOperand(op.getLoc(), oldOperand, newOperand, - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), oldOperand, convertedOperand, + executionAffinityAttr, rewriter); newOperand = newOperandCast.resource; callOperandSizes.push_back(newOperandCast.resourceSize); operandSizes.push_back(newOperandCast.resourceSize); @@ -847,6 +903,7 @@ struct ConvertCallOp : public AffinityOpConversionPattern { callOperandEnds.push_back(newOperandCast.resourceSize); callOperandLengths.push_back(newOperandCast.resourceSize); } else { + newOperand = convertedOperand.front(); operandSizes.push_back({}); } callOperands.push_back(newOperand); @@ -861,6 +918,7 @@ struct ConvertCallOp : public AffinityOpConversionPattern { auto oldResultType = result.value().getType(); if (!llvm::isa(oldResultType)) { resultTypes.push_back(getTypeConverter()->convertType(oldResultType)); + resultSizes.push_back(nullptr); continue; } auto tiedOperand = op.getTiedResultOperandIndex(result.index()); @@ -878,12 +936,13 @@ struct ConvertCallOp : public AffinityOpConversionPattern { } } - auto newOp = rewriter.replaceOpWithNewOp( - op, resultTypes, adaptor.getCalleeAttr(), callOperands, + auto newOp = rewriter.create( + op.getLoc(), resultTypes, adaptor.getCalleeAttr(), callOperands, callOperandSizes, callOperandOffsets, callOperandEnds, callOperandLengths, resultSizes, adaptor.getTiedOperandsAttr(), executionAffinityAttr); newOp->setDialectAttrs(op->getDialectAttrs()); + replaceOpWithMultiple(op, newOp->getResults(), resultSizes, rewriter); return success(); } }; diff --git a/compiler/src/iree/compiler/Dialect/Stream/Conversion/HALToStream/Patterns.cpp b/compiler/src/iree/compiler/Dialect/Stream/Conversion/HALToStream/Patterns.cpp index 76eef8b8e56f..e597aaffba8f 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Conversion/HALToStream/Patterns.cpp +++ b/compiler/src/iree/compiler/Dialect/Stream/Conversion/HALToStream/Patterns.cpp @@ -16,6 +16,14 @@ namespace mlir::iree_compiler { namespace { +/// Flatten the given value ranges into a single vector of values. +static SmallVector flattenValues(ArrayRef values) { + SmallVector result; + for (const auto &vals : values) + llvm::append_range(result, vals); + return result; +} + // %1 = hal.tensor.import %0 : !hal.buffer_view -> tensor<4xf32> // -> // %1 = stream.tensor.import %0 : !hal.buffer_view -> @@ -24,7 +32,7 @@ struct ConvertTensorImportOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::HAL::TensorImportOp op, OpAdaptor adaptor, + IREE::HAL::TensorImportOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto sourceType = op.getSource().getType(); @@ -42,9 +50,9 @@ struct ConvertTensorImportOp // mistake and it's better to know of a shape mismatch than just buffer // byte length difference. if (auto tensorType = llvm::dyn_cast(targetType)) { - if (failed(buildEncodingAssertions(op.getLoc(), adaptor.getSource(), - op.getNameAttr(), tensorType, - op.getTargetDims(), rewriter))) { + if (failed(buildEncodingAssertions( + op.getLoc(), adaptor.getSource().front(), op.getNameAttr(), + tensorType, op.getTargetDims(), rewriter))) { return rewriter.notifyMatchFailure(op, "unsupported tensor type"); } } @@ -55,11 +63,12 @@ struct ConvertTensorImportOp IREE::Stream::Lifetime::External); Value resultSize = rewriter.create( op.getLoc(), rewriter.getIndexType(), - TypeAttr::get(op.getTarget().getType()), adaptor.getTargetDims(), - executionAffinityAttr); + TypeAttr::get(op.getTarget().getType()), + flattenValues(adaptor.getTargetDims()), executionAffinityAttr); Value resource = rewriter.create( - op.getLoc(), resultType, adaptor.getSource(), TypeAttr::get(targetType), - adaptor.getTargetDims(), resultSize, executionAffinityAttr); + op.getLoc(), resultType, adaptor.getSource().front(), + TypeAttr::get(targetType), flattenValues(adaptor.getTargetDims()), + resultSize, executionAffinityAttr); // Await the fence, if needed. When not specified the resource is assumed to // be immediately available. @@ -75,10 +84,11 @@ struct ConvertTensorImportOp } auto unknownType = rewriter.getType(); - rewriter.replaceOpWithNewOp( - op, unknownType, resource, resultSize, resultSize, + Value newImport = rewriter.create( + op.getLoc(), unknownType, resource, resultSize, resultSize, /*source_affinity=*/executionAffinityAttr, /*target_affinity=*/executionAffinityAttr); + rewriter.replaceOpWithMultiple(op, {{newImport, resultSize}}); return success(); } @@ -125,7 +135,7 @@ struct ConvertTensorExportOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::HAL::TensorExportOp op, OpAdaptor adaptor, + IREE::HAL::TensorExportOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto sourceType = op.getSourceEncoding(); @@ -136,12 +146,12 @@ struct ConvertTensorExportOp } auto source = - transferTensorOperand(op.getLoc(), op.getSource(), adaptor.getSource(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getSource(), adaptor.getSource(), + executionAffinityAttr, rewriter); // Exporting a produced value - transfer our source value to an externally // usable resource and directly export it. This will cause an allocation. - auto exportSource = adaptor.getSource(); + Value exportSource = adaptor.getSource().front(); auto externalType = rewriter.getType( IREE::Stream::Lifetime::External); if (source.resource.getType() != externalType) { @@ -154,7 +164,8 @@ struct ConvertTensorExportOp // Export (stream resource to buffer view). rewriter.replaceOpWithNewOp( op, targetType, exportSource, TypeAttr::get(sourceType), - adaptor.getSourceDims(), source.resourceSize, executionAffinityAttr); + flattenValues(adaptor.getSourceDims()), source.resourceSize, + executionAffinityAttr); return success(); } }; @@ -174,19 +185,21 @@ struct ConvertTensorAliasOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - IREE::HAL::TensorAliasOp op, OpAdaptor adaptor, + IREE::HAL::TensorAliasOp op, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { auto sourceType = op.getSource().getType(); auto source = - transferTensorOperand(op.getLoc(), op.getSource(), adaptor.getSource(), - executionAffinityAttr, rewriter); + transferTensorOperands(op.getLoc(), op.getSource(), adaptor.getSource(), + executionAffinityAttr, rewriter); // Query the target storage buffer length; we will only populate up to // what is required for the output. + SmallVector convertedSourceDims = + flattenValues(adaptor.getSourceDims()); Value storageSize = rewriter.create( op.getLoc(), rewriter.getIndexType(), - TypeAttr::get(op.getSource().getType()), adaptor.getSourceDims(), + TypeAttr::get(op.getSource().getType()), convertedSourceDims, executionAffinityAttr); // Import the target storage as a resource that we can use as an update @@ -195,8 +208,8 @@ struct ConvertTensorAliasOp auto externalType = rewriter.getType( IREE::Stream::Lifetime::External); auto importOp = rewriter.create( - op.getLoc(), externalType, adaptor.getStorage(), - TypeAttr::get(sourceType), adaptor.getSourceDims(), storageSize, + op.getLoc(), externalType, adaptor.getStorage().front(), + TypeAttr::get(sourceType), convertedSourceDims, storageSize, executionAffinityAttr); // Await the fence, if needed. When not specified the storage is assumed to @@ -235,7 +248,7 @@ struct ConvertTensorAliasOp op.getLoc(), source.resource.getType(), result, source.resourceSize, source.resourceSize, executionAffinityAttr, executionAffinityAttr); } - rewriter.replaceOp(op, result); + rewriter.replaceOpWithMultiple(op, {{result, source.resourceSize}}); return success(); } @@ -254,20 +267,22 @@ struct ConvertTensorBarrierOp : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(IREE::HAL::TensorBarrierOp op, OpAdaptor adaptor, + matchAndRewrite(IREE::HAL::TensorBarrierOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { auto timepointType = rewriter.getType(); IREE::Stream::AffinityAttr anyAffinityAttr; SmallVector signaledResources; + SmallVector signaledResourceSizes; SmallVector signaledTimepoints; for (auto [sourceTensor, sourceResource] : llvm::zip_equal(op.getSources(), adaptor.getSources())) { - auto source = resolveTensorOperand(op.getLoc(), sourceTensor, - sourceResource, rewriter); + auto source = resolveTensorOperands(op.getLoc(), sourceTensor, + sourceResource, rewriter); auto barrierOp = rewriter.create( - sourceResource.getLoc(), source.resource.getType(), timepointType, - source.resource, source.resourceSize, source.affinity); + sourceResource.front().getLoc(), source.resource.getType(), + timepointType, source.resource, source.resourceSize, source.affinity); signaledResources.push_back(barrierOp.getResult()); + signaledResourceSizes.push_back(source.resourceSize); signaledTimepoints.push_back(barrierOp.getResultTimepoint()); // When joining from multiple affinities we need to pick one to perform @@ -283,7 +298,8 @@ struct ConvertTensorBarrierOp rewriter.create( op.getLoc(), joinedTimepoint, ValueRange{adaptor.getSignalFence()}, anyAffinityAttr); - rewriter.replaceOp(op, signaledResources); + replaceOpWithMultiple(op, signaledResources, signaledResourceSizes, + rewriter); return success(); } }; diff --git a/compiler/src/iree/compiler/Dialect/Stream/Conversion/PatternUtils.cpp b/compiler/src/iree/compiler/Dialect/Stream/Conversion/PatternUtils.cpp index fee06f2df4cb..45122452d64b 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Conversion/PatternUtils.cpp +++ b/compiler/src/iree/compiler/Dialect/Stream/Conversion/PatternUtils.cpp @@ -44,73 +44,25 @@ tryLookupResultAffinity(Value value, return affinityAnalysis->lookupResourceAffinity(value); } -static std::pair -resolveTensorOperand(Location loc, Value convertedOperand, OpBuilder &builder) { - auto operandType = convertedOperand.getType(); - if (llvm::isa(operandType)) { - // Prior to https://reviews.llvm.org/D111620 this is the path we'd take; - // the tensor operands would be remapped into their new resource types. - // This is still possible during rewriting if we ourselves produce a new - // resource type, but the automatic materialization will go down the - // unrealized_conversion_cast path below. - return std::make_pair(convertedOperand, - builder.createOrFold( - loc, builder.getIndexType(), convertedOperand)); - } else if (auto castOp = - convertedOperand - .getDefiningOp()) { - // We only have a single tensor type conversion and it expands to (resource, - // size) so that's all we look for here. - assert(castOp.getNumOperands() == 2 && "expected (resource, size)"); - return std::make_pair(castOp.getOperand(0), castOp.getOperand(1)); - } - assert(false && - "unexpected operand; expected either a IREE::Stream::ResourceType or " - "the result of a mlir::UnrealizedConversionCastOp"); - return std::make_pair(Value{}, Value{}); -} - -void expandResourceOperand(Location loc, Value operand, - SmallVectorImpl &newOperands, - OpBuilder &builder) { - if (llvm::isa(operand.getType())) { - auto [resource, resourceSize] = resolveTensorOperand(loc, operand, builder); - newOperands.push_back(resource); - newOperands.push_back(resourceSize); - } else if (llvm::isa(operand.getType())) { - newOperands.push_back(operand); - newOperands.push_back( - builder.createOrFold(loc, operand)); - } else { - newOperands.push_back(operand); - } -} - -SmallVector expandResourceOperands(Location loc, ValueRange operands, - ConversionPatternRewriter &rewriter) { - SmallVector expandedOperands; - expandedOperands.reserve(operands.size()); - for (auto operand : operands) { - expandResourceOperand(loc, operand, expandedOperands, rewriter); - } - return expandedOperands; -} - -ConvertedTensor resolveTensorOperand( - Location loc, Value originalOperand, Value convertedOperand, +ConvertedTensor resolveTensorOperands( + Location loc, Value originalOperand, ValueRange convertedOperand, IREE::Stream::AffinityAnalysis *affinityAnalysis, OpBuilder &builder) { - auto [resource, resourceSize] = - resolveTensorOperand(loc, convertedOperand, builder); + assert(convertedOperand.size() == 2 && + "expected tensor operands to be converted to `!stream.resource<*>, " + "index`"); auto affinityAttr = affinityAnalysis->lookupResourceAffinity(originalOperand); - return {affinityAttr, resource, resourceSize}; + return {affinityAttr, convertedOperand[0], convertedOperand[1]}; } -ConvertedTensor transferTensorOperand( - Location loc, Value originalOperand, Value convertedOperand, +ConvertedTensor transferTensorOperands( + Location loc, Value originalOperand, ValueRange convertedOperand, IREE::Stream::AffinityAttr requiredAffinityAttr, IREE::Stream::AffinityAnalysis *affinityAnalysis, OpBuilder &builder) { - auto [resource, resourceSize] = - resolveTensorOperand(loc, convertedOperand, builder); + assert(convertedOperand.size() == 2 && + "expected tensor operands to be converted to `!stream.resource<*>, " + "index`"); + Value resource = convertedOperand[0]; + Value resourceSize = convertedOperand[1]; auto affinityAttr = affinityAnalysis->lookupResourceAffinity(originalOperand); if (affinityAttr != requiredAffinityAttr) { resource = builder.create( @@ -120,4 +72,25 @@ ConvertedTensor transferTensorOperand( return {requiredAffinityAttr, resource, resourceSize}; } +void replaceOpWithMultiple(Operation *op, + ArrayRef> replacements, + ConversionPatternRewriter &rewriter) { + auto r = llvm::map_to_vector( + replacements, [](ArrayRef v) -> ValueRange { return v; }); + rewriter.replaceOpWithMultiple(op, r); +} + +void replaceOpWithMultiple(Operation *op, ValueRange resources, + ValueRange sizes, + ConversionPatternRewriter &rewriter) { + SmallVector> replacements = llvm::map_to_vector( + llvm::zip_equal(resources, sizes), [](auto it) -> SmallVector { + if (std::get<1>(it)) { + return {std::get<0>(it), std::get<1>(it)}; + } + return {std::get<0>(it)}; + }); + replaceOpWithMultiple(op, replacements, rewriter); +} + } // namespace mlir::iree_compiler diff --git a/compiler/src/iree/compiler/Dialect/Stream/Conversion/PatternUtils.h b/compiler/src/iree/compiler/Dialect/Stream/Conversion/PatternUtils.h index 43cfbb073494..774b7f65b9d2 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Conversion/PatternUtils.h +++ b/compiler/src/iree/compiler/Dialect/Stream/Conversion/PatternUtils.h @@ -42,18 +42,11 @@ struct ConvertedTensor { Value resourceSize; }; -void expandResourceOperand(Location loc, Value convertedOperand, - SmallVectorImpl &newOperands, - OpBuilder &builder); -SmallVector expandResourceOperands(Location loc, - ValueRange convertedOperands, - ConversionPatternRewriter &rewriter); - -ConvertedTensor resolveTensorOperand( - Location loc, Value originalOperand, Value convertedOperand, +ConvertedTensor resolveTensorOperands( + Location loc, Value originalOperand, ValueRange convertedOperand, IREE::Stream::AffinityAnalysis *affinityAnalysis, OpBuilder &builder); -ConvertedTensor transferTensorOperand( - Location loc, Value originalOperand, Value convertedOperand, +ConvertedTensor transferTensorOperands( + Location loc, Value originalOperand, ValueRange convertedOperand, IREE::Stream::AffinityAttr requiredAffinityAttr, IREE::Stream::AffinityAnalysis *affinityAnalysis, OpBuilder &builder); @@ -72,19 +65,19 @@ struct AffinityAwareConversionPattern : public OpConversionPattern { } protected: - ConvertedTensor resolveTensorOperand(Location loc, Value originalOperand, - Value convertedOperand, - OpBuilder &builder) const { - return mlir::iree_compiler::resolveTensorOperand( + ConvertedTensor resolveTensorOperands(Location loc, Value originalOperand, + ValueRange convertedOperand, + OpBuilder &builder) const { + return mlir::iree_compiler::resolveTensorOperands( loc, originalOperand, convertedOperand, affinityAnalysis, builder); } ConvertedTensor - transferTensorOperand(Location loc, Value originalOperand, - Value convertedOperand, - IREE::Stream::AffinityAttr requiredAffinityAttr, - OpBuilder &builder) const { - return mlir::iree_compiler::transferTensorOperand( + transferTensorOperands(Location loc, Value originalOperand, + ValueRange convertedOperand, + IREE::Stream::AffinityAttr requiredAffinityAttr, + OpBuilder &builder) const { + return mlir::iree_compiler::transferTensorOperands( loc, originalOperand, convertedOperand, requiredAffinityAttr, affinityAnalysis, builder); } @@ -110,13 +103,14 @@ struct AffinityOpConversionPattern protected: virtual LogicalResult matchAndRewriteOnAffinity( - OpT op, typename OpConversionPattern::OpAdaptor adaptor, + OpT op, typename OpConversionPattern::OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const = 0; private: LogicalResult - matchAndRewrite(OpT op, typename OpConversionPattern::OpAdaptor adaptor, + matchAndRewrite(OpT op, + typename OpConversionPattern::OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override final { auto executionAffinityAttr = tryLookupExecutionAffinity(op, this->getAffinityAnalysis()); @@ -125,6 +119,13 @@ struct AffinityOpConversionPattern } }; +void replaceOpWithMultiple(Operation *op, + ArrayRef> replacements, + ConversionPatternRewriter &rewriter); +void replaceOpWithMultiple(Operation *op, ValueRange resources, + ValueRange sizes, + ConversionPatternRewriter &rewriter); + } // namespace mlir::iree_compiler #endif // IREE_COMPILER_DIALECT_STREAM_CONVERSION_PATTERN_UTILS_H_ diff --git a/compiler/src/iree/compiler/Dialect/Stream/Conversion/StandardToStream/Patterns.cpp b/compiler/src/iree/compiler/Dialect/Stream/Conversion/StandardToStream/Patterns.cpp index 9924fd2edf1c..ce51aad16c06 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Conversion/StandardToStream/Patterns.cpp +++ b/compiler/src/iree/compiler/Dialect/Stream/Conversion/StandardToStream/Patterns.cpp @@ -29,11 +29,19 @@ namespace mlir::iree_compiler { namespace { +/// Flatten the given value ranges into a single vector of values. +static SmallVector flattenValues(ArrayRef values) { + SmallVector result; + for (const auto &vals : values) + llvm::append_range(result, vals); + return result; +} + struct ConvertTensorConstantOp : public AffinityOpConversionPattern { using AffinityOpConversionPattern::AffinityOpConversionPattern; LogicalResult matchAndRewriteOnAffinity( - arith::ConstantOp constantOp, OpAdaptor adaptor, + arith::ConstantOp constantOp, OneToNOpAdaptor adaptor, IREE::Stream::AffinityAttr executionAffinityAttr, ConversionPatternRewriter &rewriter) const override { // Only handle tensor types - other arith.constant types (like i32) are @@ -53,10 +61,13 @@ struct ConvertTensorConstantOp auto unknownType = rewriter.getType(); auto constantSize = rewriter.createOrFold( constantOp.getLoc(), rewriter.getIndexType(), newOp.getResult()); - rewriter.replaceOpWithNewOp( - constantOp, unknownType, newOp.getResult(), constantSize, constantSize, + auto transferOp = rewriter.create( + constantOp.getLoc(), unknownType, newOp.getResult(), constantSize, + constantSize, /*source_affinity=*/executionAffinityAttr, /*result_affinity=*/executionAffinityAttr); + rewriter.replaceOpWithMultiple(constantOp, + {{transferOp.getResult(), constantSize}}); return success(); } }; @@ -65,13 +76,11 @@ struct BranchOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(mlir::cf::BranchOp op, OpAdaptor adaptor, + matchAndRewrite(mlir::cf::BranchOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Expand any resource operands to resource + size. - auto expandedOperands = expandResourceOperands( - op.getLoc(), adaptor.getDestOperands(), rewriter); - rewriter.replaceOpWithNewOp(op, op.getDest(), - expandedOperands); + rewriter.replaceOpWithNewOp( + op, op.getDest(), flattenValues(adaptor.getOperands())); return success(); } }; @@ -80,15 +89,13 @@ struct CondBranchOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(mlir::cf::CondBranchOp op, OpAdaptor adaptor, + matchAndRewrite(mlir::cf::CondBranchOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Expand any resource operands to resource + size. - auto trueDestOperands = expandResourceOperands( - op.getLoc(), adaptor.getTrueDestOperands(), rewriter); - auto falseDestOperands = expandResourceOperands( - op.getLoc(), adaptor.getFalseDestOperands(), rewriter); + auto trueDestOperands = flattenValues(adaptor.getTrueDestOperands()); + auto falseDestOperands = flattenValues(adaptor.getFalseDestOperands()); rewriter.replaceOpWithNewOp( - op, adaptor.getCondition(), op.getTrueDest(), trueDestOperands, + op, adaptor.getCondition().front(), op.getTrueDest(), trueDestOperands, op.getFalseDest(), falseDestOperands); return success(); } @@ -100,18 +107,17 @@ struct SwitchOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(mlir::cf::SwitchOp op, OpAdaptor adaptor, + matchAndRewrite(mlir::cf::SwitchOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Expand any resource operands to resource + size. - auto defaultOperands = expandResourceOperands( - op.getLoc(), adaptor.getDefaultOperands(), rewriter); - auto caseOperands = llvm::to_vector( - llvm::map_range(adaptor.getCaseOperands(), [&](ValueRange operands) { - return expandResourceOperands(op.getLoc(), operands, rewriter); + auto defaultOperands = flattenValues(adaptor.getDefaultOperands()); + auto caseOperands = llvm::to_vector(llvm::map_range( + adaptor.getCaseOperands(), [&](ArrayRef operands) { + return flattenValues(operands); })); rewriter.replaceOpWithNewOp( - op, adaptor.getFlag(), op.getDefaultDestination(), defaultOperands, - op.getCaseValuesAttr(), op.getCaseDestinations(), + op, adaptor.getFlag().front(), op.getDefaultDestination(), + defaultOperands, op.getCaseValuesAttr(), op.getCaseDestinations(), llvm::to_vector(llvm::map_range(caseOperands, asValueRange))); return success(); } @@ -121,24 +127,23 @@ struct SelectOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(mlir::arith::SelectOp op, OpAdaptor adaptor, + matchAndRewrite(mlir::arith::SelectOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Only handle selects where the operands are tensors (resources). if (!llvm::isa(op.getTrueValue().getType())) return failure(); - auto trueOperand = resolveTensorOperand(op.getLoc(), op.getTrueValue(), - adaptor.getTrueValue(), rewriter); - auto falseOperand = resolveTensorOperand(op.getLoc(), op.getFalseValue(), - adaptor.getFalseValue(), rewriter); + auto trueOperand = resolveTensorOperands(op.getLoc(), op.getTrueValue(), + adaptor.getTrueValue(), rewriter); + auto falseOperand = resolveTensorOperands( + op.getLoc(), op.getFalseValue(), adaptor.getFalseValue(), rewriter); auto resourceSelectOp = rewriter.create( - op.getLoc(), adaptor.getCondition(), trueOperand.resource, + op.getLoc(), adaptor.getCondition().front(), trueOperand.resource, falseOperand.resource); auto sizeSelectOp = rewriter.create( - op.getLoc(), adaptor.getCondition(), trueOperand.resourceSize, + op.getLoc(), adaptor.getCondition().front(), trueOperand.resourceSize, falseOperand.resourceSize); - rewriter.replaceOpWithNewOp( - op, adaptor.getTrueValue().getType(), - ValueRange{resourceSelectOp.getResult(), sizeSelectOp.getResult()}); + rewriter.replaceOpWithMultiple(op, {ValueRange{resourceSelectOp.getResult(), + sizeSelectOp.getResult()}}); return success(); } }; @@ -186,21 +191,19 @@ struct ScfIfOpConversion // Tie all resource results together so we end up with 1:1 results with the // original op. SmallVector results; + SmallVector resultSizes; for (auto result : resultMap) { if (llvm::isa(result.newType)) { - auto oldType = op.getResult(result.originalIndex).getType(); auto resource = ifOp.getResult(result.newIndex + 0); auto resourceSize = ifOp.getResult(result.newIndex + 1); - results.push_back(rewriter - .create( - op.getLoc(), TypeRange{oldType}, - ValueRange{resource, resourceSize}) - .getResult(0)); + results.push_back(resource); + resultSizes.push_back(resourceSize); } else { results.push_back(ifOp.getResult(result.newIndex)); + resultSizes.push_back(nullptr); } } - rewriter.replaceOp(op, results); + replaceOpWithMultiple(op, results, resultSizes, rewriter); return success(); } }; @@ -209,13 +212,12 @@ struct ScfForOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(mlir::scf::ForOp op, OpAdaptor adaptor, + matchAndRewrite(mlir::scf::ForOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { auto &typeConverter = *getTypeConverter(); // Expand any resource operands to resource + size. - auto expandedOperands = - expandResourceOperands(op.getLoc(), adaptor.getInitArgs(), rewriter); + auto expandedOperands = flattenValues(adaptor.getInitArgs()); // Expand any resource results to resource + size. SmallVector expandedTypes; @@ -250,8 +252,9 @@ struct ScfForOpConversion // expanded output results. We can't directly replace the original loop as // the result counts differ. auto forOp = rewriter.create( - op.getLoc(), adaptor.getLowerBound(), adaptor.getUpperBound(), - adaptor.getStep(), expandedOperands); + op.getLoc(), adaptor.getLowerBound().front(), + adaptor.getUpperBound().front(), adaptor.getStep().front(), + expandedOperands); // Inline the block and update the block arguments. rewriter.eraseBlock(forOp.getBody()); @@ -265,21 +268,19 @@ struct ScfForOpConversion // Tie all resource results together so we end up with 1:1 results with the // original op. SmallVector results; + SmallVector resultSizes; for (auto result : resultMap) { if (llvm::isa(result.newType)) { - auto oldType = op.getResult(result.originalIndex).getType(); auto resource = forOp.getResult(result.newIndex + 0); auto resourceSize = forOp.getResult(result.newIndex + 1); - results.push_back(rewriter - .create( - op.getLoc(), TypeRange{oldType}, - ValueRange{resource, resourceSize}) - .getResult(0)); + results.push_back(resource); + resultSizes.push_back(resourceSize); } else { results.push_back(forOp.getResult(result.newIndex)); + resultSizes.push_back(nullptr); } } - rewriter.replaceOp(op, results); + replaceOpWithMultiple(op, results, resultSizes, rewriter); return success(); } }; @@ -288,13 +289,12 @@ struct ScfWhileOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(mlir::scf::WhileOp op, OpAdaptor adaptor, + matchAndRewrite(mlir::scf::WhileOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { auto &typeConverter = *getTypeConverter(); // Expand any resource operands to resource + size. - auto expandedOperands = - expandResourceOperands(op.getLoc(), adaptor.getOperands(), rewriter); + auto expandedOperands = flattenValues(adaptor.getOperands()); // Expand any resource results to resource + size. SmallVector expandedTypes; @@ -351,21 +351,19 @@ struct ScfWhileOpConversion // Tie all resource results together so we end up with 1:1 results with the // original op. SmallVector results; + SmallVector resultSizes; for (auto result : resultMap) { if (llvm::isa(result.newType)) { - auto oldType = op.getResult(result.originalIndex).getType(); auto resource = whileOp.getResult(result.newIndex + 0); auto resourceSize = whileOp.getResult(result.newIndex + 1); - results.push_back(rewriter - .create( - op.getLoc(), TypeRange{oldType}, - ValueRange{resource, resourceSize}) - .getResult(0)); + results.push_back(resource); + resultSizes.push_back(resourceSize); } else { results.push_back(whileOp.getResult(result.newIndex)); + resultSizes.push_back(nullptr); } } - rewriter.replaceOp(op, results); + replaceOpWithMultiple(op, results, resultSizes, rewriter); return success(); } }; @@ -374,13 +372,12 @@ struct ScfConditionOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(mlir::scf::ConditionOp op, OpAdaptor adaptor, + matchAndRewrite(mlir::scf::ConditionOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Expand any resource operands to resource + size. - auto expandedOperands = - expandResourceOperands(op.getLoc(), adaptor.getArgs(), rewriter); + auto expandedOperands = flattenValues(adaptor.getArgs()); rewriter.replaceOpWithNewOp( - op, adaptor.getCondition(), expandedOperands); + op, adaptor.getCondition().front(), expandedOperands); return success(); } }; @@ -389,11 +386,10 @@ struct ScfYieldOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(mlir::scf::YieldOp op, OpAdaptor adaptor, + matchAndRewrite(mlir::scf::YieldOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Expand any resource operands to resource + size. - auto expandedOperands = - expandResourceOperands(op.getLoc(), adaptor.getOperands(), rewriter); + auto expandedOperands = flattenValues(adaptor.getOperands()); rewriter.replaceOpWithNewOp(op, expandedOperands); return success(); } diff --git a/compiler/src/iree/compiler/Dialect/Stream/Conversion/UtilToStream/Patterns.cpp b/compiler/src/iree/compiler/Dialect/Stream/Conversion/UtilToStream/Patterns.cpp index 35e1ca8760a8..b7c24d4b1820 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Conversion/UtilToStream/Patterns.cpp +++ b/compiler/src/iree/compiler/Dialect/Stream/Conversion/UtilToStream/Patterns.cpp @@ -19,6 +19,14 @@ namespace mlir::iree_compiler { namespace { +/// Flatten the given value ranges into a single vector of values. +static SmallVector flattenValues(ArrayRef values) { + SmallVector result; + for (const auto &vals : values) + llvm::append_range(result, vals); + return result; +} + //===----------------------------------------------------------------------===// // Structural ops //===----------------------------------------------------------------------===// @@ -71,7 +79,7 @@ struct CallOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(IREE::Util::CallOp op, OpAdaptor adaptor, + matchAndRewrite(IREE::Util::CallOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Create a new call that takes the expanded input operands and returns the // expanded output results. We can't directly replace the original call as @@ -85,9 +93,9 @@ struct CallOpConversion bool anyFailed = false; auto callOp = op.cloneAndExpand( [&](unsigned i, Value operand, SmallVectorImpl &newOperands) { - auto adaptorOperand = adaptor.getOperands()[i]; - expandResourceOperand(op.getLoc(), adaptorOperand, newOperands, - rewriter); + SmallVector appendNewOperands = + flattenValues(adaptor.getOperands()[i]); + newOperands.append(appendNewOperands); }, [&](unsigned i, Type type, SmallVectorImpl &newTypes) { size_t newIndex = newTypes.size(); @@ -103,21 +111,19 @@ struct CallOpConversion // Tie all resource results together so we end up with 1:1 results with the // original op. SmallVector results; + SmallVector resourceSizes; for (auto result : resultMap) { if (llvm::isa(result.newType)) { - auto oldType = op.getResult(result.originalIndex).getType(); auto resource = callOp.getResult(result.newIndex + 0); auto resourceSize = callOp.getResult(result.newIndex + 1); - results.push_back(rewriter - .create( - op.getLoc(), TypeRange{oldType}, - ValueRange{resource, resourceSize}) - .getResult(0)); + results.push_back(resource); + resourceSizes.push_back(resourceSize); } else { results.push_back(callOp.getResult(result.newIndex)); + resourceSizes.push_back(nullptr); } } - rewriter.replaceOp(op, results); + replaceOpWithMultiple(op, results, resourceSizes, rewriter); return success(); } @@ -127,11 +133,10 @@ struct ReturnOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(IREE::Util::ReturnOp op, OpAdaptor adaptor, + matchAndRewrite(IREE::Util::ReturnOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Expand any resource operands to resource + size. - auto expandedOperands = - expandResourceOperands(op.getLoc(), adaptor.getOperands(), rewriter); + auto expandedOperands = flattenValues(adaptor.getOperands()); rewriter.replaceOpWithNewOp(op, expandedOperands); return success(); } @@ -312,11 +317,12 @@ struct GlobalLoadOpExpansion loadOp.getLoc(), rewriter.getIndexType(), expandedGlobal.resourceSizeOp.getSymName()) .getResult(); - rewriter.replaceOpWithNewOp( - loadOp, unknownType, resource, resourceSize, resourceSize, + auto transferOp = rewriter.create( + loadOp.getLoc(), unknownType, resource, resourceSize, resourceSize, /*source_affinity=*/expandedGlobal.affinityAttr, /*result_affinity=*/expandedGlobal.affinityAttr); - + rewriter.replaceOpWithMultiple(loadOp, + {{transferOp.getResult(), resourceSize}}); return success(); } }; @@ -325,7 +331,7 @@ struct GlobalStoreOpExpansion : public BaseGlobalConversionPattern { using BaseGlobalConversionPattern::BaseGlobalConversionPattern; LogicalResult - matchAndRewrite(IREE::Util::GlobalStoreOp storeOp, OpAdaptor adaptor, + matchAndRewrite(IREE::Util::GlobalStoreOp storeOp, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { // Only apply to expanded types (tensors/etc). if (!isExpandedType(storeOp.getValue().getType())) @@ -341,8 +347,8 @@ struct GlobalStoreOpExpansion // Insert a transfer/store to the global with unknown lifetime. Lifetime // refinement will make this go away if possible. auto value = - resolveTensorOperand(storeOp.getLoc(), storeOp.getValue(), - adaptor.getValue(), affinityAnalysis, rewriter); + resolveTensorOperands(storeOp.getLoc(), storeOp.getValue(), + adaptor.getValue(), affinityAnalysis, rewriter); assert(expandedGlobal.resourceOp && "Missing resource op"); auto transferOp = rewriter.create( storeOp.getLoc(), expandedGlobal.resourceOp.getType(), value.resource, @@ -364,21 +370,27 @@ struct OptimizationBarrierOpConversion : public AffinityAwareConversionPattern { using AffinityAwareConversionPattern::AffinityAwareConversionPattern; LogicalResult - matchAndRewrite(IREE::Util::OptimizationBarrierOp op, OpAdaptor adaptor, + matchAndRewrite(IREE::Util::OptimizationBarrierOp op, OneToNOpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { SmallVector newOperands; + SmallVector operandSizes; for (auto [originalOperand, convertedOperand] : llvm::zip_equal(op.getOperands(), adaptor.getOperands())) { - if (isa(convertedOperand.getType())) { - newOperands.push_back(resolveTensorOperand(op.getLoc(), originalOperand, - convertedOperand, rewriter) - .resource); + if (isa(originalOperand.getType())) { + auto tensorOperands = resolveTensorOperands( + op.getLoc(), originalOperand, convertedOperand, rewriter); + newOperands.push_back(tensorOperands.resource); + operandSizes.push_back(tensorOperands.resourceSize); } else { - newOperands.push_back(convertedOperand); + assert(convertedOperand.size() == 1 && + "all non-tensor type expected to have a 1-1 conversion"); + newOperands.push_back(convertedOperand.front()); + operandSizes.push_back(nullptr); } } - rewriter.replaceOpWithNewOp(op, - newOperands); + auto barrierOp = rewriter.create( + op.getLoc(), newOperands); + replaceOpWithMultiple(op, barrierOp->getResults(), operandSizes, rewriter); return success(); } }; diff --git a/compiler/src/iree/compiler/Dialect/Stream/Conversion/UtilToStream/test/compiler_hints.mlir b/compiler/src/iree/compiler/Dialect/Stream/Conversion/UtilToStream/test/compiler_hints.mlir index c778fbf1e502..7c178e503924 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Conversion/UtilToStream/test/compiler_hints.mlir +++ b/compiler/src/iree/compiler/Dialect/Stream/Conversion/UtilToStream/test/compiler_hints.mlir @@ -3,9 +3,9 @@ // CHECK-LABEL: @optimizationBarrier util.func public @optimizationBarrier(%arg0: tensor) -> tensor { // CHECK-SAME: %[[ARG0:.+]]: !stream.resource<*> + // CHECK-SAME: %[[ARG1:.+]]: index // CHECK: %[[RESOURCE:.*]] = util.optimization_barrier %[[ARG0]] - // CHECK: %[[SIZE:.*]] = stream.resource.size %[[RESOURCE]] : !stream.resource<*> - // CHECK: util.return %[[RESOURCE]], %[[SIZE]] : !stream.resource<*>, index + // CHECK: util.return %[[RESOURCE]], %[[ARG1]] : !stream.resource<*>, index %0 = util.optimization_barrier %arg0 : tensor util.return %0 : tensor } diff --git a/compiler/src/iree/compiler/Dialect/Stream/Transforms/ConvertToStream.cpp b/compiler/src/iree/compiler/Dialect/Stream/Transforms/ConvertToStream.cpp index 0da7d95f486d..501cbb83fbbb 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Transforms/ConvertToStream.cpp +++ b/compiler/src/iree/compiler/Dialect/Stream/Transforms/ConvertToStream.cpp @@ -68,7 +68,7 @@ struct GenericResourcePattern : public ConversionPattern { affinityAnalysis(affinityAnalysis) {} LogicalResult - matchAndRewrite(Operation *op, ArrayRef operands, + matchAndRewrite(Operation *op, ArrayRef operands, ConversionPatternRewriter &rewriter) const override { if (!doesOperationNeedWrapping(op)) { return failure(); @@ -80,10 +80,10 @@ struct GenericResourcePattern : public ConversionPattern { SmallVector newOperands; newOperands.reserve(op->getNumOperands()); rewriter.setInsertionPoint(op); - for (auto [oldOperand, newOperand] : + for (auto [oldOperand, convertedOperands] : llvm::zip_equal(op->getOperands(), operands)) { - if (!isa(newOperand.getType())) { - newOperands.push_back(newOperand); + if (!isa(oldOperand.getType())) { + newOperands.push_back(convertedOperands.front()); continue; } auto tensorType = dyn_cast(oldOperand.getType()); @@ -94,7 +94,7 @@ struct GenericResourcePattern : public ConversionPattern { auto dynamicDims = IREE::Util::buildDynamicDimsForValue( op->getLoc(), oldOperand, rewriter); newOperands.push_back(buildTensorExportOp( - op->getLoc(), oldOperand, newOperand, tensorType, dynamicDims, + op->getLoc(), oldOperand, convertedOperands, tensorType, dynamicDims, exportAffinityAttr ? exportAffinityAttr : executionAffinityAttr, rewriter)); } @@ -127,13 +127,13 @@ struct GenericResourcePattern : public ConversionPattern { // Builds a stream.tensor.export op that exports a stream resource into an // external tensor value. Value buildTensorExportOp(Location loc, Value originalValue, - Value convertedValue, TensorType targetType, + ValueRange convertedValue, TensorType targetType, ValueRange dynamicDims, IREE::Stream::AffinityAttr executionAffinityAttr, OpBuilder &builder) const { - auto source = - transferTensorOperand(loc, originalValue, convertedValue, - executionAffinityAttr, affinityAnalysis, builder); + auto source = transferTensorOperands(loc, originalValue, convertedValue, + executionAffinityAttr, + affinityAnalysis, builder); // If needed insert a transfer to external resource lifetime. auto externalType = builder.getType( diff --git a/compiler/src/iree/compiler/Dialect/Stream/Transforms/test/convert_to_stream.mlir b/compiler/src/iree/compiler/Dialect/Stream/Transforms/test/convert_to_stream.mlir index fd68d30bc5f6..8815f6103f78 100644 --- a/compiler/src/iree/compiler/Dialect/Stream/Transforms/test/convert_to_stream.mlir +++ b/compiler/src/iree/compiler/Dialect/Stream/Transforms/test/convert_to_stream.mlir @@ -130,8 +130,7 @@ util.func public @while_test() { // CHECK: %[[INITIAL_DNO:.+]] = util.optimization_barrier %[[INITIAL]] : !stream.resource<*> %0 = util.optimization_barrier %cst : tensor - // CHECK: %[[VAR_SIZE:.+]] = stream.resource.size %[[INITIAL_DNO]] : !stream.resource<*> - // CHECK: cf.br ^bb1(%[[INITIAL_DNO]], %[[VAR_SIZE]] : !stream.resource<*>, index) + // CHECK: cf.br ^bb1(%[[INITIAL_DNO]], %[[CONSTANT_SIZE]] : !stream.resource<*>, index) cf.br ^bb1(%0 : tensor) // CHECK: ^bb1(%[[BB1_ARG:.+]]: !stream.resource<*>, %[[BB1_ARG_SIZE:.+]]: index): diff --git a/third_party/llvm-project b/third_party/llvm-project index ccdbcf948ba2..078c7bb5c927 160000 --- a/third_party/llvm-project +++ b/third_party/llvm-project @@ -1 +1 @@ -Subproject commit ccdbcf948ba24cfc80860e9a0256eb343f3373da +Subproject commit 078c7bb5c927ab1596d8a508e0b70d5140e59669 From 78ea0adb419b41e8050adcd5cbf8d8b4595dab5d Mon Sep 17 00:00:00 2001 From: Rob Suderman Date: Tue, 17 Dec 2024 15:24:36 -0800 Subject: [PATCH 33/64] Bump to use the flash attention variant (#19505) Signed-off-by: Rob Suderman --- .github/workflows/pkgci_test_sharktank.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/pkgci_test_sharktank.yml b/.github/workflows/pkgci_test_sharktank.yml index a7845498598d..67204a8fcf41 100644 --- a/.github/workflows/pkgci_test_sharktank.yml +++ b/.github/workflows/pkgci_test_sharktank.yml @@ -60,7 +60,7 @@ jobs: uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 with: repository: iree-org/iree-test-suites - ref: c644a9dfc3e5e1a9d071d5e786b79cf612e9b1d3 + ref: dab5402da679e713b94a8bd5c400a51cce0e665a path: iree-test-suites lfs: true - name: Install Sharktank models test suite requirements From 8ae1b54482043f1d13439d04213c176859211b79 Mon Sep 17 00:00:00 2001 From: Nirvedh Meshram <96096277+nirvedhmeshram@users.noreply.github.com> Date: Tue, 17 Dec 2024 19:49:18 -0600 Subject: [PATCH 34/64] [GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes (#19484) This PR does two things 1. Allow all GEMM shapes to use padded TileAndFuse Matmul configuration. This is still behind the `iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false` flag by default and does not change the default behavior. However following PRs that have landed in the past month make it possible to relax the guards we originally had on this. https://github.com/iree-org/iree/pull/19196 https://github.com/iree-org/iree/pull/19307 https://github.com/llvm/llvm-project/pull/117340 2. Allow fused producers to use use padded TileAndFuse Matmul configuration. Following PRs make this possible now https://github.com/iree-org/iree/pull/19399 https://github.com/llvm/llvm-project/pull/119039 Together this allows us to do padded IGEMM with intrinsics for shapes unaligned to intrinsic which we use by default. [Here](https://docs.google.com/spreadsheets/d/1O-SdUZCn5pHsxx7JTGjIIdH6PWCFnvlfe4XBbjEBaIM/edit?gid=0#gid=0) is the performance difference observed in conv cases in iree-kernel-benchmark-module that utilize this change. A median speedup of 2.26x was observed. The numeric changes I observed with enabling this path were the same between any aligned shape when comparing intrinsic vs no intrinsic use. Generally some differences are noticed for narrow types like f16 but they are within a relative error of 0.001 but since our tests use absolute errors we may have to change some test values to account for this change. The perf difference in CI seem to be within noise margin compared to main, https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902 --------- Signed-off-by: Nirvedh --- .../Dialect/GPU/TargetUtils/ConfigUtils.cpp | 16 +--- .../iree/compiler/Codegen/LLVMGPU/Passes.cpp | 2 + .../ROCDL/config_igemm_tile_and_fuse.mlir | 32 ++++++-- .../ROCDL/pipeline_igemm_tile_and_fuse.mlir | 80 +++++++++++++++++++ 4 files changed, 112 insertions(+), 18 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp b/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp index 6a0700362ab6..4f4d7784536c 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp @@ -182,8 +182,7 @@ static FailureOr> getMatmulLoweringConfigAndWorkgroupSize(SmallVector bounds, ArrayRef maps, ArrayRef operands, - IREE::GPU::TargetAttr target, - bool hasFusedLeadingOp) { + IREE::GPU::TargetAttr target) { if (target.getWgp().getMma().empty()) return failure(); @@ -253,13 +252,11 @@ getMatmulLoweringConfigAndWorkgroupSize(SmallVector bounds, std::optional schedule = getMmaScheduleFromProblemAndTarget( target, problem, transposedLhs, transposedRhs); - // TODO (nirvedhmeshram, jerryyin): Support all GEMM types. - // TODO (nirvedhmeshram): Support fused leading op. // TODO (nirvedhmeshram, qedawkins): The performance with this will be bad if // the GEMM is accumulating (i.e doesnt have a zero fill dpsInit) as that // buffer currently gets materialized as private memory. We need to add // missing patterns to fix that. - if (!schedule && !contractionDims.batch.empty() && !hasFusedLeadingOp) { + if (!schedule) { LDBG("Attempting to deduce unaligned TileAndFuse MMA schedulee"); mustBeAligned = false; doCPromotion = true; @@ -342,9 +339,6 @@ getMatmulLoweringConfigAndWorkgroupSize(SmallVector bounds, } else { // TODO (nirvedhmeshram, Max191, jerryyin) : Add support so that unaligned // shapes do not require c promotion. - // TODO (nirvedhmeshram, jerryyin) : When using c promotion the heuristics - // used during finding a schedule need to be updated to account for the - // extra shared memory for the result. GPU::setPromotedOperandList(context, attrs, {0, 1, 2}); SmallVector paddingTileSizes = workgroupTileSizes; int64_t innerKDim = contractionDims.k.back(); @@ -391,8 +385,7 @@ setIGEMMConvolutionLoweringConfig(IREE::GPU::TargetAttr target, SmallVector bounds = igemmLoopBounds.value(); FailureOr> configAndWgSize = getMatmulLoweringConfigAndWorkgroupSize( - bounds, igemmContractionMaps.value(), igemmOperands.value(), target, - /*hasFusedLeadingOp=*/true); + bounds, igemmContractionMaps.value(), igemmOperands.value(), target); if (failed(configAndWgSize)) { return failure(); } @@ -435,8 +428,7 @@ LogicalResult setMatmulLoweringConfig(IREE::GPU::TargetAttr target, LDBG("Matmul TileAndFuse Config"); FailureOr> configAndWgSize = - getMatmulLoweringConfigAndWorkgroupSize(bounds, maps, operands, target, - hasFusedLeadingOp(linalgOp)); + getMatmulLoweringConfigAndWorkgroupSize(bounds, maps, operands, target); if (failed(configAndWgSize)) { return failure(); } diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp index 18aa3b41f64e..d460a1b9f56b 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp @@ -1033,6 +1033,8 @@ static void addLowerToLLVMGPUPasses(OpPassManager &modulePassManager, // Pad allocations with dynamic dimension after linalg lowering but before // lowering SCF and affine ops. .addPass(createPadDynamicAllocPass) + // Hoist any newly static allocations from PadDynamicAlloc. + .addPass(createHoistStaticallyBoundAllocationsPass) .addPass(createLowerAffinePass) .addPass(createCanonicalizerPass) diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_igemm_tile_and_fuse.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_igemm_tile_and_fuse.mlir index 4c9f79e207ba..cf170ef7d930 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_igemm_tile_and_fuse.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_igemm_tile_and_fuse.mlir @@ -59,7 +59,7 @@ func.func @nchw_conv_mfma() { // ----- -func.func @nhwc_conv_no_mfma() { +func.func @nhwc_conv_unaligned_mfma() { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index %0 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor> @@ -74,12 +74,22 @@ func.func @nhwc_conv_no_mfma() { return } -// CHECK-LABEL: func.func @nhwc_conv_no_mfma -// CHECK-NOT: use_igemm_convolution = true +// CHECK-LABEL: func.func @nhwc_conv_unaligned_mfma +// CHECK-SAME: #iree_codegen.translation_info +// CHECK-SAME: padding = [2, 1, 32, 64, 32] +// CHECK-SAME: promote_operands = [0, 1, 2] +// CHECK-SAME: reduction = [0, 0, 0, 0, 8] +// CHECK-SAME: subgroup = [2, 1, 2, 1, 0] +// CHECK-SAME: workgroup = [2, 1, 32, 64, 0] // ----- -func.func @nchw_conv_no_mfma() { +func.func @nchw_conv_unaligned_mfma() { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index %0 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor> @@ -94,5 +104,15 @@ func.func @nchw_conv_no_mfma() { return } -// CHECK-LABEL: func.func @nchw_conv_no_mfma -// CHECK-NOT: use_igemm_convolution = true +// CHECK-LABEL: func.func @nchw_conv_unaligned_mfma +// CHECK-SAME: #iree_codegen.translation_info +// CHECK-SAME: padding = [1, 64, 2, 32, 32] +// CHECK-SAME: promote_operands = [0, 1, 2] +// CHECK-SAME: reduction = [0, 0, 0, 0, 8] +// CHECK-SAME: subgroup = [1, 2, 2, 1, 0] +// CHECK-SAME: workgroup = [1, 64, 2, 32, 0] diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_igemm_tile_and_fuse.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_igemm_tile_and_fuse.mlir index 15d4dc4dae19..3d3504d87a08 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_igemm_tile_and_fuse.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_igemm_tile_and_fuse.mlir @@ -78,3 +78,83 @@ hal.executable private @main { // CHECK: } {mapping = [#iree_codegen.workgroup_mapping, #iree_codegen.workgroup_mapping, #iree_codegen.workgroup_mapping]} // TODO(Max191): Add tests for more convolution types + +// ----- + +#pipeline_layout = #hal.pipeline.layout, + #hal.pipeline.binding, + #hal.pipeline.binding +]> +#translation = #iree_codegen.translation_info + }> +#config = #iree_gpu.lowering_config<{ + padding = [2, 1, 32, 16, 16], + workgroup = [2, 1, 32, 16, 0], + reduction = [0, 0, 0, 0, 1], + subgroup = [1, 1, 1, 1, 0], + mma_kind = #iree_gpu.mma_layout, + promote_operands = [0, 1, 2] +}> +hal.executable private @main { + hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb">) { + hal.executable.export public @conv_dispatch_0_conv_2d_nhwc_hwcf_2x17x17x1281x3x3x1281_f16xf16xf32 ordinal(0) layout(#pipeline_layout) { + ^bb0(%arg0: !hal.device): + %x, %y, %z = flow.dispatch.workgroup_count_from_slice + hal.return %x, %y, %z : index, index, index + } + builtin.module { + func.func @conv_nhwc_unaligned_stride_2() attributes {translation_info = #translation} { + %cst = arith.constant 0.000000e+00 : f32 + %c0 = arith.constant 0 : index + %0 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor> %1 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor> + %2 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor> + %3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [2, 35, 35, 1281], strides = [1, 1, 1, 1] : !flow.dispatch.tensor> -> tensor<2x35x35x1281xf16> + %4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, 0], sizes = [3, 3, 1281, 1281], strides = [1, 1, 1, 1] : !flow.dispatch.tensor> -> tensor<3x3x1281x1281xf16> + %5 = tensor.empty() : tensor<2x17x17x1281xf32> + %6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<2x17x17x1281xf32>) -> tensor<2x17x17x1281xf32> + %7 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, lowering_config = #config, strides = dense<2> : tensor<2xi64>} ins(%3, %4 : tensor<2x35x35x1281xf16>, tensor<3x3x1281x1281xf16>) outs(%6 : tensor<2x17x17x1281xf32>) -> tensor<2x17x17x1281xf32> + flow.dispatch.tensor.store %7, %2, offsets = [0, 0, 0, 0], sizes = [2, 17, 17, 1281], strides = [1, 1, 1, 1] : tensor<2x17x17x1281xf32> -> !flow.dispatch.tensor> + return + } + } + } +} + +// CHECK-LABEL: func @conv_nhwc_unaligned +// CHECK-DAG: %[[B0:.+]] = hal.interface.binding.subspan layout({{.+}}) binding(0) +// CHECK-DAG: %[[B1:.+]] = hal.interface.binding.subspan layout({{.+}}) binding(1) +// CHECK-DAG: %[[B2:.+]] = hal.interface.binding.subspan layout({{.+}}) binding(2) +// CHECK-DAG: memref.alloc() : memref<2x1x2x16x1x16xf32, #gpu.address_space> +// CHECK-DAG: memref.alloc() : memref<16x20xf16, #gpu.address_space> +// CHECK-DAG: memref.alloc() : memref<2x1x32x20xf16, #gpu.address_space> +// CHECK-DAG: %[[C0:.+]] = arith.constant 0 : index +// CHECK-DAG: %[[C721:.+]] = arith.constant 721 : index +// CHECK-DAG: %[[C1:.+]] = arith.constant 1 : index +// CHECK: scf.forall ({{.*}}) in (17, 81) { +// CHECK: %[[LOOP:.+]] = scf.for %[[IV:.+]] = %[[C0]] to %[[C721]] step %[[C1]] {{.*}} -> (vector<1x1x1x1x4x1xf32>) +// CHECK: gpu.barrier +// CHECK-DAG: %[[LHS_RD:.+]] = vector.transfer_read %[[B0]]{{.*}} vector<1xf16> +// CHECK-DAG: vector.transfer_write %[[LHS_RD]] +// Note that to simplify the test we are not showing the mapping of the RHS_RD +// to its buffer as it goes through an scf.if/else control structure +// involving allocas. +// CHECK-DAG: %[[RHS_RD:.+]] = vector.transfer_read {{.*}} vector<1xf16> +// CHECK-DAG: vector.transfer_write %[[RHS_RD]] +// CHECK: gpu.barrier +// CHECK-DAG: %[[LHS_MM0:.+]] = vector.transfer_read {{.*}} vector<4xf16> +// CHECK-DAG: %[[RHS_MM:.+]] = vector.transfer_read {{.*}} vector<4x1x1xf16> +// CHECK-COUNT-1: amdgpu.mfma {{.*}}blocks = 1 : i32, k = 16 : i32, m = 16 : i32, n = 16 : i32 +// CHECK: %[[LOOP_T:.+]] = vector.shape_cast %[[LOOP]] : vector<1x1x1x1x4x1xf32> to vector<4x1x1xf32> +// CHECK: vector.transfer_write %[[LOOP_T]] +// Note there is a writeback loop here that is skipped to simplify the test. +// CHECK: vector.transfer_write {{.*}}, %[[B2]] +// CHECK: } {mapping = [#iree_codegen.workgroup_mapping, #iree_codegen.workgroup_mapping]} From 345b1dab4a32a058caab6dced991b91039040a16 Mon Sep 17 00:00:00 2001 From: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> Date: Tue, 17 Dec 2024 22:04:57 -0800 Subject: [PATCH 35/64] Revert "[LLVMGPU] Deprecate the matmul simt pipeline (#19335)" (#19508) This reverts commit 6ff00a8a008d06b604d4ca4e0ae6e601ae810b4f. The above commit causes Llama3.1 8B fp16 model to generate NaN logits for prefill/decode. Issue: https://github.com/iree-org/iree/issues/19506 Signed-off-by: archana-ramalingam --- .../test/gpu_reorder_workgroups_static.mlir | 2 +- .../Dialect/Codegen/IR/IREECodegenAttrs.td | 22 +++--- .../compiler/Codegen/LLVMGPU/KernelConfig.cpp | 70 ++----------------- .../LLVMGPU/LLVMGPULowerExecutableTarget.cpp | 3 + .../iree/compiler/Codegen/LLVMGPU/Passes.cpp | 66 +++++++++++++++++ .../iree/compiler/Codegen/LLVMGPU/Passes.h | 4 ++ .../compiler/Codegen/LLVMGPU/Verifiers.cpp | 11 ++- .../Codegen/LLVMGPU/test/config_matvec.mlir | 5 +- .../test/config_root_op_attribute.mlir | 2 +- .../LLVMGPU/test/distribute_to_thread.mlir | 8 +-- .../LLVMGPU/test/gpu_set_num_workgroups.mlir | 28 +++++--- .../LLVMGPU/test/illegal_configuration.mlir | 38 ++++++++++ .../LLVMGPU/test/nvvm_pipeline_test.mlir | 31 ++++---- .../LLVMGPU/test/rocdl_pipeline_test.mlir | 13 ++-- tests/e2e/matmul/BUILD.bazel | 24 +++++++ tests/e2e/matmul/CMakeLists.txt | 26 +++++++ tests/e2e/matmul/generate_e2e_matmul_tests.py | 14 +++- 17 files changed, 259 insertions(+), 108 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir index 992dc8ec4435..1b7a99184dcb 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir @@ -25,7 +25,7 @@ ]> hal.executable private @main_dispatch_0 { hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb">) { - hal.executable.export public @main_dispatch_0_matmul_transpose_b_32000x32000x4096_f16 ordinal(0) layout(#pipeline_layout) attributes {subgroup_size = 64 : index, translation_info = #iree_codegen.translation_info, workgroup_size = [64 : index, 16 : index, 1 : index]} { + hal.executable.export public @main_dispatch_0_matmul_transpose_b_32000x32000x4096_f16 ordinal(0) layout(#pipeline_layout) attributes {subgroup_size = 64 : index, translation_info = #iree_codegen.translation_info, workgroup_size = [64 : index, 16 : index, 1 : index]} { ^bb0(%arg0: !hal.device): %c250 = arith.constant 250 : index %c500 = arith.constant 500 : index diff --git a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td index e5c6f6f649cd..26b37dd07e24 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td +++ b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td @@ -40,24 +40,26 @@ def LLVMGPU_SimpleDistribute : I32EnumAttrCase<"LLVMGPUDistribute", 102>; def LLVMGPU_Vectorize : I32EnumAttrCase<"LLVMGPUVectorize", 103>; +def LLVMGPU_MatmulSimt + : I32EnumAttrCase<"LLVMGPUMatmulSimt", 104>; def LLVMGPU_MatmulTensorCore - : I32EnumAttrCase<"LLVMGPUMatmulTensorCore", 104>; + : I32EnumAttrCase<"LLVMGPUMatmulTensorCore", 105>; def LLVMGPU_TransposeSharedMem - : I32EnumAttrCase<"LLVMGPUTransposeSharedMem", 105>; + : I32EnumAttrCase<"LLVMGPUTransposeSharedMem", 106>; def LLVMGPU_WarpReduction - : I32EnumAttrCase<"LLVMGPUWarpReduction", 106>; + : I32EnumAttrCase<"LLVMGPUWarpReduction", 107>; def LLVMGPU_PackUnPack - : I32EnumAttrCase<"LLVMGPUPackUnPack", 107>; + : I32EnumAttrCase<"LLVMGPUPackUnPack", 108>; def LLVMGPU_MatmulTensorCoreMmaSync - : I32EnumAttrCase<"LLVMGPUMatmulTensorCoreMmaSync", 108>; + : I32EnumAttrCase<"LLVMGPUMatmulTensorCoreMmaSync", 109>; def LLVMGPU_VectorDistribute - : I32EnumAttrCase<"LLVMGPUVectorDistribute", 109>; + : I32EnumAttrCase<"LLVMGPUVectorDistribute", 110>; def LLVMGPU_PadAndVectorDistribute - : I32EnumAttrCase<"LLVMGPUPadAndVectorDistribute", 110>; + : I32EnumAttrCase<"LLVMGPUPadAndVectorDistribute", 111>; def LLVMGPU_WinogradVectorize - : I32EnumAttrCase<"LLVMGPUWinogradVectorize", 111>; + : I32EnumAttrCase<"LLVMGPUWinogradVectorize", 112>; def LLVMGPU_TileAndFuse - : I32EnumAttrCase<"LLVMGPUTileAndFuse", 112>; + : I32EnumAttrCase<"LLVMGPUTileAndFuse", 113>; def SPIRV_BaseLowering : I32EnumAttrCase<"SPIRVBaseLowering", 200>; @@ -96,7 +98,7 @@ def DispatchLoweringPassPipelineEnum : I32EnumAttr< // LLVMGPU CodeGen pipelines LLVMGPU_Default, LLVMGPU_BaseLowering, LLVMGPU_SimpleDistribute, - LLVMGPU_Vectorize, LLVMGPU_MatmulTensorCore, + LLVMGPU_Vectorize, LLVMGPU_MatmulSimt, LLVMGPU_MatmulTensorCore, LLVMGPU_TransposeSharedMem, LLVMGPU_WarpReduction, LLVMGPU_PackUnPack, LLVMGPU_MatmulTensorCoreMmaSync, LLVMGPU_VectorDistribute, LLVMGPU_PadAndVectorDistribute, LLVMGPU_WinogradVectorize, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp index 7f752e2c559f..fc890d1db70d 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp @@ -1295,11 +1295,9 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, CodeGenPipeline pipeline) { TileSizesListType tileSizes; unsigned numParallelLoops = op.getNumParallelLoops(); - unsigned numReductionLoops = op.getNumReductionLoops(); - SmallVector workgroupTileSizes( - numParallelLoops + numReductionLoops, 1); - workgroupTileSizes[numParallelLoops - 2] = tileX; - workgroupTileSizes[numParallelLoops - 1] = tileY; + SmallVector workgroupTileSizes(numParallelLoops - 2, 1); + workgroupTileSizes.append({tileX, tileY}); + workgroupTileSizes.append(op.getNumReductionLoops(), tileK); SmallVector partitionedLoops = cast(op.getOperation()) @@ -1313,65 +1311,11 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, } } + tileSizes.emplace_back(std::move(workgroupTileSizes)); // Workgroup level. std::optional subgroupSize = std::nullopt; if (!subgroupSizes.empty()) subgroupSize = subgroupSizes.front(); - // For the LLVMGPUTileAndFuse pipeline, we need to split tile sizes - // for workgroup, thread, and reduction. - if (pipeline == CodeGenPipeline::LLVMGPUTileAndFuse) { - - auto context = op.getContext(); - Builder b(context); - SmallVector attrs; - - SmallVector threadTileSizes(numParallelLoops + numReductionLoops, - 0); - std::fill(threadTileSizes.begin(), - threadTileSizes.begin() + numParallelLoops, 1); - - threadTileSizes[numParallelLoops - 2] = - (tileX / workgroupSize[0]) < 1 ? 1 : (tileX / workgroupSize[0]); - threadTileSizes[numParallelLoops - 1] = - (tileY / workgroupSize[1]) < 1 ? 1 : (tileY / workgroupSize[1]); - - SmallVector reductionTileSizes( - numParallelLoops + numReductionLoops, 0); - reductionTileSizes[numParallelLoops + numReductionLoops - 1] = tileK; - - attrs.emplace_back(b.getStringAttr("workgroup"), - b.getI64ArrayAttr(workgroupTileSizes)); - attrs.emplace_back(b.getStringAttr("thread"), - b.getI64ArrayAttr(threadTileSizes)); - attrs.emplace_back(b.getStringAttr("reduction"), - b.getI64ArrayAttr(reductionTileSizes)); - - // Promote operands to use shared memory for LHS and RHS. - IREE::GPU::setPromotedOperandList(context, attrs, {0, 1}); - auto configDict = b.getDictionaryAttr(attrs); - auto loweringConfig = - IREE::GPU::LoweringConfigAttr::get(context, configDict); - SmallVector pipelineAttrs; - auto pipelineOptions = IREE::GPU::GPUPipelineOptionsAttr::get( - context, /*prefetchSharedMemory=*/false, - /*no_reduce_shared_memory_bank_conflicts=*/true, - /*use_igemm_convolution=*/false, - /*reorder_workgroups_strategy=*/std::nullopt); - pipelineAttrs.emplace_back( - b.getStringAttr(IREE::GPU::GPUPipelineOptionsAttr::getDictKeyName()), - pipelineOptions); - auto pipelineConfig = b.getDictionaryAttr(pipelineAttrs); - - return setOpConfigAndEntryPointFnTranslation( - entryPoint, op, loweringConfig, pipeline, workgroupSize, subgroupSize, - pipelineConfig); - } - - // Other pipeline (MatmulTensorCore) expect the reduction tile size to be in - // the same list. - workgroupTileSizes[numParallelLoops + numReductionLoops - 1] = tileK; - tileSizes.emplace_back(std::move(workgroupTileSizes)); - return setOpConfigAndEntryPointFnTranslation( entryPoint, op, tileSizes, pipeline, workgroupSize, subgroupSize, getSoftwarePipeliningAttrDict(op->getContext(), softwarePipelineDepth, @@ -1446,7 +1390,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, return setMatmulConfig( sizeN, sizeM, 4, {sizeM, sizeN, 1}, target.getWgp().getSubgroupSizeChoices().asArrayRef(), - softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUTileAndFuse); + softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUMatmulSimt); } // SIMT matmul case. Query the best configuration. @@ -1460,7 +1404,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, config.tileSize[0], config.tileSize[1], config.tileSize[2], config.workgroupSize, target.getWgp().getSubgroupSizeChoices().asArrayRef(), - softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUTileAndFuse); + softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUMatmulSimt); } } } @@ -1485,7 +1429,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, return setMatmulConfig(tileX, tileY, tileK, workgroupSize, target.getWgp().getSubgroupSizeChoices().asArrayRef(), softwarePipelineDepthSimt, - CodeGenPipeline::LLVMGPUTileAndFuse); + CodeGenPipeline::LLVMGPUMatmulSimt); } //====---------------------------------------------------------------------===// diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp index 1773e229c284..73688d2b92d5 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp @@ -114,6 +114,9 @@ void LLVMGPULowerExecutableTargetPass::runOnOperation() { case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUWinogradVectorize: addGPUWinogradVectorizePassPipeline(pipeline); break; + case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUMatmulSimt: + addGPUMatmulSimtPassPipeline(pipeline, pipelineOptions); + break; case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUMatmulTensorCore: { FailureOr maybeDepth = getSoftwarePipelineDepth(translationInfo.getConfiguration()); diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp index d460a1b9f56b..1debcf3bc205 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp @@ -526,6 +526,72 @@ void addGPUWinogradVectorizePassPipeline(OpPassManager &funcPassManager) { funcPassManager.addPass(createOptimizeTensorInsertExtractSlicesPass()); } +//===---------------------------------------------------------------------===// +// MatmulSIMT +//===---------------------------------------------------------------------===// + +void addGPUMatmulSimtPassPipeline(OpPassManager &funcPassManager, + const GPUPipelineOptions &options) { + tileAndDistributeToWorkgroup(funcPassManager, /*useForall=*/false); + + funcPassManager.addPass(createConfigTrackingCanonicalizerPass()); + funcPassManager.addPass(createConfigTrackingCanonicalizerPass()); + funcPassManager.addPass(createCSEPass()); + + funcPassManager.addPass(createGPUTensorTileToSerialLoopsPass()); + funcPassManager.addPass(createGPUTensorAlloc()); + funcPassManager.addPass(createGPUTensorTilePass()); + + // Linalg -> vector + addGPUVectorizationPasses(funcPassManager); + + // tensor to memref + addBufferizePasses(funcPassManager); + + // distribute foreach threads + funcPassManager.addPass(createGPUDistributePass()); + + funcPassManager.addPass(createMemrefCopyToLinalgPass()); + funcPassManager.addPass(createGPUDistributeSharedMemoryCopyPass()); + funcPassManager.addPass(createCanonicalizerPass()); + funcPassManager.addPass(createCSEPass()); + + if (options.enableReduceSharedMemoryBankConflicts) { + funcPassManager.addPass(createGPUReduceBankConflictsPass()); + } + + ReorderWorkgroupsStrategy reorderStrategy = + getReorderWorkgroupsStrategy(options.reorderStrategy); + funcPassManager.addPass( + createReorderWorkgroups(reorderStrategy, canReorderWorkgroups)); + + funcPassManager.addPass(createCanonicalizerPass()); + funcPassManager.addPass(createCSEPass()); + + funcPassManager.addPass(memref::createFoldMemRefAliasOpsPass()); + funcPassManager.addPass(createCSEPass()); + funcPassManager.addPass(createCanonicalizerPass()); + funcPassManager.addPass(createCSEPass()); + + // Even though we vectorize before bufferization we are not able to hoist + // accumulator load/store out of the K loop until distribution. This is + // because we materialize the fill and the matmul in two different scf.forall + // regions, when they should be in the same scf.forall. Newer pipelines + // like TileAndFuse don't have this problem, because they coalesce these + // scf.forall regions into a single scf.forall. + // + // Therefore we still rely on buffer level transformations for transfer ops + // hoisting and store to load forwarding. This relies on shacky alias + // analysis and we need to move this to tensor level once we have better + // abstractions. + funcPassManager.addPass(createOptimizeVectorTransferPass()); + + // Hoist loop invariant code to avoid pipelining it. + funcPassManager.addPass(createIREELoopInvariantCodeMotionPass()); + // Pipeline memory operations. + funcPassManager.addPass(createGPUPipeliningPass()); +} + //===---------------------------------------------------------------------===// // Matmul Tensor Core //===---------------------------------------------------------------------===// diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h index 17b7b866be11..caacfb2656e3 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h @@ -28,6 +28,10 @@ using IREE::GPU::GPUPipelineOptions; // LLVMGPU Backend Pass Pipelines //----------------------------------------------------------------------------// +/// Lowering using SIMT CUDA core operations. +void addGPUMatmulSimtPassPipeline(OpPassManager &funcPassManager, + const GPUPipelineOptions &options); + /// Lowering using mma.sync Tensor Core operations. void addGPUMatmulTensorCoreMmaSyncPassPipeline( OpPassManager &funcPassManager, const GPUPipelineOptions &options, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp index bab5de877eb3..f2e3e2da4e3f 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp @@ -38,6 +38,10 @@ getInstructionShape(Operation *op, CodeGenPipeline pipeline, Type inputElementType, SmallVector &instructionShape) { switch (pipeline) { + case CodeGenPipeline::LLVMGPUMatmulSimt: + // SIMT Pipeline / CUDA Cores + instructionShape = {1, 1, 1}; + break; case CodeGenPipeline::LLVMGPUMatmulTensorCore: // Tensor Core Pipeline / WMMA API if (inputElementType.isF16() || inputElementType.isBF16()) { @@ -77,7 +81,8 @@ verifyGPUMatmulPipeline(Operation *op, ArrayRef workgroupSize) { // This verifier only applies to matmul. CodeGenPipeline pipeline = translationInfo.getDispatchLoweringPassPipeline(); - if (pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCore && + if (pipeline != CodeGenPipeline::LLVMGPUMatmulSimt && + pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCore && pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCoreMmaSync) { return success(); } @@ -175,6 +180,10 @@ verifyGPUMatmulPipeline(Operation *op, << pipelineName; } + // Return success for SIMT/CUDA cores. + if (pipeline == CodeGenPipeline::LLVMGPUMatmulSimt) + return success(); + // // Additional verification Tensor Core pipelines. // diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir index 059761c1ae19..1e5dbf63f2f9 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir @@ -267,11 +267,12 @@ func.func @not_vmt() { return } -// CHECK-DAG: #[[$TRANSLATION:.+]] = #iree_codegen.translation_info}> +// CHECK-DAG: #[[$CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK: #[[$TRANSLATION:.+]] = #iree_codegen.translation_info // CHECK: func.func @not_vmt() // CHECK-SAME: translation_info = #[[$TRANSLATION]] // CHECK: linalg.generic -// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], reduction = [0, 0, 8], thread = [1, 128, 0], workgroup = [1, 128, 1]}> +// CHECK-SAME: lowering_config = #[[$CONFIG]] // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir index 3c7e52aa475a..f3e0d81fb961 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir @@ -9,4 +9,4 @@ func.func @matmul(%lhs: tensor<4x4xf32>, %rhs: tensor<4x4xf32>) -> tensor<4x4xf3 return %result : tensor<4x4xf32> } -// CHECK: %2 = linalg.matmul {lowering_config = #{{.*}}, root_op} ins(%arg0, %arg1 : tensor<4x4xf32>, tensor<4x4xf32>) outs(%1 : tensor<4x4xf32>) -> tensor<4x4xf32> +// CHECK: %2 = linalg.matmul {lowering_config = #config, root_op} ins(%arg0, %arg1 : tensor<4x4xf32>, tensor<4x4xf32>) outs(%1 : tensor<4x4xf32>) -> tensor<4x4xf32> diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir index cec55cdaf0a5..cd69906aec13 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir @@ -9,7 +9,7 @@ #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 256)> #map2 = affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @dot_dispatch_0() attributes {translation_info = #translation} { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index @@ -79,7 +79,7 @@ func.func @dot_dispatch_0() attributes {translation_info = #translation} { #map2 = affine_map<(d0, d1, d2)[s0] -> (d0 * 32768 + s0 + d1 * 1024 + d2)> #map3 = affine_map<(d0, d1, d2)[s0] -> (d0 * 65536 + s0 + d1 * 64 + d2)> #map4 = affine_map<(d0, d1, d2)[s0] -> (d0 * 2048 + s0 + d1 * 64 + d2)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @batch_matmul_func() attributes {translation_info = #translation} { %c0 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 @@ -148,7 +148,7 @@ func.func @batch_matmul_func() attributes {translation_info = #translation} { #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 32)> #map2 = affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @dot_dispatch_0() attributes {translation_info = #translation} { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index @@ -312,7 +312,7 @@ module { #hal.pipeline.binding ]> #config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 256)> #map2 = affine_map<(d0)[s0] -> (-d0 + s0, 2)> diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir index b407aa64e864..66fc62f2e482 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir @@ -54,12 +54,14 @@ func.func @dot_dispatch_1() { return } -// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> +// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info // CHECK: func.func @dot_dispatch_1 // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill +// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul -// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], reduction = [0, 0, 4], thread = [2, 1, 0], workgroup = [4, 2, 1]}> +// CHECK-SAME: lowering_config = #[[CONFIG]] // ----- @@ -81,12 +83,14 @@ func.func @unaligned_k() { return } -// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> +// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info // CHECK: func.func @unaligned_k // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill +// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul -// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], reduction = [0, 0, 2], thread = [1, 16, 0], workgroup = [32, 128, 1]}> +// CHECK-SAME: lowering_config = #[[CONFIG]] // ----- @@ -119,6 +123,7 @@ func.func @predict_dispatch_153() { // CHECK: func.func @predict_dispatch_153() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill +// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.generic // CHECK-SAME: lowering_config = #[[CONFIG]] @@ -249,7 +254,7 @@ func.func @static_3d_fft_stage3() { #hal.pipeline.binding ]> #config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info #compilation = #iree_codegen.compilation_info func.func @_lowering_config_test_dispatch_1() { %cst = arith.constant 0.000000e+00 : f32 @@ -269,10 +274,11 @@ func.func @_lowering_config_test_dispatch_1() { } // CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info // CHECK: func.func @_lowering_config_test_dispatch_1() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill +// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul // CHECK-SAME: lowering_config = #[[CONFIG]] @@ -335,7 +341,7 @@ func.func @matmul_config_sm35() { return } -// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info // CHECK: func.func @matmul_config_sm35() // CHECK-SAME: translation_info = #[[TRANSLATION]] @@ -495,6 +501,7 @@ func.func @large_matmul_f16() { // SM80: func.func @large_matmul_f16() // SM80-SAME: translation_info = #[[TRANSLATION]] // SM80: linalg.fill +// SM80-SAME: lowering_config = #[[CONFIG]] // SM80: linalg.matmul // SM80-SAME: lowering_config = #[[CONFIG]] @@ -527,6 +534,7 @@ func.func @large_matmul_f32() { // SM80: func.func @large_matmul_f32() // SM80-SAME: translation_info = #[[TRANSLATION]] // SM80: linalg.fill +// SM80-SAME: lowering_config = #[[CONFIG]] // SM80: linalg.matmul // SM80-SAME: lowering_config = #[[CONFIG]] @@ -651,12 +659,14 @@ func.func @_main_dispatch_15_generic_512x4x42x42x64_f32() { return } -// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> +// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config // CHECK: func.func @_main_dispatch_15_generic_512x4x42x42x64_f32() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill +// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.generic -// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], reduction = [0, 0, 0, 0, 32], thread = [1, 1, 1, 16, 0], workgroup = [1, 1, 32, 128, 1]}> +// CHECK-SAME: lowering_config = #[[CONFIG]] // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir index 8dccac1fb4a6..2c3df44b325b 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir @@ -1,5 +1,43 @@ // RUN: iree-opt --iree-gpu-test-target=sm_60 --pass-pipeline="builtin.module(iree-llvmgpu-select-lowering-strategy)" --verify-diagnostics --split-input-file %s +#pipeline_layout = #hal.pipeline.layout, + #hal.pipeline.binding, + #hal.pipeline.binding +]> +#config = #iree_codegen.lowering_config +#translation = #iree_codegen.translation_info +func.func @illegal() attributes {translation_info = #translation} { + %c0 = arith.constant 0 : index + %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) : memref<4x8xf32> + %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) : memref<8x16xf32> + %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) : memref<4x16xf32> + // expected-error @+1 {{Total number of threads in a thread block 2048 exceeds the limit of 1024 with compilation pipeline LLVMGPUMatmulSimt}} + linalg.matmul {lowering_config = #config} ins(%0, %1 : memref<4x8xf32>, memref<8x16xf32>) outs(%2 : memref<4x16xf32>) + return +} + +// ----- + +#pipeline_layout = #hal.pipeline.layout, + #hal.pipeline.binding, + #hal.pipeline.binding +]> +#config = #iree_codegen.lowering_config +#translation = #iree_codegen.translation_info +func.func @illegal() attributes {translation_info = #translation} { + %c0 = arith.constant 0 : index + %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) : memref<4x8xf32> + %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) : memref<8x16xf32> + %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) : memref<4x16xf32> + // expected-error @+1 {{Expected workgroup size in z-dim = 1, but got 2 with compilation pipeline LLVMGPUMatmulSimt}} + linalg.matmul {lowering_config = #config} ins(%0, %1 : memref<4x8xf32>, memref<8x16xf32>) outs(%2 : memref<4x16xf32>) + return +} + +// ----- + #pipeline_layout = #hal.pipeline.layout, #hal.pipeline.binding, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir index ee2578c3e9e9..9cb3fed6254c 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir @@ -83,14 +83,20 @@ hal.executable @dot_dispatch_0 { } } -// CHECK-LABEL: hal.executable public @dot_dispatch_0 -// CHECK: hal.executable.variant public @cuda -// CHECK-NOT: llvm.store -// CHECK: llvm.br -// CHECK: llvm.load {{.*}} : !llvm.ptr<3> -> vector<32xf32> -// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<16xf32> -// CHECK-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> -// CHECK: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> +// CHECK-LABEL: hal.executable public @dot_dispatch_0 +// CHECK: hal.executable.variant public @cuda +// CHECK-NOT: llvm.store +// CHECK-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> +// CHECK: llvm.br +// CHECK-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> +// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> +// CHECK-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> +// CHECK-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> +// CHECK: llvm.br +// CHECK-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> +// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> +// CHECK-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> +// CHECK-COUNT-4: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> // ----- @@ -152,10 +158,11 @@ hal.executable @dot_dispatch_0 { } // CHECK-LABEL: hal.executable public @dot_dispatch_0 -// CHECK: hal.executable.variant public @cuda -// CHECK: llvm.br -// CHECK-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> -// CHECK: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> +// CHECK: hal.executable.variant public @cuda +// CHECK: llvm.br +// CHECK-COUNT-8: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> +// CHECK: llvm.br +// CHECK-COUNT-2: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir index 8055b9e8c412..578d28b027b5 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir @@ -87,12 +87,17 @@ hal.executable @dot_dispatch_0 { // RDNA3-LABEL: hal.executable public @dot_dispatch_0 // RDNA3: hal.executable.variant public @rocm // RDNA3-NOT: llvm.store +// RDNA3-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> // RDNA3: llvm.br -// RDNA3-COUNT-1: llvm.load {{.*}} : !llvm.ptr<3> -> vector<32xf32> -// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<16xf32> -// RDNA3-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> -// RDNA3-COUNT-1: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> +// RDNA3-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> +// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> +// RDNA3-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> +// RDNA3-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> // RDNA3: llvm.br +// RDNA3-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> +// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> +// RDNA3-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> +// RDNA3-COUNT-4: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> // ----- diff --git a/tests/e2e/matmul/BUILD.bazel b/tests/e2e/matmul/BUILD.bazel index 8ffe93c0ffac..0bad5e06eef7 100644 --- a/tests/e2e/matmul/BUILD.bazel +++ b/tests/e2e/matmul/BUILD.bazel @@ -385,6 +385,30 @@ X86_64_AVX512_BF16 = X86_64_AVX512 + [ ## ########################################################################### +iree_generated_e2e_runner_test( + name = "e2e_matmul_cuda_f32_large_simt", + generator = ":generate_e2e_matmul_tests", + generator_args = [ + "--lhs_rhs_type=f32", + "--acc_type=f32", + "--shapes=easy_large_static", + "--compilation_info=LLVMGPUMatmulSimt", + ], + tags = [ + # CUDA cuInit fails with sanitizer on. + "noasan", + "nomsan", + "notsan", + "noubsan", + "requires-gpu-nvidia", + ], + target_backends_and_drivers = [ + ("cuda", "cuda"), + ], + test_runner = "//tools/testing/e2e:iree-e2e-matmul-test", + test_type = "matmul", +) + # Testing Ampere + TensorCore path. # WMMA TensorCore(F32): wmma.161616.f32.tf32 iree_generated_e2e_runner_test( diff --git a/tests/e2e/matmul/CMakeLists.txt b/tests/e2e/matmul/CMakeLists.txt index b744d346ebef..9e7ec415b564 100644 --- a/tests/e2e/matmul/CMakeLists.txt +++ b/tests/e2e/matmul/CMakeLists.txt @@ -1016,6 +1016,32 @@ iree_generated_e2e_runner_test( "--iree-opt-data-tiling" ) +iree_generated_e2e_runner_test( + NAME + e2e_matmul_cuda_f32_large_simt + TEST_TYPE + matmul + GENERATOR + "generate_e2e_matmul_tests.py" + GENERATOR_ARGS + "--lhs_rhs_type=f32" + "--acc_type=f32" + "--shapes=easy_large_static" + "--compilation_info=LLVMGPUMatmulSimt" + TEST_RUNNER + iree_tools_testing_e2e_iree-e2e-matmul-test + TARGET_BACKENDS + "cuda" + DRIVERS + "cuda" + LABELS + "noasan" + "nomsan" + "notsan" + "noubsan" + "requires-gpu-nvidia" +) + iree_generated_e2e_runner_test( NAME e2e_matmul_cuda_f32_large_tensorcore diff --git a/tests/e2e/matmul/generate_e2e_matmul_tests.py b/tests/e2e/matmul/generate_e2e_matmul_tests.py index 3061fb620af0..a97b5626c069 100644 --- a/tests/e2e/matmul/generate_e2e_matmul_tests.py +++ b/tests/e2e/matmul/generate_e2e_matmul_tests.py @@ -50,6 +50,7 @@ class ShapesId(enum.Enum): @enum.unique class CompilationInfoId(enum.Enum): NONE = "" + LLVMGPUMatmulSimt = "LLVMGPUMatmulSimt" LLVMGPUMatmulTensorCore = "LLVMGPUMatmulTensorCore" LLVMGPUMatmulTensorCoreMmaSync = "LLVMGPUMatmulTensorCoreMmaSync" LLVMGPUVectorDistributeMFMA = "LLVMGPUVectorDistributeMFMA" @@ -460,7 +461,18 @@ def get_test_compilation_infos( software_pipeline_depth = 0 tile_workgroup_size_pairs = [] - if compilation_info_id == CompilationInfoId.SPIRVCooperativeMatrixVectorize: + if compilation_info_id == CompilationInfoId.LLVMGPUMatmulSimt: + tile_workgroup_size_pairs = [ + TileWorkgroupSizePair([[32, 128, 32]], [32, 8, 1]), + TileWorkgroupSizePair([[128, 64, 8]], [16, 8, 1]), + TileWorkgroupSizePair([[16, 256, 32]], [64, 2, 1]), + TileWorkgroupSizePair([[8, 32, 32]], [8, 8, 1]), + TileWorkgroupSizePair([[8, 128, 4]], [32, 1, 1]), + TileWorkgroupSizePair([[16, 64, 4]], [16, 2, 1]), + TileWorkgroupSizePair([[1, 128, 8]], [32, 1, 1]), + ] + software_pipeline_depth = 3 + elif compilation_info_id == CompilationInfoId.SPIRVCooperativeMatrixVectorize: tile_workgroup_size_pairs = [ TileWorkgroupSizePair( [[64, 128], [32, 64], [0, 0, 32], [16, 16, 16]], [64, 2, 1] From 5b67943517b8f39e633f95696bfee175a225696c Mon Sep 17 00:00:00 2001 From: Ben Vanik Date: Wed, 18 Dec 2024 09:06:33 -0800 Subject: [PATCH 36/64] Moving synchronous HAL file APIs to the public API. (#19512) Implementations backed by device APIs like cuFile/DirectStorage/etc will return false for iree_hal_file_supports_synchronous_io and the respective HAL implementations will handle I/O themselves. --- runtime/src/iree/hal/file.c | 53 ++++++++++ runtime/src/iree/hal/file.h | 56 +++++++++++ runtime/src/iree/hal/utils/BUILD.bazel | 1 - runtime/src/iree/hal/utils/CMakeLists.txt | 1 - runtime/src/iree/hal/utils/file_transfer.c | 21 +++- runtime/src/iree/hal/utils/file_transfer.h | 10 +- runtime/src/iree/hal/utils/memory_file.c | 108 +++++++-------------- runtime/src/iree/hal/utils/memory_file.h | 41 +------- 8 files changed, 172 insertions(+), 119 deletions(-) diff --git a/runtime/src/iree/hal/file.c b/runtime/src/iree/hal/file.c index dceb3c214804..1dea5c877344 100644 --- a/runtime/src/iree/hal/file.c +++ b/runtime/src/iree/hal/file.c @@ -31,3 +31,56 @@ IREE_API_EXPORT iree_status_t iree_hal_file_import( IREE_TRACE_ZONE_END(z0); return status; } + +IREE_API_EXPORT iree_hal_memory_access_t +iree_hal_file_allowed_access(iree_hal_file_t* file) { + IREE_ASSERT_ARGUMENT(file); + return _VTABLE_DISPATCH(file, allowed_access)(file); +} + +IREE_API_EXPORT uint64_t iree_hal_file_length(iree_hal_file_t* file) { + IREE_ASSERT_ARGUMENT(file); + return _VTABLE_DISPATCH(file, length)(file); +} + +IREE_API_EXPORT iree_hal_buffer_t* iree_hal_file_storage_buffer( + iree_hal_file_t* file) { + IREE_ASSERT_ARGUMENT(file); + return _VTABLE_DISPATCH(file, storage_buffer)(file); +} + +IREE_API_EXPORT bool iree_hal_file_supports_synchronous_io( + iree_hal_file_t* file) { + IREE_ASSERT_ARGUMENT(file); + return _VTABLE_DISPATCH(file, supports_synchronous_io)(file); +} + +IREE_API_EXPORT iree_status_t iree_hal_file_read( + iree_hal_file_t* file, uint64_t file_offset, iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, iree_device_size_t length) { + IREE_ASSERT_ARGUMENT(file); + IREE_ASSERT_ARGUMENT(buffer); + IREE_TRACE_ZONE_BEGIN(z0); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, file_offset); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, (int64_t)buffer_offset); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, (int64_t)length); + iree_status_t status = _VTABLE_DISPATCH(file, read)(file, file_offset, buffer, + buffer_offset, length); + IREE_TRACE_ZONE_END(z0); + return status; +} + +IREE_API_EXPORT iree_status_t iree_hal_file_write( + iree_hal_file_t* file, uint64_t file_offset, iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, iree_device_size_t length) { + IREE_ASSERT_ARGUMENT(file); + IREE_ASSERT_ARGUMENT(buffer); + IREE_TRACE_ZONE_BEGIN(z0); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, file_offset); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, (int64_t)buffer_offset); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, (int64_t)length); + iree_status_t status = _VTABLE_DISPATCH(file, write)( + file, file_offset, buffer, buffer_offset, length); + IREE_TRACE_ZONE_END(z0); + return status; +} diff --git a/runtime/src/iree/hal/file.h b/runtime/src/iree/hal/file.h index d727f59051c5..8d75017a1d43 100644 --- a/runtime/src/iree/hal/file.h +++ b/runtime/src/iree/hal/file.h @@ -94,12 +94,68 @@ IREE_API_EXPORT void iree_hal_file_retain(iree_hal_file_t* file); // Releases the given |file| from the caller. IREE_API_EXPORT void iree_hal_file_release(iree_hal_file_t* file); +// Returns the memory access allowed to the file. +// This may be more strict than the original file handle backing the resource +// if for example we want to prevent particular users from mutating the file. +IREE_API_EXPORT iree_hal_memory_access_t +iree_hal_file_allowed_access(iree_hal_file_t* file); + +// Returns the total accessible range of the file. +// This may be a portion of the original file backing this handle. +IREE_API_EXPORT uint64_t iree_hal_file_length(iree_hal_file_t* file); + +// Returns an optional device-accessible storage buffer representing the file. +// Available if the implementation is able to perform import/address-space +// mapping/etc such that device-side transfers can directly access the resources +// as if they were a normal device buffer. +IREE_API_EXPORT iree_hal_buffer_t* iree_hal_file_storage_buffer( + iree_hal_file_t* file); + +// TODO(benvanik): truncate/extend? (both can be tricky with async) + +// Returns true if the iree_hal_file_read and iree_hal_file_write APIs are +// available for use on the file. Not all implementations support synchronous +// I/O. +IREE_API_EXPORT bool iree_hal_file_supports_synchronous_io( + iree_hal_file_t* file); + +// Synchronously reads a segment of |file| into |buffer|. +// Blocks the caller until completed. Buffers are always host mappable. +// Only available if iree_hal_file_supports_synchronous_io is true. +IREE_API_EXPORT iree_status_t iree_hal_file_read( + iree_hal_file_t* file, uint64_t file_offset, iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, iree_device_size_t length); + +// Synchronously writes a segment of |buffer| into |file|. +// Blocks the caller until completed. Buffers are always host mappable. +// Only available if iree_hal_file_supports_synchronous_io is true. +IREE_API_EXPORT iree_status_t iree_hal_file_write( + iree_hal_file_t* file, uint64_t file_offset, iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, iree_device_size_t length); + //===----------------------------------------------------------------------===// // iree_hal_file_t implementation details //===----------------------------------------------------------------------===// typedef struct iree_hal_file_vtable_t { void(IREE_API_PTR* destroy)(iree_hal_file_t* IREE_RESTRICT file); + + iree_hal_memory_access_t(IREE_API_PTR* allowed_access)(iree_hal_file_t* file); + + uint64_t(IREE_API_PTR* length)(iree_hal_file_t* file); + + iree_hal_buffer_t*(IREE_API_PTR* storage_buffer)(iree_hal_file_t* file); + + bool(IREE_API_PTR* supports_synchronous_io)(iree_hal_file_t* file); + iree_status_t(IREE_API_PTR* read)(iree_hal_file_t* file, uint64_t file_offset, + iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, + iree_device_size_t length); + iree_status_t(IREE_API_PTR* write)(iree_hal_file_t* file, + uint64_t file_offset, + iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, + iree_device_size_t length); } iree_hal_file_vtable_t; IREE_HAL_ASSERT_VTABLE_LAYOUT(iree_hal_file_vtable_t); diff --git a/runtime/src/iree/hal/utils/BUILD.bazel b/runtime/src/iree/hal/utils/BUILD.bazel index 0ad6080f91ff..6828271820f2 100644 --- a/runtime/src/iree/hal/utils/BUILD.bazel +++ b/runtime/src/iree/hal/utils/BUILD.bazel @@ -99,7 +99,6 @@ iree_runtime_cc_library( srcs = ["file_transfer.c"], hdrs = ["file_transfer.h"], deps = [ - ":memory_file", "//runtime/src/iree/base", "//runtime/src/iree/base/internal", "//runtime/src/iree/hal", diff --git a/runtime/src/iree/hal/utils/CMakeLists.txt b/runtime/src/iree/hal/utils/CMakeLists.txt index 4bb9c8c7a522..f2a6e92ee57e 100644 --- a/runtime/src/iree/hal/utils/CMakeLists.txt +++ b/runtime/src/iree/hal/utils/CMakeLists.txt @@ -120,7 +120,6 @@ iree_cc_library( SRCS "file_transfer.c" DEPS - ::memory_file iree::base iree::base::internal iree::hal diff --git a/runtime/src/iree/hal/utils/file_transfer.c b/runtime/src/iree/hal/utils/file_transfer.c index 193a2f56c289..3ffa57010676 100644 --- a/runtime/src/iree/hal/utils/file_transfer.c +++ b/runtime/src/iree/hal/utils/file_transfer.c @@ -7,7 +7,6 @@ #include "iree/hal/utils/file_transfer.h" #include "iree/base/internal/math.h" -#include "iree/hal/utils/memory_file.h" //===----------------------------------------------------------------------===// // Configuration @@ -40,8 +39,6 @@ // iree_hal_transfer_operation_t //===----------------------------------------------------------------------===// -// TODO(benvanik): move to utils/ without relying on iree_hal_memory_file_t. - // Maximum number of transfer workers that can be used; common usage should be // 1-4 but on very large systems with lots of bandwidth we may be able to // use more. @@ -876,6 +873,15 @@ IREE_API_EXPORT iree_status_t iree_hal_device_queue_read_streaming( target_offset, length, IREE_HAL_COPY_FLAG_NONE); } + // This host-side transfer utility requires synchronous I/O. + // HAL implementations are expected to handle asynchronous files themselves. + if (!iree_hal_file_supports_synchronous_io(source_file)) { + return iree_make_status( + IREE_STATUS_INVALID_ARGUMENT, + "provided source file does not support synchronous I/O and cannot be " + "used with streaming file transfer"); + } + // Allocate full transfer operation. iree_hal_transfer_operation_t* operation = NULL; IREE_RETURN_IF_ERROR(iree_hal_transfer_operation_create( @@ -917,6 +923,15 @@ IREE_API_EXPORT iree_status_t iree_hal_device_queue_write_streaming( (iree_device_size_t)target_offset, length, IREE_HAL_COPY_FLAG_NONE); } + // This host-side transfer utility requires synchronous I/O. + // HAL implementations are expected to handle asynchronous files themselves. + if (!iree_hal_file_supports_synchronous_io(target_file)) { + return iree_make_status( + IREE_STATUS_INVALID_ARGUMENT, + "provided target file does not support synchronous I/O and cannot be " + "used with streaming file transfer"); + } + // Allocate full transfer operation. iree_hal_transfer_operation_t* operation = NULL; IREE_RETURN_IF_ERROR(iree_hal_transfer_operation_create( diff --git a/runtime/src/iree/hal/utils/file_transfer.h b/runtime/src/iree/hal/utils/file_transfer.h index ece694bb0b85..8f61249f9ec2 100644 --- a/runtime/src/iree/hal/utils/file_transfer.h +++ b/runtime/src/iree/hal/utils/file_transfer.h @@ -52,8 +52,9 @@ typedef struct iree_hal_file_transfer_options_t { // The provided |options.loop| is used for any asynchronous host operations // performed as part of the transfer. // -// WARNING: this only works with memory files as created via -// iree_hal_memory_file_wrap. +// Only files that support synchronous I/O are supported. Callers must use +// iree_hal_file_supports_synchronous_io and route asynchronous files to native +// implementations. IREE_API_EXPORT iree_status_t iree_hal_device_queue_read_streaming( iree_hal_device_t* device, iree_hal_queue_affinity_t queue_affinity, const iree_hal_semaphore_list_t wait_semaphore_list, @@ -75,8 +76,9 @@ IREE_API_EXPORT iree_status_t iree_hal_device_queue_read_streaming( // The provided |options.loop| is used for any asynchronous host operations // performed as part of the transfer. // -// WARNING: this only works with memory files as created via -// iree_hal_memory_file_wrap. +// Only files that support synchronous I/O are supported. Callers must use +// iree_hal_file_supports_synchronous_io and route asynchronous files to native +// implementations. IREE_API_EXPORT iree_status_t iree_hal_device_queue_write_streaming( iree_hal_device_t* device, iree_hal_queue_affinity_t queue_affinity, const iree_hal_semaphore_list_t wait_semaphore_list, diff --git a/runtime/src/iree/hal/utils/memory_file.c b/runtime/src/iree/hal/utils/memory_file.c index 9de9cca1dae3..3974b846ebe9 100644 --- a/runtime/src/iree/hal/utils/memory_file.c +++ b/runtime/src/iree/hal/utils/memory_file.c @@ -123,17 +123,18 @@ IREE_API_EXPORT iree_status_t iree_hal_memory_file_wrap( iree_allocator_t host_allocator, iree_hal_file_t** out_file) { IREE_ASSERT_ARGUMENT(out_file); *out_file = NULL; - IREE_TRACE_ZONE_BEGIN(z0); // For now we only support host allocations but could open other types that // may be backed by memory if desired. if (iree_io_file_handle_type(handle) != IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - IREE_TRACE_ZONE_END(z0); return iree_make_status(IREE_STATUS_UNIMPLEMENTED, "support for wrapping non-host-allocation file " "handles with memory files is not yet implemented"); } + + IREE_TRACE_ZONE_BEGIN(z0); + iree_byte_span_t contents = iree_io_file_handle_value(handle).host_allocation; // Note that iree_device_size_t (for device offsets/sizes) may be smaller than @@ -274,90 +275,55 @@ static void iree_hal_memory_file_try_import_buffer( iree_status_ignore(status); } -static const iree_hal_file_vtable_t iree_hal_memory_file_vtable = { - .destroy = iree_hal_memory_file_destroy, -}; - -//===----------------------------------------------------------------------===// -// EXPERIMENTAL: synchronous file read/write API -//===----------------------------------------------------------------------===// -// This is incomplete and may not appear like this on the iree_hal_file_t -// vtable; this does work for memory files though. - -IREE_API_EXPORT iree_hal_memory_access_t -iree_hal_file_allowed_access(iree_hal_file_t* base_file) { - IREE_ASSERT_ARGUMENT(base_file); - - // EXPERIMENTAL: today only memory files. This should be on the file vtable - // (if supported - not all implementations need to support it). - iree_hal_memory_file_t* file = (iree_hal_memory_file_t*)base_file; - +static iree_hal_memory_access_t iree_hal_memory_file_allowed_access( + iree_hal_file_t* base_file) { + iree_hal_memory_file_t* file = iree_hal_memory_file_cast(base_file); return file->access; } -IREE_API_EXPORT uint64_t iree_hal_file_length(iree_hal_file_t* base_file) { - IREE_ASSERT_ARGUMENT(base_file); - - // EXPERIMENTAL: today only memory files. This should be on the file vtable - // (if supported - not all implementations need to support it). - iree_hal_memory_file_t* file = (iree_hal_memory_file_t*)base_file; - +static uint64_t iree_hal_memory_file_length(iree_hal_file_t* base_file) { + iree_hal_memory_file_t* file = iree_hal_memory_file_cast(base_file); return file->storage->contents.data_length; } -IREE_API_EXPORT iree_hal_buffer_t* iree_hal_file_storage_buffer( +static iree_hal_buffer_t* iree_hal_memory_file_storage_buffer( iree_hal_file_t* base_file) { - IREE_ASSERT_ARGUMENT(base_file); - - // EXPERIMENTAL: today only memory files. This should be on the file vtable - // (if supported - not all implementations need to support it). - iree_hal_memory_file_t* file = (iree_hal_memory_file_t*)base_file; - + iree_hal_memory_file_t* file = iree_hal_memory_file_cast(base_file); return file->imported_buffer; } -IREE_API_EXPORT iree_status_t iree_hal_file_read( - iree_hal_file_t* base_file, uint64_t file_offset, iree_hal_buffer_t* buffer, - iree_device_size_t buffer_offset, iree_device_size_t length) { - IREE_ASSERT_ARGUMENT(base_file); - IREE_ASSERT_ARGUMENT(buffer); - IREE_TRACE_ZONE_BEGIN(z0); - IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, file_offset); - IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, (int64_t)buffer_offset); - IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, (int64_t)length); - - // EXPERIMENTAL: today only memory files. This should be on the file vtable - // (if supported - not all implementations need to support it). - iree_hal_memory_file_t* file = (iree_hal_memory_file_t*)base_file; +static bool iree_hal_memory_file_supports_synchronous_io( + iree_hal_file_t* base_file) { + // Memory files always support synchronous IO. + return true; +} - // Copy from the file contents to the staging buffer. +static iree_status_t iree_hal_memory_file_read(iree_hal_file_t* base_file, + uint64_t file_offset, + iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, + iree_device_size_t length) { + iree_hal_memory_file_t* file = iree_hal_memory_file_cast(base_file); iree_byte_span_t file_contents = file->storage->contents; - iree_status_t status = iree_hal_buffer_map_write( - buffer, buffer_offset, file_contents.data + file_offset, length); - - IREE_TRACE_ZONE_END(z0); - return status; + return iree_hal_buffer_map_write(buffer, buffer_offset, + file_contents.data + file_offset, length); } -IREE_API_EXPORT iree_status_t iree_hal_file_write( +static iree_status_t iree_hal_memory_file_write( iree_hal_file_t* base_file, uint64_t file_offset, iree_hal_buffer_t* buffer, iree_device_size_t buffer_offset, iree_device_size_t length) { - IREE_ASSERT_ARGUMENT(base_file); - IREE_ASSERT_ARGUMENT(buffer); - IREE_TRACE_ZONE_BEGIN(z0); - IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, file_offset); - IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, (int64_t)buffer_offset); - IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, (int64_t)length); - - // EXPERIMENTAL: today only memory files. This should be on the file vtable - // (if supported - not all implementations need to support it). - iree_hal_memory_file_t* file = (iree_hal_memory_file_t*)base_file; - - // Copy from the staging buffer to the file contents. + iree_hal_memory_file_t* file = iree_hal_memory_file_cast(base_file); iree_byte_span_t file_contents = file->storage->contents; - iree_status_t status = iree_hal_buffer_map_read( - buffer, buffer_offset, file_contents.data + file_offset, length); - - IREE_TRACE_ZONE_END(z0); - return status; + return iree_hal_buffer_map_read(buffer, buffer_offset, + file_contents.data + file_offset, length); } + +static const iree_hal_file_vtable_t iree_hal_memory_file_vtable = { + .destroy = iree_hal_memory_file_destroy, + .allowed_access = iree_hal_memory_file_allowed_access, + .length = iree_hal_memory_file_length, + .storage_buffer = iree_hal_memory_file_storage_buffer, + .supports_synchronous_io = iree_hal_memory_file_supports_synchronous_io, + .read = iree_hal_memory_file_read, + .write = iree_hal_memory_file_write, +}; diff --git a/runtime/src/iree/hal/utils/memory_file.h b/runtime/src/iree/hal/utils/memory_file.h index af03370653aa..433b99b804a9 100644 --- a/runtime/src/iree/hal/utils/memory_file.h +++ b/runtime/src/iree/hal/utils/memory_file.h @@ -19,8 +19,8 @@ extern "C" { // iree_hal_memory_file_t //===----------------------------------------------------------------------===// -// Creates a file handle backed by |contents| without copying the data. -// |release_callback| will be called when the file is destroyed. +// Creates a file backed by |handle| without copying the data. +// Only supports file handles of IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION. // If the memory can be imported into a usable staging buffer |device_allocator| // will be used to do so. IREE_API_EXPORT iree_status_t iree_hal_memory_file_wrap( @@ -28,43 +28,6 @@ IREE_API_EXPORT iree_status_t iree_hal_memory_file_wrap( iree_io_file_handle_t* handle, iree_hal_allocator_t* device_allocator, iree_allocator_t host_allocator, iree_hal_file_t** out_file); -//===----------------------------------------------------------------------===// -// EXPERIMENTAL: synchronous file read/write API -//===----------------------------------------------------------------------===// -// This is incomplete and may not appear like this on the iree_hal_file_t -// vtable; this does work for memory files though. - -// Returns the memory access allowed to the file. -// This may be more strict than the original file handle backing the resource -// if for example we want to prevent particular users from mutating the file. -IREE_API_EXPORT iree_hal_memory_access_t -iree_hal_file_allowed_access(iree_hal_file_t* file); - -// Returns the total accessible range of the file. -// This may be a portion of the original file backing this handle. -IREE_API_EXPORT uint64_t iree_hal_file_length(iree_hal_file_t* file); - -// Returns an optional device-accessible storage buffer representing the file. -// Available if the implementation is able to perform import/address-space -// mapping/etc such that device-side transfers can directly access the resources -// as if they were a normal device buffer. -IREE_API_EXPORT iree_hal_buffer_t* iree_hal_file_storage_buffer( - iree_hal_file_t* file); - -// TODO(benvanik): truncate/extend? (both can be tricky with async) - -// Synchronously reads a segment of |file| into |buffer|. -// Blocks the caller until completed. Buffers are always host mappable. -IREE_API_EXPORT iree_status_t iree_hal_file_read( - iree_hal_file_t* file, uint64_t file_offset, iree_hal_buffer_t* buffer, - iree_device_size_t buffer_offset, iree_device_size_t length); - -// Synchronously writes a segment of |buffer| into |file|. -// Blocks the caller until completed. Buffers are always host mappable. -IREE_API_EXPORT iree_status_t iree_hal_file_write( - iree_hal_file_t* file, uint64_t file_offset, iree_hal_buffer_t* buffer, - iree_device_size_t buffer_offset, iree_device_size_t length); - #ifdef __cplusplus } // extern "C" #endif // __cplusplus From cbdcdd04867f6a1ea47f95f6e4647946fe2dabeb Mon Sep 17 00:00:00 2001 From: Ben Vanik Date: Wed, 18 Dec 2024 09:27:07 -0800 Subject: [PATCH 37/64] Adding iree_hal_file_from_handle factory for common file impls. (#19513) This lets us hide the details from HAL implementations. Implementations are expected to snoop the file handle type to see if they can do better and otherwise fall back to the common implementations. --- experimental/webgpu/BUILD.bazel | 2 +- experimental/webgpu/CMakeLists.txt | 2 +- experimental/webgpu/webgpu_device.c | 12 ++---- runtime/src/iree/hal/drivers/cuda/BUILD.bazel | 2 +- .../src/iree/hal/drivers/cuda/CMakeLists.txt | 2 +- .../src/iree/hal/drivers/cuda/cuda_device.c | 12 ++---- .../src/iree/hal/drivers/hip/CMakeLists.txt | 2 +- runtime/src/iree/hal/drivers/hip/hip_device.c | 13 ++----- .../iree/hal/drivers/local_sync/BUILD.bazel | 2 +- .../hal/drivers/local_sync/CMakeLists.txt | 2 +- .../iree/hal/drivers/local_sync/sync_device.c | 12 ++---- .../iree/hal/drivers/local_task/BUILD.bazel | 2 +- .../hal/drivers/local_task/CMakeLists.txt | 2 +- .../iree/hal/drivers/local_task/task_device.c | 12 ++---- .../src/iree/hal/drivers/metal/CMakeLists.txt | 2 +- .../src/iree/hal/drivers/metal/metal_device.m | 11 ++---- runtime/src/iree/hal/drivers/null/BUILD.bazel | 2 +- .../src/iree/hal/drivers/null/CMakeLists.txt | 2 +- runtime/src/iree/hal/drivers/null/device.c | 12 ++---- .../src/iree/hal/drivers/vulkan/BUILD.bazel | 2 +- .../iree/hal/drivers/vulkan/CMakeLists.txt | 2 +- .../iree/hal/drivers/vulkan/vulkan_device.cc | 12 ++---- runtime/src/iree/hal/utils/BUILD.bazel | 28 ++++++++------ runtime/src/iree/hal/utils/CMakeLists.txt | 30 ++++++++------- runtime/src/iree/hal/utils/file_registry.c | 38 +++++++++++++++++++ runtime/src/iree/hal/utils/file_registry.h | 35 +++++++++++++++++ runtime/src/iree/hal/utils/memory_file.c | 5 ++- runtime/src/iree/hal/utils/memory_file.h | 5 ++- 28 files changed, 150 insertions(+), 115 deletions(-) create mode 100644 runtime/src/iree/hal/utils/file_registry.c create mode 100644 runtime/src/iree/hal/utils/file_registry.h diff --git a/experimental/webgpu/BUILD.bazel b/experimental/webgpu/BUILD.bazel index c7cec08b3070..12d1a44a6e99 100644 --- a/experimental/webgpu/BUILD.bazel +++ b/experimental/webgpu/BUILD.bazel @@ -55,7 +55,7 @@ iree_runtime_cc_library( "//runtime/src/iree/hal/utils:buffer_transfer", "//runtime/src/iree/hal/utils:executable_debug_info", "//runtime/src/iree/hal/utils:file_transfer", - "//runtime/src/iree/hal/utils:memory_file", + "//runtime/src/iree/hal/utils:files", "//runtime/src/iree/schemas:executable_debug_info_c_fbs", "//runtime/src/iree/schemas:webgpu_executable_def_c_fbs", "@webgpu_headers", diff --git a/experimental/webgpu/CMakeLists.txt b/experimental/webgpu/CMakeLists.txt index fa71067dfc69..fa0482089666 100644 --- a/experimental/webgpu/CMakeLists.txt +++ b/experimental/webgpu/CMakeLists.txt @@ -49,7 +49,7 @@ iree_cc_library( iree::experimental::webgpu::platform iree::experimental::webgpu::shaders iree::hal::utils::file_transfer - iree::hal::utils::memory_file + iree::hal::utils::files iree::schemas::webgpu_executable_def_c_fbs PUBLIC ) diff --git a/experimental/webgpu/webgpu_device.c b/experimental/webgpu/webgpu_device.c index c9a2457a4fc6..ab0c455efa9f 100644 --- a/experimental/webgpu/webgpu_device.c +++ b/experimental/webgpu/webgpu_device.c @@ -20,8 +20,8 @@ #include "experimental/webgpu/simple_allocator.h" #include "experimental/webgpu/staging_buffer.h" #include "iree/base/internal/arena.h" +#include "iree/hal/utils/file_registry.h" #include "iree/hal/utils/file_transfer.h" -#include "iree/hal/utils/memory_file.h" //===----------------------------------------------------------------------===// // iree_hal_webgpu_device_t @@ -283,14 +283,8 @@ static iree_status_t iree_hal_webgpu_device_import_file( iree_hal_device_t* base_device, iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, iree_io_file_handle_t* handle, iree_hal_external_file_flags_t flags, iree_hal_file_t** out_file) { - if (iree_io_file_handle_type(handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - return iree_make_status( - IREE_STATUS_UNAVAILABLE, - "implementation does not support the external file type"); - } - return iree_hal_memory_file_wrap( - queue_affinity, access, handle, iree_hal_device_allocator(base_device), + return iree_hal_file_from_handle( + iree_hal_device_allocator(base_device), queue_affinity, access, handle, iree_hal_device_host_allocator(base_device), out_file); } diff --git a/runtime/src/iree/hal/drivers/cuda/BUILD.bazel b/runtime/src/iree/hal/drivers/cuda/BUILD.bazel index f6551efa6bd5..29cbec7ac286 100644 --- a/runtime/src/iree/hal/drivers/cuda/BUILD.bazel +++ b/runtime/src/iree/hal/drivers/cuda/BUILD.bazel @@ -63,7 +63,7 @@ iree_runtime_cc_library( "//runtime/src/iree/hal/utils:deferred_work_queue", "//runtime/src/iree/hal/utils:executable_debug_info", "//runtime/src/iree/hal/utils:file_transfer", - "//runtime/src/iree/hal/utils:memory_file", + "//runtime/src/iree/hal/utils:files", "//runtime/src/iree/hal/utils:resource_set", "//runtime/src/iree/hal/utils:semaphore_base", "//runtime/src/iree/hal/utils:stream_tracing", diff --git a/runtime/src/iree/hal/drivers/cuda/CMakeLists.txt b/runtime/src/iree/hal/drivers/cuda/CMakeLists.txt index a040b1de753f..03ec177c5efa 100644 --- a/runtime/src/iree/hal/drivers/cuda/CMakeLists.txt +++ b/runtime/src/iree/hal/drivers/cuda/CMakeLists.txt @@ -60,7 +60,7 @@ iree_cc_library( iree::hal::utils::deferred_work_queue iree::hal::utils::executable_debug_info iree::hal::utils::file_transfer - iree::hal::utils::memory_file + iree::hal::utils::files iree::hal::utils::resource_set iree::hal::utils::semaphore_base iree::hal::utils::stream_tracing diff --git a/runtime/src/iree/hal/drivers/cuda/cuda_device.c b/runtime/src/iree/hal/drivers/cuda/cuda_device.c index 5ff3cdea0cce..efa326232e58 100644 --- a/runtime/src/iree/hal/drivers/cuda/cuda_device.c +++ b/runtime/src/iree/hal/drivers/cuda/cuda_device.c @@ -27,8 +27,8 @@ #include "iree/hal/drivers/cuda/timepoint_pool.h" #include "iree/hal/utils/deferred_command_buffer.h" #include "iree/hal/utils/deferred_work_queue.h" +#include "iree/hal/utils/file_registry.h" #include "iree/hal/utils/file_transfer.h" -#include "iree/hal/utils/memory_file.h" #include "iree/hal/utils/stream_tracing.h" //===----------------------------------------------------------------------===// @@ -897,14 +897,8 @@ static iree_status_t iree_hal_cuda_device_import_file( iree_hal_device_t* base_device, iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, iree_io_file_handle_t* handle, iree_hal_external_file_flags_t flags, iree_hal_file_t** out_file) { - if (iree_io_file_handle_type(handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - return iree_make_status( - IREE_STATUS_UNAVAILABLE, - "implementation does not support the external file type"); - } - return iree_hal_memory_file_wrap( - queue_affinity, access, handle, iree_hal_device_allocator(base_device), + return iree_hal_file_from_handle( + iree_hal_device_allocator(base_device), queue_affinity, access, handle, iree_hal_device_host_allocator(base_device), out_file); } diff --git a/runtime/src/iree/hal/drivers/hip/CMakeLists.txt b/runtime/src/iree/hal/drivers/hip/CMakeLists.txt index 15e375c1fd2c..dc9355853fc4 100644 --- a/runtime/src/iree/hal/drivers/hip/CMakeLists.txt +++ b/runtime/src/iree/hal/drivers/hip/CMakeLists.txt @@ -70,7 +70,7 @@ iree_cc_library( iree::hal::utils::executable_debug_info iree::hal::utils::deferred_command_buffer iree::hal::utils::file_transfer - iree::hal::utils::memory_file + iree::hal::utils::files iree::hal::utils::resource_set iree::hal::utils::semaphore_base iree::hal::utils::stream_tracing diff --git a/runtime/src/iree/hal/drivers/hip/hip_device.c b/runtime/src/iree/hal/drivers/hip/hip_device.c index bfb58f5fad41..06ce1092ddba 100644 --- a/runtime/src/iree/hal/drivers/hip/hip_device.c +++ b/runtime/src/iree/hal/drivers/hip/hip_device.c @@ -31,8 +31,8 @@ #include "iree/hal/drivers/hip/status_util.h" #include "iree/hal/drivers/hip/stream_command_buffer.h" #include "iree/hal/utils/deferred_command_buffer.h" +#include "iree/hal/utils/file_registry.h" #include "iree/hal/utils/file_transfer.h" -#include "iree/hal/utils/memory_file.h" #include "iree/hal/utils/stream_tracing.h" //===----------------------------------------------------------------------===// @@ -849,15 +849,8 @@ static iree_status_t iree_hal_hip_device_import_file( iree_hal_device_t* base_device, iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, iree_io_file_handle_t* handle, iree_hal_external_file_flags_t flags, iree_hal_file_t** out_file) { - *out_file = NULL; - if (iree_io_file_handle_type(handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - return iree_make_status( - IREE_STATUS_UNAVAILABLE, - "implementation does not support the external file type"); - } - return iree_hal_memory_file_wrap( - queue_affinity, access, handle, iree_hal_device_allocator(base_device), + return iree_hal_file_from_handle( + iree_hal_device_allocator(base_device), queue_affinity, access, handle, iree_hal_device_host_allocator(base_device), out_file); } diff --git a/runtime/src/iree/hal/drivers/local_sync/BUILD.bazel b/runtime/src/iree/hal/drivers/local_sync/BUILD.bazel index e8f6c3cc28f0..5650640fef22 100644 --- a/runtime/src/iree/hal/drivers/local_sync/BUILD.bazel +++ b/runtime/src/iree/hal/drivers/local_sync/BUILD.bazel @@ -37,7 +37,7 @@ iree_runtime_cc_library( "//runtime/src/iree/hal/local:executable_environment", "//runtime/src/iree/hal/utils:deferred_command_buffer", "//runtime/src/iree/hal/utils:file_transfer", - "//runtime/src/iree/hal/utils:memory_file", + "//runtime/src/iree/hal/utils:files", "//runtime/src/iree/hal/utils:semaphore_base", ], ) diff --git a/runtime/src/iree/hal/drivers/local_sync/CMakeLists.txt b/runtime/src/iree/hal/drivers/local_sync/CMakeLists.txt index a71f930af7ac..5fc8af6e1b7f 100644 --- a/runtime/src/iree/hal/drivers/local_sync/CMakeLists.txt +++ b/runtime/src/iree/hal/drivers/local_sync/CMakeLists.txt @@ -34,7 +34,7 @@ iree_cc_library( iree::hal::local::executable_environment iree::hal::utils::deferred_command_buffer iree::hal::utils::file_transfer - iree::hal::utils::memory_file + iree::hal::utils::files iree::hal::utils::semaphore_base PUBLIC ) diff --git a/runtime/src/iree/hal/drivers/local_sync/sync_device.c b/runtime/src/iree/hal/drivers/local_sync/sync_device.c index c539dcbcf66f..f81169283987 100644 --- a/runtime/src/iree/hal/drivers/local_sync/sync_device.c +++ b/runtime/src/iree/hal/drivers/local_sync/sync_device.c @@ -18,8 +18,8 @@ #include "iree/hal/local/inline_command_buffer.h" #include "iree/hal/local/local_executable_cache.h" #include "iree/hal/utils/deferred_command_buffer.h" +#include "iree/hal/utils/file_registry.h" #include "iree/hal/utils/file_transfer.h" -#include "iree/hal/utils/memory_file.h" typedef struct iree_hal_sync_device_t { iree_hal_resource_t resource; @@ -267,14 +267,8 @@ static iree_status_t iree_hal_sync_device_import_file( iree_hal_device_t* base_device, iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, iree_io_file_handle_t* handle, iree_hal_external_file_flags_t flags, iree_hal_file_t** out_file) { - if (iree_io_file_handle_type(handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - return iree_make_status( - IREE_STATUS_UNAVAILABLE, - "implementation does not support the external file type"); - } - return iree_hal_memory_file_wrap( - queue_affinity, access, handle, iree_hal_device_allocator(base_device), + return iree_hal_file_from_handle( + iree_hal_device_allocator(base_device), queue_affinity, access, handle, iree_hal_device_host_allocator(base_device), out_file); } diff --git a/runtime/src/iree/hal/drivers/local_task/BUILD.bazel b/runtime/src/iree/hal/drivers/local_task/BUILD.bazel index b6231bf3e3fb..48b78c8669d4 100644 --- a/runtime/src/iree/hal/drivers/local_task/BUILD.bazel +++ b/runtime/src/iree/hal/drivers/local_task/BUILD.bazel @@ -49,7 +49,7 @@ iree_runtime_cc_library( "//runtime/src/iree/hal/local:executable_library", "//runtime/src/iree/hal/utils:deferred_command_buffer", "//runtime/src/iree/hal/utils:file_transfer", - "//runtime/src/iree/hal/utils:memory_file", + "//runtime/src/iree/hal/utils:files", "//runtime/src/iree/hal/utils:resource_set", "//runtime/src/iree/hal/utils:semaphore_base", "//runtime/src/iree/task", diff --git a/runtime/src/iree/hal/drivers/local_task/CMakeLists.txt b/runtime/src/iree/hal/drivers/local_task/CMakeLists.txt index ba8f2a949c86..042ff6b5f942 100644 --- a/runtime/src/iree/hal/drivers/local_task/CMakeLists.txt +++ b/runtime/src/iree/hal/drivers/local_task/CMakeLists.txt @@ -43,7 +43,7 @@ iree_cc_library( iree::hal::local::executable_library iree::hal::utils::deferred_command_buffer iree::hal::utils::file_transfer - iree::hal::utils::memory_file + iree::hal::utils::files iree::hal::utils::resource_set iree::hal::utils::semaphore_base iree::task diff --git a/runtime/src/iree/hal/drivers/local_task/task_device.c b/runtime/src/iree/hal/drivers/local_task/task_device.c index c01979c36f45..05a7ca9a89d5 100644 --- a/runtime/src/iree/hal/drivers/local_task/task_device.c +++ b/runtime/src/iree/hal/drivers/local_task/task_device.c @@ -19,8 +19,8 @@ #include "iree/hal/local/executable_environment.h" #include "iree/hal/local/local_executable_cache.h" #include "iree/hal/utils/deferred_command_buffer.h" +#include "iree/hal/utils/file_registry.h" #include "iree/hal/utils/file_transfer.h" -#include "iree/hal/utils/memory_file.h" typedef struct iree_hal_task_device_t { iree_hal_resource_t resource; @@ -346,14 +346,8 @@ static iree_status_t iree_hal_task_device_import_file( iree_hal_device_t* base_device, iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, iree_io_file_handle_t* handle, iree_hal_external_file_flags_t flags, iree_hal_file_t** out_file) { - if (iree_io_file_handle_type(handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - return iree_make_status( - IREE_STATUS_UNAVAILABLE, - "implementation does not support the external file type"); - } - return iree_hal_memory_file_wrap( - queue_affinity, access, handle, iree_hal_device_allocator(base_device), + return iree_hal_file_from_handle( + iree_hal_device_allocator(base_device), queue_affinity, access, handle, iree_hal_device_host_allocator(base_device), out_file); } diff --git a/runtime/src/iree/hal/drivers/metal/CMakeLists.txt b/runtime/src/iree/hal/drivers/metal/CMakeLists.txt index d43efc06e3eb..9331d2d0e582 100644 --- a/runtime/src/iree/hal/drivers/metal/CMakeLists.txt +++ b/runtime/src/iree/hal/drivers/metal/CMakeLists.txt @@ -42,7 +42,7 @@ iree_cc_library( iree::hal::utils::deferred_command_buffer iree::hal::utils::executable_debug_info iree::hal::utils::file_transfer - iree::hal::utils::memory_file + iree::hal::utils::files iree::hal::utils::resource_set iree::schemas::executable_debug_info_c_fbs iree::schemas::metal_executable_def_c_fbs diff --git a/runtime/src/iree/hal/drivers/metal/metal_device.m b/runtime/src/iree/hal/drivers/metal/metal_device.m index 0a155d8fec13..4f99709ddabd 100644 --- a/runtime/src/iree/hal/drivers/metal/metal_device.m +++ b/runtime/src/iree/hal/drivers/metal/metal_device.m @@ -17,8 +17,8 @@ #include "iree/hal/drivers/metal/shared_event.h" #include "iree/hal/drivers/metal/staging_buffer.h" #include "iree/hal/utils/deferred_command_buffer.h" +#include "iree/hal/utils/file_registry.h" #include "iree/hal/utils/file_transfer.h" -#include "iree/hal/utils/memory_file.h" #include "iree/hal/utils/resource_set.h" typedef struct iree_hal_metal_device_t { @@ -288,13 +288,8 @@ static iree_status_t iree_hal_metal_device_import_file(iree_hal_device_t* base_d iree_io_file_handle_t* handle, iree_hal_external_file_flags_t flags, iree_hal_file_t** out_file) { - if (iree_io_file_handle_type(handle) != IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - return iree_make_status(IREE_STATUS_UNAVAILABLE, - "implementation does not support the external file type"); - } - return iree_hal_memory_file_wrap(queue_affinity, access, handle, - iree_hal_device_allocator(base_device), - iree_hal_device_host_allocator(base_device), out_file); + return iree_hal_file_from_handle(iree_hal_device_allocator(base_device), queue_affinity, access, + handle, iree_hal_device_host_allocator(base_device), out_file); } static iree_status_t iree_hal_metal_device_create_semaphore(iree_hal_device_t* base_device, diff --git a/runtime/src/iree/hal/drivers/null/BUILD.bazel b/runtime/src/iree/hal/drivers/null/BUILD.bazel index a33a99c31c8c..5343034c7765 100644 --- a/runtime/src/iree/hal/drivers/null/BUILD.bazel +++ b/runtime/src/iree/hal/drivers/null/BUILD.bazel @@ -44,7 +44,7 @@ iree_runtime_cc_library( "//runtime/src/iree/base/internal", "//runtime/src/iree/hal", "//runtime/src/iree/hal/utils:file_transfer", - "//runtime/src/iree/hal/utils:memory_file", + "//runtime/src/iree/hal/utils:files", "//runtime/src/iree/hal/utils:semaphore_base", ], ) diff --git a/runtime/src/iree/hal/drivers/null/CMakeLists.txt b/runtime/src/iree/hal/drivers/null/CMakeLists.txt index c5ed9eb49d67..fa9f96a79101 100644 --- a/runtime/src/iree/hal/drivers/null/CMakeLists.txt +++ b/runtime/src/iree/hal/drivers/null/CMakeLists.txt @@ -41,7 +41,7 @@ iree_cc_library( iree::base::internal iree::hal iree::hal::utils::file_transfer - iree::hal::utils::memory_file + iree::hal::utils::files iree::hal::utils::semaphore_base PUBLIC ) diff --git a/runtime/src/iree/hal/drivers/null/device.c b/runtime/src/iree/hal/drivers/null/device.c index 11953645ff13..d91c23eb187e 100644 --- a/runtime/src/iree/hal/drivers/null/device.c +++ b/runtime/src/iree/hal/drivers/null/device.c @@ -14,8 +14,8 @@ #include "iree/hal/drivers/null/executable.h" #include "iree/hal/drivers/null/executable_cache.h" #include "iree/hal/drivers/null/semaphore.h" +#include "iree/hal/utils/file_registry.h" #include "iree/hal/utils/file_transfer.h" -#include "iree/hal/utils/memory_file.h" //===----------------------------------------------------------------------===// // iree_hal_null_device_t @@ -273,14 +273,8 @@ static iree_status_t iree_hal_null_device_import_file( // definitely prefer that. The emulated file I/O present here as a default is // inefficient. The queue affinity specifies which queues may access the file // via read and write queue operations. - if (iree_io_file_handle_type(handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - return iree_make_status( - IREE_STATUS_UNAVAILABLE, - "implementation does not support the external file type"); - } - return iree_hal_memory_file_wrap( - queue_affinity, access, handle, iree_hal_device_allocator(base_device), + return iree_hal_file_from_handle( + iree_hal_device_allocator(base_device), queue_affinity, access, handle, iree_hal_device_host_allocator(base_device), out_file); } diff --git a/runtime/src/iree/hal/drivers/vulkan/BUILD.bazel b/runtime/src/iree/hal/drivers/vulkan/BUILD.bazel index ff0e0899dbdd..ef91748101ea 100644 --- a/runtime/src/iree/hal/drivers/vulkan/BUILD.bazel +++ b/runtime/src/iree/hal/drivers/vulkan/BUILD.bazel @@ -81,7 +81,7 @@ iree_runtime_cc_library( "//runtime/src/iree/hal/utils:deferred_command_buffer", "//runtime/src/iree/hal/utils:executable_debug_info", "//runtime/src/iree/hal/utils:file_transfer", - "//runtime/src/iree/hal/utils:memory_file", + "//runtime/src/iree/hal/utils:files", "//runtime/src/iree/hal/utils:resource_set", "//runtime/src/iree/hal/utils:semaphore_base", "//runtime/src/iree/schemas:executable_debug_info_c_fbs", diff --git a/runtime/src/iree/hal/drivers/vulkan/CMakeLists.txt b/runtime/src/iree/hal/drivers/vulkan/CMakeLists.txt index 0084d00387f7..7444ef3a8486 100644 --- a/runtime/src/iree/hal/drivers/vulkan/CMakeLists.txt +++ b/runtime/src/iree/hal/drivers/vulkan/CMakeLists.txt @@ -76,7 +76,7 @@ iree_cc_library( iree::hal::utils::deferred_command_buffer iree::hal::utils::executable_debug_info iree::hal::utils::file_transfer - iree::hal::utils::memory_file + iree::hal::utils::files iree::hal::utils::resource_set iree::hal::utils::semaphore_base iree::schemas::executable_debug_info_c_fbs diff --git a/runtime/src/iree/hal/drivers/vulkan/vulkan_device.cc b/runtime/src/iree/hal/drivers/vulkan/vulkan_device.cc index 2e25d1193c3f..ac82b690781d 100644 --- a/runtime/src/iree/hal/drivers/vulkan/vulkan_device.cc +++ b/runtime/src/iree/hal/drivers/vulkan/vulkan_device.cc @@ -31,8 +31,8 @@ #include "iree/hal/drivers/vulkan/util/arena.h" #include "iree/hal/drivers/vulkan/util/ref_ptr.h" #include "iree/hal/utils/deferred_command_buffer.h" +#include "iree/hal/utils/file_registry.h" #include "iree/hal/utils/file_transfer.h" -#include "iree/hal/utils/memory_file.h" using namespace iree::hal::vulkan; @@ -1599,14 +1599,8 @@ static iree_status_t iree_hal_vulkan_device_import_file( iree_hal_device_t* base_device, iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, iree_io_file_handle_t* handle, iree_hal_external_file_flags_t flags, iree_hal_file_t** out_file) { - if (iree_io_file_handle_type(handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - return iree_make_status( - IREE_STATUS_UNAVAILABLE, - "implementation does not support the external file type"); - } - return iree_hal_memory_file_wrap( - queue_affinity, access, handle, iree_hal_device_allocator(base_device), + return iree_hal_file_from_handle( + iree_hal_device_allocator(base_device), queue_affinity, access, handle, iree_hal_device_host_allocator(base_device), out_file); } diff --git a/runtime/src/iree/hal/utils/BUILD.bazel b/runtime/src/iree/hal/utils/BUILD.bazel index 6828271820f2..76d710e7dab0 100644 --- a/runtime/src/iree/hal/utils/BUILD.bazel +++ b/runtime/src/iree/hal/utils/BUILD.bazel @@ -105,6 +105,23 @@ iree_runtime_cc_library( ], ) +iree_runtime_cc_library( + name = "files", + srcs = [ + "file_registry.c", + "memory_file.c", + ], + hdrs = [ + "file_registry.h", + "memory_file.h", + ], + deps = [ + "//runtime/src/iree/base", + "//runtime/src/iree/hal", + "//runtime/src/iree/io:file_handle", + ], +) + iree_runtime_cc_library( name = "libmpi", srcs = ["libmpi.c"], @@ -130,17 +147,6 @@ iree_runtime_cc_test( ], ) -iree_runtime_cc_library( - name = "memory_file", - srcs = ["memory_file.c"], - hdrs = ["memory_file.h"], - deps = [ - "//runtime/src/iree/base", - "//runtime/src/iree/hal", - "//runtime/src/iree/io:file_handle", - ], -) - iree_runtime_cc_library( name = "mpi_channel_provider", srcs = ["mpi_channel_provider.c"], diff --git a/runtime/src/iree/hal/utils/CMakeLists.txt b/runtime/src/iree/hal/utils/CMakeLists.txt index f2a6e92ee57e..b4e389e56eb9 100644 --- a/runtime/src/iree/hal/utils/CMakeLists.txt +++ b/runtime/src/iree/hal/utils/CMakeLists.txt @@ -126,6 +126,22 @@ iree_cc_library( PUBLIC ) +iree_cc_library( + NAME + files + HDRS + "file_registry.h" + "memory_file.h" + SRCS + "file_registry.c" + "memory_file.c" + DEPS + iree::base + iree::hal + iree::io::file_handle + PUBLIC +) + iree_cc_library( NAME libmpi @@ -153,20 +169,6 @@ iree_cc_test( iree::testing::gtest_main ) -iree_cc_library( - NAME - memory_file - HDRS - "memory_file.h" - SRCS - "memory_file.c" - DEPS - iree::base - iree::hal - iree::io::file_handle - PUBLIC -) - iree_cc_library( NAME mpi_channel_provider diff --git a/runtime/src/iree/hal/utils/file_registry.c b/runtime/src/iree/hal/utils/file_registry.c new file mode 100644 index 000000000000..3926c394b09f --- /dev/null +++ b/runtime/src/iree/hal/utils/file_registry.c @@ -0,0 +1,38 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "iree/hal/utils/file_registry.h" + +#include "iree/hal/utils/memory_file.h" + +IREE_API_EXPORT iree_status_t iree_hal_file_from_handle( + iree_hal_allocator_t* device_allocator, + iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, + iree_io_file_handle_t* handle, iree_allocator_t host_allocator, + iree_hal_file_t** out_file) { + IREE_ASSERT_ARGUMENT(handle); + IREE_ASSERT_ARGUMENT(out_file); + *out_file = NULL; + IREE_TRACE_ZONE_BEGIN(z0); + + iree_status_t status = iree_ok_status(); + switch (iree_io_file_handle_type(handle)) { + case IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION: + status = + iree_hal_memory_file_wrap(device_allocator, queue_affinity, access, + handle, host_allocator, out_file); + break; + default: + status = iree_make_status( + IREE_STATUS_UNIMPLEMENTED, + "no common implementation supported for file handles of type %d", + (int)iree_io_file_handle_type(handle)); + break; + } + + IREE_TRACE_ZONE_END(z0); + return status; +} diff --git a/runtime/src/iree/hal/utils/file_registry.h b/runtime/src/iree/hal/utils/file_registry.h new file mode 100644 index 000000000000..ff4c8cc4c4f8 --- /dev/null +++ b/runtime/src/iree/hal/utils/file_registry.h @@ -0,0 +1,35 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#ifndef IREE_HAL_UTILS_FILE_REGISTRY_H_ +#define IREE_HAL_UTILS_FILE_REGISTRY_H_ + +#include "iree/base/api.h" +#include "iree/hal/api.h" +#include "iree/io/file_handle.h" + +#ifdef __cplusplus +extern "C" { +#endif // __cplusplus + +// Creates a file backed by |handle| using a common host implementation. +// Supported file handle types are determined based on compile configuration. +// +// Some implementations - such as for IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION - +// will try to import the backing storage directly into a usable staging buffer +// using |device_allocator| and available with |queue_affinity|. Otherwise the +// file is allowed to be use with any device or queue. +IREE_API_EXPORT iree_status_t iree_hal_file_from_handle( + iree_hal_allocator_t* device_allocator, + iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, + iree_io_file_handle_t* handle, iree_allocator_t host_allocator, + iree_hal_file_t** out_file); + +#ifdef __cplusplus +} // extern "C" +#endif // __cplusplus + +#endif // IREE_HAL_UTILS_FILE_REGISTRY_H_ diff --git a/runtime/src/iree/hal/utils/memory_file.c b/runtime/src/iree/hal/utils/memory_file.c index 3974b846ebe9..66a8c54c9bc6 100644 --- a/runtime/src/iree/hal/utils/memory_file.c +++ b/runtime/src/iree/hal/utils/memory_file.c @@ -118,9 +118,10 @@ static void iree_hal_memory_file_try_import_buffer( iree_hal_allocator_t* device_allocator); IREE_API_EXPORT iree_status_t iree_hal_memory_file_wrap( + iree_hal_allocator_t* device_allocator, iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, - iree_io_file_handle_t* handle, iree_hal_allocator_t* device_allocator, - iree_allocator_t host_allocator, iree_hal_file_t** out_file) { + iree_io_file_handle_t* handle, iree_allocator_t host_allocator, + iree_hal_file_t** out_file) { IREE_ASSERT_ARGUMENT(out_file); *out_file = NULL; diff --git a/runtime/src/iree/hal/utils/memory_file.h b/runtime/src/iree/hal/utils/memory_file.h index 433b99b804a9..c3e2a9fbc14f 100644 --- a/runtime/src/iree/hal/utils/memory_file.h +++ b/runtime/src/iree/hal/utils/memory_file.h @@ -24,9 +24,10 @@ extern "C" { // If the memory can be imported into a usable staging buffer |device_allocator| // will be used to do so. IREE_API_EXPORT iree_status_t iree_hal_memory_file_wrap( + iree_hal_allocator_t* device_allocator, iree_hal_queue_affinity_t queue_affinity, iree_hal_memory_access_t access, - iree_io_file_handle_t* handle, iree_hal_allocator_t* device_allocator, - iree_allocator_t host_allocator, iree_hal_file_t** out_file); + iree_io_file_handle_t* handle, iree_allocator_t host_allocator, + iree_hal_file_t** out_file); #ifdef __cplusplus } // extern "C" From 4c00a2283e1b01ff219156a2f1ffffea6fc69a6c Mon Sep 17 00:00:00 2001 From: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Wed, 18 Dec 2024 09:50:50 -0800 Subject: [PATCH 38/64] Enable scatter fusion with index operand. (#19198) This drops a pessimistic check during analysis of the indexing maps of the fused `OpOperand` in the producer and consumer that was preventing fusion of the scatter operation with its index operand producer. Signed-off-by: MaheshRavishankar --- .../DispatchCreation/FormDispatchRegions.cpp | 6 ++-- .../test/form_dispatch_regions.mlir | 32 +++++++++++++++++++ 2 files changed, 35 insertions(+), 3 deletions(-) diff --git a/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp b/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp index 73d306bd7f6e..8e8c27ef95e1 100644 --- a/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp +++ b/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp @@ -267,14 +267,14 @@ matchIteratorTypes(const llvm::SmallBitVector &rootOuterParallelLoop, // If the candidate is all parallel, then it should be at least as parallel as // the root. - for (int pos : llvm::seq(0, rootOuterParallelLoop.size())) { + for (int pos : llvm::seq(0, std::min(candidateOuterParallelLoop.size(), + rootOuterParallelLoop.size()))) { // If we reach the end of the outer loops of the root, break out of the // loop. if (!rootOuterParallelLoop.test(pos)) break; // If the root loop is parallel, the candidate loop should also be parallel. - if (pos >= candidateOuterParallelLoop.size() || - !candidateOuterParallelLoop.test(pos)) + if (!candidateOuterParallelLoop.test(pos)) return false; } return true; diff --git a/compiler/src/iree/compiler/DispatchCreation/test/form_dispatch_regions.mlir b/compiler/src/iree/compiler/DispatchCreation/test/form_dispatch_regions.mlir index 196fc8795718..2344285e2b1b 100644 --- a/compiler/src/iree/compiler/DispatchCreation/test/form_dispatch_regions.mlir +++ b/compiler/src/iree/compiler/DispatchCreation/test/form_dispatch_regions.mlir @@ -922,3 +922,35 @@ util.func @custom_op_no_producer_fusion(%arg0 : tensor, %arg1 : tensor< // CHECK-SAME: ins(%[[DISPATCH1]], // CHECK: flow.return %[[CUSTOM_OP]] // CHECK: util.return %[[DISPATCH2]] + +// ----- + +util.func @scatter_index_producer_fusion(%arg0 : tensor, + %arg1 : index, %arg2 : tensor, + %arg3 : tensor) -> tensor { + %empty = tensor.empty(%arg1) : tensor + %0 = linalg.generic { + indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, + affine_map<(d0, d1) -> (d0, d1)>], + iterator_types = ["parallel", "parallel"]} + ins(%arg0 : tensor) outs(%empty : tensor) { + ^bb0(%in: i64, %out: i32): + %1 = arith.trunci %in : i64 to i32 + linalg.yield %1 : i32 + } -> tensor + %1 = iree_linalg_ext.scatter + dimension_map = [0] unique_indices(true) + ins(%arg2, %0 : tensor, tensor) + outs(%arg3 : tensor) { + ^bb0(%arg6: f16, %arg7: f16): + iree_linalg_ext.yield %arg6 : f16 + } -> tensor + util.return %1 : tensor +} +// CHECK-LABEL: func public @scatter_index_producer_fusion +// CHECK: %[[DISPATCH:.+]] = flow.dispatch.region +// CHECK: %[[GENERIC:.+]] = linalg.generic +// CHECK: %[[SCATTER:.+]] = iree_linalg_ext.scatter +// CHECK-SAME: ins(%{{.+}}, %[[GENERIC]] : +// CHECK: flow.return %[[SCATTER]] +// CHECK: util.return %[[DISPATCH]] From 101f55c686de01fbbbc63cd2f10019031e9f37a0 Mon Sep 17 00:00:00 2001 From: Ben Vanik Date: Wed, 18 Dec 2024 10:01:00 -0800 Subject: [PATCH 39/64] Adding fd-based file handle support to the HAL. (#19514) This allows multi-threaded unbuffered IO via pread/pwrite (with an emulation on Windows). --- runtime/src/iree/hal/utils/BUILD.bazel | 2 + runtime/src/iree/hal/utils/CMakeLists.txt | 2 + runtime/src/iree/hal/utils/fd_file.c | 384 +++++++++++++++++++++ runtime/src/iree/hal/utils/fd_file.h | 34 ++ runtime/src/iree/hal/utils/file_registry.c | 5 + runtime/src/iree/io/file_handle.c | 35 ++ runtime/src/iree/io/file_handle.h | 7 +- 7 files changed, 468 insertions(+), 1 deletion(-) create mode 100644 runtime/src/iree/hal/utils/fd_file.c create mode 100644 runtime/src/iree/hal/utils/fd_file.h diff --git a/runtime/src/iree/hal/utils/BUILD.bazel b/runtime/src/iree/hal/utils/BUILD.bazel index 76d710e7dab0..61cd73d3a0ae 100644 --- a/runtime/src/iree/hal/utils/BUILD.bazel +++ b/runtime/src/iree/hal/utils/BUILD.bazel @@ -108,10 +108,12 @@ iree_runtime_cc_library( iree_runtime_cc_library( name = "files", srcs = [ + "fd_file.c", "file_registry.c", "memory_file.c", ], hdrs = [ + "fd_file.h", "file_registry.h", "memory_file.h", ], diff --git a/runtime/src/iree/hal/utils/CMakeLists.txt b/runtime/src/iree/hal/utils/CMakeLists.txt index b4e389e56eb9..e47c9ab1c4c3 100644 --- a/runtime/src/iree/hal/utils/CMakeLists.txt +++ b/runtime/src/iree/hal/utils/CMakeLists.txt @@ -130,9 +130,11 @@ iree_cc_library( NAME files HDRS + "fd_file.h" "file_registry.h" "memory_file.h" SRCS + "fd_file.c" "file_registry.c" "memory_file.c" DEPS diff --git a/runtime/src/iree/hal/utils/fd_file.c b/runtime/src/iree/hal/utils/fd_file.c new file mode 100644 index 000000000000..3a7fb60f0802 --- /dev/null +++ b/runtime/src/iree/hal/utils/fd_file.c @@ -0,0 +1,384 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "iree/hal/utils/fd_file.h" + +//===----------------------------------------------------------------------===// +// Platform Support +//===----------------------------------------------------------------------===// + +#if IREE_FILE_IO_ENABLE + +#define _GNU_SOURCE + +#include +#include +#include + +#if defined(IREE_PLATFORM_WINDOWS) +#include +#else +#include +#endif // IREE_PLATFORM_WINDOWS + +#if defined(IREE_PLATFORM_WINDOWS) + +// Returns the allowed access and length in bytes of the file descriptor. +// Returns 0 if the file descriptor has no length (a /proc stream, etc). +static iree_status_t iree_hal_platform_fd_stat( + int fd, iree_hal_memory_access_t* out_allowed_access, + uint64_t* out_length) { + IREE_ASSERT_ARGUMENT(out_allowed_access); + IREE_ASSERT_ARGUMENT(out_length); + *out_allowed_access = IREE_HAL_MEMORY_ACCESS_NONE; + *out_length = 0; + + struct _stat64 buffer = {0}; + if (_fstat64(fd, &buffer) == -1) { + return iree_make_status(iree_status_code_from_errno(errno), + "unable to stat file descriptor length"); + } + + *out_allowed_access = + ((buffer.st_mode & _S_IREAD) ? IREE_HAL_MEMORY_ACCESS_READ : 0) | + ((buffer.st_mode & _S_IWRITE) ? IREE_HAL_MEMORY_ACCESS_WRITE : 0); + *out_length = (uint64_t)buffer.st_size; + return iree_ok_status(); +} + +static iree_status_t iree_hal_platform_fd_pread( + int fd, void* buffer, iree_host_size_t count, uint64_t offset, + iree_host_size_t* out_bytes_read) { + IREE_ASSERT_ARGUMENT(out_bytes_read); + *out_bytes_read = 0; + + HANDLE handle = (HANDLE)_get_osfhandle(fd); + if (handle == INVALID_HANDLE_VALUE) { + return iree_make_status( + IREE_STATUS_INVALID_ARGUMENT, + "file descriptor is not backed by a valid Win32 HANDLE"); + } + + DWORD bytes_read = 0; + OVERLAPPED overlapped = {0}; + overlapped.Offset = (DWORD)(offset & 0xFFFFFFFFu); + overlapped.OffsetHigh = (DWORD)((offset >> 32) & 0xFFFFFFFFu); + if (!ReadFile(handle, buffer, (DWORD)count, &bytes_read, &overlapped)) { + return iree_make_status(iree_status_code_from_win32_error(GetLastError()), + "failed to read requested buffer length"); + } + + *out_bytes_read = (iree_host_size_t)bytes_read; + return iree_ok_status(); +} + +static iree_status_t iree_hal_platform_fd_pwrite( + int fd, const void* buffer, iree_host_size_t count, uint64_t offset, + iree_host_size_t* out_bytes_written) { + IREE_ASSERT_ARGUMENT(out_bytes_written); + *out_bytes_written = 0; + + HANDLE handle = (HANDLE)_get_osfhandle(fd); + if (handle == INVALID_HANDLE_VALUE) { + return iree_make_status( + IREE_STATUS_INVALID_ARGUMENT, + "file descriptor is not backed by a valid Win32 HANDLE"); + } + + DWORD bytes_written = 0; + OVERLAPPED overlapped = {0}; + overlapped.Offset = (DWORD)(offset & 0xFFFFFFFFu); + overlapped.OffsetHigh = (DWORD)((offset >> 32) & 0xFFFFFFFFu); + if (!WriteFile(handle, buffer, (DWORD)count, &bytes_written, &overlapped)) { + return iree_make_status(iree_status_code_from_win32_error(GetLastError()), + "failed to write requested buffer length"); + } + + *out_bytes_written = (iree_host_size_t)bytes_written; + return iree_ok_status(); +} + +#else + +// Returns the allowed access and length in bytes of the file descriptor. +// Returns 0 if the file descriptor has no length (a /proc stream, etc). +static iree_status_t iree_hal_platform_fd_stat( + int fd, iree_hal_memory_access_t* out_allowed_access, + uint64_t* out_length) { + IREE_ASSERT_ARGUMENT(out_allowed_access); + IREE_ASSERT_ARGUMENT(out_length); + *out_allowed_access = IREE_HAL_MEMORY_ACCESS_NONE; + *out_length = 0; + + struct stat buffer = {0}; + if (fstat(fd, &buffer) == -1) { + return iree_make_status(iree_status_code_from_errno(errno), + "unable to stat file descriptor length"); + } + + *out_allowed_access = + ((buffer.st_mode & S_IRUSR) ? IREE_HAL_MEMORY_ACCESS_READ : 0) | + ((buffer.st_mode & S_IWUSR) ? IREE_HAL_MEMORY_ACCESS_WRITE : 0); + *out_length = (uint64_t)buffer.st_size; + return iree_ok_status(); +} + +static iree_status_t iree_hal_platform_fd_pread( + int fd, void* buffer, iree_host_size_t count, uint64_t offset, + iree_host_size_t* out_bytes_read) { + IREE_ASSERT_ARGUMENT(out_bytes_read); + *out_bytes_read = 0; + ssize_t bytes_read = pread(fd, buffer, (size_t)count, (off_t)offset); + if (bytes_read > 0) { + *out_bytes_read = (iree_host_size_t)bytes_read; + return iree_ok_status(); + } else if (bytes_read == 0) { + return iree_make_status(IREE_STATUS_OUT_OF_RANGE, + "end of file hit during read"); + } else { + return iree_make_status(iree_status_code_from_errno(errno), + "failed to read requested buffer length"); + } +} + +static iree_status_t iree_hal_platform_fd_pwrite( + int fd, const void* buffer, iree_host_size_t count, uint64_t offset, + iree_host_size_t* out_bytes_written) { + IREE_ASSERT_ARGUMENT(out_bytes_written); + *out_bytes_written = 0; + ssize_t bytes_written = pwrite(fd, buffer, (size_t)count, (off_t)offset); + if (bytes_written > 0) { + *out_bytes_written = (iree_host_size_t)bytes_written; + return iree_ok_status(); + } else if (bytes_written == 0) { + return iree_make_status(IREE_STATUS_OUT_OF_RANGE, + "end of file hit during write"); + } else { + return iree_make_status(iree_status_code_from_errno(errno), + "failed to write requested buffer length"); + } +} + +#endif // IREE_PLATFORM_WINDOWS + +#endif // IREE_FILE_IO_ENABLE + +//===----------------------------------------------------------------------===// +// iree_hal_fd_file_t +//===----------------------------------------------------------------------===// + +#if IREE_FILE_IO_ENABLE + +typedef struct iree_hal_fd_file_t { + iree_hal_resource_t resource; + // Used to allocate this structure. + iree_allocator_t host_allocator; + // Allowed access bits. + iree_hal_memory_access_t access; + // Base file handle, retained. + iree_io_file_handle_t* handle; + // File descriptor, unretained (the handle retains it). + // Note that this descriptor may be shared with multiple threads and all + // operations we perform against it must be stateless. + int fd; + // Total file (stream) length in bytes as queried on creation. + uint64_t length; +} iree_hal_fd_file_t; + +static const iree_hal_file_vtable_t iree_hal_fd_file_vtable; + +static iree_hal_fd_file_t* iree_hal_fd_file_cast( + iree_hal_file_t* IREE_RESTRICT base_value) { + return (iree_hal_fd_file_t*)base_value; +} + +IREE_API_EXPORT iree_status_t iree_hal_fd_file_from_handle( + iree_hal_memory_access_t access, iree_io_file_handle_t* handle, + iree_allocator_t host_allocator, iree_hal_file_t** out_file) { + IREE_ASSERT_ARGUMENT(out_file); + *out_file = NULL; + + // For now we only support posix file descriptors but could support other + // handle types so long as they are compatible with pread/pwrite. + iree_io_file_handle_primitive_t primitive = + iree_io_file_handle_primitive(handle); + if (primitive.type != IREE_IO_FILE_HANDLE_TYPE_FD) { + return iree_make_status(IREE_STATUS_UNIMPLEMENTED, + "support for creating non-fd files not supported"); + } + const int fd = primitive.value.fd; + + IREE_TRACE_ZONE_BEGIN(z0); + + // Query the file length. This also acts as a quick check that the file + // descriptor is accessible. + iree_hal_memory_access_t allowed_access = IREE_HAL_MEMORY_ACCESS_NONE; + uint64_t length = 0; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_hal_platform_fd_stat(fd, &allowed_access, &length)); + + // Verify that the requested access can be satisfied. + if (iree_all_bits_set(access, IREE_HAL_MEMORY_ACCESS_READ) && + !iree_all_bits_set(allowed_access, IREE_HAL_MEMORY_ACCESS_READ)) { + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, + iree_make_status( + IREE_STATUS_PERMISSION_DENIED, + "read access requested on a file descriptor that is not readable")); + } else if (iree_all_bits_set(access, IREE_HAL_MEMORY_ACCESS_WRITE) && + !iree_all_bits_set(allowed_access, IREE_HAL_MEMORY_ACCESS_WRITE)) { + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_make_status(IREE_STATUS_PERMISSION_DENIED, + "write access requested on a file descriptor that " + "is not writable")); + } + + // Allocate object that retains the underlying file handle and our opened + // descriptor. + iree_hal_fd_file_t* file = NULL; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_allocator_malloc(host_allocator, sizeof(*file), (void**)&file)); + iree_hal_resource_initialize(&iree_hal_fd_file_vtable, &file->resource); + file->host_allocator = host_allocator; + file->access = access; + file->handle = handle; + iree_io_file_handle_retain(file->handle); + file->fd = fd; + file->length = length; + + *out_file = (iree_hal_file_t*)file; + IREE_TRACE_ZONE_END(z0); + return iree_ok_status(); +} + +static void iree_hal_fd_file_destroy(iree_hal_file_t* IREE_RESTRICT base_file) { + iree_hal_fd_file_t* file = iree_hal_fd_file_cast(base_file); + iree_allocator_t host_allocator = file->host_allocator; + IREE_TRACE_ZONE_BEGIN(z0); + + iree_io_file_handle_release(file->handle); + + iree_allocator_free(host_allocator, file); + + IREE_TRACE_ZONE_END(z0); +} + +static iree_hal_memory_access_t iree_hal_fd_file_allowed_access( + iree_hal_file_t* base_file) { + iree_hal_fd_file_t* file = iree_hal_fd_file_cast(base_file); + return file->access; +} + +static uint64_t iree_hal_fd_file_length(iree_hal_file_t* base_file) { + iree_hal_fd_file_t* file = iree_hal_fd_file_cast(base_file); + return file->length; +} + +static iree_hal_buffer_t* iree_hal_fd_file_storage_buffer( + iree_hal_file_t* base_file) { + // We could map files if we wanted to provide this interface but today leave + // that up to users (they can pass in HOST_ALLOCATION file handles to import). + return NULL; +} + +static bool iree_hal_fd_file_supports_synchronous_io( + iree_hal_file_t* base_file) { + // Host files always support synchronous IO. + return true; +} + +static iree_status_t iree_hal_fd_file_read(iree_hal_file_t* base_file, + uint64_t file_offset, + iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, + iree_device_size_t length) { + if (length == 0) return iree_ok_status(); + iree_hal_fd_file_t* file = iree_hal_fd_file_cast(base_file); + + iree_hal_buffer_mapping_t mapping = {{0}}; + IREE_RETURN_IF_ERROR(iree_hal_buffer_map_range( + buffer, IREE_HAL_MAPPING_MODE_SCOPED, + IREE_HAL_MEMORY_ACCESS_DISCARD_WRITE, buffer_offset, length, &mapping)); + + iree_status_t status = iree_ok_status(); + uint8_t* buffer_ptr = mapping.contents.data; + iree_host_size_t bytes_remaining = mapping.contents.data_length; + while (iree_status_is_ok(status) && bytes_remaining > 0) { + const iree_host_size_t bytes_requested = iree_min(bytes_remaining, INT_MAX); + iree_host_size_t bytes_read = 0; + status = iree_hal_platform_fd_pread(file->fd, buffer_ptr, bytes_requested, + file_offset, &bytes_read); + file_offset += bytes_read; + buffer_ptr += bytes_read; + bytes_remaining -= bytes_read; + } + + if (iree_status_is_ok(status) && + !iree_all_bits_set(iree_hal_buffer_memory_type(buffer), + IREE_HAL_MEMORY_TYPE_HOST_COHERENT)) { + status = + iree_hal_buffer_mapping_flush_range(&mapping, buffer_offset, length); + } + + return iree_status_join(status, iree_hal_buffer_unmap_range(&mapping)); +} + +static iree_status_t iree_hal_fd_file_write(iree_hal_file_t* base_file, + uint64_t file_offset, + iree_hal_buffer_t* buffer, + iree_device_size_t buffer_offset, + iree_device_size_t length) { + if (length == 0) return iree_ok_status(); + iree_hal_fd_file_t* file = iree_hal_fd_file_cast(base_file); + + iree_hal_buffer_mapping_t mapping = {{0}}; + IREE_RETURN_IF_ERROR(iree_hal_buffer_map_range( + buffer, IREE_HAL_MAPPING_MODE_SCOPED, IREE_HAL_MEMORY_ACCESS_READ, + buffer_offset, length, &mapping)); + + iree_status_t status = iree_ok_status(); + if (!iree_all_bits_set(iree_hal_buffer_memory_type(buffer), + IREE_HAL_MEMORY_TYPE_HOST_COHERENT)) { + status = iree_hal_buffer_mapping_invalidate_range(&mapping, buffer_offset, + length); + } + + const uint8_t* buffer_ptr = mapping.contents.data; + iree_host_size_t bytes_remaining = mapping.contents.data_length; + while (iree_status_is_ok(status) && bytes_remaining > 0) { + const iree_host_size_t bytes_requested = iree_min(bytes_remaining, INT_MAX); + iree_host_size_t bytes_written = 0; + status = iree_hal_platform_fd_pwrite(file->fd, buffer_ptr, bytes_requested, + file_offset, &bytes_written); + file_offset += bytes_written; + buffer_ptr += bytes_written; + bytes_remaining -= bytes_written; + } + + return iree_status_join(status, iree_hal_buffer_unmap_range(&mapping)); +} + +static const iree_hal_file_vtable_t iree_hal_fd_file_vtable = { + .destroy = iree_hal_fd_file_destroy, + .allowed_access = iree_hal_fd_file_allowed_access, + .length = iree_hal_fd_file_length, + .storage_buffer = iree_hal_fd_file_storage_buffer, + .supports_synchronous_io = iree_hal_fd_file_supports_synchronous_io, + .read = iree_hal_fd_file_read, + .write = iree_hal_fd_file_write, +}; + +#else + +IREE_API_EXPORT iree_status_t iree_hal_fd_file_from_handle( + iree_hal_memory_access_t access, iree_io_file_handle_t* handle, + iree_allocator_t host_allocator, iree_hal_file_t** out_file) { + return iree_make_status(IREE_STATUS_UNAVAILABLE, + "file support has been compiled out of this binary; " + "set IREE_FILE_IO_ENABLE=1 to include it"); +} + +#endif // IREE_FILE_IO_ENABLE diff --git a/runtime/src/iree/hal/utils/fd_file.h b/runtime/src/iree/hal/utils/fd_file.h new file mode 100644 index 000000000000..f6f8f371d4c8 --- /dev/null +++ b/runtime/src/iree/hal/utils/fd_file.h @@ -0,0 +1,34 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#ifndef IREE_HAL_UTILS_FD_FILE_H_ +#define IREE_HAL_UTILS_FD_FILE_H_ + +#include "iree/base/api.h" +#include "iree/hal/api.h" +#include "iree/io/file_handle.h" + +#ifdef __cplusplus +extern "C" { +#endif // __cplusplus + +//===----------------------------------------------------------------------===// +// iree_hal_fd_file_t +//===----------------------------------------------------------------------===// + +// Creates a file backed by |handle| on disk. +// Only supports file handles of IREE_IO_FILE_HANDLE_TYPE_FD. +// File handles are stateless and each host file opened from one may see +// different versions of the file depending on the platform and file type. +IREE_API_EXPORT iree_status_t iree_hal_fd_file_from_handle( + iree_hal_memory_access_t access, iree_io_file_handle_t* handle, + iree_allocator_t host_allocator, iree_hal_file_t** out_file); + +#ifdef __cplusplus +} // extern "C" +#endif // __cplusplus + +#endif // IREE_HAL_UTILS_FD_FILE_H_ diff --git a/runtime/src/iree/hal/utils/file_registry.c b/runtime/src/iree/hal/utils/file_registry.c index 3926c394b09f..8e80e9d82c27 100644 --- a/runtime/src/iree/hal/utils/file_registry.c +++ b/runtime/src/iree/hal/utils/file_registry.c @@ -6,6 +6,7 @@ #include "iree/hal/utils/file_registry.h" +#include "iree/hal/utils/fd_file.h" #include "iree/hal/utils/memory_file.h" IREE_API_EXPORT iree_status_t iree_hal_file_from_handle( @@ -25,6 +26,10 @@ IREE_API_EXPORT iree_status_t iree_hal_file_from_handle( iree_hal_memory_file_wrap(device_allocator, queue_affinity, access, handle, host_allocator, out_file); break; + case IREE_IO_FILE_HANDLE_TYPE_FD: + status = iree_hal_fd_file_from_handle(access, handle, host_allocator, + out_file); + break; default: status = iree_make_status( IREE_STATUS_UNIMPLEMENTED, diff --git a/runtime/src/iree/io/file_handle.c b/runtime/src/iree/io/file_handle.c index 2eb7e19940e7..dea7fcaee8b1 100644 --- a/runtime/src/iree/io/file_handle.c +++ b/runtime/src/iree/io/file_handle.c @@ -9,6 +9,18 @@ #include "iree/base/internal/atomics.h" #include "iree/io/memory_stream.h" +#if IREE_FILE_IO_ENABLE +#if defined(IREE_PLATFORM_WINDOWS) + +#include // _commit + +#else + +#include // fsync + +#endif // IREE_PLATFORM_WINDOWS +#endif // IREE_FILE_IO_ENABLE + //===----------------------------------------------------------------------===// // iree_io_file_handle_t //===----------------------------------------------------------------------===// @@ -101,6 +113,25 @@ iree_io_file_handle_primitive(const iree_io_file_handle_t* handle) { return handle->primitive; } +static iree_status_t iree_io_platform_fd_flush(int fd) { +#if IREE_FILE_IO_ENABLE + +#if defined(IREE_PLATFORM_WINDOWS) + int ret = _commit(fd); +#else + int ret = fsync(fd); +#endif // IREE_PLATFORM_WINDOWS + return ret != -1 ? iree_ok_status() + : iree_make_status(iree_status_code_from_errno(errno), + "unable to sync file writes"); + +#else + return iree_make_status(IREE_STATUS_UNAVAILABLE, + "file support has been compiled out of this binary; " + "set IREE_FILE_IO_ENABLE=1 to include it"); +#endif // IREE_FILE_IO_ENABLE +} + IREE_API_EXPORT iree_status_t iree_io_file_handle_flush(iree_io_file_handle_t* handle) { IREE_ASSERT_ARGUMENT(handle); @@ -111,6 +142,10 @@ iree_io_file_handle_flush(iree_io_file_handle_t* handle) { // No-op (though we could flush when known mapped). break; } + case IREE_IO_FILE_HANDLE_TYPE_FD: { + status = iree_io_platform_fd_flush(handle->primitive.value.fd); + break; + } default: { status = iree_make_status(IREE_STATUS_UNIMPLEMENTED, "flush not supported on handle type %d", diff --git a/runtime/src/iree/io/file_handle.h b/runtime/src/iree/io/file_handle.h index 6831cbe6c3e1..6d46ad559d4c 100644 --- a/runtime/src/iree/io/file_handle.h +++ b/runtime/src/iree/io/file_handle.h @@ -46,7 +46,10 @@ typedef enum iree_io_file_handle_type_e { // as long as the file handle referencing it. IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION = 0u, - // TODO(benvanik): file descriptor, FILE*, HANDLE, etc. + // Platform file descriptor (fd). + IREE_IO_FILE_HANDLE_TYPE_FD, + + // TODO(benvanik): FILE*, HANDLE, etc. } iree_io_file_handle_type_t; // A platform handle to a file primitive. @@ -55,6 +58,8 @@ typedef enum iree_io_file_handle_type_e { typedef union iree_io_file_handle_primitive_value_t { // IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION iree_byte_span_t host_allocation; + // IREE_IO_FILE_HANDLE_TYPE_FD + int fd; } iree_io_file_handle_primitive_value_t; // A (type, value) pair describing a system file primitive handle. From 700572cc1a31544568286262826a55eb3b9b9ccf Mon Sep 17 00:00:00 2001 From: Ben Vanik Date: Wed, 18 Dec 2024 10:27:58 -0800 Subject: [PATCH 40/64] Adding experimental iree_io_file_map_view API. (#19515) This allows for any iree_io_file_handle_t to be mapped into host memory on platforms where doing so is supported. Eventually this will be replaced with a proper mapping object and view API that will allow us to unify the ELF loading, file_io utils, and parameter handling APIs. --- .../Transforms/ImportParameters.cpp | 4 +- runtime/bindings/python/io.cc | 9 +- runtime/src/iree/base/internal/file_io.c | 4 +- runtime/src/iree/io/file_handle.c | 452 +++++++++++++++++- runtime/src/iree/io/file_handle.h | 103 +++- .../src/iree/io/formats/gguf/gguf_parser.c | 28 +- .../src/iree/io/formats/gguf/gguf_parser.h | 6 +- .../iree/io/formats/gguf/gguf_parser_test.cc | 12 +- .../src/iree/io/formats/irpa/irpa_parser.c | 27 +- .../src/iree/io/formats/irpa/irpa_parser.h | 6 +- .../iree/io/formats/irpa/irpa_parser_test.cc | 12 +- runtime/src/iree/io/formats/parser_registry.c | 9 +- runtime/src/iree/io/formats/parser_registry.h | 5 +- .../formats/safetensors/safetensors_parser.c | 32 +- .../formats/safetensors/safetensors_parser.h | 6 +- .../safetensors/safetensors_parser_test.cc | 9 +- runtime/src/iree/tooling/parameter_util.c | 3 +- 17 files changed, 654 insertions(+), 73 deletions(-) diff --git a/compiler/src/iree/compiler/Modules/IO/Parameters/Transforms/ImportParameters.cpp b/compiler/src/iree/compiler/Modules/IO/Parameters/Transforms/ImportParameters.cpp index 288495eac3d6..91b6e690819b 100644 --- a/compiler/src/iree/compiler/Modules/IO/Parameters/Transforms/ImportParameters.cpp +++ b/compiler/src/iree/compiler/Modules/IO/Parameters/Transforms/ImportParameters.cpp @@ -80,10 +80,12 @@ loadParameterIndex(ModuleOp moduleOp, StringRef path, return failure(); // Parse the archive as a particular format. + iree_allocator_t hostAllocator = iree_allocator_system(); return handleRuntimeError( moduleOp, iree_io_parse_file_index(iree_make_string_view(path.data(), path.size()), - fileHandle->get(), parameterIndex), + fileHandle->get(), parameterIndex, + hostAllocator), "parsing parameter archive"); } diff --git a/runtime/bindings/python/io.cc b/runtime/bindings/python/io.cc index a558cb90a5b0..07218f585a4e 100644 --- a/runtime/bindings/python/io.cc +++ b/runtime/bindings/python/io.cc @@ -98,10 +98,11 @@ void ParameterIndexAddFromFileHandle(ParameterIndex &self, std::string &key, void ParameterIndexParseFileHandle(ParameterIndex &self, FileHandle &file_handle, std::string &format) { - CheckApiStatus(iree_io_parse_file_index( - iree_make_string_view(format.data(), format.size()), - file_handle.raw_ptr(), self.raw_ptr()), - "Could not parse parameter file index"); + CheckApiStatus( + iree_io_parse_file_index( + iree_make_string_view(format.data(), format.size()), + file_handle.raw_ptr(), self.raw_ptr(), iree_allocator_system()), + "Could not parse parameter file index"); } void ParameterIndexLoadFile(ParameterIndex &self, std::string &file_path, diff --git a/runtime/src/iree/base/internal/file_io.c b/runtime/src/iree/base/internal/file_io.c index 8ba489e63477..a0753f09a1fe 100644 --- a/runtime/src/iree/base/internal/file_io.c +++ b/runtime/src/iree/base/internal/file_io.c @@ -466,7 +466,9 @@ iree_status_t iree_file_create_mapped(const char* path, uint64_t file_size, IREE_TRACE_ZONE_BEGIN(z0); iree_file_contents_t* contents = NULL; - iree_allocator_malloc(allocator, sizeof(*contents), (void**)&contents); + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, + iree_allocator_malloc(allocator, sizeof(*contents), (void**)&contents)); contents->allocator = allocator; iree_status_t status = iree_file_create_mapped_platform( diff --git a/runtime/src/iree/io/file_handle.c b/runtime/src/iree/io/file_handle.c index dea7fcaee8b1..40d5f8562b2d 100644 --- a/runtime/src/iree/io/file_handle.c +++ b/runtime/src/iree/io/file_handle.c @@ -12,11 +12,14 @@ #if IREE_FILE_IO_ENABLE #if defined(IREE_PLATFORM_WINDOWS) -#include // _commit +#include // _commit +#include // WerRegisterExcludedMemoryBlock #else -#include // fsync +#include // mmap +#include // fstat +#include // fsync #endif // IREE_PLATFORM_WINDOWS #endif // IREE_FILE_IO_ENABLE @@ -157,6 +160,451 @@ iree_io_file_handle_flush(iree_io_file_handle_t* handle) { return status; } +//===----------------------------------------------------------------------===// +// iree_io_file_mapping_t support +//===----------------------------------------------------------------------===// + +static iree_status_t iree_io_calculate_file_view_range( + uint64_t file_size, uint64_t offset, iree_host_size_t length, + iree_host_size_t* out_adjusted_length) { + *out_adjusted_length = 0; + + // Check if the start of the range runs off the end of the buffer. + if (IREE_UNLIKELY(offset > file_size)) { + return iree_make_status(IREE_STATUS_OUT_OF_RANGE, + "attempted to access an address off the end of the " + "file range (offset=%" PRIu64 ", length=%" PRIhsz + ", file size=%" PRIu64 ")", + offset, length, file_size); + } + + // Calculate the real length adjusted for our region within the allocation. + const iree_host_size_t adjusted_length = + length == IREE_HOST_SIZE_MAX ? file_size - offset : length; + if (adjusted_length == 0) { + // Fine (but silly) to have a zero length. + return iree_ok_status(); + } + + // Check if the end runs over the allocation. + const uint64_t end = offset + adjusted_length - 1; + if (IREE_UNLIKELY(end >= file_size)) { + return iree_make_status(IREE_STATUS_OUT_OF_RANGE, + "attempted to access an address outside of the " + "file range (offset=%" PRIu64 + ", adjusted_length=%" PRIhsz ", end=%" PRIu64 + ", file size=%" PRIu64 ")", + offset, adjusted_length, end, file_size); + } + + *out_adjusted_length = adjusted_length; + return iree_ok_status(); +} + +static iree_status_t iree_io_file_mapping_from_host_allocation( + iree_byte_span_t buffer, uint64_t offset, iree_host_size_t length, + iree_byte_span_t* out_range) { + *out_range = iree_byte_span_empty(); + + iree_host_size_t adjusted_length = 0; + IREE_RETURN_IF_ERROR(iree_io_calculate_file_view_range( + (uint64_t)buffer.data_length, offset, length, &adjusted_length)); + + *out_range = iree_make_byte_span(buffer.data + offset, adjusted_length); + return iree_ok_status(); +} + +#if defined(IREE_PLATFORM_ANDROID) || defined(IREE_PLATFORM_IOS) || \ + defined(IREE_PLATFORM_LINUX) || defined(IREE_PLATFORM_MACOS) + +static iree_status_t iree_io_file_handle_to_fd( + iree_io_file_handle_primitive_t primitive, int* out_fd) { + *out_fd = -1; + switch (primitive.type) { + case IREE_IO_FILE_HANDLE_TYPE_FD: + *out_fd = primitive.value.fd; + return iree_ok_status(); + default: + return iree_make_status( + IREE_STATUS_UNIMPLEMENTED, + "no file descriptor available for file handles of type %d", + (int)primitive.type); + } +} + +static iree_status_t iree_io_platform_map_file_view( + iree_io_file_handle_primitive_t primitive, iree_io_file_access_t access, + uint64_t offset, iree_host_size_t length, + iree_io_file_mapping_flags_t flags, void** out_impl, + iree_byte_span_t* out_contents) { + *out_impl = NULL; + *out_contents = iree_byte_span_empty(); + + // Attempt to get a file descriptor from the provided IREE file handle. + int fd = -1; + IREE_RETURN_IF_ERROR(iree_io_file_handle_to_fd(primitive, &fd), + "mapping file handle to file descriptor"); + + // Query file size. We don't support extending/truncating files today and make + // the user do that - we just allow the length to be IREE_HOST_SIZE_MAX to + // indicate the remaining file should be mapped. + struct stat file_stat = {0}; + if (fstat(fd, &file_stat) == -1) { + return iree_make_status(iree_status_code_from_errno(errno), + "unable to query file size"); + } + const uint64_t file_size = file_stat.st_size; + + // Validate and adjust view size if needed. + iree_host_size_t adjusted_length = 0; + IREE_RETURN_IF_ERROR(iree_io_calculate_file_view_range( + file_size, offset, length, &adjusted_length)); + + int prot = 0; + if (iree_all_bits_set(access, IREE_IO_FILE_ACCESS_READ)) { + prot |= PROT_READ; + } + if (iree_all_bits_set(access, IREE_IO_FILE_ACCESS_WRITE)) { + prot |= PROT_WRITE; + } + + int map_flags = 0; + if (iree_all_bits_set(flags, IREE_IO_FILE_MAPPING_FLAG_PRIVATE)) { + map_flags |= MAP_PRIVATE; + } else { + map_flags |= MAP_SHARED; + } +#if defined(MAP_HUGETLB) + if (iree_all_bits_set(flags, IREE_IO_FILE_MAPPING_FLAG_LARGE_PAGES)) { + map_flags |= MAP_HUGETLB; + } +#endif // MAP_HUGETLB + + // Map the memory. + void* ptr = mmap(NULL, adjusted_length, prot, map_flags, fd, offset); + if (ptr == MAP_FAILED) { + return iree_make_status(iree_status_code_from_errno(errno), + "failed to map file handle range %" PRIu64 + "-%" PRIu64 " (%" PRIhsz + " bytes) from file of %" PRIu64 " total bytes", + offset, offset + length, length, file_size); + } + + // Pass hints to the memory manager - informational only. + int advice = 0; + if (iree_all_bits_set(flags, IREE_IO_FILE_MAPPING_FLAG_SEQUENTIAL_ACCESS)) { + advice |= MADV_SEQUENTIAL; + } +#if defined(MADV_DONTDUMP) + if (iree_all_bits_set(flags, IREE_IO_FILE_MAPPING_FLAG_EXCLUDE_FROM_DUMPS)) { + advice |= MADV_DONTDUMP; + } +#endif // MADV_DONTDUMP + if (advice) { + madvise(ptr, adjusted_length, advice); + } + + *out_impl = ptr; + *out_contents = iree_make_byte_span(ptr, adjusted_length); + return iree_ok_status(); +} + +static void iree_io_platform_unmap_file_view(iree_io_file_mapping_flags_t flags, + void* impl, + iree_byte_span_t contents) { + if (impl) { + munmap(impl, (size_t)contents.data_length); + } +} + +#elif defined(IREE_PLATFORM_WINDOWS) + +static iree_status_t iree_io_file_handle_to_win32_handle( + iree_io_file_handle_primitive_t primitive, HANDLE* out_handle) { + *out_handle = INVALID_HANDLE_VALUE; + switch (primitive.type) { + case IREE_IO_FILE_HANDLE_TYPE_FD: + *out_handle = (HANDLE)_get_osfhandle(primitive.value.fd); + if (*out_handle == INVALID_HANDLE_VALUE) { + return iree_make_status( + IREE_STATUS_INVALID_ARGUMENT, + "file descriptor is not backed by a valid Win32 HANDLE"); + } + return iree_ok_status(); + default: + return iree_make_status( + IREE_STATUS_UNIMPLEMENTED, + "no Win32 HANDLE available for file handles of type %d", + (int)primitive.type); + } +} + +static iree_status_t iree_io_platform_map_file_view( + iree_io_file_handle_primitive_t primitive, iree_io_file_access_t access, + uint64_t offset, iree_host_size_t length, + iree_io_file_mapping_flags_t flags, void** out_impl, + iree_byte_span_t* out_contents) { + *out_impl = NULL; + *out_contents = iree_byte_span_empty(); + + // Attempt to get a Win32 HANDLE from the provided IREE file handle. + HANDLE handle = INVALID_HANDLE_VALUE; + IREE_RETURN_IF_ERROR(iree_io_file_handle_to_win32_handle(primitive, &handle), + "mapping file handle to win32 handle"); + + // Query file size. We don't support extending/truncating files today and make + // the user do that - we just allow the length to be IREE_HOST_SIZE_MAX to + // indicate the remaining file should be mapped. + FILE_STANDARD_INFO file_info = {0}; + if (!GetFileInformationByHandleEx(handle, FileStandardInfo, &file_info, + (DWORD)sizeof(file_info))) { + return iree_make_status(iree_status_code_from_win32_error(GetLastError()), + "failed to query file handle information"); + } + const uint64_t file_size = file_info.AllocationSize.QuadPart; + + // Validate and adjust view size if needed. + iree_host_size_t adjusted_length = 0; + IREE_RETURN_IF_ERROR(iree_io_calculate_file_view_range( + file_size, offset, length, &adjusted_length)); + + // Create a file mapping object which will retain the file handle for the + // lifetime of the mapping. + DWORD protect = 0; + if (iree_all_bits_set(access, IREE_IO_FILE_ACCESS_WRITE)) { + protect |= PAGE_READWRITE; + } else if (iree_all_bits_set(access, IREE_IO_FILE_ACCESS_READ)) { + protect |= PAGE_READONLY; + } + if (iree_all_bits_set(flags, IREE_IO_FILE_MAPPING_FLAG_LARGE_PAGES)) { + protect |= SEC_LARGE_PAGES; + } + HANDLE mapping = + CreateFileMappingA(handle, NULL, protect, /*dwMaximumSizeHigh=*/0, + /*dwMaximumSizeLow=*/0, /*lpName=*/NULL); + if (!mapping) { + return iree_make_status(iree_status_code_from_win32_error(GetLastError()), + "failed to create file mapping for file handle"); + } + + // Map the requested range into the virtual address space of the process. + DWORD desired_access = 0; + if (iree_all_bits_set(access, IREE_IO_FILE_ACCESS_READ)) { + desired_access |= FILE_MAP_READ; + } else if (iree_all_bits_set(access, IREE_IO_FILE_ACCESS_WRITE)) { + desired_access |= FILE_MAP_WRITE; + } + LARGE_INTEGER offset_li = {0}; + offset_li.QuadPart = offset; + void* ptr = MapViewOfFileEx(mapping, desired_access, offset_li.HighPart, + offset_li.LowPart, (SIZE_T)adjusted_length, + /*lpBaseAddress=*/NULL); + if (!ptr) { + CloseHandle(mapping); + return iree_make_status( + iree_status_code_from_win32_error(GetLastError()), + "failed to map file handle range %" PRIu64 "-%" PRIu64 " (%" PRIhsz + " bytes) from file of %" PRIu64 " total bytes", + offset, offset + adjusted_length, adjusted_length, file_size); + } + +#if defined(WER_MAX_REGISTERED_ENTRIES) && \ + WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_APP | WINAPI_PARTITION_SYSTEM) + // If the user specified that we should exclude the contents from dumps then + // we need to tell Windows Error Reporting. Unfortunately the API is broken + // and only accepts a DWORD (it was added in Windows 10 **and uses a DWORD for + // size** :facepalm:). This is informational so we just try and maybe fail. + // Note that there's also a very small limit on the number of exclusions + // (WER_MAX_REGISTERED_ENTRIES = 512) so we can't just loop and try to exclude + // 4GB blocks in all cases. We try anyway, though. Maybe this isn't even + // useful - the docs are iffy. Oh well. + if (iree_all_bits_set(flags, IREE_IO_FILE_MAPPING_FLAG_EXCLUDE_FROM_DUMPS)) { + iree_host_size_t bytes_excluded = 0; + iree_host_size_t bytes_remaining = adjusted_length; + while (bytes_remaining > 0) { + const DWORD bytes_to_exclude = iree_min(bytes_remaining, UINT32_MAX); + WerRegisterExcludedMemoryBlock((uint8_t*)ptr + bytes_excluded, + bytes_to_exclude); + bytes_excluded += bytes_to_exclude; + bytes_remaining -= bytes_to_exclude; + } + } +#endif // WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_APP | + // WINAPI_PARTITION_SYSTEM) + + *out_impl = mapping; // transferred to caller + *out_contents = iree_make_byte_span(ptr, adjusted_length); + return iree_ok_status(); +} + +static void iree_io_platform_unmap_file_view(iree_io_file_mapping_flags_t flags, + void* impl, + iree_byte_span_t contents) { + if (contents.data) { + UnmapViewOfFile(contents.data); + } + +#if defined(WER_MAX_REGISTERED_ENTRIES) && \ + WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_APP | WINAPI_PARTITION_SYSTEM) + if (contents.data && + iree_all_bits_set(flags, IREE_IO_FILE_MAPPING_FLAG_EXCLUDE_FROM_DUMPS)) { + WerUnregisterExcludedMemoryBlock(contents.data); + iree_host_size_t bytes_unexcluded = 0; + iree_host_size_t bytes_remaining = contents.data_length; + while (bytes_remaining > 0) { + const DWORD bytes_to_unexclude = iree_min(bytes_remaining, UINT32_MAX); + WerUnregisterExcludedMemoryBlock(contents.data + bytes_unexcluded); + bytes_unexcluded += bytes_to_unexclude; + bytes_remaining -= bytes_to_unexclude; + } + } +#endif // WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_APP | + // WINAPI_PARTITION_SYSTEM) + + if (impl) { + CloseHandle((HANDLE)impl); + } +} + +#else + +static iree_status_t iree_io_platform_map_file_view( + iree_io_file_handle_primitive_t primitive, iree_io_file_access_t access, + uint64_t offset, iree_host_size_t length, + iree_io_file_mapping_flags_t flags, void** out_impl, + iree_byte_span_t* out_contents) { + *out_impl = NULL; + *out_contents = iree_byte_span_empty(); + return iree_make_status(IREE_STATUS_UNIMPLEMENTED, + "no support for mapping file views on this platform"); +} + +static void iree_io_platform_unmap_file_view(iree_io_file_mapping_flags_t flags, + void* impl, + iree_byte_span_t contents) {} + +#endif // IREE_PLATFORM_* + +//===----------------------------------------------------------------------===// +// iree_io_file_mapping_t +//===----------------------------------------------------------------------===// + +struct iree_io_file_mapping_t { + iree_atomic_ref_count_t ref_count; + iree_allocator_t host_allocator; + // File handle that owns the underlying file. Retained. + iree_io_file_handle_t* handle; + // Flags used when creating the mapping. + iree_io_file_mapping_flags_t flags; + // Platform-defined implementation handle. + // - mmap: base pointer returned from mmap + // - Win32: HANDLE returned by CreateFileMappingA + void* impl; + // Mapped contents in host memory. Access matches that requested on mapping. + iree_byte_span_t contents; +}; + +IREE_API_EXPORT iree_status_t iree_io_file_map_view( + iree_io_file_handle_t* handle, iree_io_file_access_t access, + uint64_t offset, iree_host_size_t length, + iree_io_file_mapping_flags_t flags, iree_allocator_t host_allocator, + iree_io_file_mapping_t** out_mapping) { + IREE_ASSERT_ARGUMENT(handle); + IREE_ASSERT_ARGUMENT(out_mapping); + *out_mapping = NULL; + IREE_TRACE_ZONE_BEGIN(z0); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, offset); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, length); + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, flags); + + iree_io_file_mapping_t* mapping = NULL; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_allocator_malloc(host_allocator, sizeof(*mapping), + (void**)&mapping)); + iree_atomic_ref_count_init(&mapping->ref_count); + mapping->host_allocator = host_allocator; + mapping->handle = handle; + iree_io_file_handle_retain(mapping->handle); + mapping->flags = flags; + mapping->contents = iree_byte_span_empty(); + + iree_status_t status = iree_ok_status(); + + // Special case for for host allocations: we can directly use them (with + // translation). Otherwise we let the platform-specific logic take care of + // things (if it exists). + iree_io_file_handle_primitive_t primitive = + iree_io_file_handle_primitive(handle); + if (primitive.type == IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { + iree_byte_span_t file_buffer = primitive.value.host_allocation; + status = iree_io_file_mapping_from_host_allocation( + file_buffer, offset, length, &mapping->contents); + } else { + // Use platform APIs to map the file. + status = + iree_io_platform_map_file_view(primitive, access, offset, length, flags, + &mapping->impl, &mapping->contents); + } + + if (iree_status_is_ok(status)) { + *out_mapping = mapping; + } else { + iree_io_file_mapping_release(mapping); + } + IREE_TRACE_ZONE_END(z0); + return status; +} + +static void iree_io_file_mapping_destroy(iree_io_file_mapping_t* mapping) { + IREE_ASSERT_ARGUMENT(mapping); + IREE_TRACE_ZONE_BEGIN(z0); + iree_allocator_t host_allocator = mapping->host_allocator; + + if (mapping->impl) { + iree_io_platform_unmap_file_view(mapping->flags, mapping->impl, + mapping->contents); + } + + iree_io_file_handle_release(mapping->handle); + + iree_allocator_free(host_allocator, mapping); + + IREE_TRACE_ZONE_END(z0); +} + +IREE_API_EXPORT void iree_io_file_mapping_retain( + iree_io_file_mapping_t* mapping) { + if (IREE_LIKELY(mapping)) { + iree_atomic_ref_count_inc(&mapping->ref_count); + } +} + +IREE_API_EXPORT void iree_io_file_mapping_release( + iree_io_file_mapping_t* mapping) { + if (IREE_LIKELY(mapping) && + iree_atomic_ref_count_dec(&mapping->ref_count) == 1) { + iree_io_file_mapping_destroy(mapping); + } +} + +IREE_API_EXPORT iree_host_size_t +iree_io_file_mapping_length(const iree_io_file_mapping_t* mapping) { + IREE_ASSERT_ARGUMENT(mapping); + return mapping->contents.data_length; +} + +IREE_API_EXPORT iree_const_byte_span_t +iree_io_file_mapping_contents_ro(const iree_io_file_mapping_t* mapping) { + return iree_make_const_byte_span(mapping->contents.data, + mapping->contents.data_length); +} + +IREE_API_EXPORT iree_byte_span_t +iree_io_file_mapping_contents_rw(iree_io_file_mapping_t* mapping) { + IREE_ASSERT_ARGUMENT(mapping); + return mapping->contents; +} + //===----------------------------------------------------------------------===// // iree_io_stream_t utilities //===----------------------------------------------------------------------===// diff --git a/runtime/src/iree/io/file_handle.h b/runtime/src/iree/io/file_handle.h index 6d46ad559d4c..53e5a37594cc 100644 --- a/runtime/src/iree/io/file_handle.h +++ b/runtime/src/iree/io/file_handle.h @@ -21,13 +21,13 @@ extern "C" { //===----------------------------------------------------------------------===// // Bits defining which operations are allowed on a file. +typedef uint32_t iree_io_file_access_t; enum iree_io_file_access_bits_t { // Allows operations that read from the file. IREE_IO_FILE_ACCESS_READ = 1u << 0, // Allows operations that write to the file. IREE_IO_FILE_ACCESS_WRITE = 1u << 1, }; -typedef uint32_t iree_io_file_access_t; //===----------------------------------------------------------------------===// // iree_io_file_handle_primitive_t @@ -156,6 +156,107 @@ static inline iree_io_file_handle_primitive_value_t iree_io_file_handle_value( IREE_API_EXPORT iree_status_t iree_io_file_handle_flush(iree_io_file_handle_t* handle); +//===----------------------------------------------------------------------===// +// iree_io_file_mapping_t +//===----------------------------------------------------------------------===// +// EXPERIMENTAL: this API may change once proper memory objects and views are +// added to iree/base/. This may just end up as a thin wrapper around that +// lower-level API with more fancy features (address placement, commit/decommit, +// etc) left to the lower-level API. We may add new APIs here for flush/sync +// as required. + +// Flags used to control the behavior of mapped file views. +typedef uint64_t iree_io_file_mapping_flags_t; +enum iree_io_file_mapping_flag_bits_t { + IREE_IO_FILE_MAPPING_FLAG_NONE = 0u, + + // Indicates that the memory access pattern of the view is mostly sequential. + // Hints to the system that an LRU page cache and sequential prefetching are + // likely to be worth it. + // + // Implemented by MADV_SEQUENTIAL. + IREE_IO_FILE_MAPPING_FLAG_SEQUENTIAL_ACCESS = 1ull << 0, + + // Enables large page support for the given view, if available. + // Certain mapping modes such as mapping of existing files or opening + // mappings from another process where the allocation was not made with large + // pages may not support large pages and the flag will be silently ignored. + // In either case the memory view will be padded to the + // iree_memory_info_t::large_page_size regardless of whether the pages are + // actually large to the system. + // + // Use large pages to reduce the overhead involved in accessing + // hot-but-non-localized memory views that may otherwise spend a significant + // amount of time/capacity maintaining the TLB. As the platform and + // machine-dependent large page size is often several orders of magnitude + // larger than the normal page size (MB vs. KB) care should be used to only + // apply this to large allocations. + // + // Implemented by FILE_MAP_LARGE_PAGES/MAP_HUGETLB, where available. + IREE_IO_FILE_MAPPING_FLAG_LARGE_PAGES = 1ull << 1, + + // Excludes the view memory from minidumps/coredumps. + // This is a hint that the memory in the ranges are not useful to include in + // dumps, such as large chunks of read-only file data (model weights, images, + // etc). + // + // Implemented by WerRegisterExcludedMemoryBlock/MADV_DONTDUMP, where + // available. + IREE_IO_FILE_MAPPING_FLAG_EXCLUDE_FROM_DUMPS = 1ull << 2, + + // Privately map the memory into the calling process. + // Other processes that may hold a reference to the file will not see changes. + // This is not a guarantee but an optimization to possibly avoid non-trivial + // kernel overheads. + // + // Implemented by MAP_PRIVATE, where available. + IREE_IO_FILE_MAPPING_FLAG_PRIVATE = 1ull << 3, +}; + +// A mapped file view into host memory. +// +// Thread-safe; the mapping is immutable and may be accessed from any thread. +// The **contents** of the mapping in the file should be coherent across threads +// within the same process but may not be across threads in different processes. +typedef struct iree_io_file_mapping_t iree_io_file_mapping_t; + +// Maps a view of a file into host-accessible memory. +// The provided file |handle| is retained for the lifetime of the view. +// To map the entire file specify a range of [0, IREE_HOST_SIZE_MAX]. +// +// If the provided file |handle| is already available for use as a host pointer +// it is returned directly. +IREE_API_EXPORT iree_status_t iree_io_file_map_view( + iree_io_file_handle_t* handle, iree_io_file_access_t access, + uint64_t offset, iree_host_size_t length, + iree_io_file_mapping_flags_t flags, iree_allocator_t host_allocator, + iree_io_file_mapping_t** out_mapping); + +// Retains the file |mapping| for the caller. The backing file handle will be +// retained as well. +IREE_API_EXPORT void iree_io_file_mapping_retain( + iree_io_file_mapping_t* mapping); + +// Releases the file |mapping| and its reference to the backing file handle. +// If the mapping was the last remaining retainer of the handle it will be +// closed. +IREE_API_EXPORT void iree_io_file_mapping_release( + iree_io_file_mapping_t* mapping); + +// Returns the length of the mapped view in bytes. +IREE_API_EXPORT iree_host_size_t +iree_io_file_mapping_length(const iree_io_file_mapping_t* mapping); + +// Returns a host-accessible read-only pointer to the file mapping memory. +// Returns iree_const_byte_span_empty if the mapping is not readable. +IREE_API_EXPORT iree_const_byte_span_t +iree_io_file_mapping_contents_ro(const iree_io_file_mapping_t* mapping); + +// Returns a host-accessible read-write pointer to the file mapping memory. +// Returns iree_byte_span_empty if the mapping is not writable. +IREE_API_EXPORT iree_byte_span_t +iree_io_file_mapping_contents_rw(iree_io_file_mapping_t* mapping); + //===----------------------------------------------------------------------===// // iree_io_stream_t utilities //===----------------------------------------------------------------------===// diff --git a/runtime/src/iree/io/formats/gguf/gguf_parser.c b/runtime/src/iree/io/formats/gguf/gguf_parser.c index 92230270fd4c..a48a2066c490 100644 --- a/runtime/src/iree/io/formats/gguf/gguf_parser.c +++ b/runtime/src/iree/io/formats/gguf/gguf_parser.c @@ -727,26 +727,24 @@ static iree_status_t iree_io_parse_gguf_index_from_memory( } IREE_API_EXPORT iree_status_t iree_io_parse_gguf_index( - iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index) { + iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index, + iree_allocator_t host_allocator) { IREE_ASSERT_ARGUMENT(index); IREE_TRACE_ZONE_BEGIN(z0); - // Today we only support memory files. - // TODO(benvanik): support iree_io_stream_t wrapping for parsing the index. - if (iree_io_file_handle_type(file_handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - IREE_TRACE_ZONE_END(z0); - return iree_make_status(IREE_STATUS_UNIMPLEMENTED, - "non-memory gguf files not yet supported"); - } - iree_byte_span_t host_allocation = - iree_io_file_handle_primitive(file_handle).value.host_allocation; + // The parser requires a host pointer but will only reference the file handle + // in the index. + iree_io_file_mapping_t* file_mapping = NULL; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_io_file_map_view(file_handle, IREE_IO_FILE_ACCESS_READ, 0, + IREE_HOST_SIZE_MAX, + IREE_IO_FILE_MAPPING_FLAG_EXCLUDE_FROM_DUMPS, + host_allocator, &file_mapping)); iree_status_t status = iree_io_parse_gguf_index_from_memory( - file_handle, - iree_make_const_byte_span(host_allocation.data, - host_allocation.data_length), - index); + file_handle, iree_io_file_mapping_contents_ro(file_mapping), index); + + iree_io_file_mapping_release(file_mapping); IREE_TRACE_ZONE_END(z0); return status; diff --git a/runtime/src/iree/io/formats/gguf/gguf_parser.h b/runtime/src/iree/io/formats/gguf/gguf_parser.h index fc0064baf989..e570e0c57dbe 100644 --- a/runtime/src/iree/io/formats/gguf/gguf_parser.h +++ b/runtime/src/iree/io/formats/gguf/gguf_parser.h @@ -19,8 +19,12 @@ extern "C" { // // Specification: // https://github.com/ggerganov/ggml/blob/master/docs/gguf.md +// +// The provided |host_allocator| may be used for allocations during parsing and +// is allowed to be an arena. IREE_API_EXPORT iree_status_t iree_io_parse_gguf_index( - iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index); + iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index, + iree_allocator_t host_allocator); #ifdef __cplusplus } // extern "C" diff --git a/runtime/src/iree/io/formats/gguf/gguf_parser_test.cc b/runtime/src/iree/io/formats/gguf/gguf_parser_test.cc index 1845e7afce2d..2477806999d4 100644 --- a/runtime/src/iree/io/formats/gguf/gguf_parser_test.cc +++ b/runtime/src/iree/io/formats/gguf/gguf_parser_test.cc @@ -38,7 +38,8 @@ TEST(GgufFormatTest, Empty) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("empty.gguf"); - IREE_ASSERT_OK(iree_io_parse_gguf_index(file_handle, index)); + IREE_ASSERT_OK( + iree_io_parse_gguf_index(file_handle, index, iree_allocator_system())); iree_io_file_handle_release(file_handle); iree_io_parameter_index_release(index); @@ -50,7 +51,8 @@ TEST(GgufFormatTest, SingleTensor) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("single.gguf"); - IREE_ASSERT_OK(iree_io_parse_gguf_index(file_handle, index)); + IREE_ASSERT_OK( + iree_io_parse_gguf_index(file_handle, index, iree_allocator_system())); iree_io_file_handle_release(file_handle); const iree_io_parameter_index_entry_t* entry0 = NULL; @@ -71,7 +73,8 @@ TEST(GgufFormatTest, SingleTensorV2) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("single_v2.gguf"); - IREE_ASSERT_OK(iree_io_parse_gguf_index(file_handle, index)); + IREE_ASSERT_OK( + iree_io_parse_gguf_index(file_handle, index, iree_allocator_system())); iree_io_file_handle_release(file_handle); const iree_io_parameter_index_entry_t* entry0 = NULL; @@ -91,7 +94,8 @@ TEST(GgufFormatTest, MultipleTensors) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("multiple.gguf"); - IREE_ASSERT_OK(iree_io_parse_gguf_index(file_handle, index)); + IREE_ASSERT_OK( + iree_io_parse_gguf_index(file_handle, index, iree_allocator_system())); iree_io_file_handle_release(file_handle); const iree_io_parameter_index_entry_t* entry0 = NULL; diff --git a/runtime/src/iree/io/formats/irpa/irpa_parser.c b/runtime/src/iree/io/formats/irpa/irpa_parser.c index dd09b3a8b441..fe69a4600435 100644 --- a/runtime/src/iree/io/formats/irpa/irpa_parser.c +++ b/runtime/src/iree/io/formats/irpa/irpa_parser.c @@ -322,27 +322,26 @@ static iree_status_t iree_io_parse_irpa_index_from_memory( } IREE_API_EXPORT iree_status_t iree_io_parse_irpa_index( - iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index) { + iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index, + iree_allocator_t host_allocator) { IREE_ASSERT_ARGUMENT(index); IREE_TRACE_ZONE_BEGIN(z0); - // Today we only support memory files. - // TODO(benvanik): support iree_io_stream_t wrapping for parsing the index. - if (iree_io_file_handle_type(file_handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - IREE_TRACE_ZONE_END(z0); - return iree_make_status(IREE_STATUS_UNIMPLEMENTED, - "non-memory irpa files not yet supported"); - } - iree_byte_span_t host_allocation = - iree_io_file_handle_primitive(file_handle).value.host_allocation; + // The parser requires a host pointer but will only reference the file handle + // in the index. + iree_io_file_mapping_t* file_mapping = NULL; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_io_file_map_view(file_handle, IREE_IO_FILE_ACCESS_READ, 0, + IREE_HOST_SIZE_MAX, + IREE_IO_FILE_MAPPING_FLAG_EXCLUDE_FROM_DUMPS, + host_allocator, &file_mapping)); iree_status_t status = iree_io_parse_irpa_index_from_memory( - file_handle, - iree_make_const_byte_span(host_allocation.data, - host_allocation.data_length), + file_handle, iree_io_file_mapping_contents_ro(file_mapping), /*base_offset=*/0, index); + iree_io_file_mapping_release(file_mapping); + IREE_TRACE_ZONE_END(z0); return status; } diff --git a/runtime/src/iree/io/formats/irpa/irpa_parser.h b/runtime/src/iree/io/formats/irpa/irpa_parser.h index 4423105606c7..e5283fdc725c 100644 --- a/runtime/src/iree/io/formats/irpa/irpa_parser.h +++ b/runtime/src/iree/io/formats/irpa/irpa_parser.h @@ -16,8 +16,12 @@ extern "C" { #endif // __cplusplus // Parses an IREE archive file and merges its contained resources into |index|. +// +// The provided |host_allocator| may be used for allocations during parsing and +// is allowed to be an arena. IREE_API_EXPORT iree_status_t iree_io_parse_irpa_index( - iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index); + iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index, + iree_allocator_t host_allocator); #ifdef __cplusplus } // extern "C" diff --git a/runtime/src/iree/io/formats/irpa/irpa_parser_test.cc b/runtime/src/iree/io/formats/irpa/irpa_parser_test.cc index 944c2b406568..c6f3bc7909a1 100644 --- a/runtime/src/iree/io/formats/irpa/irpa_parser_test.cc +++ b/runtime/src/iree/io/formats/irpa/irpa_parser_test.cc @@ -38,7 +38,8 @@ TEST(IrpaFormatTest, Empty) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("empty.irpa"); - IREE_ASSERT_OK(iree_io_parse_irpa_index(file_handle, index)); + IREE_ASSERT_OK( + iree_io_parse_irpa_index(file_handle, index, iree_allocator_system())); EXPECT_EQ(0, iree_io_parameter_index_count(index)); iree_io_file_handle_release(file_handle); @@ -51,7 +52,8 @@ TEST(IrpaFormatTest, SingleParameters) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("single.irpa"); - IREE_ASSERT_OK(iree_io_parse_irpa_index(file_handle, index)); + IREE_ASSERT_OK( + iree_io_parse_irpa_index(file_handle, index, iree_allocator_system())); EXPECT_EQ(1, iree_io_parameter_index_count(index)); iree_io_file_handle_release(file_handle); @@ -73,7 +75,8 @@ TEST(IrpaFormatTest, MultipleParameters) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("multiple.irpa"); - IREE_ASSERT_OK(iree_io_parse_irpa_index(file_handle, index)); + IREE_ASSERT_OK( + iree_io_parse_irpa_index(file_handle, index, iree_allocator_system())); EXPECT_EQ(2, iree_io_parameter_index_count(index)); iree_io_file_handle_release(file_handle); @@ -104,7 +107,8 @@ TEST(IrpaFormatTest, MixedDataAndSplats) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("mixed.irpa"); - IREE_ASSERT_OK(iree_io_parse_irpa_index(file_handle, index)); + IREE_ASSERT_OK( + iree_io_parse_irpa_index(file_handle, index, iree_allocator_system())); EXPECT_EQ(4, iree_io_parameter_index_count(index)); iree_io_file_handle_release(file_handle); diff --git a/runtime/src/iree/io/formats/parser_registry.c b/runtime/src/iree/io/formats/parser_registry.c index 716f07f905b5..0c6bac6b0e08 100644 --- a/runtime/src/iree/io/formats/parser_registry.c +++ b/runtime/src/iree/io/formats/parser_registry.c @@ -13,7 +13,7 @@ IREE_API_EXPORT iree_status_t iree_io_parse_file_index( iree_string_view_t path, iree_io_file_handle_t* file_handle, - iree_io_parameter_index_t* index) { + iree_io_parameter_index_t* index, iree_allocator_t host_allocator) { IREE_TRACE_ZONE_BEGIN(z0); IREE_TRACE_ZONE_APPEND_TEXT(z0, path.data, path.size); @@ -28,11 +28,12 @@ IREE_API_EXPORT iree_status_t iree_io_parse_file_index( iree_status_t status = iree_ok_status(); if (iree_string_view_equal_case(extension, IREE_SV("irpa"))) { - status = iree_io_parse_irpa_index(file_handle, index); + status = iree_io_parse_irpa_index(file_handle, index, host_allocator); } else if (iree_string_view_equal_case(extension, IREE_SV("gguf"))) { - status = iree_io_parse_gguf_index(file_handle, index); + status = iree_io_parse_gguf_index(file_handle, index, host_allocator); } else if (iree_string_view_equal_case(extension, IREE_SV("safetensors"))) { - status = iree_io_parse_safetensors_index(file_handle, index); + status = + iree_io_parse_safetensors_index(file_handle, index, host_allocator); } else { status = iree_make_status( IREE_STATUS_UNIMPLEMENTED, diff --git a/runtime/src/iree/io/formats/parser_registry.h b/runtime/src/iree/io/formats/parser_registry.h index c982d08a3edd..404f8ec6a4ed 100644 --- a/runtime/src/iree/io/formats/parser_registry.h +++ b/runtime/src/iree/io/formats/parser_registry.h @@ -19,9 +19,12 @@ extern "C" { // |path| is used for logging and file format identification. It may either be // the original file path of |file_handle| or an extension (such as `irpa`). // Upon return any parameters in the file are appended to the |index|. +// +// The provided |host_allocator| may be used for allocations during parsing and +// is allowed to be an arena. IREE_API_EXPORT iree_status_t iree_io_parse_file_index( iree_string_view_t path, iree_io_file_handle_t* file_handle, - iree_io_parameter_index_t* index); + iree_io_parameter_index_t* index, iree_allocator_t host_allocator); #ifdef __cplusplus } // extern "C" diff --git a/runtime/src/iree/io/formats/safetensors/safetensors_parser.c b/runtime/src/iree/io/formats/safetensors/safetensors_parser.c index 04eda053d1ae..66c842cc2bde 100644 --- a/runtime/src/iree/io/formats/safetensors/safetensors_parser.c +++ b/runtime/src/iree/io/formats/safetensors/safetensors_parser.c @@ -501,26 +501,28 @@ static iree_status_t iree_io_parse_safetensors_index_from_memory( } IREE_API_EXPORT iree_status_t iree_io_parse_safetensors_index( - iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index) { + iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index, + iree_allocator_t host_allocator) { IREE_ASSERT_ARGUMENT(index); IREE_TRACE_ZONE_BEGIN(z0); - // Today we only support memory files. - // TODO(benvanik): support iree_io_stream_t wrapping for parsing the index. - if (iree_io_file_handle_type(file_handle) != - IREE_IO_FILE_HANDLE_TYPE_HOST_ALLOCATION) { - IREE_TRACE_ZONE_END(z0); - return iree_make_status(IREE_STATUS_UNIMPLEMENTED, - "non-memory safetensors files not yet supported"); - } - iree_byte_span_t host_allocation = - iree_io_file_handle_primitive(file_handle).value.host_allocation; + // The parser requires a host pointer but will only reference the file handle + // in the index. It'd be easy to change this parser to use stream-based + // reading as we could just read the header bytes and JSON blob into transient + // memory but the intent is that parameter parsing should not allocate large + // amounts of memory and this keeps the behavior consistent with other + // implementations. + iree_io_file_mapping_t* file_mapping = NULL; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_io_file_map_view(file_handle, IREE_IO_FILE_ACCESS_READ, 0, + IREE_HOST_SIZE_MAX, + IREE_IO_FILE_MAPPING_FLAG_EXCLUDE_FROM_DUMPS, + host_allocator, &file_mapping)); iree_status_t status = iree_io_parse_safetensors_index_from_memory( - file_handle, - iree_make_const_byte_span(host_allocation.data, - host_allocation.data_length), - index); + file_handle, iree_io_file_mapping_contents_ro(file_mapping), index); + + iree_io_file_mapping_release(file_mapping); IREE_TRACE_ZONE_END(z0); return status; diff --git a/runtime/src/iree/io/formats/safetensors/safetensors_parser.h b/runtime/src/iree/io/formats/safetensors/safetensors_parser.h index c32c104f06a4..2ec2bcc82e65 100644 --- a/runtime/src/iree/io/formats/safetensors/safetensors_parser.h +++ b/runtime/src/iree/io/formats/safetensors/safetensors_parser.h @@ -31,8 +31,12 @@ extern "C" { // don't take that dependency for a testing tool. Users wanting to productionize // this should implement their own safetensors parser or use the rust one with // all the fun that entails. +// +// The provided |host_allocator| may be used for allocations during parsing and +// is allowed to be an arena. IREE_API_EXPORT iree_status_t iree_io_parse_safetensors_index( - iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index); + iree_io_file_handle_t* file_handle, iree_io_parameter_index_t* index, + iree_allocator_t host_allocator); #ifdef __cplusplus } // extern "C" diff --git a/runtime/src/iree/io/formats/safetensors/safetensors_parser_test.cc b/runtime/src/iree/io/formats/safetensors/safetensors_parser_test.cc index 44caf7bba949..821865c4be20 100644 --- a/runtime/src/iree/io/formats/safetensors/safetensors_parser_test.cc +++ b/runtime/src/iree/io/formats/safetensors/safetensors_parser_test.cc @@ -38,7 +38,8 @@ TEST(SafetensorsFormatTest, Empty) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("empty.safetensors"); - IREE_ASSERT_OK(iree_io_parse_safetensors_index(file_handle, index)); + IREE_ASSERT_OK(iree_io_parse_safetensors_index(file_handle, index, + iree_allocator_system())); iree_io_file_handle_release(file_handle); iree_io_parameter_index_release(index); @@ -50,7 +51,8 @@ TEST(SafetensorsFormatTest, SingleTensor) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("single.safetensors"); - IREE_ASSERT_OK(iree_io_parse_safetensors_index(file_handle, index)); + IREE_ASSERT_OK(iree_io_parse_safetensors_index(file_handle, index, + iree_allocator_system())); iree_io_file_handle_release(file_handle); const iree_io_parameter_index_entry_t* entry0 = NULL; @@ -70,7 +72,8 @@ TEST(SafetensorsFormatTest, MultipleTensors) { iree_io_parameter_index_create(iree_allocator_system(), &index)); iree_io_file_handle_t* file_handle = OpenTestFile("multiple.safetensors"); - IREE_ASSERT_OK(iree_io_parse_safetensors_index(file_handle, index)); + IREE_ASSERT_OK(iree_io_parse_safetensors_index(file_handle, index, + iree_allocator_system())); iree_io_file_handle_release(file_handle); const iree_io_parameter_index_entry_t* entry0 = NULL; diff --git a/runtime/src/iree/tooling/parameter_util.c b/runtime/src/iree/tooling/parameter_util.c index 07a52960ae0d..e311f4ff664a 100644 --- a/runtime/src/iree/tooling/parameter_util.c +++ b/runtime/src/iree/tooling/parameter_util.c @@ -106,7 +106,8 @@ static iree_status_t iree_io_append_parameter_file_to_index( z0, iree_io_open_parameter_file(path, host_allocator, &file_handle)); // Index the file based on its (inferred) format. - iree_status_t status = iree_io_parse_file_index(path, file_handle, index); + iree_status_t status = + iree_io_parse_file_index(path, file_handle, index, host_allocator); // Release our file reference - it's still retained by the index if it had any // parameters in it. From 4e29bbb2f9a6e86dd7fc7a19218f93b8621db072 Mon Sep 17 00:00:00 2001 From: Rob Suderman Date: Wed, 18 Dec 2024 12:30:58 -0800 Subject: [PATCH 41/64] Bump Sharktank forward to bypass failing test flag (#19519) --- .github/workflows/pkgci_test_sharktank.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/pkgci_test_sharktank.yml b/.github/workflows/pkgci_test_sharktank.yml index 67204a8fcf41..211d88bd4a8c 100644 --- a/.github/workflows/pkgci_test_sharktank.yml +++ b/.github/workflows/pkgci_test_sharktank.yml @@ -60,7 +60,7 @@ jobs: uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 with: repository: iree-org/iree-test-suites - ref: dab5402da679e713b94a8bd5c400a51cce0e665a + ref: fece13306aff1d8ef33858dccfdb10eaf8b036c2 path: iree-test-suites lfs: true - name: Install Sharktank models test suite requirements From 078c3ec8061a228cc23ed5952f61700049da8101 Mon Sep 17 00:00:00 2001 From: Boian Petkantchin Date: Wed, 18 Dec 2024 13:12:37 -0800 Subject: [PATCH 42/64] [runtime][python] Add IRPA entry conversion to/from numpy (#19492) Add iterop between numpy ndarray and parameter index. This is an adaptation of the original in IREE Turbine https://github.com/iree-org/iree-turbine/blob/142c8a5044a4fedb43a11229f462363b05743b23/iree/turbine/aot/params.py The goal is to maintain compatibility with IRPA files that were already generated with IREE Turbine. At some point we can refactor the IREE Turbine side to use this implementation. Signed-off-by: Boian Petkantchin --- .../python/iree/runtime/array_interop.py | 1 - runtime/bindings/python/iree/runtime/io.py | 170 +++++++++++++++++- runtime/bindings/python/tests/io_test.py | 41 +++++ ...generate_tensor_saved_with_iree_turbine.py | 7 + .../tensor_saved_with_iree_turbine.irpa | Bin 0 -> 4096 bytes 5 files changed, 214 insertions(+), 5 deletions(-) create mode 100644 runtime/bindings/python/tests/testdata/generate_tensor_saved_with_iree_turbine.py create mode 100644 runtime/bindings/python/tests/testdata/tensor_saved_with_iree_turbine.irpa diff --git a/runtime/bindings/python/iree/runtime/array_interop.py b/runtime/bindings/python/iree/runtime/array_interop.py index 863456228a91..b5859ab1b873 100644 --- a/runtime/bindings/python/iree/runtime/array_interop.py +++ b/runtime/bindings/python/iree/runtime/array_interop.py @@ -294,7 +294,6 @@ def asdevicearray( (np.float16, HalElementType.FLOAT_16), (np.float32, HalElementType.FLOAT_32), (np.float64, HalElementType.FLOAT_64), - (np.float16, HalElementType.FLOAT_16), (np.int32, HalElementType.SINT_32), (np.int64, HalElementType.SINT_64), (np.int16, HalElementType.SINT_16), diff --git a/runtime/bindings/python/iree/runtime/io.py b/runtime/bindings/python/iree/runtime/io.py index 246243a2ca8c..e9d54dae71c7 100644 --- a/runtime/bindings/python/iree/runtime/io.py +++ b/runtime/bindings/python/iree/runtime/io.py @@ -8,13 +8,16 @@ import array from functools import reduce -import numpy +import json +import numpy as np from os import PathLike -from pathlib import Path -from ._binding import ParameterIndex +from ._binding import ParameterIndex, ParameterIndexEntry __all__ = [ + "parameter_index_add_numpy_ndarray", + "parameter_index_entry_as_numpy_flat_ndarray", + "parameter_index_entry_as_numpy_ndarray", "SplatValue", "save_archive_file", ] @@ -23,7 +26,7 @@ class SplatValue: def __init__( self, - pattern: Union[array.array, numpy.ndarray], + pattern: Union[array.array, np.ndarray], count: Union[Sequence[int], int], ): if hasattr(pattern, "shape"): @@ -64,3 +67,162 @@ def save_archive_file(entries: dict[str, Union[Any, SplatValue]], file_path: Pat else: index.add_buffer(key, value) index.create_archive_file(str(file_path)) + + +def parameter_index_add_numpy_ndarray( + index: ParameterIndex, name: str, array: np.ndarray +): + """Adds an ndarray to the index.""" + metadata = _make_tensor_metadata(array) + # 0d arrays are special in both torch/numpy in different ways that makes + # it hard to reliably get a memory view of their contents. Since we + # know that 0d is always small, we just force a copy when in numpy + # land and that seems to get it on the happy path. + # See: https://github.com/iree-org/iree-turbine/issues/29 + if len(array.shape) == 0: + flat_array = array.copy() + else: + flat_array = np.ascontiguousarray(array).view(np.uint8) + index.add_buffer(name, flat_array, metadata=metadata) + + +def parameter_index_entry_as_numpy_flat_ndarray( + index_entry: ParameterIndexEntry, +) -> np.ndarray: + """Accesses the contents as a uint8 flat tensor. + + If it is a splat, then the tensor will be a view of the splat pattern. + + Raises a ValueError on unsupported entries. + """ + if index_entry.is_file: + wrapper = np.array(index_entry.file_view, copy=False) + elif index_entry.is_splat: + wrapper = np.array(index_entry.splat_pattern, copy=True) + else: + raise ValueError(f"Unsupported ParameterIndexEntry: {index_entry}") + + return wrapper + + +def parameter_index_entry_as_numpy_ndarray( + index_entry: ParameterIndexEntry, +) -> np.ndarray: + """Returns a tensor viewed with appropriate shape/dtype from metadata. + + Raises a ValueError if unsupported. + """ + + # Decode metadata. + versioned_metadata = index_entry.metadata.decode() + metadata_parts = versioned_metadata.split(_metadata_version_separator, maxsplit=1) + if len(metadata_parts) == 1: + raise ValueError( + ( + f'Invalid metadata for parameter index entry "{index_entry.key}".' + f' Expected format version prefix not found in "{metadata_parts[0][:100]}".' + ) + ) + format_version = metadata_parts[0] + metadata = metadata_parts[1] + if ( + format_version != _metadata_version + and format_version != _metadata_iree_turbine_version + ): + raise ValueError( + ( + f'Unsupported metadata format version "{format_version}" for parameter ' + 'index entry "{index_entry.key}": Cannot convert to tensor' + ) + ) + d = json.loads(metadata) + try: + type_name = d["type"] + if d["type"] != "Tensor": + raise ValueError( + f"Metadata for parameter entry {index_entry.key} is not a Tensor ('{type_name}')" + ) + dtype_name = d["dtype"] + shape = d["shape"] + except KeyError as e: + raise ValueError(f"Bad metadata for parameter entry {index_entry.key}") from e + + # Unpack/validate. + try: + dtype = _NAME_TO_DTYPE[dtype_name] + except KeyError: + raise ValueError(f"Unknown dtype name '{dtype_name}'") + try: + shape = [int(d) for d in shape] + except ValueError as e: + raise ValueError(f"Illegal shape for parameter entry {index_entry.key}") from e + + t = parameter_index_entry_as_numpy_flat_ndarray(index_entry) + return t.view(dtype=dtype).reshape(shape) + + +_DTYPE_TO_NAME = ( + (np.float16, "float16"), + (np.float32, "float32"), + (np.float64, "float64"), + (np.int32, "int32"), + (np.int64, "int64"), + (np.int16, "int16"), + (np.int8, "int8"), + (np.uint32, "uint32"), + (np.uint64, "uint64"), + (np.uint16, "uint16"), + (np.uint8, "uint8"), + (np.bool_, "bool"), + (np.complex64, "complex64"), + (np.complex128, "complex128"), +) + +_NAME_TO_DTYPE: dict[str, np.dtype] = { + name: np_dtype for np_dtype, name in _DTYPE_TO_NAME +} + + +def _map_dtype_to_name(dtype) -> str: + for match_dtype, dtype_name in _DTYPE_TO_NAME: + if match_dtype == dtype: + return dtype_name + + raise KeyError(f"Numpy dtype {dtype} not found.") + + +_metadata_version = "TENSORv0" +"""Magic number to identify the format version. +The current version that will be used when adding tensors to a parameter index.""" + +_metadata_iree_turbine_version = "PYTORCH" +"""There are files created with IREE Turbine that use this prefix. +This is here to maintain the ability to load such files.""" + +_metadata_version_separator = ":" +"""The separator between the format version and the actual metadata. +The metadata has the following format """ + + +def _make_tensor_metadata(t: np.ndarray) -> str: + """Makes a tensor metadata blob that can be used to reconstruct the tensor.""" + dtype = t.dtype + dtype_name = _map_dtype_to_name(dtype) + is_complex = np.issubdtype(dtype, np.complexfloating) + is_floating_point = np.issubdtype(dtype, np.floating) + is_signed = np.issubdtype(dtype, np.signedinteger) + dtype_desc = { + "class_name": type(dtype).__name__, + "is_complex": is_complex, + "is_floating_point": is_floating_point, + "is_signed": is_signed, + "itemsize": dtype.itemsize, + } + d = { + "type": "Tensor", + "dtype": dtype_name, + "shape": list(t.shape), + "dtype_desc": dtype_desc, + } + encoded = f"{_metadata_version}{_metadata_version_separator}{json.dumps(d)}" + return encoded diff --git a/runtime/bindings/python/tests/io_test.py b/runtime/bindings/python/tests/io_test.py index 026c9b665364..a854cb9d9a2e 100644 --- a/runtime/bindings/python/tests/io_test.py +++ b/runtime/bindings/python/tests/io_test.py @@ -96,6 +96,47 @@ def verify_archive(file_path: Path): verify_archive(file_path) gc.collect() + def testParameterIndexEntryFromToNumpy(self): + array = np.array([[1, 2], [3, 4]], dtype=np.int32) + index = rt.ParameterIndex() + key = "key" + rt.parameter_index_add_numpy_ndarray(index, key, array) + assert index.items()[0][0] == key + index_entry_as_array = rt.parameter_index_entry_as_numpy_ndarray( + index.items()[0][1] + ) + np.testing.assert_equal(index_entry_as_array, array) + + def testParameterIndexEntryFromToNumpyZeroDims(self): + array = np.array(1234, dtype=np.int32) + index = rt.ParameterIndex() + key = "key" + rt.parameter_index_add_numpy_ndarray(index, key, array) + assert index.items()[0][0] == key + index_entry_as_array = rt.parameter_index_entry_as_numpy_ndarray( + index.items()[0][1] + ) + np.testing.assert_equal(index_entry_as_array, array) + + def testParameterIndexEntryFromIreeTurbine(self): + """Verify that we are able to load a tensor from IRPA generated with IREE + Turbine. + We want to maintain backward compatibility with existing IRPA files.""" + index = rt.ParameterIndex() + irpa_path = str( + Path(__file__).resolve().parent + / "testdata" + / "tensor_saved_with_iree_turbine.irpa" + ) + index.load(irpa_path) + items = index.items() + assert len(items) == 1 + key, entry = items[0] + assert key == "the_torch_tensor" + index_entry_as_array = rt.parameter_index_entry_as_numpy_ndarray(entry) + expected_array = np.array([1, 2, 3, 4], dtype=np.uint8) + np.testing.assert_array_equal(index_entry_as_array, expected_array, strict=True) + def testFileHandleWrap(self): fh = rt.FileHandle.wrap_memory(b"foobar") view = fh.host_allocation diff --git a/runtime/bindings/python/tests/testdata/generate_tensor_saved_with_iree_turbine.py b/runtime/bindings/python/tests/testdata/generate_tensor_saved_with_iree_turbine.py new file mode 100644 index 000000000000..3a8167fc296c --- /dev/null +++ b/runtime/bindings/python/tests/testdata/generate_tensor_saved_with_iree_turbine.py @@ -0,0 +1,7 @@ +from iree.turbine.aot import ParameterArchiveBuilder +import torch + +archive = ParameterArchiveBuilder() +tensor = torch.tensor([1, 2, 3, 4], dtype=torch.uint8) +archive.add_tensor("the_torch_tensor", tensor) +archive.save("tensor_saved_with_iree_turbine.irpa") diff --git a/runtime/bindings/python/tests/testdata/tensor_saved_with_iree_turbine.irpa b/runtime/bindings/python/tests/testdata/tensor_saved_with_iree_turbine.irpa new file mode 100644 index 0000000000000000000000000000000000000000..c9a4cea922b824c7aec79b0d79e1ed7777223b83 GIT binary patch literal 4096 zcmeH@JqiLb5QY6afJczjHiCs%DK-j%A{L5B47?Vs7!5;(hoHBYF)G_KE^@)1JU=3?8K40-hLF@ST`oOwZ$Yf8y(?*h8 zvkxL}r3SN~F6WckVA#DddrBqiHrPD(S+l`HZyl7joy>tqUCR<~BCMD!t>WaqB;Opmth@MbF9v)H&PSx?aEjR<2b2 g_r5w*5&>Y5%?d07tk(8hyVZp literal 0 HcmV?d00001 From ce659488d2ed07cc944d05e096b1f91b93f28709 Mon Sep 17 00:00:00 2001 From: Ben Vanik Date: Wed, 18 Dec 2024 13:26:16 -0800 Subject: [PATCH 43/64] Adding iree_io_file_handle_create/iree_io_file_handle_open. (#19510) These utilities are optional but an easy way to get a file handle that is compatible with the rest of IREE. This largely replaces the need for the existing file_io utilities but cleanup/unification is left for future changes. Command line tools now default to --parameter_mode=file though users can force mmap or preload. There's some flags that are available on the open that may help or hurt performance and the defaults are provisional. --- runtime/src/iree/base/config.h | 31 ++- runtime/src/iree/base/internal/file_io.h | 2 + runtime/src/iree/io/file_handle.c | 283 ++++++++++++++++++++++ runtime/src/iree/io/file_handle.h | 51 ++++ runtime/src/iree/io/stdio_stream.c | 12 +- runtime/src/iree/tooling/parameter_util.c | 79 +++--- 6 files changed, 418 insertions(+), 40 deletions(-) diff --git a/runtime/src/iree/base/config.h b/runtime/src/iree/base/config.h index 1c87930baade..105ad1766a3c 100644 --- a/runtime/src/iree/base/config.h +++ b/runtime/src/iree/base/config.h @@ -128,6 +128,8 @@ typedef IREE_DEVICE_SIZE_T iree_device_size_t; //===----------------------------------------------------------------------===// // Synchronization and threading //===----------------------------------------------------------------------===// + +#if !defined(IREE_SYNCHRONIZATION_DISABLE_UNSAFE) // On ultra-tiny systems where there may only be a single core - or a single // core that is guaranteed to ever call an IREE API - all synchronization // primitives used throughout IREE can be turned into no-ops. Note that behavior @@ -135,31 +137,46 @@ typedef IREE_DEVICE_SIZE_T iree_device_size_t; // owned by IREE from multiple threads concurrently or across threads without // proper barriers in place. Unless your target system is in a similar class to // an Arduino this is definitely not what you want. - -#if !defined(IREE_SYNCHRONIZATION_DISABLE_UNSAFE) #define IREE_SYNCHRONIZATION_DISABLE_UNSAFE 0 #endif // !IREE_SYNCHRONIZATION_DISABLE_UNSAFE //===----------------------------------------------------------------------===// // File I/O //===----------------------------------------------------------------------===// + +#if !defined(IREE_FILE_IO_ENABLE) // On platforms without file systems or in applications where no file I/O // utilities are used, all file I/O operations can be stripped out. Functions // relying on file I/O will still be defined, but they will return errors. - -#if !defined(IREE_FILE_IO_ENABLE) #define IREE_FILE_IO_ENABLE 1 #endif // !IREE_FILE_IO_ENABLE +#if !defined(IREE_MAX_PATH) +// Maximum path C string length in characters excluding the NUL terminator. +// We stack allocate the path and want to keep it small enough to reasonably +// fit on any particular thread's stack (which may be as small as 64KB and we +// don't know who may be in the call stack above us). +// +// PATH_MAX is linux-only but _sometimes_ available on other platforms we use +// this code path for. Only when it's available it may indicate the PATH_MAX +// with _or_ without the NUL terminator. If we guess too large here the platform +// will fail as it scans the path and if we guess too small then users may want +// to re-evaluate their usage of the filesystem. +// +// MAX_PATH is 260 but most systems nowadays have long paths enabled and we +// don't want to limit ourselves to that. +#define IREE_MAX_PATH 2047 +#endif // !IREE_MAX_PATH + //===----------------------------------------------------------------------===// // Statistics/reporting //===----------------------------------------------------------------------===// + +#if !defined(IREE_STATISTICS_ENABLE) // Conditionally enables programmatic access to aggregate statistics. When // enabled statistics requires additional per-operation logic and per-resource // state that can bloat otherwise minimal structures. Shared resources may also // require synchronization where there otherwise would not be any. - -#if !defined(IREE_STATISTICS_ENABLE) #define IREE_STATISTICS_ENABLE 1 #endif // !IREE_STATISTICS_ENABLE @@ -173,6 +190,7 @@ typedef IREE_DEVICE_SIZE_T iree_device_size_t; // Specify a custom header with `-DIREE_TRACING_PROVIDER_H="my_provider.h"`. // Specify a dependency with `-DIREE_TRACING_PROVIDER=my_provider_target`. +#if !defined(IREE_TRACING_MODE) // Set IREE_TRACING_FEATURES based on IREE_TRACING_MODE if the user hasn't // overridden it with more specific settings. // @@ -181,7 +199,6 @@ typedef IREE_DEVICE_SIZE_T iree_device_size_t; // IREE_TRACING_MODE = 2: same as 1 with added allocation tracking // IREE_TRACING_MODE = 3: same as 2 with callstacks for allocations // IREE_TRACING_MODE = 4: same as 3 with callstacks for all instrumentation -#if !defined(IREE_TRACING_MODE) #define IREE_TRACING_MODE 0 #endif // !IREE_TRACING_MODE diff --git a/runtime/src/iree/base/internal/file_io.h b/runtime/src/iree/base/internal/file_io.h index 7f64b583061c..85b55c699e59 100644 --- a/runtime/src/iree/base/internal/file_io.h +++ b/runtime/src/iree/base/internal/file_io.h @@ -48,6 +48,8 @@ void iree_file_contents_free(iree_file_contents_t* contents); typedef enum iree_file_read_flag_bits_t { IREE_FILE_READ_FLAG_PRELOAD = (1u << 0), + // TODO(benvanik): drop this (and possibly all file utilities) in favor of + // iree_io_file_handle_t + iree_io_file_map_view. IREE_FILE_READ_FLAG_MMAP = (1u << 1), IREE_FILE_READ_FLAG_DEFAULT = IREE_FILE_READ_FLAG_PRELOAD, } iree_file_read_flags_t; diff --git a/runtime/src/iree/io/file_handle.c b/runtime/src/iree/io/file_handle.c index 40d5f8562b2d..9a6727bcf31c 100644 --- a/runtime/src/iree/io/file_handle.c +++ b/runtime/src/iree/io/file_handle.c @@ -12,11 +12,13 @@ #if IREE_FILE_IO_ENABLE #if defined(IREE_PLATFORM_WINDOWS) +#include // _open_osfhandle constants #include // _commit #include // WerRegisterExcludedMemoryBlock #else +#include // open #include // mmap #include // fstat #include // fsync @@ -160,6 +162,287 @@ iree_io_file_handle_flush(iree_io_file_handle_t* handle) { return status; } +//===----------------------------------------------------------------------===// +// iree_io_file_handle_t utilities +//===----------------------------------------------------------------------===// + +#if IREE_FILE_IO_ENABLE + +#if defined(IREE_PLATFORM_WINDOWS) + +static iree_status_t iree_io_file_handle_platform_open( + iree_io_file_mode_t mode, iree_string_view_t path, bool open_existing, + uint64_t initial_size, + iree_io_file_handle_primitive_t* out_handle_primitive) { + IREE_ASSERT_ARGUMENT(out_handle_primitive); + memset(out_handle_primitive, 0, sizeof(*out_handle_primitive)); + + // Convert path from a string view to a NUL-terminated C string. + if (path.size >= IREE_MAX_PATH) { + return iree_make_status(IREE_STATUS_OUT_OF_RANGE, + "path length %" PRIhsz + " exceeds maximum character length of %d", + path.size, IREE_MAX_PATH); + } + char* path_str = iree_alloca(path.size + 1); + iree_string_view_to_cstring(path, path_str, path.size + 1); + + DWORD desired_access = 0; + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_READ)) { + desired_access |= GENERIC_READ; + } + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_WRITE)) { + desired_access |= GENERIC_WRITE; + } + + DWORD share_mode = 0; + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_SHARE_READ)) { + share_mode |= FILE_SHARE_READ; + } + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_SHARE_WRITE)) { + share_mode |= FILE_SHARE_WRITE; + } + + DWORD creation_disposition = open_existing ? OPEN_EXISTING : CREATE_ALWAYS; + + DWORD flags = FILE_ATTRIBUTE_NORMAL; + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_RANDOM_ACCESS)) { + flags |= FILE_FLAG_RANDOM_ACCESS; + } else if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_SEQUENTIAL_SCAN)) { + flags |= FILE_FLAG_SEQUENTIAL_SCAN; + } + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_TEMPORARY)) { + flags |= FILE_FLAG_DELETE_ON_CLOSE; + } + + // Create or open the file. + HANDLE handle = CreateFileA(path_str, desired_access, share_mode, NULL, + creation_disposition, flags, NULL); + if (handle == INVALID_HANDLE_VALUE) { + return iree_make_status(iree_status_code_from_win32_error(GetLastError()), + "failed to open file '%.*s'", (int)path.size, + path.data); + } + + // If we were provided an initialize size and are creating the file then + // adjust the file length. + if (!open_existing) { + // Zeroish-extend the file up to the total file size specified by the + // caller. This may be larger than the virtual address space can handle but + // so long as the length requested for mapping is under the size_t limit + // this will succeed. + LARGE_INTEGER file_size = {0}; + file_size.QuadPart = initial_size; + if (!SetFilePointerEx(handle, file_size, NULL, FILE_BEGIN) || + !SetEndOfFile(handle)) { + CloseHandle(handle); + return iree_make_status(iree_status_code_from_win32_error(GetLastError()), + "failed to extend file '%.*s' to %" PRIu64 + " bytes (out of disk space or permission denied)", + (int)path.size, path.data, initial_size); + } + } + + // Transfer ownership of the handle to a CRT file descriptor. + // After this succeeds we cannot call CloseHandle as the CRT owns it. + int open_flags = 0; + if (!iree_all_bits_set(mode, IREE_IO_FILE_MODE_WRITE)) { + open_flags |= _O_RDONLY; + } + int fd = _open_osfhandle((intptr_t)handle, open_flags); + if (fd == -1) { + CloseHandle(handle); // must close since we didn't transfer + return iree_make_status( + IREE_STATUS_INTERNAL, + "unable to transfer Win32 HANDLE to a CRT file descriptor"); + } + + out_handle_primitive->type = IREE_IO_FILE_HANDLE_TYPE_FD; + out_handle_primitive->value.fd = fd; + return iree_ok_status(); +} + +static void iree_io_file_handle_platform_close( + void* user_data, iree_io_file_handle_primitive_t handle_primitive) { + // NOTE: we opened the file using Win32 APIs but it's safe to _close since we + // transferred ownership to the CRT with _open_osfhandle. If we used + // IREE_IO_FILE_HANDLE_TYPE_WIN32_HANDLE we'd want to switch on that instead. + IREE_ASSERT_EQ(handle_primitive.type, IREE_IO_FILE_HANDLE_TYPE_FD); + _close(handle_primitive.value.fd); +} + +#else + +static iree_status_t iree_io_file_handle_platform_open( + iree_io_file_mode_t mode, iree_string_view_t path, bool open_existing, + uint64_t initial_size, + iree_io_file_handle_primitive_t* out_handle_primitive) { + IREE_ASSERT_ARGUMENT(out_handle_primitive); + memset(out_handle_primitive, 0, sizeof(*out_handle_primitive)); + + // Convert path from a string view to a NUL-terminated C string. + if (path.size >= IREE_MAX_PATH) { + return iree_make_status(IREE_STATUS_OUT_OF_RANGE, + "path length %" PRIhsz + " exceeds maximum character length of %d", + path.size, IREE_MAX_PATH); + } + char* path_str = iree_alloca(path.size + 1); + iree_string_view_to_cstring(path, path_str, path.size + 1); + + int flags = 0; + // TODO(benvanik): add a flag for forking behavior. + flags |= O_CLOEXEC; + if (!open_existing) { + // If the file exists open anyway and truncate as if it had been recreated. + // This matches Win32 CREATE_ALWAYS behavior. + flags |= O_CREAT | O_TRUNC; + } + if (iree_all_bits_set(mode, + IREE_IO_FILE_MODE_READ | IREE_IO_FILE_MODE_WRITE)) { + // NOTE: O_RDWR != O_RDONLY | O_WRONLY! + flags |= O_RDWR; + } else if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_READ)) { + flags |= O_RDONLY; + } else if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_WRITE)) { + flags |= O_WRONLY; + } +#if defined(O_DIRECT) + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_DIRECT)) { + flags |= O_DIRECT; + } +#endif // O_DIRECT +#if defined(O_TMPFILE) + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_TEMPORARY)) { + flags |= O_TMPFILE; + } +#endif // O_TMPFILE + + // I don't know, unix file permissions are dumb. User and group seems fine? + const mode_t open_mode = (S_IRUSR | S_IWUSR) | (S_IRGRP | S_IWGRP); + + int fd = open(path_str, flags, open_mode); + if (fd == -1) { + return iree_make_status(iree_status_code_from_errno(errno), + "failed to open file '%.*s'", (int)path.size, + path.data); + } + + // If we were provided an initialize size and are creating the file then + // adjust the file length. + if (!open_existing) { + // Zero-extend the file up to the total file size specified by the + // caller. Note that `ftruncate` extends too. + if (ftruncate(fd, (off_t)initial_size) == -1) { + return iree_make_status(iree_status_code_from_errno(errno), + "failed to extend file '%.*s' to %" PRIu64 + " bytes (out of disk space or permission denied)", + (int)path.size, path.data, initial_size); + } + } + + out_handle_primitive->type = IREE_IO_FILE_HANDLE_TYPE_FD; + out_handle_primitive->value.fd = fd; + return iree_ok_status(); +} + +static void iree_io_file_handle_platform_close( + void* user_data, iree_io_file_handle_primitive_t handle_primitive) { + IREE_ASSERT_EQ(handle_primitive.type, IREE_IO_FILE_HANDLE_TYPE_FD); + close(handle_primitive.value.fd); +} + +#endif // IREE_PLATFORM_WINDOWS + +static iree_status_t iree_io_file_handle_create_or_open( + iree_io_file_mode_t mode, iree_string_view_t path, bool open_existing, + uint64_t initial_size, iree_allocator_t host_allocator, + iree_io_file_handle_t** out_handle) { + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_RANDOM_ACCESS | + IREE_IO_FILE_MODE_SEQUENTIAL_SCAN)) { + return iree_make_status(IREE_STATUS_INVALID_ARGUMENT, + "at most one access pattern hint may be specified"); + } + + iree_io_file_handle_primitive_t handle_primitive = {0}; + IREE_RETURN_IF_ERROR(iree_io_file_handle_platform_open( + mode, path, open_existing, initial_size, &handle_primitive)); + + iree_io_file_access_t allowed_access = 0; + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_READ)) { + allowed_access |= IREE_IO_FILE_ACCESS_READ; + } + if (iree_all_bits_set(mode, IREE_IO_FILE_MODE_WRITE)) { + allowed_access |= IREE_IO_FILE_ACCESS_WRITE; + } + iree_io_file_handle_release_callback_t release_callback = { + .fn = iree_io_file_handle_platform_close, + .user_data = NULL, + }; + iree_io_file_handle_t* handle = NULL; + iree_status_t status = + iree_io_file_handle_wrap(allowed_access, handle_primitive, + release_callback, host_allocator, &handle); + + if (iree_status_is_ok(status)) { + *out_handle = handle; + } else { + release_callback.fn(release_callback.user_data, handle_primitive); + } + return status; +} + +IREE_API_EXPORT iree_status_t iree_io_file_handle_create( + iree_io_file_mode_t mode, iree_string_view_t path, uint64_t initial_size, + iree_allocator_t host_allocator, iree_io_file_handle_t** out_handle) { + IREE_ASSERT_ARGUMENT(out_handle); + *out_handle = NULL; + IREE_TRACE_ZONE_BEGIN(z0); + IREE_TRACE_ZONE_APPEND_TEXT(z0, path.data, path.size); + iree_status_t status = iree_io_file_handle_create_or_open( + mode, path, /*open_existing=*/false, initial_size, host_allocator, + out_handle); + IREE_TRACE_ZONE_END(z0); + return status; +} + +IREE_API_EXPORT iree_status_t iree_io_file_handle_open( + iree_io_file_mode_t mode, iree_string_view_t path, + iree_allocator_t host_allocator, iree_io_file_handle_t** out_handle) { + IREE_ASSERT_ARGUMENT(out_handle); + *out_handle = NULL; + IREE_TRACE_ZONE_BEGIN(z0); + IREE_TRACE_ZONE_APPEND_TEXT(z0, path.data, path.size); + iree_status_t status = iree_io_file_handle_create_or_open( + mode, path, /*open_existing=*/true, 0ull, host_allocator, out_handle); + IREE_TRACE_ZONE_END(z0); + return status; +} + +#else + +IREE_API_EXPORT iree_status_t iree_io_file_handle_create( + iree_io_file_mode_t mode, iree_string_view_t path, uint64_t initial_size, + iree_allocator_t host_allocator, iree_io_file_handle_t** out_handle) { + IREE_ASSERT_ARGUMENT(out_handle); + *out_handle = NULL; + return iree_make_status(IREE_STATUS_UNAVAILABLE, + "file support has been compiled out of this binary; " + "set IREE_FILE_IO_ENABLE=1 to include it"); +} + +IREE_API_EXPORT iree_status_t iree_io_file_handle_open( + iree_io_file_mode_t mode, iree_string_view_t path, + iree_allocator_t host_allocator, iree_io_file_handle_t** out_handle) { + IREE_ASSERT_ARGUMENT(out_handle); + *out_handle = NULL; + return iree_make_status(IREE_STATUS_UNAVAILABLE, + "file support has been compiled out of this binary; " + "set IREE_FILE_IO_ENABLE=1 to include it"); +} + +#endif // IREE_FILE_IO_ENABLE + //===----------------------------------------------------------------------===// // iree_io_file_mapping_t support //===----------------------------------------------------------------------===// diff --git a/runtime/src/iree/io/file_handle.h b/runtime/src/iree/io/file_handle.h index 53e5a37594cc..908f1e7ae81a 100644 --- a/runtime/src/iree/io/file_handle.h +++ b/runtime/src/iree/io/file_handle.h @@ -156,6 +156,57 @@ static inline iree_io_file_handle_primitive_value_t iree_io_file_handle_value( IREE_API_EXPORT iree_status_t iree_io_file_handle_flush(iree_io_file_handle_t* handle); +//===----------------------------------------------------------------------===// +// iree_io_file_handle_t platform files +//===----------------------------------------------------------------------===// + +// Bits indicating how a file is opened. +typedef uint64_t iree_io_file_mode_t; +enum iree_io_file_mode_bits_t { + // Allow reads of both existing and new content. + IREE_IO_FILE_MODE_READ = 1ull << 0, + // Allow writes. + IREE_IO_FILE_MODE_WRITE = 1ull << 1, + // Hints that the file will be accessed at random (more-so than not). + // Mutually exclusive with IREE_IO_FILE_MODE_SEQUENTIAL_SCAN. If no access + // hint is specified the platform will use its default behavior. + IREE_IO_FILE_MODE_RANDOM_ACCESS = 1ull << 2, + // Hints that the file will be accessed sequentially (contiguous reads/writes + // or small skips forward only). + // Mutually exclusive with IREE_IO_FILE_MODE_RANDOM_ACCESS. If no access + // hint is specified the platform will use its default behavior. + IREE_IO_FILE_MODE_SEQUENTIAL_SCAN = 1ull << 3, + // Hints that the library and system caching are not required. May hurt + // performance more than it helps unless the file is very large and + // exclusively accessed as part of bulk transfer operations that are + // page-aligned. + IREE_IO_FILE_MODE_DIRECT = 1ull << 4, + // Ensures the file is deleted when it is closed. Platforms may use this as a + // hint to avoid writing the file contents when cache is available. + IREE_IO_FILE_MODE_TEMPORARY = 1ull << 5, + // Allows subsequent operations to open the file for read access while the + // file is open by the creator. + IREE_IO_FILE_MODE_SHARE_READ = 1ull << 6, + // Allows subsequent operations to open the file for write access while the + // file is open by the creator. + IREE_IO_FILE_MODE_SHARE_WRITE = 1ull << 7, +}; + +// Creates a new platform file at |path| for usage as defined by |mode|. +// The file will be extended to |initial_size| upon creation. +// Returns IREE_STATUS_ALREADY_EXISTS if the file already exists. +// Returns IREE_STATUS_PERMISSION_DENIED if the file cannot be created. +IREE_API_EXPORT iree_status_t iree_io_file_handle_create( + iree_io_file_mode_t mode, iree_string_view_t path, uint64_t initial_size, + iree_allocator_t host_allocator, iree_io_file_handle_t** out_handle); + +// Opens an existing platform file at |path| for usage as defined by |mode|. +// Returns IREE_STATUS_NOT_FOUND if the file does not exist. +// Returns IREE_STATUS_PERMISSION_DENIED if the specified |mode| is disallowed. +IREE_API_EXPORT iree_status_t iree_io_file_handle_open( + iree_io_file_mode_t mode, iree_string_view_t path, + iree_allocator_t host_allocator, iree_io_file_handle_t** out_handle); + //===----------------------------------------------------------------------===// // iree_io_file_mapping_t //===----------------------------------------------------------------------===// diff --git a/runtime/src/iree/io/stdio_stream.c b/runtime/src/iree/io/stdio_stream.c index 3f0fea8bd75e..b190585e5b1b 100644 --- a/runtime/src/iree/io/stdio_stream.c +++ b/runtime/src/iree/io/stdio_stream.c @@ -53,8 +53,6 @@ // iree_io_stdio_stream_t //===----------------------------------------------------------------------===// -#define IREE_MAX_PATH ((size_t)2048) - typedef struct iree_io_stdio_stream_t { iree_io_stream_t base; iree_allocator_t host_allocator; @@ -148,13 +146,15 @@ IREE_API_EXPORT iree_status_t iree_io_stdio_stream_open( // We could heap allocate instead but a few thousand chars is quite long and // since Windows doesn't support more than ~256 we generally keep them short // anyway. - if (path.size > IREE_MAX_PATH) { + if (path.size >= IREE_MAX_PATH) { IREE_TRACE_ZONE_END(z0); - return iree_make_status(IREE_STATUS_RESOURCE_EXHAUSTED, - "path exceeds reasonable maximum (%" PRIhsz - " > %" PRIhsz ")", + return iree_make_status(IREE_STATUS_OUT_OF_RANGE, + "path length %" PRIhsz + " exceeds maximum character length of %d", path.size, IREE_MAX_PATH); } + char* path_str = iree_alloca(path.size + 1); + iree_string_view_to_cstring(path, path_str, path.size + 1); char* fopen_path = (char*)iree_alloca(path.size + 1); memcpy(fopen_path, path.data, path.size); fopen_path[path.size] = 0; // NUL diff --git a/runtime/src/iree/tooling/parameter_util.c b/runtime/src/iree/tooling/parameter_util.c index e311f4ff664a..bd748ad02f17 100644 --- a/runtime/src/iree/tooling/parameter_util.c +++ b/runtime/src/iree/tooling/parameter_util.c @@ -18,13 +18,20 @@ // Parameter file I/O //===----------------------------------------------------------------------===// +#if IREE_FILE_IO_ENABLE +#define FLAG_PARAMETER_MODE_DEFAULT "file" +#else +#define FLAG_PARAMETER_MODE_DEFAULT "mmap" +#endif // IREE_FILE_IO_ENABLE + IREE_FLAG( - string, parameter_mode, "mmap", - "A parameter I/O mode of ['preload', 'mmap'].\n" + string, parameter_mode, FLAG_PARAMETER_MODE_DEFAULT, + "A parameter I/O mode of ['preload', 'mmap', 'file'].\n" " preload: read entire parameter files into wired memory on startup.\n" " mmap: maps the parameter files into discardable memory - can increase\n" " warm-up time and variance as mapped pages are swapped\n" - " by the OS."); + " by the OS.\n" + " file: uses platform file APIs to read/write the file as needed."); static void iree_file_contents_release_callback( void* user_data, iree_io_file_handle_primitive_t handle_primitive) { @@ -32,6 +39,36 @@ static void iree_file_contents_release_callback( iree_file_contents_free(file_contents); } +// Legacy parameter file open path. We should be able to replace this usage with +// iree_io_file_handle_t-based logic. +static iree_status_t iree_io_open_parameter_file_legacy( + iree_string_view_t path, iree_file_read_flags_t read_flags, + iree_allocator_t host_allocator, iree_io_file_handle_t** out_file_handle) { + IREE_ASSERT_ARGUMENT(out_file_handle); + *out_file_handle = NULL; + + char path_str[2048] = {0}; + iree_string_view_to_cstring(path, path_str, sizeof(path_str)); + + // Read (or map) the entire file into host memory. + iree_file_contents_t* file_contents = NULL; + IREE_RETURN_IF_ERROR(iree_file_read_contents(path_str, read_flags, + host_allocator, &file_contents)); + + // Wrap the loaded memory file in a file handle. + const iree_io_file_handle_release_callback_t release_callback = { + .fn = iree_file_contents_release_callback, + .user_data = file_contents, + }; + iree_status_t status = iree_io_file_handle_wrap_host_allocation( + IREE_IO_FILE_ACCESS_READ, file_contents->buffer, release_callback, + host_allocator, out_file_handle); + if (!iree_status_is_ok(status)) { + iree_file_contents_free(file_contents); + } + return status; +} + // Opens the parameter file at |path| with the mode specified by the // --parameter_mode flag and returns its handle. static iree_status_t iree_io_open_parameter_file( @@ -42,37 +79,25 @@ static iree_status_t iree_io_open_parameter_file( IREE_TRACE_ZONE_BEGIN(z0); IREE_TRACE_ZONE_APPEND_TEXT(z0, path.data, path.size); - char path_str[2048] = {0}; - iree_string_view_to_cstring(path, path_str, sizeof(path_str)); - iree_file_read_flags_t read_flags = 0; + iree_status_t status = iree_ok_status(); + iree_io_file_handle_t* file_handle = NULL; if (strcmp(FLAG_parameter_mode, "mmap") == 0) { - read_flags |= IREE_FILE_READ_FLAG_MMAP; + status = iree_io_open_parameter_file_legacy(path, IREE_FILE_READ_FLAG_MMAP, + host_allocator, &file_handle); } else if (strcmp(FLAG_parameter_mode, "preload") == 0) { - read_flags |= IREE_FILE_READ_FLAG_PRELOAD; + status = iree_io_open_parameter_file_legacy( + path, IREE_FILE_READ_FLAG_PRELOAD, host_allocator, &file_handle); + } else if (strcmp(FLAG_parameter_mode, "file") == 0) { + status = iree_io_file_handle_open(IREE_IO_FILE_MODE_READ, path, + host_allocator, &file_handle); } else { - IREE_TRACE_ZONE_END(z0); - return iree_make_status(IREE_STATUS_INVALID_ARGUMENT, - "unrecognized --parameter_mode= value '%s'", - FLAG_parameter_mode); + status = iree_make_status(IREE_STATUS_INVALID_ARGUMENT, + "unrecognized --parameter_mode= value '%s'", + FLAG_parameter_mode); } - iree_file_contents_t* file_contents = NULL; - IREE_RETURN_AND_END_ZONE_IF_ERROR( - z0, iree_file_read_contents(path_str, read_flags, host_allocator, - &file_contents)); - - iree_io_file_handle_release_callback_t release_callback = { - .fn = iree_file_contents_release_callback, - .user_data = file_contents, - }; - iree_io_file_handle_t* file_handle = NULL; - iree_status_t status = iree_io_file_handle_wrap_host_allocation( - IREE_IO_FILE_ACCESS_READ, file_contents->buffer, release_callback, - host_allocator, &file_handle); if (iree_status_is_ok(status)) { *out_file_handle = file_handle; - } else { - iree_file_contents_free(file_contents); } IREE_TRACE_ZONE_END(z0); return status; From 3614f69916a71e6cfbfe3cfc700111774ffc591f Mon Sep 17 00:00:00 2001 From: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> Date: Thu, 19 Dec 2024 05:07:31 +0700 Subject: [PATCH 44/64] Update LLVM to llvm/llvm-project@b07e7b76c5d532a61 (#19500) --- .../src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp | 5 +++++ .../src/iree/compiler/Codegen/LLVMGPU/ConvertToNVVM.cpp | 6 ++++++ .../src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp | 6 ++++++ third_party/llvm-project | 2 +- 4 files changed, 18 insertions(+), 1 deletion(-) diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp b/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp index d369a4f5e517..5ff0e08227fc 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp @@ -1051,8 +1051,13 @@ void ConvertToLLVMPass::runOnOperation() { // unroll them to 1-D before converting to the LLVM dialect. vector::populateVectorBitCastLoweringPatterns(patterns); populateVectorToLLVMMatrixConversionPatterns(typeConverter, patterns); + vector::populateVectorRankReducingFMAPattern(patterns); + vector::populateVectorInsertExtractStridedSliceTransforms(patterns); + vector::populateVectorStepLoweringPatterns(patterns); populateVectorToLLVMConversionPatterns(typeConverter, patterns, reassociateFpReductions); + vector::populateVectorTransferLoweringPatterns(patterns, + /*maxTransferRank=*/1); if (isAArch64(targetAttr) && (hasAnySVEFeature(targetAttr) || hasSMEFeature(targetAttr))) { diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToNVVM.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToNVVM.cpp index a6d0bbf2f11f..e271b31a0bd9 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToNVVM.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToNVVM.cpp @@ -31,6 +31,7 @@ #include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h" #include "mlir/Dialect/Vector/IR/VectorOps.h" #include "mlir/Dialect/Vector/Transforms/LoweringPatterns.h" +#include "mlir/Dialect/Vector/Transforms/VectorRewritePatterns.h" #include "mlir/Transforms/GreedyPatternRewriteDriver.h" namespace mlir::iree_compiler { @@ -150,7 +151,12 @@ struct ConvertToNVVMPass final cf::populateControlFlowToLLVMConversionPatterns(converter, llvmPatterns); arith::populateCeilFloorDivExpandOpsPatterns(llvmPatterns); arith::populateArithToLLVMConversionPatterns(converter, llvmPatterns); + vector::populateVectorRankReducingFMAPattern(llvmPatterns); + vector::populateVectorInsertExtractStridedSliceTransforms(llvmPatterns); + vector::populateVectorStepLoweringPatterns(llvmPatterns); populateVectorToLLVMConversionPatterns(converter, llvmPatterns); + vector::populateVectorTransferLoweringPatterns(llvmPatterns, + /*maxTransferRank=*/1); populateGpuToNVVMConversionPatterns(converter, llvmPatterns); populateNVGPUToNVVMConversionPatterns(converter, llvmPatterns); populateGpuWMMAToNVVMConversionPatterns(converter, llvmPatterns); diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp index 49964061f916..694ed778b524 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp @@ -31,6 +31,7 @@ #include "mlir/Dialect/MemRef/Transforms/Transforms.h" #include "mlir/Dialect/Vector/IR/VectorOps.h" #include "mlir/Dialect/Vector/Transforms/LoweringPatterns.h" +#include "mlir/Dialect/Vector/Transforms/VectorRewritePatterns.h" #include "mlir/Transforms/GreedyPatternRewriteDriver.h" #define DEBUG_TYPE "iree-convert-to-rocdl" @@ -221,7 +222,12 @@ struct ConvertToROCDLPass final FailureOr maybeChipset = amdgpu::Chipset::parse(chipset); populateAMDGPUToROCDLConversionPatterns( converter, llvmPatterns, maybeChipset.value_or(amdgpu::Chipset())); + vector::populateVectorRankReducingFMAPattern(llvmPatterns); + vector::populateVectorInsertExtractStridedSliceTransforms(llvmPatterns); + vector::populateVectorStepLoweringPatterns(llvmPatterns); populateVectorToLLVMConversionPatterns(converter, llvmPatterns); + vector::populateVectorTransferLoweringPatterns(llvmPatterns, + /*maxTransferRank=*/1); populateGpuToROCDLConversionPatterns(converter, llvmPatterns, gpu::amd::Runtime::Unknown); LLVMConversionTarget target(getContext()); diff --git a/third_party/llvm-project b/third_party/llvm-project index 078c7bb5c927..42dc6d585fcb 160000 --- a/third_party/llvm-project +++ b/third_party/llvm-project @@ -1 +1 @@ -Subproject commit 078c7bb5c927ab1596d8a508e0b70d5140e59669 +Subproject commit 42dc6d585fcbdc366215133f23c77c244a7d9c81 From e55342565017e86944ce57d593672e5bfb690a05 Mon Sep 17 00:00:00 2001 From: Bangtian Liu Date: Wed, 18 Dec 2024 17:52:46 -0500 Subject: [PATCH 45/64] [Codegen][Tuner] attr verifier for tuning specs (#19486) This PR is relevant to task in https://github.com/iree-org/iree/issues/19214: add [a discardable attr verifier](https://mlir.llvm.org/docs/DefiningDialects/#discardable-attribute-verification) for entry points iree_codegen.tuning_spec_entrypoint --------- Signed-off-by: Bangtian Liu --- .../Codegen/Common/LinkTuningSpecsPass.cpp | 36 +++---------- .../Common/MaterializeTuningSpecsPass.cpp | 16 +++++- .../compiler/Codegen/Common/test/BUILD.bazel | 1 + .../Codegen/Common/test/CMakeLists.txt | 1 + .../Common/test/verify_tuning_specs.mlir | 53 +++++++++++++++++++ .../Dialect/Codegen/IR/IREECodegenDialect.cpp | 43 +++++++++++++++ .../Dialect/Codegen/IR/IREECodegenDialect.td | 1 + 7 files changed, 120 insertions(+), 31 deletions(-) create mode 100644 compiler/src/iree/compiler/Codegen/Common/test/verify_tuning_specs.mlir diff --git a/compiler/src/iree/compiler/Codegen/Common/LinkTuningSpecsPass.cpp b/compiler/src/iree/compiler/Codegen/Common/LinkTuningSpecsPass.cpp index fdbadfe170d1..1cdbfb1a821d 100644 --- a/compiler/src/iree/compiler/Codegen/Common/LinkTuningSpecsPass.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/LinkTuningSpecsPass.cpp @@ -18,6 +18,7 @@ #include "mlir/IR/BuiltinAttributes.h" #include "mlir/IR/BuiltinOps.h" #include "mlir/IR/Location.h" +#include "mlir/IR/Verifier.h" #define DEBUG_TYPE "iree-codegen-link-tuning-specs" #define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ") @@ -53,27 +54,6 @@ static SmallVector findTuningSpecs(ModuleOp module) { }); } -// Returns true iff the entrypoint has the following signature: -// ``` -// transform.named_sequence @name(%arg0: !transform.any_op) -> -// (!transform.any_op) -// ``` -static LogicalResult validateTuningSpec(NamedSequenceOp op) { - ArrayRef resTypes = op.getFunctionType().getResults(); - if (resTypes.size() != 1 || !isa(resTypes[0])) { - return op.emitWarning() - << "Tuning spec entry point expected to return any_op"; - } - - ArrayRef argTypes = op.getArgumentTypes(); - if (argTypes.size() != 1 || !isa(argTypes[0])) { - return op.emitWarning() << "Tuning spec entry point expected to have a " - "single any_op argument"; - } - - return success(); -} - static bool consumesInputOp(NamedSequenceOp op) { if (op.getArgAttr(0, kArgConsumedAttrName)) { return true; @@ -81,7 +61,7 @@ static bool consumesInputOp(NamedSequenceOp op) { return false; } -static NamedSequenceOp +static FailureOr emitLinkedTuningSpec(ModuleOp module, ArrayRef specsToLink) { OpBuilder builder(module->getContext()); builder.setInsertionPointToEnd(module.getBody()); @@ -144,6 +124,11 @@ emitLinkedTuningSpec(ModuleOp module, ArrayRef specsToLink) { } builder.create(loc, operand); + + if (failed(mlir::verify(module))) { + return module.emitError("Linked tuning spec failed to verify"); + } + return newSpec; } @@ -169,13 +154,6 @@ FailureOr linkTuningSpecs(ModuleOp module) { llvm::append_range(tuningSpecs, findTuningSpecs(nested)); } - for (NamedSequenceOp spec : tuningSpecs) { - LDBG("Found tuning spec: " << spec.getSymName()); - if (failed(validateTuningSpec(spec))) { - return failure(); - } - } - size_t numConsumedSpecs = llvm::count_if(tuningSpecs, consumesInputOp); if (numConsumedSpecs > 0 && numConsumedSpecs != tuningSpecs.size()) { LDBG("Only " << numConsumedSpecs << " tuning specs out of " diff --git a/compiler/src/iree/compiler/Codegen/Common/MaterializeTuningSpecsPass.cpp b/compiler/src/iree/compiler/Codegen/Common/MaterializeTuningSpecsPass.cpp index 7495f4354215..db36cbf3b36b 100644 --- a/compiler/src/iree/compiler/Codegen/Common/MaterializeTuningSpecsPass.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/MaterializeTuningSpecsPass.cpp @@ -27,6 +27,7 @@ #include "mlir/IR/BuiltinTypeInterfaces.h" #include "mlir/IR/Location.h" #include "mlir/IR/OwningOpRef.h" +#include "mlir/IR/Verifier.h" #include "mlir/Support/FileUtilities.h" #define DEBUG_TYPE "iree-codegen-materialize-tuning-specs" @@ -138,8 +139,19 @@ getDefaultTuningSpec(ModuleOp module, // Load the library through the codegen dialect so that we cache the parsed // module. - return dialect.getOrParseTransformLibraryModule(defaultTuningSpecName, - *defaultTuningSpecSource); + FailureOr defaultTransformLibrary = + dialect.getOrParseTransformLibraryModule(defaultTuningSpecName, + *defaultTuningSpecSource); + +#ifndef NDEBUG + if (succeeded(defaultTransformLibrary) && + failed(mlir::verify(*defaultTransformLibrary))) + return (*defaultTransformLibrary).emitError() + << "Default tuning spec " << defaultTuningSpecName + << " failed to verify"; +#endif + + return defaultTransformLibrary; } static FailureOr diff --git a/compiler/src/iree/compiler/Codegen/Common/test/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/test/BUILD.bazel index f0652d2c3636..3834d0a57f13 100644 --- a/compiler/src/iree/compiler/Codegen/Common/test/BUILD.bazel +++ b/compiler/src/iree/compiler/Codegen/Common/test/BUILD.bazel @@ -96,6 +96,7 @@ iree_lit_test_suite( "vector_layout_analysis.mlir", "vectorize_memref_copy.mlir", "vectorize_tensor_pad.mlir", + "verify_tuning_specs.mlir", "verify_workgroup_distribution.mlir", "vmvx_materialize_encoding.mlir", ], diff --git a/compiler/src/iree/compiler/Codegen/Common/test/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/test/CMakeLists.txt index 2d707f68c3aa..2ef1b63bb3fe 100644 --- a/compiler/src/iree/compiler/Codegen/Common/test/CMakeLists.txt +++ b/compiler/src/iree/compiler/Codegen/Common/test/CMakeLists.txt @@ -92,6 +92,7 @@ iree_lit_test_suite( "vector_layout_analysis.mlir" "vectorize_memref_copy.mlir" "vectorize_tensor_pad.mlir" + "verify_tuning_specs.mlir" "verify_workgroup_distribution.mlir" "vmvx_materialize_encoding.mlir" TOOLS diff --git a/compiler/src/iree/compiler/Codegen/Common/test/verify_tuning_specs.mlir b/compiler/src/iree/compiler/Codegen/Common/test/verify_tuning_specs.mlir new file mode 100644 index 000000000000..aede375adb5b --- /dev/null +++ b/compiler/src/iree/compiler/Codegen/Common/test/verify_tuning_specs.mlir @@ -0,0 +1,53 @@ +// RUN: iree-opt --verify-diagnostics --split-input-file %s + +module @foo_module attributes { transform.with_named_sequence } { + func.func @baz(%arg0: i32) -> () { + return + } + transform.named_sequence @bar(%arg0: !transform.any_op {transform.readonly}) -> !transform.any_op + attributes { iree_codegen.something } { + transform.yield %arg0 : !transform.any_op + } + // expected-error @+1{{'iree_codegen.tuning_spec_entrypoint' attribute must be a UnitAttr}} + transform.named_sequence @foo(%arg0: !transform.any_op {transform.readonly}) -> !transform.any_op + attributes { iree_codegen.tuning_spec_entrypoint = "foo" } { + transform.yield %arg0 : !transform.any_op + } +} + +// ----- + +module @foo_module attributes { transform.with_named_sequence } { + // expected-error @+1{{Tuning spec entry point expected to have a single any_op argument}} + transform.named_sequence @foo(%arg0: !transform.any_op {transform.readonly}, %arg1: !transform.any_op {transform.readonly}) -> !transform.any_op + attributes { iree_codegen.tuning_spec_entrypoint } { + transform.yield %arg0 : !transform.any_op + } +} + +// ----- + +module @foo_module attributes { transform.with_named_sequence } { + // expected-error @+1{{Tuning spec entry point expected to have a single any_op argument}} + transform.named_sequence @foo(%arg0: i32) -> !transform.any_op + attributes { iree_codegen.tuning_spec_entrypoint } {} +} + +// ----- + +module @foo_module attributes { transform.with_named_sequence } { + // expected-error @+1{{Tuning spec entry point expected to return any_op}} + transform.named_sequence @foo(%arg0: !transform.any_op {transform.readonly}) -> i32 + attributes { iree_codegen.tuning_spec_entrypoint } { + %0 = arith.constant 0 : i32 + transform.yield %0 : i32 + } +} + +// ----- + +module @foo_module attributes { transform.with_named_sequence } { + // expected-error @+1{{Tuning spec entry point expected to return any_op}} + transform.named_sequence @foo(%arg0: !transform.any_op {transform.readonly}) + attributes { iree_codegen.tuning_spec_entrypoint } {} +} diff --git a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.cpp b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.cpp index 9116691400b5..4a2281eef60e 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.cpp +++ b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.cpp @@ -10,6 +10,7 @@ #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.cpp.inc" #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenOps.h" #include "iree/compiler/Codegen/Dialect/Codegen/IR/UKernelOps.h" +#include "mlir/Dialect/Transform/IR/TransformOps.h" #include "mlir/IR/DialectImplementation.h" namespace mlir::iree_compiler::IREE::Codegen { @@ -45,4 +46,46 @@ void IREECodegenDialect::initialize() { >(); } +LogicalResult +IREECodegenDialect::verifyOperationAttribute(Operation *op, + NamedAttribute attribute) { + StringRef symbol = attribute.getName().strref(); + Attribute attr = attribute.getValue(); + + // This function verifies the validity of a specific operation attribute. + // - If the attribute's name matches `kTuningSpecEntrypointAttrName` + // ("iree_codegen.tuning_spec_entrypoint"): + // 1. The attribute value must be a UnitAttr. + // 2. If the operation is a transform::NamedSequenceOp: + // - The operation's function signature must satisfy the following: + // a. It must have exactly one result type, and the result must be of + // type `transform::AnyOpType`. + // b. It must have exactly one argument type, and the argument must be + // of type `transform::AnyOpType`. + + if (symbol != kTuningSpecEntrypointAttrName) + return success(); + + if (!isa(attr)) { + return op->emitError("'") << symbol << "' attribute must be a UnitAttr"; + } + + if (auto namedSeqOp = dyn_cast(op)) { + ArrayRef resTypes = namedSeqOp.getFunctionType().getResults(); + if (resTypes.size() != 1 || !isa(resTypes[0])) { + return namedSeqOp.emitError() + << "Tuning spec entry point expected to return any_op"; + } + + ArrayRef argTypes = namedSeqOp.getArgumentTypes(); + if (argTypes.size() != 1 || !isa(argTypes[0])) { + return namedSeqOp.emitError() + << "Tuning spec entry point expected to have a " + "single any_op argument"; + } + } + + return success(); +} + } // namespace mlir::iree_compiler::IREE::Codegen diff --git a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.td b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.td index 35775d08ccf2..a51ff09e552a 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.td +++ b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenDialect.td @@ -68,6 +68,7 @@ def IREECodegen_Dialect : Dialect { std::mutex libraryMutex; }]; let useDefaultAttributePrinterParser = 1; + let hasOperationAttrVerify = 1; } def AnyRankedTensorOrMemRefType : AnyTypeOf<[AnyRankedTensor, AnyMemRef]>; From 16097c1fe724aa494010c175e939be0b5ca73b5e Mon Sep 17 00:00:00 2001 From: Prashant Kumar Date: Thu, 19 Dec 2024 20:41:31 +0700 Subject: [PATCH 46/64] Remove the operand promotion for LHS and RHS. (#19516) Operand promotion for unaligned matmul cases is leading to dynamic trip count and forall loop fusion is not taking place by iree-codegen-gpu-fuse-and-hoist-parallel-loops. --- .../test/gpu_reorder_workgroups_static.mlir | 2 +- .../Dialect/Codegen/IR/IREECodegenAttrs.td | 22 +++--- .../compiler/Codegen/LLVMGPU/KernelConfig.cpp | 68 +++++++++++++++++-- .../LLVMGPU/LLVMGPULowerExecutableTarget.cpp | 3 - .../iree/compiler/Codegen/LLVMGPU/Passes.cpp | 66 ------------------ .../iree/compiler/Codegen/LLVMGPU/Passes.h | 4 -- .../compiler/Codegen/LLVMGPU/Verifiers.cpp | 11 +-- .../Codegen/LLVMGPU/test/config_matvec.mlir | 5 +- .../test/config_root_op_attribute.mlir | 2 +- .../LLVMGPU/test/distribute_to_thread.mlir | 8 +-- .../LLVMGPU/test/gpu_set_num_workgroups.mlir | 28 +++----- .../LLVMGPU/test/illegal_configuration.mlir | 38 ----------- .../LLVMGPU/test/nvvm_pipeline_test.mlir | 31 ++++----- .../LLVMGPU/test/rocdl_pipeline_test.mlir | 13 ++-- tests/e2e/matmul/BUILD.bazel | 24 ------- tests/e2e/matmul/CMakeLists.txt | 26 ------- tests/e2e/matmul/generate_e2e_matmul_tests.py | 14 +--- 17 files changed, 106 insertions(+), 259 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir index 1b7a99184dcb..992dc8ec4435 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_reorder_workgroups_static.mlir @@ -25,7 +25,7 @@ ]> hal.executable private @main_dispatch_0 { hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb">) { - hal.executable.export public @main_dispatch_0_matmul_transpose_b_32000x32000x4096_f16 ordinal(0) layout(#pipeline_layout) attributes {subgroup_size = 64 : index, translation_info = #iree_codegen.translation_info, workgroup_size = [64 : index, 16 : index, 1 : index]} { + hal.executable.export public @main_dispatch_0_matmul_transpose_b_32000x32000x4096_f16 ordinal(0) layout(#pipeline_layout) attributes {subgroup_size = 64 : index, translation_info = #iree_codegen.translation_info, workgroup_size = [64 : index, 16 : index, 1 : index]} { ^bb0(%arg0: !hal.device): %c250 = arith.constant 250 : index %c500 = arith.constant 500 : index diff --git a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td index 26b37dd07e24..e5c6f6f649cd 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td +++ b/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.td @@ -40,26 +40,24 @@ def LLVMGPU_SimpleDistribute : I32EnumAttrCase<"LLVMGPUDistribute", 102>; def LLVMGPU_Vectorize : I32EnumAttrCase<"LLVMGPUVectorize", 103>; -def LLVMGPU_MatmulSimt - : I32EnumAttrCase<"LLVMGPUMatmulSimt", 104>; def LLVMGPU_MatmulTensorCore - : I32EnumAttrCase<"LLVMGPUMatmulTensorCore", 105>; + : I32EnumAttrCase<"LLVMGPUMatmulTensorCore", 104>; def LLVMGPU_TransposeSharedMem - : I32EnumAttrCase<"LLVMGPUTransposeSharedMem", 106>; + : I32EnumAttrCase<"LLVMGPUTransposeSharedMem", 105>; def LLVMGPU_WarpReduction - : I32EnumAttrCase<"LLVMGPUWarpReduction", 107>; + : I32EnumAttrCase<"LLVMGPUWarpReduction", 106>; def LLVMGPU_PackUnPack - : I32EnumAttrCase<"LLVMGPUPackUnPack", 108>; + : I32EnumAttrCase<"LLVMGPUPackUnPack", 107>; def LLVMGPU_MatmulTensorCoreMmaSync - : I32EnumAttrCase<"LLVMGPUMatmulTensorCoreMmaSync", 109>; + : I32EnumAttrCase<"LLVMGPUMatmulTensorCoreMmaSync", 108>; def LLVMGPU_VectorDistribute - : I32EnumAttrCase<"LLVMGPUVectorDistribute", 110>; + : I32EnumAttrCase<"LLVMGPUVectorDistribute", 109>; def LLVMGPU_PadAndVectorDistribute - : I32EnumAttrCase<"LLVMGPUPadAndVectorDistribute", 111>; + : I32EnumAttrCase<"LLVMGPUPadAndVectorDistribute", 110>; def LLVMGPU_WinogradVectorize - : I32EnumAttrCase<"LLVMGPUWinogradVectorize", 112>; + : I32EnumAttrCase<"LLVMGPUWinogradVectorize", 111>; def LLVMGPU_TileAndFuse - : I32EnumAttrCase<"LLVMGPUTileAndFuse", 113>; + : I32EnumAttrCase<"LLVMGPUTileAndFuse", 112>; def SPIRV_BaseLowering : I32EnumAttrCase<"SPIRVBaseLowering", 200>; @@ -98,7 +96,7 @@ def DispatchLoweringPassPipelineEnum : I32EnumAttr< // LLVMGPU CodeGen pipelines LLVMGPU_Default, LLVMGPU_BaseLowering, LLVMGPU_SimpleDistribute, - LLVMGPU_Vectorize, LLVMGPU_MatmulSimt, LLVMGPU_MatmulTensorCore, + LLVMGPU_Vectorize, LLVMGPU_MatmulTensorCore, LLVMGPU_TransposeSharedMem, LLVMGPU_WarpReduction, LLVMGPU_PackUnPack, LLVMGPU_MatmulTensorCoreMmaSync, LLVMGPU_VectorDistribute, LLVMGPU_PadAndVectorDistribute, LLVMGPU_WinogradVectorize, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp index fc890d1db70d..808d35644baf 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp @@ -1295,9 +1295,11 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, CodeGenPipeline pipeline) { TileSizesListType tileSizes; unsigned numParallelLoops = op.getNumParallelLoops(); - SmallVector workgroupTileSizes(numParallelLoops - 2, 1); - workgroupTileSizes.append({tileX, tileY}); - workgroupTileSizes.append(op.getNumReductionLoops(), tileK); + unsigned numReductionLoops = op.getNumReductionLoops(); + SmallVector workgroupTileSizes( + numParallelLoops + numReductionLoops, 1); + workgroupTileSizes[numParallelLoops - 2] = tileX; + workgroupTileSizes[numParallelLoops - 1] = tileY; SmallVector partitionedLoops = cast(op.getOperation()) @@ -1311,11 +1313,63 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, } } - tileSizes.emplace_back(std::move(workgroupTileSizes)); // Workgroup level. std::optional subgroupSize = std::nullopt; if (!subgroupSizes.empty()) subgroupSize = subgroupSizes.front(); + // For the LLVMGPUTileAndFuse pipeline, we need to split tile sizes + // for workgroup, thread, and reduction. + if (pipeline == CodeGenPipeline::LLVMGPUTileAndFuse) { + + auto context = op.getContext(); + Builder b(context); + SmallVector attrs; + + SmallVector threadTileSizes(numParallelLoops + numReductionLoops, + 0); + std::fill(threadTileSizes.begin(), + threadTileSizes.begin() + numParallelLoops, 1); + + threadTileSizes[numParallelLoops - 2] = + (tileX / workgroupSize[0]) < 1 ? 1 : (tileX / workgroupSize[0]); + threadTileSizes[numParallelLoops - 1] = + (tileY / workgroupSize[1]) < 1 ? 1 : (tileY / workgroupSize[1]); + + SmallVector reductionTileSizes( + numParallelLoops + numReductionLoops, 0); + reductionTileSizes[numParallelLoops + numReductionLoops - 1] = tileK; + + attrs.emplace_back(b.getStringAttr("workgroup"), + b.getI64ArrayAttr(workgroupTileSizes)); + attrs.emplace_back(b.getStringAttr("thread"), + b.getI64ArrayAttr(threadTileSizes)); + attrs.emplace_back(b.getStringAttr("reduction"), + b.getI64ArrayAttr(reductionTileSizes)); + + auto configDict = b.getDictionaryAttr(attrs); + auto loweringConfig = + IREE::GPU::LoweringConfigAttr::get(context, configDict); + SmallVector pipelineAttrs; + auto pipelineOptions = IREE::GPU::GPUPipelineOptionsAttr::get( + context, /*prefetchSharedMemory=*/false, + /*no_reduce_shared_memory_bank_conflicts=*/true, + /*use_igemm_convolution=*/false, + /*reorder_workgroups_strategy=*/std::nullopt); + pipelineAttrs.emplace_back( + b.getStringAttr(IREE::GPU::GPUPipelineOptionsAttr::getDictKeyName()), + pipelineOptions); + auto pipelineConfig = b.getDictionaryAttr(pipelineAttrs); + + return setOpConfigAndEntryPointFnTranslation( + entryPoint, op, loweringConfig, pipeline, workgroupSize, subgroupSize, + pipelineConfig); + } + + // Other pipeline (MatmulTensorCore) expect the reduction tile size to be in + // the same list. + workgroupTileSizes[numParallelLoops + numReductionLoops - 1] = tileK; + tileSizes.emplace_back(std::move(workgroupTileSizes)); + return setOpConfigAndEntryPointFnTranslation( entryPoint, op, tileSizes, pipeline, workgroupSize, subgroupSize, getSoftwarePipeliningAttrDict(op->getContext(), softwarePipelineDepth, @@ -1390,7 +1444,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, return setMatmulConfig( sizeN, sizeM, 4, {sizeM, sizeN, 1}, target.getWgp().getSubgroupSizeChoices().asArrayRef(), - softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUMatmulSimt); + softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUTileAndFuse); } // SIMT matmul case. Query the best configuration. @@ -1404,7 +1458,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, config.tileSize[0], config.tileSize[1], config.tileSize[2], config.workgroupSize, target.getWgp().getSubgroupSizeChoices().asArrayRef(), - softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUMatmulSimt); + softwarePipelineDepthSimt, CodeGenPipeline::LLVMGPUTileAndFuse); } } } @@ -1429,7 +1483,7 @@ static LogicalResult setContractConfig(IREE::GPU::TargetAttr target, return setMatmulConfig(tileX, tileY, tileK, workgroupSize, target.getWgp().getSubgroupSizeChoices().asArrayRef(), softwarePipelineDepthSimt, - CodeGenPipeline::LLVMGPUMatmulSimt); + CodeGenPipeline::LLVMGPUTileAndFuse); } //====---------------------------------------------------------------------===// diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp index 73688d2b92d5..1773e229c284 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPULowerExecutableTarget.cpp @@ -114,9 +114,6 @@ void LLVMGPULowerExecutableTargetPass::runOnOperation() { case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUWinogradVectorize: addGPUWinogradVectorizePassPipeline(pipeline); break; - case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUMatmulSimt: - addGPUMatmulSimtPassPipeline(pipeline, pipelineOptions); - break; case IREE::Codegen::DispatchLoweringPassPipeline::LLVMGPUMatmulTensorCore: { FailureOr maybeDepth = getSoftwarePipelineDepth(translationInfo.getConfiguration()); diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp index 1debcf3bc205..d460a1b9f56b 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp @@ -526,72 +526,6 @@ void addGPUWinogradVectorizePassPipeline(OpPassManager &funcPassManager) { funcPassManager.addPass(createOptimizeTensorInsertExtractSlicesPass()); } -//===---------------------------------------------------------------------===// -// MatmulSIMT -//===---------------------------------------------------------------------===// - -void addGPUMatmulSimtPassPipeline(OpPassManager &funcPassManager, - const GPUPipelineOptions &options) { - tileAndDistributeToWorkgroup(funcPassManager, /*useForall=*/false); - - funcPassManager.addPass(createConfigTrackingCanonicalizerPass()); - funcPassManager.addPass(createConfigTrackingCanonicalizerPass()); - funcPassManager.addPass(createCSEPass()); - - funcPassManager.addPass(createGPUTensorTileToSerialLoopsPass()); - funcPassManager.addPass(createGPUTensorAlloc()); - funcPassManager.addPass(createGPUTensorTilePass()); - - // Linalg -> vector - addGPUVectorizationPasses(funcPassManager); - - // tensor to memref - addBufferizePasses(funcPassManager); - - // distribute foreach threads - funcPassManager.addPass(createGPUDistributePass()); - - funcPassManager.addPass(createMemrefCopyToLinalgPass()); - funcPassManager.addPass(createGPUDistributeSharedMemoryCopyPass()); - funcPassManager.addPass(createCanonicalizerPass()); - funcPassManager.addPass(createCSEPass()); - - if (options.enableReduceSharedMemoryBankConflicts) { - funcPassManager.addPass(createGPUReduceBankConflictsPass()); - } - - ReorderWorkgroupsStrategy reorderStrategy = - getReorderWorkgroupsStrategy(options.reorderStrategy); - funcPassManager.addPass( - createReorderWorkgroups(reorderStrategy, canReorderWorkgroups)); - - funcPassManager.addPass(createCanonicalizerPass()); - funcPassManager.addPass(createCSEPass()); - - funcPassManager.addPass(memref::createFoldMemRefAliasOpsPass()); - funcPassManager.addPass(createCSEPass()); - funcPassManager.addPass(createCanonicalizerPass()); - funcPassManager.addPass(createCSEPass()); - - // Even though we vectorize before bufferization we are not able to hoist - // accumulator load/store out of the K loop until distribution. This is - // because we materialize the fill and the matmul in two different scf.forall - // regions, when they should be in the same scf.forall. Newer pipelines - // like TileAndFuse don't have this problem, because they coalesce these - // scf.forall regions into a single scf.forall. - // - // Therefore we still rely on buffer level transformations for transfer ops - // hoisting and store to load forwarding. This relies on shacky alias - // analysis and we need to move this to tensor level once we have better - // abstractions. - funcPassManager.addPass(createOptimizeVectorTransferPass()); - - // Hoist loop invariant code to avoid pipelining it. - funcPassManager.addPass(createIREELoopInvariantCodeMotionPass()); - // Pipeline memory operations. - funcPassManager.addPass(createGPUPipeliningPass()); -} - //===---------------------------------------------------------------------===// // Matmul Tensor Core //===---------------------------------------------------------------------===// diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h index caacfb2656e3..17b7b866be11 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.h @@ -28,10 +28,6 @@ using IREE::GPU::GPUPipelineOptions; // LLVMGPU Backend Pass Pipelines //----------------------------------------------------------------------------// -/// Lowering using SIMT CUDA core operations. -void addGPUMatmulSimtPassPipeline(OpPassManager &funcPassManager, - const GPUPipelineOptions &options); - /// Lowering using mma.sync Tensor Core operations. void addGPUMatmulTensorCoreMmaSyncPassPipeline( OpPassManager &funcPassManager, const GPUPipelineOptions &options, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp index f2e3e2da4e3f..bab5de877eb3 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Verifiers.cpp @@ -38,10 +38,6 @@ getInstructionShape(Operation *op, CodeGenPipeline pipeline, Type inputElementType, SmallVector &instructionShape) { switch (pipeline) { - case CodeGenPipeline::LLVMGPUMatmulSimt: - // SIMT Pipeline / CUDA Cores - instructionShape = {1, 1, 1}; - break; case CodeGenPipeline::LLVMGPUMatmulTensorCore: // Tensor Core Pipeline / WMMA API if (inputElementType.isF16() || inputElementType.isBF16()) { @@ -81,8 +77,7 @@ verifyGPUMatmulPipeline(Operation *op, ArrayRef workgroupSize) { // This verifier only applies to matmul. CodeGenPipeline pipeline = translationInfo.getDispatchLoweringPassPipeline(); - if (pipeline != CodeGenPipeline::LLVMGPUMatmulSimt && - pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCore && + if (pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCore && pipeline != CodeGenPipeline::LLVMGPUMatmulTensorCoreMmaSync) { return success(); } @@ -180,10 +175,6 @@ verifyGPUMatmulPipeline(Operation *op, << pipelineName; } - // Return success for SIMT/CUDA cores. - if (pipeline == CodeGenPipeline::LLVMGPUMatmulSimt) - return success(); - // // Additional verification Tensor Core pipelines. // diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir index 1e5dbf63f2f9..3a029f2968d0 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_matvec.mlir @@ -267,12 +267,11 @@ func.func @not_vmt() { return } -// CHECK-DAG: #[[$CONFIG:.+]] = #iree_codegen.lowering_config -// CHECK: #[[$TRANSLATION:.+]] = #iree_codegen.translation_info +// CHECK-DAG: #[[$TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @not_vmt() // CHECK-SAME: translation_info = #[[$TRANSLATION]] // CHECK: linalg.generic -// CHECK-SAME: lowering_config = #[[$CONFIG]] +// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 8], thread = [1, 128, 0], workgroup = [1, 128, 1]}> // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir index f3e0d81fb961..3c7e52aa475a 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/config_root_op_attribute.mlir @@ -9,4 +9,4 @@ func.func @matmul(%lhs: tensor<4x4xf32>, %rhs: tensor<4x4xf32>) -> tensor<4x4xf3 return %result : tensor<4x4xf32> } -// CHECK: %2 = linalg.matmul {lowering_config = #config, root_op} ins(%arg0, %arg1 : tensor<4x4xf32>, tensor<4x4xf32>) outs(%1 : tensor<4x4xf32>) -> tensor<4x4xf32> +// CHECK: %2 = linalg.matmul {lowering_config = #{{.*}}, root_op} ins(%arg0, %arg1 : tensor<4x4xf32>, tensor<4x4xf32>) outs(%1 : tensor<4x4xf32>) -> tensor<4x4xf32> diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir index cd69906aec13..cec55cdaf0a5 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/distribute_to_thread.mlir @@ -9,7 +9,7 @@ #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 256)> #map2 = affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @dot_dispatch_0() attributes {translation_info = #translation} { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index @@ -79,7 +79,7 @@ func.func @dot_dispatch_0() attributes {translation_info = #translation} { #map2 = affine_map<(d0, d1, d2)[s0] -> (d0 * 32768 + s0 + d1 * 1024 + d2)> #map3 = affine_map<(d0, d1, d2)[s0] -> (d0 * 65536 + s0 + d1 * 64 + d2)> #map4 = affine_map<(d0, d1, d2)[s0] -> (d0 * 2048 + s0 + d1 * 64 + d2)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @batch_matmul_func() attributes {translation_info = #translation} { %c0 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 @@ -148,7 +148,7 @@ func.func @batch_matmul_func() attributes {translation_info = #translation} { #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 32)> #map2 = affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)> -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info func.func @dot_dispatch_0() attributes {translation_info = #translation} { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index @@ -312,7 +312,7 @@ module { #hal.pipeline.binding ]> #config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info #map = affine_map<()[s0] -> (s0 * 2)> #map1 = affine_map<()[s0] -> (s0 * 256)> #map2 = affine_map<(d0)[s0] -> (-d0 + s0, 2)> diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir index 66fc62f2e482..642c6ed1a179 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/gpu_set_num_workgroups.mlir @@ -54,14 +54,12 @@ func.func @dot_dispatch_1() { return } -// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config -// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info +// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @dot_dispatch_1 // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul -// CHECK-SAME: lowering_config = #[[CONFIG]] +// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 4], thread = [2, 1, 0], workgroup = [4, 2, 1]}> // ----- @@ -83,14 +81,12 @@ func.func @unaligned_k() { return } -// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config -// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info +// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @unaligned_k // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul -// CHECK-SAME: lowering_config = #[[CONFIG]] +// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 2], thread = [1, 16, 0], workgroup = [32, 128, 1]}> // ----- @@ -123,7 +119,6 @@ func.func @predict_dispatch_153() { // CHECK: func.func @predict_dispatch_153() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.generic // CHECK-SAME: lowering_config = #[[CONFIG]] @@ -254,7 +249,7 @@ func.func @static_3d_fft_stage3() { #hal.pipeline.binding ]> #config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info +#translation = #iree_codegen.translation_info #compilation = #iree_codegen.compilation_info func.func @_lowering_config_test_dispatch_1() { %cst = arith.constant 0.000000e+00 : f32 @@ -274,11 +269,10 @@ func.func @_lowering_config_test_dispatch_1() { } // CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info // CHECK: func.func @_lowering_config_test_dispatch_1() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.matmul // CHECK-SAME: lowering_config = #[[CONFIG]] @@ -341,7 +335,7 @@ func.func @matmul_config_sm35() { return } -// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info +// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @matmul_config_sm35() // CHECK-SAME: translation_info = #[[TRANSLATION]] @@ -501,7 +495,6 @@ func.func @large_matmul_f16() { // SM80: func.func @large_matmul_f16() // SM80-SAME: translation_info = #[[TRANSLATION]] // SM80: linalg.fill -// SM80-SAME: lowering_config = #[[CONFIG]] // SM80: linalg.matmul // SM80-SAME: lowering_config = #[[CONFIG]] @@ -534,7 +527,6 @@ func.func @large_matmul_f32() { // SM80: func.func @large_matmul_f32() // SM80-SAME: translation_info = #[[TRANSLATION]] // SM80: linalg.fill -// SM80-SAME: lowering_config = #[[CONFIG]] // SM80: linalg.matmul // SM80-SAME: lowering_config = #[[CONFIG]] @@ -659,14 +651,12 @@ func.func @_main_dispatch_15_generic_512x4x42x42x64_f32() { return } -// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config +// CHECK: #[[TRANSLATION:.+]] = #iree_codegen.translation_info}> // CHECK: func.func @_main_dispatch_15_generic_512x4x42x42x64_f32() // CHECK-SAME: translation_info = #[[TRANSLATION]] // CHECK: linalg.fill -// CHECK-SAME: lowering_config = #[[CONFIG]] // CHECK: linalg.generic -// CHECK-SAME: lowering_config = #[[CONFIG]] +// CHECK-SAME: lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 0, 0, 32], thread = [1, 1, 1, 16, 0], workgroup = [1, 1, 32, 128, 1]}> // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir index 2c3df44b325b..8dccac1fb4a6 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/illegal_configuration.mlir @@ -1,43 +1,5 @@ // RUN: iree-opt --iree-gpu-test-target=sm_60 --pass-pipeline="builtin.module(iree-llvmgpu-select-lowering-strategy)" --verify-diagnostics --split-input-file %s -#pipeline_layout = #hal.pipeline.layout, - #hal.pipeline.binding, - #hal.pipeline.binding -]> -#config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info -func.func @illegal() attributes {translation_info = #translation} { - %c0 = arith.constant 0 : index - %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) : memref<4x8xf32> - %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) : memref<8x16xf32> - %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) : memref<4x16xf32> - // expected-error @+1 {{Total number of threads in a thread block 2048 exceeds the limit of 1024 with compilation pipeline LLVMGPUMatmulSimt}} - linalg.matmul {lowering_config = #config} ins(%0, %1 : memref<4x8xf32>, memref<8x16xf32>) outs(%2 : memref<4x16xf32>) - return -} - -// ----- - -#pipeline_layout = #hal.pipeline.layout, - #hal.pipeline.binding, - #hal.pipeline.binding -]> -#config = #iree_codegen.lowering_config -#translation = #iree_codegen.translation_info -func.func @illegal() attributes {translation_info = #translation} { - %c0 = arith.constant 0 : index - %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) : memref<4x8xf32> - %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) : memref<8x16xf32> - %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) : memref<4x16xf32> - // expected-error @+1 {{Expected workgroup size in z-dim = 1, but got 2 with compilation pipeline LLVMGPUMatmulSimt}} - linalg.matmul {lowering_config = #config} ins(%0, %1 : memref<4x8xf32>, memref<8x16xf32>) outs(%2 : memref<4x16xf32>) - return -} - -// ----- - #pipeline_layout = #hal.pipeline.layout, #hal.pipeline.binding, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir index 9cb3fed6254c..ad6aad32420c 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/nvvm_pipeline_test.mlir @@ -83,20 +83,14 @@ hal.executable @dot_dispatch_0 { } } -// CHECK-LABEL: hal.executable public @dot_dispatch_0 -// CHECK: hal.executable.variant public @cuda -// CHECK-NOT: llvm.store -// CHECK-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> -// CHECK: llvm.br -// CHECK-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> -// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> -// CHECK-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// CHECK-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> -// CHECK: llvm.br -// CHECK-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> -// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> -// CHECK-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// CHECK-COUNT-4: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> +// CHECK-LABEL: hal.executable public @dot_dispatch_0 +// CHECK: hal.executable.variant public @cuda +// CHECK-NOT: llvm.store +// CHECK: llvm.br +// CHECK: llvm.load {{.*}} : !llvm.ptr<1> -> vector<32xf32> +// CHECK-COUNT-32: llvm.load {{.*}} : !llvm.ptr<1> -> vector<16xf32> +// CHECK-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> +// CHECK: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> // ----- @@ -158,11 +152,10 @@ hal.executable @dot_dispatch_0 { } // CHECK-LABEL: hal.executable public @dot_dispatch_0 -// CHECK: hal.executable.variant public @cuda -// CHECK: llvm.br -// CHECK-COUNT-8: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// CHECK: llvm.br -// CHECK-COUNT-2: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> +// CHECK: hal.executable.variant public @cuda +// CHECK: llvm.br +// CHECK-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> +// CHECK: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> // ----- diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir index 578d28b027b5..2e7cd879d328 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/rocdl_pipeline_test.mlir @@ -87,17 +87,12 @@ hal.executable @dot_dispatch_0 { // RDNA3-LABEL: hal.executable public @dot_dispatch_0 // RDNA3: hal.executable.variant public @rocm // RDNA3-NOT: llvm.store -// RDNA3-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> // RDNA3: llvm.br -// RDNA3-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> -// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> -// RDNA3-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// RDNA3-COUNT-3: llvm.load {{.*}} : !llvm.ptr<1> -> vector<4xf32> +// RDNA3-COUNT-1: llvm.load {{.*}} : !llvm.ptr<1> -> vector<32xf32> +// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<1> -> vector<16xf32> +// RDNA3-COUNT-32: llvm.intr.fmuladd({{.*}}) : (vector<16xf32>, vector<16xf32>, vector<16xf32>) -> vector<16xf32> +// RDNA3-COUNT-1: llvm.store {{.*}} : vector<16xf32>, !llvm.ptr<1> // RDNA3: llvm.br -// RDNA3-COUNT-3: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<3> -// RDNA3-COUNT-32: llvm.load {{.*}} : !llvm.ptr<3> -> vector<4xf32> -// RDNA3-COUNT-128: llvm.intr.fmuladd({{.*}}) : (vector<4xf32>, vector<4xf32>, vector<4xf32>) -> vector<4xf32> -// RDNA3-COUNT-4: llvm.store {{.*}} : vector<4xf32>, !llvm.ptr<1> // ----- diff --git a/tests/e2e/matmul/BUILD.bazel b/tests/e2e/matmul/BUILD.bazel index 0bad5e06eef7..8ffe93c0ffac 100644 --- a/tests/e2e/matmul/BUILD.bazel +++ b/tests/e2e/matmul/BUILD.bazel @@ -385,30 +385,6 @@ X86_64_AVX512_BF16 = X86_64_AVX512 + [ ## ########################################################################### -iree_generated_e2e_runner_test( - name = "e2e_matmul_cuda_f32_large_simt", - generator = ":generate_e2e_matmul_tests", - generator_args = [ - "--lhs_rhs_type=f32", - "--acc_type=f32", - "--shapes=easy_large_static", - "--compilation_info=LLVMGPUMatmulSimt", - ], - tags = [ - # CUDA cuInit fails with sanitizer on. - "noasan", - "nomsan", - "notsan", - "noubsan", - "requires-gpu-nvidia", - ], - target_backends_and_drivers = [ - ("cuda", "cuda"), - ], - test_runner = "//tools/testing/e2e:iree-e2e-matmul-test", - test_type = "matmul", -) - # Testing Ampere + TensorCore path. # WMMA TensorCore(F32): wmma.161616.f32.tf32 iree_generated_e2e_runner_test( diff --git a/tests/e2e/matmul/CMakeLists.txt b/tests/e2e/matmul/CMakeLists.txt index 9e7ec415b564..b744d346ebef 100644 --- a/tests/e2e/matmul/CMakeLists.txt +++ b/tests/e2e/matmul/CMakeLists.txt @@ -1016,32 +1016,6 @@ iree_generated_e2e_runner_test( "--iree-opt-data-tiling" ) -iree_generated_e2e_runner_test( - NAME - e2e_matmul_cuda_f32_large_simt - TEST_TYPE - matmul - GENERATOR - "generate_e2e_matmul_tests.py" - GENERATOR_ARGS - "--lhs_rhs_type=f32" - "--acc_type=f32" - "--shapes=easy_large_static" - "--compilation_info=LLVMGPUMatmulSimt" - TEST_RUNNER - iree_tools_testing_e2e_iree-e2e-matmul-test - TARGET_BACKENDS - "cuda" - DRIVERS - "cuda" - LABELS - "noasan" - "nomsan" - "notsan" - "noubsan" - "requires-gpu-nvidia" -) - iree_generated_e2e_runner_test( NAME e2e_matmul_cuda_f32_large_tensorcore diff --git a/tests/e2e/matmul/generate_e2e_matmul_tests.py b/tests/e2e/matmul/generate_e2e_matmul_tests.py index a97b5626c069..3061fb620af0 100644 --- a/tests/e2e/matmul/generate_e2e_matmul_tests.py +++ b/tests/e2e/matmul/generate_e2e_matmul_tests.py @@ -50,7 +50,6 @@ class ShapesId(enum.Enum): @enum.unique class CompilationInfoId(enum.Enum): NONE = "" - LLVMGPUMatmulSimt = "LLVMGPUMatmulSimt" LLVMGPUMatmulTensorCore = "LLVMGPUMatmulTensorCore" LLVMGPUMatmulTensorCoreMmaSync = "LLVMGPUMatmulTensorCoreMmaSync" LLVMGPUVectorDistributeMFMA = "LLVMGPUVectorDistributeMFMA" @@ -461,18 +460,7 @@ def get_test_compilation_infos( software_pipeline_depth = 0 tile_workgroup_size_pairs = [] - if compilation_info_id == CompilationInfoId.LLVMGPUMatmulSimt: - tile_workgroup_size_pairs = [ - TileWorkgroupSizePair([[32, 128, 32]], [32, 8, 1]), - TileWorkgroupSizePair([[128, 64, 8]], [16, 8, 1]), - TileWorkgroupSizePair([[16, 256, 32]], [64, 2, 1]), - TileWorkgroupSizePair([[8, 32, 32]], [8, 8, 1]), - TileWorkgroupSizePair([[8, 128, 4]], [32, 1, 1]), - TileWorkgroupSizePair([[16, 64, 4]], [16, 2, 1]), - TileWorkgroupSizePair([[1, 128, 8]], [32, 1, 1]), - ] - software_pipeline_depth = 3 - elif compilation_info_id == CompilationInfoId.SPIRVCooperativeMatrixVectorize: + if compilation_info_id == CompilationInfoId.SPIRVCooperativeMatrixVectorize: tile_workgroup_size_pairs = [ TileWorkgroupSizePair( [[64, 128], [32, 64], [0, 0, 32], [16, 16, 16]], [64, 2, 1] From ed9a028d3f3bfb0ab32004881c87539577048aa8 Mon Sep 17 00:00:00 2001 From: Benoit Jacob Date: Thu, 19 Dec 2024 09:42:21 -0500 Subject: [PATCH 47/64] GPU Data-tiled multi-mma: subgroup dimensions should be outer (#19521) This was already the idea, but there was an accidental exception: in the accumulator tensor, if there was both a `unroll_m` dimension and `subgroup_n` dimension, then the `subgroup_n` dimension wasn't on the outside of `unroll_m` as it was meant to be. Noticed this when it required corresponding strides in the ukernel. Signed-off-by: Benoit Jacob --- .../test/gpu_materialize_encoding_gfx942.mlir | 30 +++++++++---------- .../Dialect/GPU/IR/GPUTileSwizzleUtils.cpp | 6 ++-- .../test/ROCDL/pipeline_tile_and_fuse.mlir | 10 +++---- 3 files changed, 23 insertions(+), 23 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx942.mlir b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx942.mlir index 2544fc127f89..392ed2927498 100644 --- a/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx942.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/gpu_materialize_encoding_gfx942.mlir @@ -230,8 +230,8 @@ func.func @set_encoding_ACC_unroll8x8x4_MFMA_F32_16x16x4_F32() { // CHECK-SAME : tensor<2x5x128x128xf32> into tensor<2x5x8x4x4x4x2x16xf32> // CHECK: %[[TRANSPOSE:.*]] = linalg.transpose // CHECK-SAME: ins(%[[EXPAND]] : tensor<2x5x8x4x4x4x2x16xf32>) -// CHECK-SAME: outs({{.*}} : tensor<2x5x8x4x2x4x16x4xf32>) -// CHECK-SAME: permutation = [0, 1, 2, 5, 6, 3, 7, 4] +// CHECK-SAME: outs({{.*}} : tensor<2x5x4x8x2x4x16x4xf32>) +// CHECK-SAME: permutation = [0, 1, 5, 2, 6, 3, 7, 4] // CHECK: flow.dispatch.tensor.store %[[TRANSPOSE]] // ----- @@ -255,9 +255,9 @@ func.func @unset_encoding_ACC_unroll8x8x4_MFMA_F32_16x16x4_F32() { // CHECK-LABEL: func.func @unset_encoding_ACC_unroll8x8x4_MFMA_F32_16x16x4_F32() { // CHECK: %[[TRANSPOSE:.*]] = linalg.transpose -// CHECK-SAME: ins(%{{.+}} : tensor<2x5x8x4x2x4x16x4xf32>) +// CHECK-SAME: ins(%{{.+}} : tensor<2x5x4x8x2x4x16x4xf32>) // CHECK-SAME: outs({{.*}} : tensor<2x5x8x4x4x4x2x16xf32>) -// CHECK-SAME: permutation = [0, 1, 2, 5, 7, 3, 4, 6] +// CHECK-SAME: permutation = [0, 1, 3, 5, 7, 2, 4, 6] // CHECK: %[[COLLAPSE:.*]] = tensor.collapse_shape %[[TRANSPOSE]] // CHECK-SAME: : tensor<2x5x8x4x4x4x2x16xf32> into tensor<2x5x128x128xf32> // CHECK: %[[UNPACK:.*]] = tensor.unpack %[[COLLAPSE]] @@ -298,9 +298,9 @@ func.func @unset_encoding_ACC_dynamic_unroll8x8x4_MFMA_F32_16x16x4_F32() { } // CHECK-LABEL: func.func @unset_encoding_ACC_dynamic_unroll8x8x4_MFMA_F32_16x16x4_F32 // CHECK: %[[TRANSPOSE:.*]] = linalg.transpose -// CHECK-SAME: ins(%{{.+}} : tensor) +// CHECK-SAME: ins(%{{.+}} : tensor) // CHECK-SAME: outs({{.*}} : tensor) -// CHECK-SAME: permutation = [0, 1, 2, 5, 7, 3, 4, 6] +// CHECK-SAME: permutation = [0, 1, 3, 5, 7, 2, 4, 6] // CHECK: %[[COLLAPSE:.*]] = tensor.collapse_shape %[[TRANSPOSE]] // CHECK-SAME: : tensor into tensor // CHECK: %[[UNPACK:.*]] = tensor.unpack %[[COLLAPSE]] @@ -362,7 +362,7 @@ func.func @matmul_lowering_MFMA_F32_16x16x4_F32() { // CHECK-DAG: %[[ACC_BINDING:.+]] = hal.interface.binding.subspan {{.+}} binding(2) // CHECK-DAG: %[[LHS:.+]] = flow.dispatch.tensor.load %[[LHS_BINDING]]{{.+}} -> tensor // CHECK-DAG: %[[RHS:.+]] = flow.dispatch.tensor.load %[[RHS_BINDING]]{{.+}} -> tensor -// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor +// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor // CHECK: %[[MMA:.+]] = iree_gpu.multi_mma %[[LHS]], %[[RHS]], %[[ACC]] // CHECK-SAME: indexing_maps = [#[[MAP0]], #[[MAP1]], #[[MAP2]]], // CHECK-SAME: iterator_types = [#iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type] @@ -422,7 +422,7 @@ func.func @batch_matmul_lowering_MFMA_F32_16x16x4_F32() { // CHECK-DAG: %[[ACC_BINDING:.+]] = hal.interface.binding.subspan {{.+}} binding(2) // CHECK-DAG: %[[LHS:.+]] = flow.dispatch.tensor.load %[[LHS_BINDING]]{{.+}} -> tensor // CHECK-DAG: %[[RHS:.+]] = flow.dispatch.tensor.load %[[RHS_BINDING]]{{.+}} -> tensor -// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor +// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor // CHECK: %[[MMA:.+]] = iree_gpu.multi_mma %[[LHS]], %[[RHS]], %[[ACC]] // CHECK-SAME: indexing_maps = [#[[MAP0]], #[[MAP1]], #[[MAP2]]], // CHECK-SAME: iterator_types = [#iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type] @@ -528,8 +528,8 @@ func.func @set_encoding_ACC_unroll8x8x2_MFMA_I32_16x16x32_I8() { // CHECK-SAME : tensor<2x5x128x128xi32> into tensor<2x5x8x4x4x4x2x16xi32> // CHECK: %[[TRANSPOSE:.*]] = linalg.transpose // CHECK-SAME: ins(%[[EXPAND]] : tensor<2x5x8x4x4x4x2x16xi32>) -// CHECK-SAME: outs({{.*}} : tensor<2x5x8x4x2x4x16x4xi32>) -// CHECK-SAME: permutation = [0, 1, 2, 5, 6, 3, 7, 4] +// CHECK-SAME: outs({{.*}} : tensor<2x5x4x8x2x4x16x4xi32>) +// CHECK-SAME: permutation = [0, 1, 5, 2, 6, 3, 7, 4] // CHECK: flow.dispatch.tensor.store %[[TRANSPOSE]] // ----- @@ -553,9 +553,9 @@ func.func @unset_encoding_ACC_unroll8x8x2_MFMA_I32_16x16x32_I8() { // CHECK-LABEL: func.func @unset_encoding_ACC_unroll8x8x2_MFMA_I32_16x16x32_I8() { // CHECK: %[[TRANSPOSE:.*]] = linalg.transpose -// CHECK-SAME: ins(%{{.+}} : tensor<2x5x8x4x2x4x16x4xi32>) +// CHECK-SAME: ins(%{{.+}} : tensor<2x5x4x8x2x4x16x4xi32>) // CHECK-SAME: outs({{.*}} : tensor<2x5x8x4x4x4x2x16xi32>) -// CHECK-SAME: permutation = [0, 1, 2, 5, 7, 3, 4, 6] +// CHECK-SAME: permutation = [0, 1, 3, 5, 7, 2, 4, 6] // CHECK: %[[COLLAPSE:.*]] = tensor.collapse_shape %[[TRANSPOSE]] // CHECK-SAME: : tensor<2x5x8x4x4x4x2x16xi32> into tensor<2x5x128x128xi32> // CHECK: %[[UNPACK:.*]] = tensor.unpack %[[COLLAPSE]] @@ -618,7 +618,7 @@ func.func @matmul_lowering_MFMA_I32_16x16x32_I8() { // CHECK-DAG: %[[ACC_BINDING:.+]] = hal.interface.binding.subspan {{.+}} binding(2) // CHECK-DAG: %[[LHS:.+]] = flow.dispatch.tensor.load %[[LHS_BINDING]]{{.+}} -> tensor // CHECK-DAG: %[[RHS:.+]] = flow.dispatch.tensor.load %[[RHS_BINDING]]{{.+}} -> tensor -// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor +// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor // CHECK: %[[MMA:.+]] = iree_gpu.multi_mma %[[LHS]], %[[RHS]], %[[ACC]] // CHECK-SAME: indexing_maps = [#[[MAP0]], #[[MAP1]], #[[MAP2]]], // CHECK-SAME: iterator_types = [#iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type] @@ -1124,7 +1124,7 @@ func.func @batch_matmul_lowering_MFMA_F32_16x16x32_F8E4M3FNUZ() { // CHECK-DAG: %[[ACC_BINDING:.+]] = hal.interface.binding.subspan {{.+}} binding(2) // CHECK-DAG: %[[LHS:.+]] = flow.dispatch.tensor.load %[[LHS_BINDING]]{{.+}} -> tensor // CHECK-DAG: %[[RHS:.+]] = flow.dispatch.tensor.load %[[RHS_BINDING]]{{.+}} -> tensor -// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor +// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor // CHECK: %[[MMA:.+]] = iree_gpu.multi_mma %[[LHS]], %[[RHS]], %[[ACC]] // CHECK-SAME: indexing_maps = [#[[MAP0]], #[[MAP1]], #[[MAP2]]], // CHECK-SAME: iterator_types = [#iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type] @@ -1184,7 +1184,7 @@ func.func @batch_matmul_lowering_MFMA_F32_16x16x16_BF16() { // CHECK-DAG: %[[ACC_BINDING:.+]] = hal.interface.binding.subspan {{.+}} binding(2) // CHECK-DAG: %[[LHS:.+]] = flow.dispatch.tensor.load %[[LHS_BINDING]]{{.+}} -> tensor // CHECK-DAG: %[[RHS:.+]] = flow.dispatch.tensor.load %[[RHS_BINDING]]{{.+}} -> tensor -// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor +// CHECK-DAG: %[[ACC:.+]] = flow.dispatch.tensor.load %[[ACC_BINDING]]{{.+}} -> tensor // CHECK: %[[MMA:.+]] = iree_gpu.multi_mma %[[LHS]], %[[RHS]], %[[ACC]] // CHECK-SAME: indexing_maps = [#[[MAP0]], #[[MAP1]], #[[MAP2]]], // CHECK-SAME: iterator_types = [#iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type] diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPUTileSwizzleUtils.cpp b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPUTileSwizzleUtils.cpp index f80c345d8513..cbff06a0fb2c 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPUTileSwizzleUtils.cpp +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPUTileSwizzleUtils.cpp @@ -183,12 +183,12 @@ TileSwizzle getSwizzle(IREE::GPU::DataTiledMMAAttr mma, if (mma.getUnrollN() > 1) { expand(swizzle, 1, {Kind::CrossIntrinsic, mma.getUnrollN()}); } - if (mma.getSubgroupsN() > 1) { - expand(swizzle, 1, {Kind::CrossThread, mma.getSubgroupsN()}); - } if (mma.getUnrollM() > 1) { expand(swizzle, 0, {Kind::CrossIntrinsic, mma.getUnrollM()}); } + if (mma.getSubgroupsN() > 1) { + expand(swizzle, 1, {Kind::CrossThread, mma.getSubgroupsN()}); + } if (mma.getSubgroupsM() > 1) { expand(swizzle, 0, {Kind::CrossThread, mma.getSubgroupsM()}); } diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_tile_and_fuse.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_tile_and_fuse.mlir index a716c6b7579c..f71add60f4b1 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_tile_and_fuse.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_tile_and_fuse.mlir @@ -755,11 +755,11 @@ hal.executable public @main { // CHECK: gpu.barrier // CHECK-DAG: %[[A_READ:.+]] = vector.transfer_read %[[A_ALLOC]]{{.*}} vector<8x1x1x4xf32> // CHECK-DAG: %[[B_READ:.+]] = vector.transfer_read %[[B_ALLOC]]{{.*}} vector<2x1x1x4xf32> -// CHECK-DAG: %[[C_READ:.+]] = vector.transfer_read %[[BINDING_C]]{{.*}} vector<8x1x2x1x1x4xf32> -// CHECK-DAG: %[[C_00_0:.+]] = vector.extract %[[C_READ]][0, 0, 0, 0, 0] : vector<4xf32> from vector<8x1x2x1x1x4xf32> -// CHECK-DAG: %[[C_01_0:.+]] = vector.extract %[[C_READ]][0, 0, 1, 0, 0] : vector<4xf32> from vector<8x1x2x1x1x4xf32> -// CHECK-DAG: %[[C_70_0:.+]] = vector.extract %[[C_READ]][7, 0, 0, 0, 0] : vector<4xf32> from vector<8x1x2x1x1x4xf32> -// CHECK-DAG: %[[C_71_0:.+]] = vector.extract %[[C_READ]][7, 0, 1, 0, 0] : vector<4xf32> from vector<8x1x2x1x1x4xf32> +// CHECK-DAG: %[[C_READ:.+]] = vector.transfer_read %[[BINDING_C]]{{.*}} vector<8x2x1x1x4xf32> +// CHECK-DAG: %[[C_00_0:.+]] = vector.extract %[[C_READ]][0, 0, 0, 0] : vector<4xf32> from vector<8x2x1x1x4xf32> +// CHECK-DAG: %[[C_01_0:.+]] = vector.extract %[[C_READ]][0, 1, 0, 0] : vector<4xf32> from vector<8x2x1x1x4xf32> +// CHECK-DAG: %[[C_70_0:.+]] = vector.extract %[[C_READ]][7, 0, 0, 0] : vector<4xf32> from vector<8x2x1x1x4xf32> +// CHECK-DAG: %[[C_71_0:.+]] = vector.extract %[[C_READ]][7, 1, 0, 0] : vector<4xf32> from vector<8x2x1x1x4xf32> // CHECK-DAG: %[[A_EXTRACT00:.+]] = vector.extract %[[A_READ]][0, 0, 0, 0] : f32 from vector<8x1x1x4xf32> // CHECK-DAG: %[[A_EXTRACT01:.+]] = vector.extract %[[A_READ]][0, 0, 0, 1] : f32 from vector<8x1x1x4xf32> // CHECK-DAG: %[[A_EXTRACT02:.+]] = vector.extract %[[A_READ]][0, 0, 0, 2] : f32 from vector<8x1x1x4xf32> From 01f090030d7a6c790fadfce117585eeec74a139d Mon Sep 17 00:00:00 2001 From: Han-Chung Wang Date: Thu, 19 Dec 2024 08:02:52 -0800 Subject: [PATCH 48/64] [NFC] Fixing typo (mutli -> multi). (#19526) I found the typo when I'm working on multi-device related work, the revision fixes the issue. And I verified that there are no "mutli" typo in IREE repo with the fix. Signed-off-by: hanhanW --- .../Codegen/Common/test/iree_comprehensive_bufferize.mlir | 4 ++-- runtime/src/iree/schemas/vulkan_executable_def.fbs | 2 +- tests/e2e/stablehlo_ops/dot_general.mlir | 2 +- tools/test/iree-run-module-multi.mlir | 6 +++--- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/test/iree_comprehensive_bufferize.mlir b/compiler/src/iree/compiler/Codegen/Common/test/iree_comprehensive_bufferize.mlir index f0fe5eea375c..953d361e3361 100644 --- a/compiler/src/iree/compiler/Codegen/Common/test/iree_comprehensive_bufferize.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/test/iree_comprehensive_bufferize.mlir @@ -2172,7 +2172,7 @@ func.func @operand_fusion() { #map3 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)> #map4 = affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)> #map5 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)> -func.func @dot_general_nontrivial_batching_mutliple_parallel_dimension() { +func.func @dot_general_nontrivial_batching_multiple_parallel_dimension() { %cst = arith.constant dense<0.000000e+00> : vector<1x4x2xf32> %c1 = arith.constant 1 : index %c6 = arith.constant 6 : index @@ -2217,7 +2217,7 @@ func.func @dot_general_nontrivial_batching_mutliple_parallel_dimension() { } return } -// CHECK-LABEL: func.func @dot_general_nontrivial_batching_mutliple_parallel_dimension() +// CHECK-LABEL: func.func @dot_general_nontrivial_batching_multiple_parallel_dimension() // CHECK-NOT: memref.alloc // ----- diff --git a/runtime/src/iree/schemas/vulkan_executable_def.fbs b/runtime/src/iree/schemas/vulkan_executable_def.fbs index 476e0d3a0499..ee57bf8ce0de 100644 --- a/runtime/src/iree/schemas/vulkan_executable_def.fbs +++ b/runtime/src/iree/schemas/vulkan_executable_def.fbs @@ -93,7 +93,7 @@ table ExecutableDef { descriptor_set_layouts:[DescriptorSetLayoutDef]; // A list of pipeline layouts. Exports reference layouts in this list and - // multiple exports present in mutliple shader modules may share layouts. + // multiple exports present in multiple shader modules may share layouts. // This list may not have the same size as the pipelines list. pipeline_layouts:[PipelineLayoutDef]; diff --git a/tests/e2e/stablehlo_ops/dot_general.mlir b/tests/e2e/stablehlo_ops/dot_general.mlir index 69f92d92c9af..5bfb8b8680ad 100644 --- a/tests/e2e/stablehlo_ops/dot_general.mlir +++ b/tests/e2e/stablehlo_ops/dot_general.mlir @@ -126,7 +126,7 @@ func.func @large_dot_general() { return } -func.func @dot_general_nontrivial_batching_mutliple_parallel_dimension() { +func.func @dot_general_nontrivial_batching_multiple_parallel_dimension() { %lhs = util.unfoldable_constant dense<[ [[[0.0], [1.0]], [[2.0], [3.0]], [[ 4.0], [ 5.0]]], [[[6.0], [7.0]], [[8.0], [9.0]], [[10.0], [11.0]]] diff --git a/tools/test/iree-run-module-multi.mlir b/tools/test/iree-run-module-multi.mlir index 341259652818..8f1d5f6aad64 100644 --- a/tools/test/iree-run-module-multi.mlir +++ b/tools/test/iree-run-module-multi.mlir @@ -11,17 +11,17 @@ // RUN: --iree-hal-local-target-device-backends=vmvx | \ // RUN: iree-run-module \ // RUN: --module=- \ -// RUN: --function=mutli_device_mul \ +// RUN: --function=multi_device_mul \ // RUN: --input=4xf32=10,11,12,13 \ // RUN: --device=local-task \ // RUN: --device=local-task \ // RUN: --task_topology_group_count=1) | \ // RUN: FileCheck %s -// CHECK: EXEC @mutli_device_mul +// CHECK: EXEC @multi_device_mul // CHECK-NEXT: result[0]: hal.buffer_view // CHECK-NEXT: 4xf32=0 55 144 273 -func.func public @mutli_device_mul( +func.func public @multi_device_mul( // Input argument is resident on device_a (tooling default to first device). %input_a: tensor<4xf32> {iree.abi.affinity = #hal.device.promise<@device_a>} ) -> ( From 5c4bc678f9b7356fd083c20821fa2b92a48ab4fd Mon Sep 17 00:00:00 2001 From: Scott Todd Date: Thu, 19 Dec 2024 08:49:50 -0800 Subject: [PATCH 49/64] Trigger presubmit ci workflows from `ci.yml` via `workflow_call`. (#19445) ## History These workflow jobs were originally part of `ci.yml` but they were moved out of part of https://github.com/iree-org/iree/issues/17957. ## Reasons to keep them separate * Run history for each workflow is independent, e.g. https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang.yml and https://github.com/iree-org/iree/actions/workflows/ci_windows_x64_msvc.yml. * Performance metrics for each workflow are independent: ![image](https://github.com/user-attachments/assets/5832d491-541d-46e4-9626-422f989e8879) * Status badges are independent ## Reasons to merge them back * Less code duplication, particularly for `setup` and `summary` steps (see https://github.com/iree-org/iree/pull/19444) * Less noise in the checks view in pull requests (note all the `setup` jobs, and the PR above will add similarly dupliated `summary` jobs): ![image](https://github.com/user-attachments/assets/37628719-4d23-46f5-89ff-9da7f391695b) * Centralized logs for checks, not split between multiple pages like they are currently: Current | With this PR -- | -- ![image](https://github.com/user-attachments/assets/5adc117e-aa76-47a7-973f-6f1e7adcbc9c) | ![image](https://github.com/user-attachments/assets/2cb55430-63f5-4b1f-b64c-550b919fe51f) * This gives us a single `ci_summary` check in `ci.yml` to set as a required check instead of multiple split across workflows * Given enough runner capacity / budget, we could run all these jobs on presubmit/postsubmit and not just on nightly schedules --- .github/workflows/ci.yml | 31 +++++++++ .github/workflows/ci_linux_x64_bazel.yml | 26 +------- .github/workflows/ci_linux_x64_clang.yml | 26 +------- .github/workflows/ci_linux_x64_clang_asan.yml | 26 +------- README.md | 6 +- .../docs/developers/general/github-actions.md | 64 +++++++------------ 6 files changed, 65 insertions(+), 114 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 3246beb3ab80..a56853a0bcb5 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -46,6 +46,9 @@ jobs: setup: uses: ./.github/workflows/setup.yml + ############################################################################## + # Runtime builds + runtime: needs: setup name: "runtime :: ${{ matrix.name }}" @@ -196,6 +199,27 @@ jobs: - name: CMake - build run: cmake --build ${BUILD_DIR} -- -k 0 + ############################################################################## + # Full project builds + + linux_x64_bazel: + needs: setup + if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'linux_x64_bazel') + uses: ./.github/workflows/ci_linux_x64_bazel.yml + secrets: inherit + + linux_x64_clang: + needs: setup + if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'linux_x64_clang') + uses: ./.github/workflows/ci_linux_x64_clang.yml + secrets: inherit + + linux_x64_clang_asan: + needs: setup + if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'linux_x64_clang_asan') + uses: ./.github/workflows/ci_linux_x64_clang_asan.yml + secrets: inherit + ############################################################################## # Aggregate job status and alerting on failures. @@ -203,9 +227,16 @@ jobs: if: always() needs: - setup + + # Runtime builds. - runtime - runtime_small - runtime_tracing + + # Full project builds. + - linux_x64_bazel + - linux_x64_clang + - linux_x64_clang_asan uses: ./.github/workflows/workflow_summary.yml secrets: inherit with: diff --git a/.github/workflows/ci_linux_x64_bazel.yml b/.github/workflows/ci_linux_x64_bazel.yml index ce9c4859ef4f..ad23ecf74e40 100644 --- a/.github/workflows/ci_linux_x64_bazel.yml +++ b/.github/workflows/ci_linux_x64_bazel.yml @@ -7,27 +7,11 @@ name: CI - Linux x64 bazel on: + workflow_call: workflow_dispatch: - pull_request: - push: - branches: - - main - -concurrency: - # A PR number if a pull request and otherwise the commit hash. This cancels - # queued and in-progress runs for the same PR (presubmit) or commit - # (postsubmit). The workflow name is prepended to avoid conflicts between - # different workflows. - group: ${{ github.workflow }}-${{ github.event.number || github.sha }} - cancel-in-progress: true jobs: - setup: - uses: ./.github/workflows/setup.yml - linux_x64_bazel: - needs: setup - if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'linux_x64_bazel') runs-on: azure-linux-scale container: image: ghcr.io/iree-org/cpubuilder_ubuntu_jammy@sha256:78a558b999b230f7e1da376639e14b44f095f30f1777d6a272ba48c0bbdd4ccb @@ -54,10 +38,4 @@ jobs: /usr/local/bin/fetch_cuda_deps.sh ${IREE_CUDA_DEPS_DIR} ./build_tools/bazel/build_test_all.sh - - name: Post to Discord on Failure - uses: sarisia/actions-status-discord@ce8cc68e4e626000136b3c702d049a154243e490 # v1.14.7 - if: failure() && github.ref_name == 'main' && github.repository_owner == 'iree-org' - with: - webhook: ${{ secrets.DISCORD_WEBHOOK }} - description: "The ${{ github.workflow }} workflow failed" - url: "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}/attempts/${{ github.run_attempt }}" + # Alerting on failure is the responsibility of the calling job. diff --git a/.github/workflows/ci_linux_x64_clang.yml b/.github/workflows/ci_linux_x64_clang.yml index e8825636a9ee..2d1a44864dd5 100644 --- a/.github/workflows/ci_linux_x64_clang.yml +++ b/.github/workflows/ci_linux_x64_clang.yml @@ -7,27 +7,11 @@ name: CI - Linux x64 clang on: + workflow_call: workflow_dispatch: - pull_request: - push: - branches: - - main - -concurrency: - # A PR number if a pull request and otherwise the commit hash. This cancels - # queued and in-progress runs for the same PR (presubmit) or commit - # (postsubmit). The workflow name is prepended to avoid conflicts between - # different workflows. - group: ${{ github.workflow }}-${{ github.event.number || github.sha }} - cancel-in-progress: true jobs: - setup: - uses: ./.github/workflows/setup.yml - linux_x64_clang: - needs: setup - if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'linux_x64_clang') runs-on: azure-linux-scale container: image: ghcr.io/iree-org/cpubuilder_ubuntu_jammy@sha256:78a558b999b230f7e1da376639e14b44f095f30f1777d6a272ba48c0bbdd4ccb @@ -71,10 +55,4 @@ jobs: - name: Test iree-dialects run: ./build_tools/cmake/test_iree_dialects.sh "${BUILD_DIR}" - - name: Post to Discord on Failure - uses: sarisia/actions-status-discord@ce8cc68e4e626000136b3c702d049a154243e490 # v1.14.7 - if: failure() && github.ref_name == 'main' && github.repository_owner == 'iree-org' - with: - webhook: ${{ secrets.DISCORD_WEBHOOK }} - description: "The ${{ github.workflow }} workflow failed" - url: "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}/attempts/${{ github.run_attempt }}" + # Alerting on failure is the responsibility of the calling job. diff --git a/.github/workflows/ci_linux_x64_clang_asan.yml b/.github/workflows/ci_linux_x64_clang_asan.yml index c352dbda4572..bc7b7a935b59 100644 --- a/.github/workflows/ci_linux_x64_clang_asan.yml +++ b/.github/workflows/ci_linux_x64_clang_asan.yml @@ -7,27 +7,11 @@ name: CI - Linux x64 clang ASan on: + workflow_call: workflow_dispatch: - pull_request: - push: - branches: - - main - -concurrency: - # A PR number if a pull request and otherwise the commit hash. This cancels - # queued and in-progress runs for the same PR (presubmit) or commit - # (postsubmit). The workflow name is prepended to avoid conflicts between - # different workflows. - group: ${{ github.workflow }}-${{ github.event.number || github.sha }} - cancel-in-progress: true jobs: - setup: - uses: ./.github/workflows/setup.yml - linux_x64_clang_asan: - needs: setup - if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'linux_x64_clang_asan') runs-on: azure-linux-scale container: ghcr.io/iree-org/cpubuilder_ubuntu_jammy@sha256:78a558b999b230f7e1da376639e14b44f095f30f1777d6a272ba48c0bbdd4ccb defaults: @@ -54,10 +38,4 @@ jobs: ./build_tools/cmake/build_and_test_asan.sh sccache --show-stats - - name: Post to Discord on Failure - uses: sarisia/actions-status-discord@ce8cc68e4e626000136b3c702d049a154243e490 # v1.14.7 - if: failure() && github.ref_name == 'main' && github.repository_owner == 'iree-org' - with: - webhook: ${{ secrets.DISCORD_WEBHOOK }} - description: "The ${{ github.workflow }} workflow failed" - url: "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}/attempts/${{ github.run_attempt }}" + # Alerting on failure is the responsibility of the calling job. diff --git a/README.md b/README.md index dcd58e65f385..d04fe4ac496b 100644 --- a/README.md +++ b/README.md @@ -39,9 +39,11 @@ Python iree-base-runtime | [![PyPI version](https://badge.fury.io/py/iree-base-r [![CI](https://github.com/iree-org/iree/actions/workflows/ci.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/ci.yml?query=branch%3Amain+event%3Apush) [![PkgCI](https://github.com/iree-org/iree/actions/workflows/pkgci.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/pkgci.yml?query=branch%3Amain+event%3Apush) -| Host platform | Build status | +#### Nightly build status + +| Operating system | Build status | | -- | --: | -Linux | [![CI - Linux x64 clang](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang.yml?query=branch%3Amain+event%3Apush)
[![CI - Linux arm64 clang](https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml?query=branch%3Amain+event%3Aschedule) +Linux | [![CI - Linux arm64 clang](https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml?query=branch%3Amain+event%3Aschedule) macOS | [![CI - macOS x64 clang](https://github.com/iree-org/iree/actions/workflows/ci_macos_x64_clang.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_macos_x64_clang.yml?query=branch%3Amain+event%3Aschedule) Windows | [![CI - Windows x64 MSVC](https://github.com/iree-org/iree/actions/workflows/ci_windows_x64_msvc.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_windows_x64_msvc.yml?query=branch%3Amain+event%3Aschedule) diff --git a/docs/website/docs/developers/general/github-actions.md b/docs/website/docs/developers/general/github-actions.md index 8d947f85c32b..4480e092daf1 100644 --- a/docs/website/docs/developers/general/github-actions.md +++ b/docs/website/docs/developers/general/github-actions.md @@ -69,7 +69,29 @@ graph ## :material-list-status: Workflow descriptions and status -### Package tests +### "CI" - Core builds and tests + +These workflows build the project from source then run unit tests. + +* To keep these workflows focused, they should not need any special hardware + (e.g. GPUs). +* Some workflows in this category use sanitizers, debug builds, alternate + compilers, and other features that maintainers want automated coverage for. + +Workflow file | Build status | Event triggers +-- | --: | -- +[`ci.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci.yml) | [![cI](https://github.com/iree-org/iree/actions/workflows/ci.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/ci.yml?query=branch%3Amain+event%3Apush) | `pull_request`, `push` + | | +[`ci_linux_arm64_clang.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_arm64_clang.yml) | [![CI - Linux arm64 clang](https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml?query=branch%3Amain+event%3Aschedule) | `schedule` +[`ci_macos_x64_clang.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_macos_x64_clang.yml) | [![CI - macOS x64 clang](https://github.com/iree-org/iree/actions/workflows/ci_macos_x64_clang.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_macos_x64_clang.yml?query=branch%3Amain+event%3Aschedule) | `schedule` +[`ci_windows_x64_msvc.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_windows_x64_msvc.yml) | [![CI - Windows x64 MSVC](https://github.com/iree-org/iree/actions/workflows/ci_windows_x64_msvc.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_windows_x64_msvc.yml?query=branch%3Amain+event%3Aschedule) | `schedule` + | | +[`ci_linux_x64_clang_byollvm.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_byollvm.yml) | [![CI - Linux x64 clang_byollvm](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_byollvm.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_byollvm.yml?query=branch%3Amain+event%3Aschedule) | `schedule` +[`ci_linux_x64_clang_debug.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_debug.yml) | [![CI - Linux x64 clang debug](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_debug.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_debug.yml?query=branch%3Amain+event%3Aschedule) | `schedule` +[`ci_linux_x64_clang_tsan.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_tsan.yml) | [![CI - Linux x64 clang TSan](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_tsan.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_tsan.yml?query=branch%3Amain+event%3Aschedule) | `schedule` +[`ci_linux_x64_gcc.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_gcc.yml) | [![CI - Linux x64 gcc](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_gcc.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_gcc.yml?query=branch%3Amain+event%3Aschedule) | `schedule` + +### "PkgCI" - Package builds and tests These workflows build packages from source then run test suites using them. @@ -102,50 +124,12 @@ Workflow file | Build status | Event triggers Package tests | | [`pkgci.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci.yml) | [![PkgCI](https://github.com/iree-org/iree/actions/workflows/pkgci.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/pkgci.yml?query=branch%3Amain+event%3Apush) | `pull_request`, `push` -### Platform builds - -These workflows build the full project from source using standard options then -run basic tests. - -* To keep these workflows focused, they should not need any special hardware - (e.g. GPUs). - -Workflow file | Build status | Event triggers --- | --: | -- -[`ci_linux_x64_clang.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang.yml) | [![CI - Linux x64 clang](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang.yml?query=branch%3Amain+event%3Apush) | `pull_request`, `push` -[`ci_linux_arm64_clang.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_arm64_clang.yml) | [![CI - Linux arm64 clang](https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml?query=branch%3Amain+event%3Aschedule) | `schedule` -[`ci_macos_x64_clang.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_macos_x64_clang.yml) | [![CI - macOS x64 clang](https://github.com/iree-org/iree/actions/workflows/ci_macos_x64_clang.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_macos_x64_clang.yml?query=branch%3Amain+event%3Aschedule) | `schedule` -[`ci_windows_x64_msvc.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_windows_x64_msvc.yml) | [![CI - Windows x64 MSVC](https://github.com/iree-org/iree/actions/workflows/ci_windows_x64_msvc.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_windows_x64_msvc.yml?query=branch%3Amain+event%3Aschedule) | `schedule` - - - -### Other build configurations - -These workflows build the full project from source using optional settings -then run basic tests. - -* Workflows in this category can use sanitizers, debug builds, alternate - compilers, and other features that maintainers want automated coverage for. - -Workflow file | Build status | Event triggers --- | --: | -- -[`ci_linux_x64_clang_asan.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_asan.yml) | [![CI - Linux x64 clang ASan](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_asan.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_asan.yml?query=branch%3Amain+event%3Apush) | `pull_request`, `push` -[`ci_linux_x64_clang_tsan.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_tsan.yml) | [![CI - Linux x64 clang TSan](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_tsan.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_tsan.yml?query=branch%3Amain+event%3Aschedule) | `schedule` -[`ci_linux_x64_clang_debug.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_debug.yml) | [![CI - Linux x64 clang debug](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_debug.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_debug.yml?query=branch%3Amain+event%3Aschedule) | `schedule` -[`ci_linux_x64_gcc.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_gcc.yml) | [![CI - Linux x64 gcc](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_gcc.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_gcc.yml?query=branch%3Amain+event%3Aschedule) | `schedule` -[`ci_linux_x64_clang_byollvm.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_byollvm.yml) | [![CI - Linux x64 clang_byollvm](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_byollvm.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_clang_byollvm.yml?query=branch%3Amain+event%3Aschedule) | `schedule` -[`ci_linux_x64_bazel.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_bazel.yml) | [![CI - Linux x64 bazel](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_bazel.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/ci_linux_x64_bazel.yml?query=branch%3Amain+event%3Apush) | `pull_request`, `push` - - - - ### Other workflows Workflow file | Build status | Event triggers -- | --: | -- -[`ci.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/ci.yml) | [![cI](https://github.com/iree-org/iree/actions/workflows/ci.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/ci.yml?query=branch%3Amain+event%3Apush) | `pull_request`, `push` [`build_package.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/build_package.yml) | [![Build Release Packages](https://github.com/iree-org/iree/actions/workflows/build_package.yml/badge.svg)](https://github.com/iree-org/iree/actions/workflows/build_package.yml) | `schedule` -[`publish_website.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/publish_website.yml) | [![publish_website](https://github.com/iree-org/iree/actions/workflows/publish_website.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/publish_website.yml?query=branch%3Amain+event%3Apush) | `push` +[`publish_website.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/publish_website.yml) | [![publish_website](https://github.com/iree-org/iree/actions/workflows/publish_website.yml/badge.svg?query=branch%3Amain+event%3Apush)](https://github.com/iree-org/iree/actions/workflows/publish_website.yml?query=branch%3Amain+event%3Apush) | `push`, `release`, `schedule` [`samples.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/samples.yml) | [![Samples](https://github.com/iree-org/iree/actions/workflows/samples.yml/badge.svg?query=branch%3Amain+event%3Aschedule)](https://github.com/iree-org/iree/actions/workflows/samples.yml?query=branch%3Amain+event%3Aschedule) | `schedule` ## :octicons-pencil-16: Writing and editing workflows From fb4d09470dc4be674de810dbfbf2d3764e2970ba Mon Sep 17 00:00:00 2001 From: Benoit Jacob Date: Thu, 19 Dec 2024 12:31:42 -0500 Subject: [PATCH 50/64] Ukernel lowering for data-tiled `multi_mma` with `mfma_i32_16x16x32_i8` (#19522) This finishes implementing an initial ukernel for `multi_mma` for `DataTiledMMAAttr` with `kind = mfma_i32_16x16x32_i8`. The ukernel takes unroll and subgroup parameters as function parameters. The idea is that once inlining works as intended, these function parameters will be constants and the optimized code will be the same as if we had hardcoded specific values. This inlining isn't happening at the moment, but that is a bug that we should fix first. It is happening in LLVMCPU, so that's probably something missing in LLVMGPU. The ukernel file has a comment with a few TODOs to get from this initial naive ukernel to something faster. The first step is to fix the above-mentioned inlining problem, then get shared memory, then get better instruction scheduling. Signed-off-by: Benoit Jacob --- .../target/ROCM/builtins/ukernel/BUILD.bazel | 8 +-- .../ROCM/builtins/ukernel/CMakeLists.txt | 8 +-- .../target/ROCM/builtins/ukernel/common.h | 1 - ...uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c | 65 +++++++++++++++++++ ...i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c | 53 --------------- .../test/config_ukernel_multi_mma_gfx942.mlir | 4 +- .../Codegen/Common/GPU/GPULowerToUKernels.cpp | 52 +++++++++++++-- .../compiler/Codegen/Common/GPU/Passes.td | 2 + .../GPU/test/gpu_lower_to_ukernels.mlir | 32 +++++++-- .../Dialect/GPU/Transforms/Transforms.cpp | 11 +--- .../test/distribute_mma_to_lanes.mlir | 2 +- .../iree/compiler/Codegen/LLVMGPU/Passes.cpp | 3 + .../LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp | 13 +--- tests/e2e/matmul/CMakeLists.txt | 30 +++++++++ 14 files changed, 189 insertions(+), 95 deletions(-) create mode 100644 compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c delete mode 100644 compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel b/compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel index 840d45fc27cb..7bcedce7e29f 100644 --- a/compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel +++ b/compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel @@ -60,19 +60,19 @@ argmax_bc_files = [ ] iree_amdgpu_bitcode_library( - name = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4_gfx942", + name = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_gfx942", srcs = [ "common.h", - "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c", + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c", ], - out = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc", + out = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.gfx942.bc", gpu_arch = "gfx942", ) iree_c_embed_data( name = "iree_uk_amdgpu_bitcode", srcs = argmax_bc_files + [ - "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc", + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.gfx942.bc", ], c_file_output = "iree_uk_amdgpu_bitcode.c", flatten = True, diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/CMakeLists.txt b/compiler/plugins/target/ROCM/builtins/ukernel/CMakeLists.txt index ad1a19028a5b..97962aaff481 100644 --- a/compiler/plugins/target/ROCM/builtins/ukernel/CMakeLists.txt +++ b/compiler/plugins/target/ROCM/builtins/ukernel/CMakeLists.txt @@ -208,14 +208,14 @@ iree_amdgpu_bitcode_library( iree_amdgpu_bitcode_library( NAME - iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4_gfx942 + iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_gfx942 GPU_ARCH gfx942 SRCS "common.h" - "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c" + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c" OUT - "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc" + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.gfx942.bc" ) iree_c_embed_data( @@ -238,7 +238,7 @@ iree_c_embed_data( "iree_uk_amdgpu_argmax_f32i64.gfx1100.bc" "iree_uk_amdgpu_argmax_f32i64.gfx90a.bc" "iree_uk_amdgpu_argmax_f32i64.gfx942.bc" - "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc" + "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.gfx942.bc" C_FILE_OUTPUT "iree_uk_amdgpu_bitcode.c" H_FILE_OUTPUT diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/common.h b/compiler/plugins/target/ROCM/builtins/ukernel/common.h index 14b65a253c5d..d046986cc9b5 100644 --- a/compiler/plugins/target/ROCM/builtins/ukernel/common.h +++ b/compiler/plugins/target/ROCM/builtins/ukernel/common.h @@ -61,7 +61,6 @@ typedef __UINT64_TYPE__ uint64_t; // Vector typedefs //===----------------------------------------------------------------------===// -typedef __attribute__((__vector_size__(8 * 2))) int64_t int64x2_t; typedef __attribute__((__vector_size__(4 * 4))) int32_t int32x4_t; //===----------------------------------------------------------------------===// diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c b/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c new file mode 100644 index 000000000000..9029a86ddb59 --- /dev/null +++ b/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c @@ -0,0 +1,65 @@ +// Copyright 2024 The IREE Authors +// +// Licensed under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +#include "compiler/plugins/target/ROCM/builtins/ukernel/common.h" + +// Very naive kernel. TODO(bjacob): +// 1. Inlining: the `always_inline` attribute here is correctly preserved in +// the bitcode, but isn't having the intended effect of inlining calls to +// this function. Making that work is key as various function parameters +// (e.g. `unroll_m`) are meant to be constants. +// 2. Shared memory: can't allocate it within the microkernel (which is just a +// helper device function, not the actual amdgpu_kernel). Need to get it +// passed down here as a `T [[clang::address_space(3)]] *` parameter. +// 3. Better scheduling via either barrier intrinsics or inline assemby. +// 4. Subgroups1x4 being asymmetric is a historical accident... should be 2x2. +[[clang::always_inline]] void iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8( + const int8_t *a_buffer, int64_t a_offset, const int8_t *b_buffer, + int64_t b_offset, int32_t *c_buffer, int64_t c_offset, int32_t k_size, + int32_t unroll_m, int32_t subgroups_m, int32_t unroll_n, + int32_t subgroups_n, int32_t unroll_k) { + /* + TODO(bjacob): reenable this once inlining works. + // Load existing accumulators. This is a VLA, but should become fixed-size + // once this function is inlined and unroll_* factors become constants. + int32x4_t c[unroll_m][unroll_n]; + */ + // Load existing accumulators. + if (unroll_m > 8 || unroll_n > 2) { + __builtin_trap(); + } + int32x4_t c[8][2]; + int32x4_t *c_global = (int32x4_t *)(c_buffer + c_offset); + for (int m = 0; m < unroll_m; ++m) { + for (int n = 0; n < unroll_n; ++n) { + c[m][n] = c_global[64 * (m * unroll_n + n)]; + } + } + + // Arithmetic loop. + const int64_t *a_global = (const int64_t *)(a_buffer + a_offset); + const int64_t *b_global = (const int64_t *)(b_buffer + b_offset); + for (int k_outer = 0; k_outer < k_size; ++k_outer) { + for (int m = 0; m < unroll_m; ++m) { + for (int n = 0; n < unroll_n; ++n) { + for (int k = 0; k < unroll_k; ++k) { + c[m][n] = __builtin_amdgcn_mfma_i32_16x16x32_i8( + a_global[64 * unroll_k * m + k], b_global[64 * unroll_k * n + k], + c[m][n], 0, 0, 0); + } + } + } + a_global += 64 * unroll_m * subgroups_m * unroll_k; + b_global += 64 * unroll_n * subgroups_n * unroll_k; + } + + // Store accumulators. + for (int m = 0; m < unroll_m; ++m) { + for (int n = 0; n < unroll_n; ++n) { + c_global[64 * (m * unroll_n + n)] = c[m][n]; + } + } +} diff --git a/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c b/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c deleted file mode 100644 index 7d0e2643050e..000000000000 --- a/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.c +++ /dev/null @@ -1,53 +0,0 @@ -// Copyright 2024 The IREE Authors -// -// Licensed under the Apache License v2.0 with LLVM Exceptions. -// See https://llvm.org/LICENSE.txt for license information. -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - -#include "compiler/plugins/target/ROCM/builtins/ukernel/common.h" - -// Very naive kernel. TODO(bjacob): -// 1. Shared memory: can't allocate it within the microkernel (which is just a -// helper device function, not the actual amdgpu_kernel). Need to get it -// passed down here as a `T [[clang::address_space(3)]] *` parameter. -// 2. Better scheduling via either barrier intrinsics or inline assemby. -// 3. Subgroups1x4 being asymmetric is a historical accident... should be 2x2. -[[clang::always_inline]] void -iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4( - const int8_t *a_buffer, int64_t a_offset, const int8_t *b_buffer, - int64_t b_offset, int32_t *c_buffer, int64_t c_offset, int64_t k_size) { - int tid = __builtin_amdgcn_workitem_id_x(); - - // Load existing accumulators. - int32x4_t acc[8][2] = {{0}}; - int32x4_t *c_global = (int32x4_t *)(c_buffer + c_offset); - for (int i = 0; i < 8; ++i) { - for (int j = 0; j < 2; ++j) { - acc[i][j] = c_global[256 * (2 * i + j) + tid]; - } - } - - // Arithmetic loop. - const int64x2_t *a_global = - (const int64x2_t *)(a_buffer + a_offset) + (tid % 64); - const int64x2_t *b_global = (const int64x2_t *)(b_buffer + b_offset) + tid; - for (int k_outer = 0; k_outer < k_size; ++k_outer) { - for (int i = 0; i < 8; ++i) { - for (int j = 0; j < 2; ++j) { - for (int k = 0; k < 2; ++k) { - acc[i][j] = __builtin_amdgcn_mfma_i32_16x16x32_i8( - a_global[64 * i][k], b_global[256 * j][k], acc[i][j], 0, 0, 0); - } - } - } - a_global += 512; - b_global += 512; - } - - // Store accumulators. - for (int i = 0; i < 8; ++i) { - for (int j = 0; j < 2; ++j) { - c_global[256 * (2 * i + j) + tid] = acc[i][j]; - } - } -} diff --git a/compiler/plugins/target/ROCM/test/config_ukernel_multi_mma_gfx942.mlir b/compiler/plugins/target/ROCM/test/config_ukernel_multi_mma_gfx942.mlir index 646418f80666..2fd78139ed59 100644 --- a/compiler/plugins/target/ROCM/test/config_ukernel_multi_mma_gfx942.mlir +++ b/compiler/plugins/target/ROCM/test/config_ukernel_multi_mma_gfx942.mlir @@ -23,7 +23,7 @@ func.func @multi_mma_mfma_i32_16x16x32_i8(%a : tensor<1x2x8x4x16x2x8xi8>, // CHECK-LABEL: @multi_mma_mfma_i32_16x16x32_i8 // CHECK: iree_gpu.multi_mma -// CHECK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8_unroll8x2x2_subgroups1x4.gfx942.bc" +// CHECK-SAME: #hal.executable.object<{path = "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.gfx942.bc" // CHECK-NOT: promote_operands // CHECK-SAME: reduction = [0, 0, 0] -// CHECK-SAME: #iree_gpu.ukernel_config matchArgmaxDAGForUKernel(RewriterBase &rewriter, linalg::GenericOp op) { Value input = op.getDpsInputOperand(0)->get(); - auto inputType = cast(input.getType()); Value index = op.getDpsInitOperand(1)->get(); auto indexType = cast(index.getType()); - std::string suffix; - llvm::raw_string_ostream(suffix) - << inputType.getElementType() << indexType.getElementType(); auto loweringConfig = getLoweringConfig(op); if (!loweringConfig) { return rewriter.notifyMatchFailure(op, "no lowering_config on this op"); @@ -84,6 +81,50 @@ struct LowerArgmaxToUKernelPattern : OpRewritePattern { } }; +struct LowerMultiMmaToUKernelPattern : OpRewritePattern { + LowerMultiMmaToUKernelPattern(MLIRContext *context) + : OpRewritePattern(context) {} + + LogicalResult matchAndRewrite(IREE::GPU::MultiMmaOp op, + PatternRewriter &rewriter) const override { + auto loweringConfig = getLoweringConfig(op); + if (!loweringConfig) { + return rewriter.notifyMatchFailure(op, "no lowering_config on this op"); + } + IREE::GPU::UKernelConfigAttr ukernelAttr = + IREE::GPU::getUkernelSpec(loweringConfig); + if (!ukernelAttr) { + return rewriter.notifyMatchFailure(op, "no ukernel selected for this op"); + } + auto mma = dyn_cast(op.getKind()); + if (!mma) { + return rewriter.notifyMatchFailure(op, "unhandled MMAInterfaceAttr"); + } + auto castIndexToI32 = [&](Value val) { + return rewriter.create(op.getLoc(), + rewriter.getI32Type(), val); + }; + auto constI32 = [&](int val) { + return rewriter.create(op.getLoc(), val, + rewriter.getI32Type()); + }; + Value k = castIndexToI32( + rewriter.create(op.getLoc(), op.getLhs(), 1)); + Value unrollM = constI32(mma.getUnrollM()); + Value subgroupsM = constI32(mma.getSubgroupsM()); + Value unrollN = constI32(mma.getUnrollN()); + Value subgroupsN = constI32(mma.getSubgroupsN()); + Value unrollK = constI32(mma.getUnrollK()); + rewriter.replaceOpWithNewOp( + op, TypeRange{op.getAccType()}, ukernelAttr.getName(), + ValueRange{op.getLhs(), op.getRhs()}, op.getAcc(), + ValueRange{k, unrollM, subgroupsM, unrollN, subgroupsN, unrollK}, + ukernelAttr.getDefAttrs(), + /*strided_outer_dims=*/rewriter.getIndexAttr(0)); + return success(); + } +}; + struct GPULowerToUKernelsPass final : impl::GPULowerToUKernelsPassBase { void runOnOperation() override { @@ -101,7 +142,8 @@ struct GPULowerToUKernelsPass final // evidence that it is difficult for codegen to consistently approach // microkernels performance, and that consideration overrides the benefit of // fusions for these ops. - patterns.insert(context); + patterns.add( + context); if (failed(applyPatternsAndFoldGreedily(getOperation(), std::move(patterns)))) { return signalPassFailure(); diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td b/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td index ff2b2b94f9b2..24552cbdfee0 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td @@ -111,6 +111,8 @@ def GPULowerToUKernelsPass : let dependentDialects = [ "::mlir::iree_compiler::IREE::Codegen::IREECodegenDialect", "::mlir::iree_compiler::IREE::GPU::IREEGPUDialect", + "::mlir::arith::ArithDialect", + "::mlir::tensor::TensorDialect", ]; } diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir index 7acab19f945a..bc9331fea2cc 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_lower_to_ukernels.mlir @@ -1,9 +1,7 @@ // RUN: iree-opt --split-input-file --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-lower-to-ukernels,cse,canonicalize))" %s | FileCheck %s #config = #iree_gpu.lowering_config<{ukernel = #iree_gpu.ukernel_config}> -func.func @argmax_f32i64_with_selected_ukernel(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { - hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> -} { +func.func @argmax_f32i64_with_selected_ukernel(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> { %c0_i64 = arith.constant 0 : i64 %cst = arith.constant 0xFF800000 : f32 %0 = tensor.empty() : tensor<1xi64> @@ -42,9 +40,7 @@ func.func @argmax_f32i64_with_selected_ukernel(%arg0 : tensor<1x?xf32>) -> tenso // ----- -func.func @argmax_f32i64_without_selected_ukernel(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> attributes { - hal.executable.target = #hal.executable.target<"rocm", "rocm-hsaco-fb", {ukernels = "all"}> -} { +func.func @argmax_f32i64_without_selected_ukernel(%arg0 : tensor<1x?xf32>) -> tensor<1xi64> { %c0_i64 = arith.constant 0 : i64 %cst = arith.constant 0xFF800000 : f32 %0 = tensor.empty() : tensor<1xi64> @@ -70,3 +66,27 @@ func.func @argmax_f32i64_without_selected_ukernel(%arg0 : tensor<1x?xf32>) -> te //CHECK-LABEL: func @argmax_f32i64_without_selected_ukernel( // CHECK-NOT: iree_codegen.ukernel.generic // CHECK: linalg.generic + +// ----- + +func.func @multi_mma_mfma_i32_16x16x32_i8(%a : tensor<1x2x8x1x1x2x8xi8>, %b : tensor<1x2x1x2x1x1x2x8xi8>, %c : tensor<1x1x1x8x2x1x1x4xi32>) -> tensor<1x1x1x8x2x1x1x4xi32> { + %d = iree_gpu.multi_mma %a, %b, %c { + indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], + iterator_types = [#iree_gpu.iterator_type, #iree_gpu.iterator_type, #iree_gpu.iterator_type], + kind = #iree_gpu.data_tiled_mma_layout, + lowering_config = #iree_gpu.lowering_config<{ + reduction = [0, 0, 0], + ukernel = #iree_gpu.ukernel_config, + workgroup = [1, 1, 0]}> + } : tensor<1x2x8x1x1x2x8xi8>, tensor<1x2x1x2x1x1x2x8xi8> into tensor<1x1x1x8x2x1x1x4xi32> + return %d : tensor<1x1x1x8x2x1x1x4xi32> +} + +// CHECK-LABEL: func @multi_mma_mfma_i32_16x16x32_i8( +// CHECK-DAG: %c2_i32 = arith.constant 2 : i32 +// CHECK-DAG: %c8_i32 = arith.constant 8 : i32 +// CHECK-DAG: %c1_i32 = arith.constant 1 : i32 +// CHECK-DAG: %c4_i32 = arith.constant 4 : i32 +// CHECK: %[[MICRO_KERNEL:.+]] = iree_codegen.ukernel.generic +// CHECK-SAME: "iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8" +// CHECK-SAME: (%c2_i32, %c8_i32, %c1_i32, %c2_i32, %c4_i32, %c2_i32 : i32, i32, i32, i32, i32, i32) diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp b/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp index 75bf5e51d54c..659f2a9487a1 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp @@ -620,16 +620,11 @@ distributeMultiMmaOp(RewriterBase &rewriter, IREE::GPU::MultiMmaOp mmaOp, accStrides); // Step 3. Create the new multi_mma op. - auto newKind = mmaOp.getKind(); - if (auto dataTiledMma = dyn_cast(newKind)) { - newKind = DataTiledMMAAttr::get( - context, dataTiledMma.getIntrinsic(), dataTiledMma.getUnrollM(), - /*subgroups_m=*/1, dataTiledMma.getUnrollN(), - /*subgroups_n=*/1, dataTiledMma.getUnrollK()); - } auto newMmaOp = rewriter.create( loc, lhsSlice, rhsSlice, accSlice, mmaOp.getIndexingMaps(), - mmaOp.getIteratorTypes(), newKind); + mmaOp.getIteratorTypes(), mmaOp.getKind()); + + newMmaOp->setDiscardableAttrs(mmaOp->getDiscardableAttrDictionary()); // Step 4. Insert the result of the multi_mma using the same offsets/sizes as // the accumulator slice. diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_mma_to_lanes.mlir b/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_mma_to_lanes.mlir index 07729a11e2b5..a5a0ff14e9cb 100644 --- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_mma_to_lanes.mlir +++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_mma_to_lanes.mlir @@ -471,7 +471,7 @@ func.func @data_tiled_2x2x4_tensor_multi_mma_unrolled_to_subgroups(%lhs: tensor< // CHECK-DAG: %[[ACC_SLICE:.+]] = tensor.extract_slice %[[ACC_ARG]] // CHECK-SAME: [0, 0, %[[ACC_IDS]]#1, %[[ACC_IDS]]#2, %[[ACC_IDS]]#3, %[[ACC_IDS]]#4, 0] [1, 1, 1, 1, 1, 1, 4] [1, 1, 1, 1, 1, 1, 1] // CHECK: %[[MMA:.+]] = iree_gpu.multi_mma %[[LHS_SLICE]], %[[RHS_SLICE]], %[[ACC_SLICE]] -// CHECK-SAME: kind = #iree_gpu.data_tiled_mma_layout} +// CHECK-SAME: kind = #iree_gpu.data_tiled_mma_layout} // CHECK-SAME: : tensor<1x1x1x1x1x4xf32>, tensor<1x1x1x1x1x4xf32> into tensor<1x1x1x1x1x1x4xf32> // CHECK: tensor.parallel_insert_slice %[[MMA]] into %[[ACC_ARG]] // CHECK-SAME: [0, 0, %[[ACC_IDS]]#1, %[[ACC_IDS]]#2, %[[ACC_IDS]]#3, %[[ACC_IDS]]#4, 0] [1, 1, 1, 1, 1, 1, 4] [1, 1, 1, 1, 1, 1, 1] diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp index d460a1b9f56b..f8399d3c69a2 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp @@ -410,6 +410,9 @@ void addGPUTileAndFusePassPipeline(OpPassManager &funcPassManager, } funcPassManager.addPass(IREE::GPU::createDistributeMmaToLanesPass()); + // Step 4.5. Things that need to happen right after distribution to threads. + funcPassManager.addPass(createGPULowerToUKernelsPass()); + // Normalize loop bounds for later lowerings. funcPassManager.addPass(iree_compiler::createNormalizeLoopBoundsPass( NormalizeLoopBoundsPassOptions{/*normalizeFor=*/false, diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp index 453669db7426..8d81cf78ec61 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/LLVMGPUSelectUKernels.cpp @@ -42,17 +42,8 @@ getUKernelNameAndSuffixForMultiMma(IREE::GPU::MultiMmaOp op) { if (!mma) { return {}; // Only handling DataTiledMMAAttr for now. } - std::string suffix{ - stringifyMMAIntrinsic(mma.getIntrinsic().getValue()).lower()}; - if (mma.getUnrollM() != 1 || mma.getUnrollN() != 1 || mma.getUnrollK() != 1) { - suffix += llvm::formatv("_unroll{}x{}x{}", mma.getUnrollM(), - mma.getUnrollN(), mma.getUnrollK()); - } - if (mma.getSubgroupsM() != 1 || mma.getSubgroupsN() != 1) { - suffix += llvm::formatv("_subgroups{}x{}", mma.getSubgroupsM(), - mma.getSubgroupsN()); - } - return {"multi_mma", suffix}; + return {"multi_mma", + stringifyMMAIntrinsic(mma.getIntrinsic().getValue()).lower()}; } // Returns ukernel name and suffix for any op. Empty name = no ukernel. diff --git a/tests/e2e/matmul/CMakeLists.txt b/tests/e2e/matmul/CMakeLists.txt index b744d346ebef..cf1ed28038d1 100644 --- a/tests/e2e/matmul/CMakeLists.txt +++ b/tests/e2e/matmul/CMakeLists.txt @@ -1600,6 +1600,36 @@ iree_generated_e2e_runner_test( "requires-gpu-cdna3" ) +iree_generated_e2e_runner_test( + NAME + e2e_matmul_cdna3_dt_uk_i8 + TEST_TYPE + matmul + GENERATOR + "generate_e2e_matmul_tests.py" + GENERATOR_ARGS + "--lhs_rhs_type=i8" + "--acc_type=i32" + TEST_RUNNER + iree_tools_testing_e2e_iree-e2e-matmul-test + TARGET_BACKENDS + "rocm" + DRIVERS + "hip" + COMPILER_FLAGS + ${IREE_HIP_TEST_COMPILER_FLAGS} + "--iree-opt-data-tiling" + "--iree-global-opt-experimental-rocm-data-tiling" + "--iree-global-opt-enable-early-materialization=true" + "--iree-hip-enable-ukernels=multi_mma" + LABELS + "noasan" + "nomsan" + "notsan" + "noubsan" + "requires-gpu-cdna3" +) + iree_generated_e2e_runner_test( NAME e2e_matmul_cdna3_dt_f32 From 83af67948d99cd2731c68dccbbbaba102948b1dc Mon Sep 17 00:00:00 2001 From: Rob Suderman Date: Thu, 19 Dec 2024 12:15:49 -0800 Subject: [PATCH 51/64] Bump to llvm/torch-mlir@061bbc5e1bc4f7880bb565e404a6709f97396818 (#19531) Bumping torch-mlir forwards to head --- .../onnx_ops/onnx_ops_cpu_llvm_sync.json | 12 ++++++++---- .../onnx_ops/onnx_ops_gpu_rocm_rdna3.json | 4 ---- .../onnx_ops/onnx_ops_gpu_vulkan.json | 2 +- third_party/torch-mlir | 2 +- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/tests/external/iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json b/tests/external/iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json index 351c3420407c..8b01994bac56 100644 --- a/tests/external/iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json +++ b/tests/external/iree-test-suites/onnx_ops/onnx_ops_cpu_llvm_sync.json @@ -120,6 +120,14 @@ "onnx/node/generated/test_nonmaxsuppression_two_batches", "onnx/node/generated/test_nonmaxsuppression_two_classes", "onnx/node/generated/test_nonzero_example", + "onnx/node/generated/test_pow_types_float32_int32", + "onnx/node/generated/test_pow_types_float32_int64", + "onnx/node/generated/test_pow_types_float32_uint32", + "onnx/node/generated/test_pow_types_float32_uint64", + "onnx/node/generated/test_pow_types_int32_float32", + "onnx/node/generated/test_pow_types_int32_int32", + "onnx/node/generated/test_pow_types_int64_float32", + "onnx/node/generated/test_pow_types_int64_int64", "onnx/node/generated/test_quantizelinear_axis", "onnx/node/generated/test_quantizelinear_blocked_asymmetric", "onnx/node/generated/test_quantizelinear_blocked_symmetric", @@ -333,10 +341,6 @@ "onnx/node/generated/test_lstm_with_peepholes", "onnx/node/generated/test_pow", "onnx/node/generated/test_pow_example", - "onnx/node/generated/test_pow_types_float32_int32", - "onnx/node/generated/test_pow_types_float32_int64", - "onnx/node/generated/test_pow_types_float32_uint32", - "onnx/node/generated/test_pow_types_float32_uint64", "onnx/node/generated/test_qlinearmatmul_2D_int8_float16", "onnx/node/generated/test_qlinearmatmul_2D_int8_float32", "onnx/node/generated/test_qlinearmatmul_3D_int8_float16", diff --git a/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_rocm_rdna3.json b/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_rocm_rdna3.json index a215adb62b79..608667a92f40 100644 --- a/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_rocm_rdna3.json +++ b/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_rocm_rdna3.json @@ -348,10 +348,6 @@ "onnx/node/generated/test_lstm_with_peepholes", "onnx/node/generated/test_pow", "onnx/node/generated/test_pow_example", - "onnx/node/generated/test_pow_types_float32_int32", - "onnx/node/generated/test_pow_types_float32_int64", - "onnx/node/generated/test_pow_types_float32_uint32", - "onnx/node/generated/test_pow_types_float32_uint64", "onnx/node/generated/test_qlinearmatmul_2D_int8_float16", "onnx/node/generated/test_qlinearmatmul_2D_int8_float32", "onnx/node/generated/test_qlinearmatmul_3D_int8_float16", diff --git a/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_vulkan.json b/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_vulkan.json index eb8e94e5aa36..884d4cda28f7 100644 --- a/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_vulkan.json +++ b/tests/external/iree-test-suites/onnx_ops/onnx_ops_gpu_vulkan.json @@ -160,7 +160,6 @@ "onnx/node/generated/test_nonmaxsuppression_flipped_coordinates", "onnx/node/generated/test_nonmaxsuppression_identical_boxes", "onnx/node/generated/test_nonmaxsuppression_limit_output_size", - "onnx/node/generated/test_nonmaxsuppression_single_box", "onnx/node/generated/test_nonmaxsuppression_suppress_by_IOU", "onnx/node/generated/test_nonmaxsuppression_suppress_by_IOU_and_scores", "onnx/node/generated/test_nonmaxsuppression_two_batches", @@ -466,6 +465,7 @@ "onnx/node/generated/test_mod_mixed_sign_int8", "onnx/node/generated/test_mod_uint16", "onnx/node/generated/test_mod_uint8", + "onnx/node/generated/test_nonmaxsuppression_single_box", "onnx/node/generated/test_or_bcast3v1d", "onnx/node/generated/test_or_bcast4v2d", "onnx/node/generated/test_or_bcast4v4d", diff --git a/third_party/torch-mlir b/third_party/torch-mlir index 99115dcdc8cf..061bbc5e1bc4 160000 --- a/third_party/torch-mlir +++ b/third_party/torch-mlir @@ -1 +1 @@ -Subproject commit 99115dcdc8cff8ce07bd027a12b001ddd7e957f3 +Subproject commit 061bbc5e1bc4f7880bb565e404a6709f97396818 From 07f81f0e074b2846c512012fe7e2f1bfb00b2dee Mon Sep 17 00:00:00 2001 From: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Fri, 20 Dec 2024 02:03:42 -0800 Subject: [PATCH 52/64] Revert "Enable scatter fusion with index operand. (#19198)" (#19535) This reverts commit 4c00a2283e1b01ff219156a2f1ffffea6fc69a6c. Seems to be cause of https://github.com/iree-org/iree/issues/19533 --- .../DispatchCreation/FormDispatchRegions.cpp | 6 ++-- .../test/form_dispatch_regions.mlir | 32 ------------------- 2 files changed, 3 insertions(+), 35 deletions(-) diff --git a/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp b/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp index 8e8c27ef95e1..73d306bd7f6e 100644 --- a/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp +++ b/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp @@ -267,14 +267,14 @@ matchIteratorTypes(const llvm::SmallBitVector &rootOuterParallelLoop, // If the candidate is all parallel, then it should be at least as parallel as // the root. - for (int pos : llvm::seq(0, std::min(candidateOuterParallelLoop.size(), - rootOuterParallelLoop.size()))) { + for (int pos : llvm::seq(0, rootOuterParallelLoop.size())) { // If we reach the end of the outer loops of the root, break out of the // loop. if (!rootOuterParallelLoop.test(pos)) break; // If the root loop is parallel, the candidate loop should also be parallel. - if (!candidateOuterParallelLoop.test(pos)) + if (pos >= candidateOuterParallelLoop.size() || + !candidateOuterParallelLoop.test(pos)) return false; } return true; diff --git a/compiler/src/iree/compiler/DispatchCreation/test/form_dispatch_regions.mlir b/compiler/src/iree/compiler/DispatchCreation/test/form_dispatch_regions.mlir index 2344285e2b1b..196fc8795718 100644 --- a/compiler/src/iree/compiler/DispatchCreation/test/form_dispatch_regions.mlir +++ b/compiler/src/iree/compiler/DispatchCreation/test/form_dispatch_regions.mlir @@ -922,35 +922,3 @@ util.func @custom_op_no_producer_fusion(%arg0 : tensor, %arg1 : tensor< // CHECK-SAME: ins(%[[DISPATCH1]], // CHECK: flow.return %[[CUSTOM_OP]] // CHECK: util.return %[[DISPATCH2]] - -// ----- - -util.func @scatter_index_producer_fusion(%arg0 : tensor, - %arg1 : index, %arg2 : tensor, - %arg3 : tensor) -> tensor { - %empty = tensor.empty(%arg1) : tensor - %0 = linalg.generic { - indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, - affine_map<(d0, d1) -> (d0, d1)>], - iterator_types = ["parallel", "parallel"]} - ins(%arg0 : tensor) outs(%empty : tensor) { - ^bb0(%in: i64, %out: i32): - %1 = arith.trunci %in : i64 to i32 - linalg.yield %1 : i32 - } -> tensor - %1 = iree_linalg_ext.scatter - dimension_map = [0] unique_indices(true) - ins(%arg2, %0 : tensor, tensor) - outs(%arg3 : tensor) { - ^bb0(%arg6: f16, %arg7: f16): - iree_linalg_ext.yield %arg6 : f16 - } -> tensor - util.return %1 : tensor -} -// CHECK-LABEL: func public @scatter_index_producer_fusion -// CHECK: %[[DISPATCH:.+]] = flow.dispatch.region -// CHECK: %[[GENERIC:.+]] = linalg.generic -// CHECK: %[[SCATTER:.+]] = iree_linalg_ext.scatter -// CHECK-SAME: ins(%{{.+}}, %[[GENERIC]] : -// CHECK: flow.return %[[SCATTER]] -// CHECK: util.return %[[DISPATCH]] From 7ff83ea165f191161a94114b50a8b089afaa90c3 Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Fri, 20 Dec 2024 10:00:18 -0800 Subject: [PATCH 53/64] [hip][cuda] Increase the size of the query pool. (#19542) Some of our models now have > 30k commands in them, and each command takes up 2 events, so increase the total so we don't run out. Signed-off-by: Andrew Woloszyn --- runtime/src/iree/hal/utils/stream_tracing.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/runtime/src/iree/hal/utils/stream_tracing.c b/runtime/src/iree/hal/utils/stream_tracing.c index f99e932cbb51..a60bdbb4ecd2 100644 --- a/runtime/src/iree/hal/utils/stream_tracing.c +++ b/runtime/src/iree/hal/utils/stream_tracing.c @@ -11,7 +11,7 @@ // Total number of events per tracing context. This translates to the maximum // number of outstanding timestamp queries before collection is required. // To prevent spilling pages we leave some room for the context structure. -#define IREE_HAL_TRACING_DEFAULT_QUERY_CAPACITY (32 * 1024 - 256) +#define IREE_HAL_TRACING_DEFAULT_QUERY_CAPACITY (128 * 1024 - 256) // iree_hal_stream_tracing_context_event_t contains a native event that is used // to record timestamps for tracing GPU execution. In this struct, there are From 604cba8d601c911d740ac99462cdf2ef4f87bef8 Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Fri, 20 Dec 2024 12:13:46 -0800 Subject: [PATCH 54/64] Fix incorrect offset in fd_file.c (#19543) iree_hal_buffer_mapping_{flush,invalidate}_range take values that are relative to the mappings, not the buffers themselves. Signed-off-by: Andrew Woloszyn --- runtime/src/iree/hal/utils/fd_file.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/runtime/src/iree/hal/utils/fd_file.c b/runtime/src/iree/hal/utils/fd_file.c index 3a7fb60f0802..6941e2e5f5de 100644 --- a/runtime/src/iree/hal/utils/fd_file.c +++ b/runtime/src/iree/hal/utils/fd_file.c @@ -319,8 +319,7 @@ static iree_status_t iree_hal_fd_file_read(iree_hal_file_t* base_file, if (iree_status_is_ok(status) && !iree_all_bits_set(iree_hal_buffer_memory_type(buffer), IREE_HAL_MEMORY_TYPE_HOST_COHERENT)) { - status = - iree_hal_buffer_mapping_flush_range(&mapping, buffer_offset, length); + status = iree_hal_buffer_mapping_flush_range(&mapping, 0, length); } return iree_status_join(status, iree_hal_buffer_unmap_range(&mapping)); @@ -342,8 +341,7 @@ static iree_status_t iree_hal_fd_file_write(iree_hal_file_t* base_file, iree_status_t status = iree_ok_status(); if (!iree_all_bits_set(iree_hal_buffer_memory_type(buffer), IREE_HAL_MEMORY_TYPE_HOST_COHERENT)) { - status = iree_hal_buffer_mapping_invalidate_range(&mapping, buffer_offset, - length); + status = iree_hal_buffer_mapping_invalidate_range(&mapping, 0, length); } const uint8_t* buffer_ptr = mapping.contents.data; From 9b8bba82b676749600d502f98e32997d58c4c229 Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Fri, 20 Dec 2024 12:14:06 -0800 Subject: [PATCH 55/64] [hip] Fixed a busy wait in event_semaphore. (#19540) Because of how locks were scheduled in iree_hal_hip_semaphore_wait we would run scheduled callbacks after preparing our notification but before waiting on it. However in running the scheduled callbacks it would unconditionally post to the notification. This change only posts to the notification if there was any forward progress made. --------- Signed-off-by: Andrew Woloszyn --- runtime/src/iree/hal/drivers/hip/event_semaphore.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/runtime/src/iree/hal/drivers/hip/event_semaphore.c b/runtime/src/iree/hal/drivers/hip/event_semaphore.c index 276290c5f006..ddd00fc42a4d 100644 --- a/runtime/src/iree/hal/drivers/hip/event_semaphore.c +++ b/runtime/src/iree/hal/drivers/hip/event_semaphore.c @@ -842,10 +842,6 @@ static iree_status_t iree_hal_hip_semaphore_wait( iree_notification_prepare_wait(&semaphore->state_notification); iree_slim_mutex_unlock(&semaphore->mutex); - // We are going to pick up the correct status from query_locked below. - iree_status_ignore( - iree_hal_hip_event_semaphore_run_scheduled_callbacks(base_semaphore)); - // We have to wait for the semaphore to catch up. bool committed = iree_notification_commit_wait(&semaphore->state_notification, wait, From d917e7da77fe520dd49ac0f93500f0d3f00314a7 Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Fri, 20 Dec 2024 12:14:14 -0800 Subject: [PATCH 56/64] [hip] Fixes a race in allocator_free_async. (#19541) If we freed a buffer with hipFree and it was immediately re-used before the call to IREE_TRACE_FREE_NAMED, we could end up in a case where tracy would see the same buffer allocated twice (and subsequently crash) Signed-off-by: Andrew Woloszyn --- runtime/src/iree/hal/drivers/hip/hip_allocator.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/runtime/src/iree/hal/drivers/hip/hip_allocator.c b/runtime/src/iree/hal/drivers/hip/hip_allocator.c index eb618abe68fb..0e89ca10a7bb 100644 --- a/runtime/src/iree/hal/drivers/hip/hip_allocator.c +++ b/runtime/src/iree/hal/drivers/hip/hip_allocator.c @@ -692,14 +692,15 @@ iree_status_t iree_hal_hip_allocator_free_async( return iree_ok_status(); } - IREE_RETURN_IF_ERROR(IREE_HIP_CALL_TO_STATUS(allocator->symbols, - hipFree(device_ptr), "hipFree")); - iree_hal_hip_buffer_set_allocation_empty(buffer); - IREE_TRACE_FREE_NAMED(IREE_HAL_HIP_ALLOCATOR_ID, (void*)device_ptr); IREE_STATISTICS(iree_hal_allocator_statistics_record_free( &allocator->statistics, iree_hal_buffer_memory_type(buffer), iree_hal_buffer_allocation_size(buffer))); + + IREE_RETURN_IF_ERROR(IREE_HIP_CALL_TO_STATUS(allocator->symbols, + hipFree(device_ptr), "hipFree")); + iree_hal_hip_buffer_set_allocation_empty(buffer); + return iree_ok_status(); } From 47ccd93ca3f1a55f08e983e92f5960a8d26f577c Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Fri, 20 Dec 2024 12:39:12 -0800 Subject: [PATCH 57/64] [hip] Implement asynchronous file reads in hip. (#19545) This moves the blocking reads off the main thread, and so long as your PCI bandwidth outpaces your file IO bandwith, reduces the amount of time required to read files dramatically. Signed-off-by: Andrew Woloszyn --------- Signed-off-by: Andrew Woloszyn --- runtime/src/iree/hal/drivers/hip/api.h | 8 + runtime/src/iree/hal/drivers/hip/hip_device.c | 1113 +++++++++++------ .../hal/drivers/hip/per_device_information.h | 9 + 3 files changed, 767 insertions(+), 363 deletions(-) diff --git a/runtime/src/iree/hal/drivers/hip/api.h b/runtime/src/iree/hal/drivers/hip/api.h index 2218b19f2aa3..7f91a2d5a18c 100644 --- a/runtime/src/iree/hal/drivers/hip/api.h +++ b/runtime/src/iree/hal/drivers/hip/api.h @@ -91,6 +91,14 @@ typedef struct iree_hal_hip_device_params_t { // device. Defaults to true when the device supports it. bool async_allocations; + // The reserved buffer size for asynchronous file transfers. + iree_device_size_t file_transfer_buffer_size; + + // The maximum chunk size for any single asynchronous file transfer. + // This should be smaller than the full buffer size to allow overlapping + // cpu and gpu workloads. + iree_device_size_t file_transfer_chunk_size; + // Parameters for each hipMemPool_t used for queue-ordered allocations. iree_hal_hip_memory_pooling_params_t memory_pools; diff --git a/runtime/src/iree/hal/drivers/hip/hip_device.c b/runtime/src/iree/hal/drivers/hip/hip_device.c index 06ce1092ddba..47217fafdc73 100644 --- a/runtime/src/iree/hal/drivers/hip/hip_device.c +++ b/runtime/src/iree/hal/drivers/hip/hip_device.c @@ -35,6 +35,9 @@ #include "iree/hal/utils/file_transfer.h" #include "iree/hal/utils/stream_tracing.h" +#define IREE_HAL_DEVICE_TRANSFER_DEFAULT_BUFFER_SIZE (128 * 1024 * 1024) +#define IREE_HAL_DEVICE_MAX_TRANSFER_DEFAULT_CHUNK_SIZE (64 * 1024 * 1024) + //===----------------------------------------------------------------------===// // iree_hal_hip_device_t //===----------------------------------------------------------------------===// @@ -221,6 +224,10 @@ IREE_API_EXPORT void iree_hal_hip_device_params_initialize( out_params->command_buffer_mode = IREE_HAL_HIP_COMMAND_BUFFER_MODE_STREAM; out_params->stream_tracing = 0; out_params->async_allocations = true; + out_params->file_transfer_buffer_size = + IREE_HAL_DEVICE_TRANSFER_DEFAULT_BUFFER_SIZE; + out_params->file_transfer_chunk_size = + IREE_HAL_DEVICE_MAX_TRANSFER_DEFAULT_CHUNK_SIZE; out_params->allow_inline_execution = false; } @@ -368,6 +375,34 @@ static iree_status_t iree_hal_hip_device_initialize_internal( } } + if (iree_status_is_ok(status)) { + for (iree_host_size_t i = 0; i < device->device_count; ++i) { + iree_hal_buffer_params_t buffer_params = { + .usage = IREE_HAL_BUFFER_USAGE_TRANSFER | + IREE_HAL_BUFFER_USAGE_MAPPING_SCOPED, + .access = IREE_HAL_MEMORY_ACCESS_READ | IREE_HAL_MEMORY_ACCESS_WRITE | + IREE_HAL_MEMORY_ACCESS_DISCARD, + .type = IREE_HAL_MEMORY_TYPE_HOST_VISIBLE | + IREE_HAL_MEMORY_TYPE_DEVICE_VISIBLE, + .queue_affinity = (iree_hal_queue_affinity_t)1 << i, + .min_alignment = 0, + }; + status = iree_hal_allocator_allocate_buffer( + device->device_allocator, buffer_params, + params->file_transfer_buffer_size, + &device->devices[i].file_transfer_staging_buffer.buffer); + if (!iree_status_is_ok(status)) { + break; + } + device->devices[i].file_transfer_staging_buffer.head = 0; + device->devices[i].file_transfer_staging_buffer.tail = 0; + iree_slim_mutex_initialize( + &device->devices[i].file_transfer_staging_buffer.mutex); + iree_notification_initialize( + &device->devices[i].file_transfer_staging_buffer.notify); + } + } + if (!iree_status_is_ok(status)) { iree_hal_device_release((iree_hal_device_t*)device); } @@ -479,6 +514,15 @@ static void iree_hal_hip_device_destroy(iree_hal_device_t* base_device) { iree_hal_hip_cleanup_thread_deinitialize(device->cleanup_thread); iree_hal_hip_cleanup_thread_deinitialize(device->buffer_free_thread); + for (iree_host_size_t i = 0; i < device->device_count; ++i) { + iree_hal_resource_release( + device->devices[i].file_transfer_staging_buffer.buffer); + iree_slim_mutex_deinitialize( + &device->devices[i].file_transfer_staging_buffer.mutex); + iree_notification_deinitialize( + &device->devices[i].file_transfer_staging_buffer.notify); + } + // There should be no more buffers live that use the allocator. iree_hal_allocator_release(device->device_allocator); @@ -920,40 +964,23 @@ static iree_status_t iree_hal_hip_device_prepare_async_alloc( IREE_TRACE_ZONE_END(z0); return status; } -typedef enum iree_hal_hip_device_semaphore_buffer_operation_type_e { - IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_ALLOC, - IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_DEALLOC, - IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_MAX = - IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_DEALLOC, -} iree_hal_hip_device_semaphore_buffer_operation_type_t; -typedef struct iree_hal_hip_device_semaphore_buffer_operation_callback_data_t { +typedef struct iree_hal_hip_semaphore_callback_data_t { iree_allocator_t host_allocator; iree_atomic_int64_t wait_semaphore_count; iree_hal_hip_device_t* device; iree_hal_queue_affinity_t queue_affinity; + iree_hal_hip_dispatch_callback_t dispatch_fn; iree_hal_semaphore_list_t wait_semaphore_list; iree_hal_semaphore_list_t signal_semaphore_list; - iree_hal_buffer_t* buffer; - iree_hal_hip_device_semaphore_buffer_operation_type_t type; iree_slim_mutex_t status_mutex; iree_status_t status; -} iree_hal_hip_device_semaphore_buffer_operation_callback_data_t; - -static iree_status_t iree_hal_hip_device_make_buffer_callback_data( - iree_hal_hip_device_t* device, iree_allocator_t host_allocator, - iree_hal_queue_affinity_t queue_affinity, - const iree_hal_semaphore_list_t wait_semaphore_list, - const iree_hal_semaphore_list_t signal_semaphore_list, - iree_hal_buffer_t* buffer, - iree_hal_hip_device_semaphore_buffer_operation_type_t type, - iree_hal_hip_device_semaphore_buffer_operation_callback_data_t** out_data) { - *out_data = NULL; - IREE_TRACE_ZONE_BEGIN(z0); - // Embed captured tables in the action allocation. - iree_hal_hip_device_semaphore_buffer_operation_callback_data_t* - callback_data = NULL; +} iree_hal_hip_semaphore_callback_data_t; +iree_host_size_t +iree_hal_hip_semaphore_callback_data_get_additional_allocation_size( + iree_hal_semaphore_list_t wait_semaphore_list, + iree_hal_semaphore_list_t signal_semaphore_list) { const iree_host_size_t wait_semaphore_list_size = wait_semaphore_list.count * sizeof(*wait_semaphore_list.semaphores) + wait_semaphore_list.count * sizeof(*wait_semaphore_list.payload_values); @@ -961,34 +988,39 @@ static iree_status_t iree_hal_hip_device_make_buffer_callback_data( signal_semaphore_list.count * sizeof(*signal_semaphore_list.semaphores) + signal_semaphore_list.count * sizeof(*signal_semaphore_list.payload_values); + return wait_semaphore_list_size + signal_semaphore_list_size; +} - const iree_host_size_t total_callback_size = sizeof(*callback_data) + - wait_semaphore_list_size + - signal_semaphore_list_size; - IREE_RETURN_AND_END_ZONE_IF_ERROR( - z0, iree_allocator_malloc(host_allocator, total_callback_size, - (void**)&callback_data)); - uint8_t* callback_ptr = (uint8_t*)callback_data + sizeof(*callback_data); +void iree_hal_hip_semaphore_callback_data_initialize( + iree_allocator_t host_allocator, iree_hal_hip_device_t* device, + iree_hal_queue_affinity_t queue_affinity, + iree_hal_hip_dispatch_callback_t dispatch_fn, + iree_hal_semaphore_list_t wait_semaphore_list, + iree_hal_semaphore_list_t signal_semaphore_list, + void* additional_data_offset, + iree_hal_hip_semaphore_callback_data_t* data) { + data->host_allocator = host_allocator; + iree_atomic_store(&data->wait_semaphore_count, wait_semaphore_list.count, + iree_memory_order_relaxed); - iree_atomic_store(&callback_data->wait_semaphore_count, - wait_semaphore_list.count, iree_memory_order_relaxed); + data->device = device; + data->queue_affinity = queue_affinity; + data->dispatch_fn = dispatch_fn; - callback_data->host_allocator = host_allocator; - callback_data->device = device; - callback_data->queue_affinity = queue_affinity; + uint8_t* callback_ptr = (uint8_t*)additional_data_offset; - // Copy wait list for later access. - callback_data->wait_semaphore_list.count = wait_semaphore_list.count; - callback_data->wait_semaphore_list.semaphores = - (iree_hal_semaphore_t**)callback_ptr; - memcpy(callback_data->wait_semaphore_list.semaphores, - wait_semaphore_list.semaphores, + const iree_host_size_t wait_semaphore_list_size = + wait_semaphore_list.count * sizeof(*wait_semaphore_list.semaphores) + + wait_semaphore_list.count * sizeof(*wait_semaphore_list.payload_values); + data->wait_semaphore_list.count = wait_semaphore_list.count; + data->wait_semaphore_list.semaphores = (iree_hal_semaphore_t**)callback_ptr; + memcpy(data->wait_semaphore_list.semaphores, wait_semaphore_list.semaphores, wait_semaphore_list.count * sizeof(*wait_semaphore_list.semaphores)); - callback_data->wait_semaphore_list.payload_values = + data->wait_semaphore_list.payload_values = (uint64_t*)(callback_ptr + wait_semaphore_list.count * sizeof(*wait_semaphore_list.semaphores)); memcpy( - callback_data->wait_semaphore_list.payload_values, + data->wait_semaphore_list.payload_values, wait_semaphore_list.payload_values, wait_semaphore_list.count * sizeof(*wait_semaphore_list.payload_values)); for (iree_host_size_t i = 0; i < wait_semaphore_list.count; ++i) { @@ -997,41 +1029,28 @@ static iree_status_t iree_hal_hip_device_make_buffer_callback_data( callback_ptr += wait_semaphore_list_size; // Copy signal list for later access. - callback_data->signal_semaphore_list.count = signal_semaphore_list.count; - callback_data->signal_semaphore_list.semaphores = - (iree_hal_semaphore_t**)callback_ptr; + data->signal_semaphore_list.count = signal_semaphore_list.count; + data->signal_semaphore_list.semaphores = (iree_hal_semaphore_t**)callback_ptr; memcpy( - callback_data->signal_semaphore_list.semaphores, - signal_semaphore_list.semaphores, + data->signal_semaphore_list.semaphores, signal_semaphore_list.semaphores, signal_semaphore_list.count * sizeof(*signal_semaphore_list.semaphores)); - callback_data->signal_semaphore_list.payload_values = + data->signal_semaphore_list.payload_values = (uint64_t*)(callback_ptr + signal_semaphore_list.count * sizeof(*signal_semaphore_list.semaphores)); - memcpy(callback_data->signal_semaphore_list.payload_values, + memcpy(data->signal_semaphore_list.payload_values, signal_semaphore_list.payload_values, signal_semaphore_list.count * sizeof(*signal_semaphore_list.payload_values)); for (iree_host_size_t i = 0; i < signal_semaphore_list.count; ++i) { iree_hal_resource_retain(signal_semaphore_list.semaphores[i]); } - callback_ptr += signal_semaphore_list_size; - - callback_data->buffer = buffer; - iree_hal_buffer_retain(buffer); - callback_data->type = type; - iree_slim_mutex_initialize(&callback_data->status_mutex); - callback_data->status = iree_ok_status(); - *out_data = callback_data; - IREE_TRACE_ZONE_END(z0); - return iree_ok_status(); + iree_slim_mutex_initialize(&data->status_mutex); + data->status = iree_ok_status(); } -void iree_hal_hip_device_destroy_buffer_callback_data( - iree_hal_hip_device_semaphore_buffer_operation_callback_data_t* data) { - if (!data) { - return; - } +void iree_hal_hip_semaphore_callback_data_deinitialize( + iree_hal_hip_semaphore_callback_data_t* data) { iree_slim_mutex_deinitialize(&data->status_mutex); for (iree_host_size_t i = 0; i < data->wait_semaphore_list.count; ++i) { iree_hal_resource_release(data->wait_semaphore_list.semaphores[i]); @@ -1039,9 +1058,37 @@ void iree_hal_hip_device_destroy_buffer_callback_data( for (iree_host_size_t i = 0; i < data->signal_semaphore_list.count; ++i) { iree_hal_resource_release(data->signal_semaphore_list.semaphores[i]); } - iree_hal_buffer_release(data->buffer); +} + +typedef enum iree_hal_hip_device_semaphore_buffer_operation_type_e { + IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_ALLOC, + IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_DEALLOC, + IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_MAX = + IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_DEALLOC, +} iree_hal_hip_device_semaphore_buffer_operation_type_t; + +static iree_status_t iree_hal_hip_device_stream_add_cleanup( + iree_hal_hip_device_t* device, iree_hal_hip_cleanup_thread_t* thread, + iree_host_size_t device_ordinal, iree_hal_hip_cleanup_callback_t callback, + void* user_data) { + iree_hal_hip_event_t* event = NULL; + iree_status_t status = iree_hal_hip_event_pool_acquire( + device->devices[device_ordinal].device_event_pool, 1, &event); + + if (iree_status_is_ok(status)) { + status = IREE_HIP_CALL_TO_STATUS( + device->hip_symbols, + hipEventRecord(iree_hal_hip_event_handle(event), + device->devices[device_ordinal].hip_dispatch_stream)); + } - iree_allocator_free(data->host_allocator, data); + if (iree_status_is_ok(status)) { + status = iree_hal_hip_cleanup_thread_add_cleanup(thread, event, callback, + user_data); + } else { + iree_hal_hip_event_release(event); + } + return status; } static iree_status_t @@ -1071,26 +1118,10 @@ iree_hal_hip_device_stream_signal_semaphores_and_add_cleanup( signal_semaphore_list.payload_values[i]); } - iree_hal_hip_event_t* event = NULL; - if (iree_status_is_ok(status)) { - status = iree_hal_hip_event_pool_acquire( - device->devices[device_ordinal].device_event_pool, 1, &event); - } - - if (iree_status_is_ok(status)) { - status = IREE_HIP_CALL_TO_STATUS( - device->hip_symbols, - hipEventRecord(iree_hal_hip_event_handle(event), - device->devices[device_ordinal].hip_dispatch_stream)); - } - if (iree_status_is_ok(status)) { - status = iree_hal_hip_cleanup_thread_add_cleanup(thread, event, callback, - user_data); - } else { - iree_hal_hip_event_release(event); + status = iree_hal_hip_device_stream_add_cleanup( + device, thread, device_ordinal, callback, user_data); } - IREE_TRACE_ZONE_END(z0); return status; } @@ -1155,49 +1186,6 @@ static iree_status_t iree_hal_hip_async_free_buffer(void* user_data, return status; } -static iree_status_t iree_hal_hip_device_complete_buffer_operation( - void* user_data, iree_hal_hip_event_t* event, iree_status_t status) { - IREE_TRACE_ZONE_BEGIN(z0); - - iree_hal_hip_device_semaphore_buffer_operation_callback_data_t* data = - (iree_hal_hip_device_semaphore_buffer_operation_callback_data_t*) - user_data; - - // Free the event we specifically created. - iree_hal_hip_event_release(event); - - // Notify all of the signal semaphores that they have been incremented. - for (iree_host_size_t i = 0; i < data->signal_semaphore_list.count; ++i) { - iree_status_ignore(iree_hal_hip_event_semaphore_advance( - data->signal_semaphore_list.semaphores[i])); - } - - if (data->buffer && - data->type == IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_DEALLOC) { - int device_ordinal = - iree_math_count_trailing_zeros_u64(data->queue_affinity); - if (data->device->supports_memory_pools) { - status = iree_status_join( - status, iree_hal_hip_memory_pools_deallocate( - &data->device->devices[device_ordinal].memory_pools, - data->device->devices[device_ordinal].hip_dispatch_stream, - data->buffer)); - } else { - status = iree_status_join( - status, - iree_hal_hip_allocator_free_async( - iree_hal_device_allocator((iree_hal_device_t*)data->device), - data->device->devices[device_ordinal].hip_dispatch_stream, - data->buffer)); - } - } - - iree_hal_hip_device_destroy_buffer_callback_data(data); - - IREE_TRACE_ZONE_END(z0); - return status; -} - static iree_status_t iree_hal_hip_device_stream_wait_for_semaphores( iree_hal_hip_device_t* device, iree_hal_semaphore_list_t wait_semaphore_list, @@ -1236,6 +1224,91 @@ static iree_status_t iree_hal_hip_device_stream_wait_for_semaphores( return status; } +static iree_status_t iree_hal_hip_device_semaphore_callback( + void* user_context, iree_hal_semaphore_t* semaphore, iree_status_t status) { + iree_hal_hip_semaphore_callback_data_t* data = + (iree_hal_hip_semaphore_callback_data_t*)user_context; + + if (!iree_status_is_ok(status)) { + iree_slim_mutex_lock(&data->status_mutex); + data->status = iree_status_join(data->status, status); + iree_slim_mutex_unlock(&data->status_mutex); + } + if (iree_atomic_fetch_sub(&data->wait_semaphore_count, 1, + iree_memory_order_acq_rel) != 1) { + return iree_ok_status(); + } + + int device_ordinal = iree_math_count_trailing_zeros_u64(data->queue_affinity); + + // Now the actual submit happens, as all semaphore have been satisfied + // (by satisfied here, we specifically mean that the semaphore has been + // scheduled, not necessarily completed) + return iree_hal_hip_dispatch_thread_add_dispatch( + data->device->devices[device_ordinal].dispatch_thread, data->dispatch_fn, + data); +} + +typedef struct iree_hal_hip_device_semaphore_buffer_operation_callback_data_t { + iree_hal_hip_semaphore_callback_data_t base; + iree_hal_buffer_t* buffer; + iree_hal_hip_device_semaphore_buffer_operation_type_t type; +} iree_hal_hip_device_semaphore_buffer_operation_callback_data_t; + +void iree_hal_hip_device_destroy_buffer_callback_data( + iree_hal_hip_device_semaphore_buffer_operation_callback_data_t* data) { + if (!data) { + return; + } + iree_hal_buffer_release(data->buffer); + iree_hal_hip_semaphore_callback_data_deinitialize(&data->base); + iree_allocator_free(data->base.host_allocator, data); +} + +static iree_status_t iree_hal_hip_device_complete_buffer_operation( + void* user_data, iree_hal_hip_event_t* event, iree_status_t status) { + IREE_TRACE_ZONE_BEGIN(z0); + iree_hal_hip_device_semaphore_buffer_operation_callback_data_t* data = + (iree_hal_hip_device_semaphore_buffer_operation_callback_data_t*) + user_data; + + // Free the event we specifically created. + iree_hal_hip_event_release(event); + + // Notify all of the signal semaphores that they have been incremented. + for (iree_host_size_t i = 0; i < data->base.signal_semaphore_list.count; + ++i) { + iree_status_ignore(iree_hal_hip_event_semaphore_advance( + data->base.signal_semaphore_list.semaphores[i])); + } + + if (data->buffer && + data->type == IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_DEALLOC) { + int device_ordinal = + iree_math_count_trailing_zeros_u64(data->base.queue_affinity); + if (data->base.device->supports_memory_pools) { + status = iree_status_join( + status, + iree_hal_hip_memory_pools_deallocate( + &data->base.device->devices[device_ordinal].memory_pools, + data->base.device->devices[device_ordinal].hip_dispatch_stream, + data->buffer)); + } else { + status = iree_status_join( + status, + iree_hal_hip_allocator_free_async( + iree_hal_device_allocator((iree_hal_device_t*)data->base.device), + data->base.device->devices[device_ordinal].hip_dispatch_stream, + data->buffer)); + } + } + + iree_hal_hip_device_destroy_buffer_callback_data(data); + + IREE_TRACE_ZONE_END(z0); + return status; +} + static iree_status_t iree_hal_hip_device_perform_buffer_operation_now( void* user_data, iree_status_t status) { IREE_TRACE_ZONE_BEGIN(z0); @@ -1245,26 +1318,27 @@ static iree_status_t iree_hal_hip_device_perform_buffer_operation_now( user_data; IREE_ASSERT_LE(data->type, IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_MAX); - iree_hal_hip_device_t* device = data->device; + iree_hal_hip_device_t* device = data->base.device; // If we had a semaphore failure then we should propagate it // but not run anything. - if (!iree_status_is_ok(data->status)) { - status = iree_status_join(data->status, status); + if (!iree_status_is_ok(data->base.status)) { + status = iree_status_join(data->base.status, status); } - int device_ordinal = iree_math_count_trailing_zeros_u64(data->queue_affinity); + int device_ordinal = + iree_math_count_trailing_zeros_u64(data->base.queue_affinity); if (iree_status_is_ok(status)) { status = IREE_HIP_CALL_TO_STATUS( - data->device->hip_symbols, - hipCtxPushCurrent(data->device->devices[device_ordinal].hip_context)); + device->hip_symbols, + hipCtxPushCurrent(device->devices[device_ordinal].hip_context)); } IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, device_ordinal); if (iree_status_is_ok(status)) { status = iree_hal_hip_device_stream_wait_for_semaphores( - data->device, data->wait_semaphore_list, device_ordinal); + device, data->base.wait_semaphore_list, device_ordinal); } // We have satisfied all of the waits. @@ -1281,7 +1355,7 @@ static iree_status_t iree_hal_hip_device_perform_buffer_operation_now( break; } status = iree_hal_hip_allocator_alloc_async( - iree_hal_device_allocator((iree_hal_device_t*)data->device), + iree_hal_device_allocator((iree_hal_device_t*)device), device->devices[device_ordinal].hip_dispatch_stream, data->buffer); break; case IREE_HAL_HIP_DEVICE_SEMAPHORE_OPERATION_ASYNC_DEALLOC: { @@ -1293,17 +1367,18 @@ static iree_status_t iree_hal_hip_device_perform_buffer_operation_now( } IREE_TRACE_ZONE_END(z3); - const iree_hal_hip_dynamic_symbols_t* symbols = data->device->hip_symbols; + const iree_hal_hip_dynamic_symbols_t* symbols = device->hip_symbols; if (iree_status_is_ok(status)) { // Data may get deleted any time after adding it to the cleanup, // so retain the symbols here. status = iree_hal_hip_device_stream_signal_semaphores_and_add_cleanup( - data->device, data->device->cleanup_thread, data->signal_semaphore_list, + device, device->cleanup_thread, data->base.signal_semaphore_list, device_ordinal, &iree_hal_hip_device_complete_buffer_operation, data); } else { - for (iree_host_size_t i = 0; i < data->signal_semaphore_list.count; ++i) { - iree_hal_semaphore_fail(data->signal_semaphore_list.semaphores[i], - iree_status_clone(data->status)); + for (iree_host_size_t i = 0; i < data->base.signal_semaphore_list.count; + ++i) { + iree_hal_semaphore_fail(data->base.signal_semaphore_list.semaphores[i], + iree_status_clone(data->base.status)); } iree_hal_hip_device_destroy_buffer_callback_data(data); } @@ -1313,28 +1388,43 @@ static iree_status_t iree_hal_hip_device_perform_buffer_operation_now( status, IREE_HIP_CALL_TO_STATUS(symbols, hipCtxPopCurrent(NULL))); } -static iree_status_t iree_hal_hip_device_semaphore_buffer_operation_callback( - void* user_context, iree_hal_semaphore_t* semaphore, iree_status_t status) { - iree_hal_hip_device_semaphore_buffer_operation_callback_data_t* data = - (iree_hal_hip_device_semaphore_buffer_operation_callback_data_t*) - user_context; - if (!iree_status_is_ok(status)) { - iree_slim_mutex_lock(&data->status_mutex); - data->status = iree_status_join(data->status, status); - iree_slim_mutex_unlock(&data->status_mutex); - } - if (iree_atomic_fetch_sub(&data->wait_semaphore_count, 1, - iree_memory_order_acq_rel) != 1) { - return iree_ok_status(); - } +static iree_status_t iree_hal_hip_device_make_buffer_callback_data( + iree_hal_hip_device_t* device, iree_allocator_t host_allocator, + iree_hal_queue_affinity_t queue_affinity, + const iree_hal_semaphore_list_t wait_semaphore_list, + const iree_hal_semaphore_list_t signal_semaphore_list, + iree_hal_buffer_t* buffer, + iree_hal_hip_device_semaphore_buffer_operation_type_t type, + iree_hal_hip_device_semaphore_buffer_operation_callback_data_t** out_data) { + *out_data = NULL; + IREE_TRACE_ZONE_BEGIN(z0); + // Embed captured tables in the action allocation. + iree_hal_hip_device_semaphore_buffer_operation_callback_data_t* + callback_data = NULL; - int device_ordinal = iree_math_count_trailing_zeros_u64(data->queue_affinity); - // Now the actual buffer_operation happens, as all semaphore have been - // satisfied (by satisfied here, we specifically mean that the semaphore has - // been scheduled, not necessarily completed). - return iree_hal_hip_dispatch_thread_add_dispatch( - data->device->devices[device_ordinal].dispatch_thread, - &iree_hal_hip_device_perform_buffer_operation_now, data); + const iree_host_size_t additional_data_for_base = + iree_hal_hip_semaphore_callback_data_get_additional_allocation_size( + wait_semaphore_list, signal_semaphore_list); + + const iree_host_size_t total_callback_size = + sizeof(*callback_data) + additional_data_for_base; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_allocator_malloc(host_allocator, total_callback_size, + (void**)&callback_data)); + iree_hal_hip_semaphore_callback_data_initialize( + host_allocator, device, queue_affinity, + &iree_hal_hip_device_perform_buffer_operation_now, wait_semaphore_list, + signal_semaphore_list, + (void*)((uint8_t*)callback_data + sizeof(*callback_data)), + &callback_data->base); + + callback_data->buffer = buffer; + iree_hal_buffer_retain(buffer); + callback_data->type = type; + + *out_data = callback_data; + IREE_TRACE_ZONE_END(z0); + return iree_ok_status(); } // TODO: implement multiple streams; today we only have one and queue_affinity @@ -1391,8 +1481,7 @@ static iree_status_t iree_hal_hip_device_queue_alloca( wait_semaphore_list.semaphores[i], wait_semaphore_list.payload_values[i], device->devices[device_ordinal].device_event_pool, - &iree_hal_hip_device_semaphore_buffer_operation_callback, - callback_data)); + &iree_hal_hip_device_semaphore_callback, callback_data)); } } else { iree_hal_hip_device_destroy_buffer_callback_data(callback_data); @@ -1482,8 +1571,7 @@ static iree_status_t iree_hal_hip_device_queue_dealloca( wait_semaphore_list.semaphores[i], wait_semaphore_list.payload_values[i], device->devices[device_ordinal].device_event_pool, - &iree_hal_hip_device_semaphore_buffer_operation_callback, - callback_data)); + &iree_hal_hip_device_semaphore_callback, callback_data)); } } else { iree_hal_hip_device_destroy_buffer_callback_data(callback_data); @@ -1519,7 +1607,369 @@ static iree_status_t iree_hal_hip_device_queue_dealloca( return status; } -static iree_status_t iree_hal_hip_device_queue_read( +typedef struct iree_hal_hip_device_semaphore_queue_read_callback_data_t { + iree_hal_hip_semaphore_callback_data_t base; + iree_hal_file_t* source_file; + uint64_t source_offset; + iree_hal_buffer_t* target_buffer; + iree_device_size_t target_offset; + iree_device_size_t length; + iree_hal_read_flags_t flags; + int64_t read_chunks_completed; + uint64_t num_read_chunks; + uint64_t* read_chunk_sizes; + iree_hal_command_buffer_t** command_buffers; +} iree_hal_hip_device_semaphore_queue_read_callback_data_t; + +void iree_hal_hip_device_destroy_queue_read_callback_data( + iree_hal_hip_device_semaphore_queue_read_callback_data_t* data) { + if (!data) { + return; + } + iree_hal_resource_release(data->target_buffer); + iree_hal_hip_semaphore_callback_data_deinitialize(&data->base); + iree_allocator_free(data->base.host_allocator, data); +} + +static iree_status_t iree_hal_hip_device_complete_queue_read_operation( + void* user_data, iree_hal_hip_event_t* event, iree_status_t status) { + IREE_TRACE_ZONE_BEGIN(z0); + iree_hal_hip_device_semaphore_queue_read_callback_data_t* data = + (iree_hal_hip_device_semaphore_queue_read_callback_data_t*)user_data; + + // Free the event we specifically created. + iree_hal_hip_event_release(event); + + // Release the next chunk of buffer back to the ring. + int device_ordinal = + iree_math_count_trailing_zeros_u64(data->base.queue_affinity); + iree_slim_mutex_lock(&data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.mutex); + data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.tail += + data->read_chunk_sizes[data->read_chunks_completed]; + data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.tail %= + data->base.device->params.file_transfer_buffer_size; + if (data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.head == + data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.tail) { + // Slight optimization here. If the buffer is empty, reset it to 0, so that + // we are less likely to wrap. + data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.head = 0; + data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.tail = 0; + } + iree_slim_mutex_unlock(&data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.mutex); + iree_notification_post(&data->base.device->devices[device_ordinal] + .file_transfer_staging_buffer.notify, + IREE_ALL_WAITERS); + + iree_hal_command_buffer_t* command_buffer = + data->command_buffers[data->read_chunks_completed]; + if (iree_hal_hip_multi_queue_command_buffer_isa(command_buffer)) { + status = iree_hal_hip_multi_queue_command_buffer_get( + command_buffer, data->base.queue_affinity, &command_buffer); + } + + status = iree_status_join( + status, + iree_hal_stream_tracing_context_collect_list( + // Get the tracing context from the device/stream/queue affinity. + data->base.device->devices[device_ordinal].tracing_context, + // Get the tracing event list from the command buffer. + iree_hal_hip_stream_command_buffer_tracing_events(command_buffer) + .head)); + + iree_hal_resource_release(data->command_buffers[data->read_chunks_completed]); + // If there are more chunks to this transfer don't destroy and advance + // the semaphores yet. + if (++data->read_chunks_completed != data->num_read_chunks) { + IREE_TRACE_ZONE_END(z0); + return status; + } + + // Notify all of the signal semaphores that they have been incremented. + for (iree_host_size_t i = 0; i < data->base.signal_semaphore_list.count; + ++i) { + iree_status_ignore(iree_hal_hip_event_semaphore_advance( + data->base.signal_semaphore_list.semaphores[i])); + } + + iree_hal_hip_device_destroy_queue_read_callback_data(data); + + IREE_TRACE_ZONE_END(z0); + return status; +} + +iree_device_size_t iree_hal_hip_transfer_buffer_size_left( + iree_hal_hip_device_t* device, iree_hal_hip_per_device_info_t* info) { + iree_slim_mutex_lock(&info->file_transfer_staging_buffer.mutex); + iree_device_size_t size_left = + info->file_transfer_staging_buffer.head >= + info->file_transfer_staging_buffer.tail + ? device->params.file_transfer_buffer_size - + (info->file_transfer_staging_buffer.head - + info->file_transfer_staging_buffer.tail) + : info->file_transfer_staging_buffer.tail - + info->file_transfer_staging_buffer.head; + iree_slim_mutex_unlock(&info->file_transfer_staging_buffer.mutex); + return size_left; +} + +typedef struct iree_hal_hip_transfer_buffer_chunk_t { + iree_device_size_t offset; + iree_device_size_t size; +} iree_hal_hip_transfer_buffer_chunk_t; + +typedef struct iree_hal_hip_transfer_buffer_size_check_data_t { + iree_host_size_t device_ordinal; + iree_device_size_t num_bytes; + iree_hal_hip_device_t* device; +} iree_hal_hip_transfer_buffer_size_check_data_t; + +bool iree_hal_hip_transfer_buffer_size_check_condition(void* user_data) { + iree_hal_hip_transfer_buffer_size_check_data_t* data = + (iree_hal_hip_transfer_buffer_size_check_data_t*)user_data; + return iree_hal_hip_transfer_buffer_size_left( + data->device, &data->device->devices[data->device_ordinal]) > + data->num_bytes; +} + +// Returns two chunks that are needed to cover the buffer. Pass in an +// array of 2 chunks to be filled in. It will wait on the availability +// of the chunks if there are transfers in progress. It is possible that +// either of the chunks may have a size of 0. The requested size must be +// less than or equal to the file_transfer_chunk_size. +void iree_hal_hip_transfer_buffer_reserve_chunks( + iree_hal_hip_device_t* device, iree_host_size_t device_ordinal, + iree_device_size_t size, iree_hal_hip_transfer_buffer_chunk_t* out_chunks) { + IREE_ASSERT_ARGUMENT(out_chunks); + IREE_ASSERT(size <= device->params.file_transfer_chunk_size, + "Trying to allocate a chunk that is too large."); + iree_hal_hip_transfer_buffer_size_check_data_t size_check = { + .device_ordinal = device_ordinal, .num_bytes = size, .device = device}; + + iree_notification_await( + &device->devices[device_ordinal].file_transfer_staging_buffer.notify, + iree_hal_hip_transfer_buffer_size_check_condition, (void*)&size_check, + iree_infinite_timeout()); + + iree_hal_hip_per_device_info_t* info = &device->devices[device_ordinal]; + iree_slim_mutex_lock(&info->file_transfer_staging_buffer.mutex); + out_chunks[0].offset = info->file_transfer_staging_buffer.head; + out_chunks[0].size = + iree_min(size, device->params.file_transfer_buffer_size - + info->file_transfer_staging_buffer.head); + if (size != out_chunks->size) { + out_chunks[1].offset = 0; + out_chunks[1].size = size - out_chunks[0].size; + info->file_transfer_staging_buffer.head = out_chunks[1].size; + } else { + out_chunks[1].offset = 0; + out_chunks[1].size = 0; + info->file_transfer_staging_buffer.head += size; + } + iree_slim_mutex_unlock(&info->file_transfer_staging_buffer.mutex); +} + +static iree_status_t iree_hal_hip_device_perform_queue_read_now( + void* user_data, iree_status_t status) { + IREE_TRACE_ZONE_BEGIN(z0); + + iree_hal_hip_device_semaphore_queue_read_callback_data_t* data = + (iree_hal_hip_device_semaphore_queue_read_callback_data_t*)user_data; + + iree_hal_hip_device_t* device = data->base.device; + + // If we had a semaphore failure then we should propagate it + // but not run anything. + if (!iree_status_is_ok(data->base.status)) { + status = iree_status_join(data->base.status, status); + } + + int device_ordinal = + iree_math_count_trailing_zeros_u64(data->base.queue_affinity); + + if (iree_status_is_ok(status)) { + status = IREE_HIP_CALL_TO_STATUS( + device->hip_symbols, + hipCtxPushCurrent(device->devices[device_ordinal].hip_context)); + } + IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, device_ordinal); + + if (iree_status_is_ok(status)) { + status = iree_hal_hip_device_stream_wait_for_semaphores( + device, data->base.wait_semaphore_list, device_ordinal); + } + + const iree_hal_hip_dynamic_symbols_t* symbols = device->hip_symbols; + iree_device_size_t amount_left = data->length; + iree_device_size_t offset = 0; + for (iree_host_size_t i = 0; + i < data->num_read_chunks && iree_status_is_ok(status); ++i) { + iree_device_size_t chunk_size = + iree_min(device->params.file_transfer_chunk_size, amount_left); + iree_hal_hip_transfer_buffer_chunk_t chunks[2]; + iree_hal_hip_transfer_buffer_reserve_chunks(device, device_ordinal, + chunk_size, &chunks[0]); + + iree_device_size_t read_offset = offset; + for (iree_host_size_t j = 0; j < 2; ++j) { + if (chunks[j].size) { + status = iree_hal_file_read( + data->source_file, data->source_offset + read_offset, + device->devices[device_ordinal].file_transfer_staging_buffer.buffer, + chunks[j].offset, chunks[j].size); + if (!iree_status_is_ok(status)) { + break; + } + read_offset += chunks[j].size; + } + } + + // We use a command buffer because it allows us to easily get tracing events + // into the trace, and the actual overhead is quite minimal We only start it + // here, rather than creating up above the read which is more natural, + // because it will show up as actively doing work while we are recording, + // because hip stream command buffers are executed at record-time, so we + // don't want to get the file io mixed with the record. + iree_hal_command_buffer_t* stream_command_buffer = NULL; + if (iree_status_is_ok(status)) { + status = iree_hal_hip_device_create_stream_command_buffer( + (iree_hal_device_t*)device, + IREE_HAL_COMMAND_BUFFER_MODE_ONE_SHOT | + IREE_HAL_COMMAND_BUFFER_MODE_ALLOW_INLINE_EXECUTION | + IREE_HAL_COMMAND_BUFFER_MODE_UNVALIDATED, + IREE_HAL_COMMAND_CATEGORY_TRANSFER, data->base.queue_affinity, 0, + &stream_command_buffer); + } + if (iree_status_is_ok(status)) { + status = iree_hal_command_buffer_begin(stream_command_buffer); + } + for (iree_host_size_t j = 0; j < 2; ++j) { + if (iree_status_is_ok(status)) { + iree_hal_buffer_ref_t src = {0}; + src.buffer = + device->devices[device_ordinal].file_transfer_staging_buffer.buffer; + src.offset = chunks[j].offset; + src.length = chunks[j].size; + + iree_hal_buffer_ref_t dst = {0}; + dst.buffer = data->target_buffer; + dst.offset = data->target_offset + offset; + dst.length = chunks[j].size; + + status = iree_hal_command_buffer_copy_buffer( + stream_command_buffer, src, dst, IREE_HAL_COPY_FLAG_NONE); + } + offset += chunks[j].size; + amount_left -= chunks[j].size; + } + + if (iree_status_is_ok(status)) { + status = iree_hal_command_buffer_end(stream_command_buffer); + data->command_buffers[i] = stream_command_buffer; + } + + if (iree_status_is_ok(status)) { + data->read_chunk_sizes[i] = chunk_size; + // We only want to signal the semaphores on the final + // chunk. + if (i == data->num_read_chunks - 1) { + status = iree_hal_hip_device_stream_signal_semaphores_and_add_cleanup( + device, device->cleanup_thread, data->base.signal_semaphore_list, + device_ordinal, &iree_hal_hip_device_complete_queue_read_operation, + data); + } else { + status = iree_hal_hip_device_stream_add_cleanup( + device, device->cleanup_thread, device_ordinal, + &iree_hal_hip_device_complete_queue_read_operation, data); + } + } + } + + if (!iree_status_is_ok(status)) { + for (iree_host_size_t i = 0; i < data->base.signal_semaphore_list.count; + ++i) { + iree_hal_semaphore_fail(data->base.signal_semaphore_list.semaphores[i], + iree_status_clone(data->base.status)); + } + iree_hal_hip_device_destroy_queue_read_callback_data(data); + } + + IREE_TRACE_ZONE_END(z0); + return iree_status_join( + status, IREE_HIP_CALL_TO_STATUS(symbols, hipCtxPopCurrent(NULL))); +} + +static iree_status_t iree_hal_hip_device_make_queue_read_callback_data( + iree_hal_hip_device_t* device, iree_allocator_t host_allocator, + iree_hal_queue_affinity_t queue_affinity, + const iree_hal_semaphore_list_t wait_semaphore_list, + const iree_hal_semaphore_list_t signal_semaphore_list, + iree_hal_file_t* source_file, uint64_t source_offset, + iree_hal_buffer_t* target_buffer, iree_device_size_t target_offset, + iree_device_size_t length, iree_hal_read_flags_t flags, + iree_hal_hip_device_semaphore_queue_read_callback_data_t** out_data) { + *out_data = NULL; + IREE_TRACE_ZONE_BEGIN(z0); + + iree_hal_hip_device_semaphore_queue_read_callback_data_t* callback_data = + NULL; + + uint64_t chunk_count = + (length + device->params.file_transfer_chunk_size - 1) / + device->params.file_transfer_chunk_size; + + const iree_host_size_t additional_data_for_base = + iree_hal_hip_semaphore_callback_data_get_additional_allocation_size( + wait_semaphore_list, signal_semaphore_list); + const iree_host_size_t additional_data_for_chunks = + sizeof(*callback_data->read_chunk_sizes) * chunk_count + + sizeof(*callback_data->command_buffers) * chunk_count; + + const iree_host_size_t total_callback_size = sizeof(*callback_data) + + additional_data_for_base + + additional_data_for_chunks; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_allocator_malloc(host_allocator, total_callback_size, + (void**)&callback_data)); + iree_hal_hip_semaphore_callback_data_initialize( + host_allocator, device, queue_affinity, + &iree_hal_hip_device_perform_queue_read_now, wait_semaphore_list, + signal_semaphore_list, + (void*)((uint8_t*)callback_data + sizeof(*callback_data)), + &callback_data->base); + + uint64_t* chunk_base = + (void*)((uint8_t*)callback_data + sizeof(*callback_data) + + additional_data_for_base); + iree_hal_command_buffer_t** command_buffer_base = + (iree_hal_command_buffer_t**)((uint8_t*)chunk_base + + sizeof(*callback_data->read_chunk_sizes) * + chunk_count); + callback_data->source_file = source_file; + callback_data->source_offset = source_offset; + callback_data->target_buffer = target_buffer; + iree_hal_resource_retain(target_buffer); + callback_data->target_offset = target_offset; + callback_data->length = length; + callback_data->flags = flags; + callback_data->read_chunks_completed = 0; + callback_data->num_read_chunks = chunk_count; + callback_data->read_chunk_sizes = chunk_base; + callback_data->command_buffers = command_buffer_base; + + *out_data = callback_data; + IREE_TRACE_ZONE_END(z0); + return iree_ok_status(); +} + +static iree_status_t iree_hal_hip_device_queue_read( iree_hal_device_t* base_device, iree_hal_queue_affinity_t queue_affinity, const iree_hal_semaphore_list_t wait_semaphore_list, const iree_hal_semaphore_list_t signal_semaphore_list, @@ -1528,21 +1978,39 @@ static iree_status_t iree_hal_hip_device_queue_read( iree_device_size_t length, iree_hal_read_flags_t flags) { IREE_TRACE_ZONE_BEGIN(z0); - // TODO: expose streaming chunk count/size options. - iree_status_t loop_status = iree_ok_status(); - iree_hal_file_transfer_options_t options = { - .loop = iree_loop_inline(&loop_status), - .chunk_count = IREE_HAL_FILE_TRANSFER_CHUNK_COUNT_DEFAULT, - .chunk_size = IREE_HAL_FILE_TRANSFER_CHUNK_SIZE_DEFAULT, - }; - IREE_RETURN_AND_END_ZONE_IF_ERROR( - z0, iree_hal_device_queue_read_streaming( - base_device, queue_affinity, wait_semaphore_list, - signal_semaphore_list, source_file, source_offset, target_buffer, - target_offset, length, flags, options)); + if (queue_affinity == IREE_HAL_QUEUE_AFFINITY_ANY) { + queue_affinity = 0x1; + } + iree_hal_hip_device_t* device = iree_hal_hip_device_cast(base_device); + const int device_ordinal = iree_math_count_trailing_zeros_u64(queue_affinity); + + iree_hal_hip_device_semaphore_queue_read_callback_data_t* callback_data = + NULL; + iree_status_t status = iree_hal_hip_device_make_queue_read_callback_data( + device, device->host_allocator, queue_affinity, wait_semaphore_list, + signal_semaphore_list, source_file, source_offset, target_buffer, + target_offset, length, flags, &callback_data); + + if (iree_status_is_ok(status) && wait_semaphore_list.count == 0) { + status = iree_hal_hip_dispatch_thread_add_dispatch( + device->devices[device_ordinal].dispatch_thread, + &iree_hal_hip_device_perform_queue_read_now, callback_data); + } else if (iree_status_is_ok(status) && wait_semaphore_list.count != 0) { + for (iree_host_size_t i = 0; + i < wait_semaphore_list.count && iree_status_is_ok(status); ++i) { + status = iree_status_join( + status, iree_hal_hip_semaphore_notify_work( + wait_semaphore_list.semaphores[i], + wait_semaphore_list.payload_values[i], + device->devices[device_ordinal].device_event_pool, + &iree_hal_hip_device_semaphore_callback, callback_data)); + } + } else { + iree_hal_hip_device_destroy_queue_read_callback_data(callback_data); + } IREE_TRACE_ZONE_END(z0); - return loop_status; + return status; } static iree_status_t iree_hal_hip_device_queue_write( @@ -1572,153 +2040,20 @@ static iree_status_t iree_hal_hip_device_queue_write( } typedef struct iree_hal_hip_device_semaphore_submit_callback_data_t { - iree_allocator_t host_allocator; - iree_atomic_int64_t wait_semaphore_count; - iree_hal_hip_device_t* device; - iree_hal_queue_affinity_t queue_affinity; + iree_hal_hip_semaphore_callback_data_t base; iree_hal_command_buffer_t* command_buffer; iree_hal_buffer_binding_table_t binding_table; - iree_hal_semaphore_list_t wait_semaphore_list; - iree_hal_semaphore_list_t signal_semaphore_list; iree_hal_resource_set_t* resource_set; - iree_slim_mutex_t status_mutex; - iree_status_t status; } iree_hal_hip_device_semaphore_submit_callback_data_t; -static iree_status_t iree_hal_hip_device_make_callback_data( - iree_hal_hip_device_t* device, iree_allocator_t host_allocator, - iree_arena_block_pool_t* block_pool, - iree_hal_queue_affinity_t queue_affinity, - const iree_hal_semaphore_list_t wait_semaphore_list, - const iree_hal_semaphore_list_t signal_semaphore_list, - iree_hal_command_buffer_t* command_buffer, - iree_hal_buffer_binding_table_t binding_table, - iree_hal_hip_device_semaphore_submit_callback_data_t** out_data) { - IREE_TRACE_ZONE_BEGIN(z0); - - *out_data = NULL; - - // Embed captured tables in the action allocation. - iree_hal_hip_device_semaphore_submit_callback_data_t* callback_data = NULL; - - const iree_host_size_t wait_semaphore_list_size = - wait_semaphore_list.count * sizeof(*wait_semaphore_list.semaphores) + - wait_semaphore_list.count * sizeof(*wait_semaphore_list.payload_values); - const iree_host_size_t signal_semaphore_list_size = - signal_semaphore_list.count * sizeof(*signal_semaphore_list.semaphores) + - signal_semaphore_list.count * - sizeof(*signal_semaphore_list.payload_values); - - const iree_host_size_t payload_size = - binding_table.count * sizeof(*binding_table.bindings); - - const iree_host_size_t total_callback_size = - sizeof(*callback_data) + wait_semaphore_list_size + - signal_semaphore_list_size + payload_size; - IREE_RETURN_AND_END_ZONE_IF_ERROR( - z0, iree_allocator_malloc(host_allocator, total_callback_size, - (void**)&callback_data)); - uint8_t* callback_ptr = (uint8_t*)callback_data + sizeof(*callback_data); - - callback_data->host_allocator = host_allocator; - callback_data->device = device; - - iree_atomic_store(&callback_data->wait_semaphore_count, - wait_semaphore_list.count, iree_memory_order_relaxed); - // Copy wait list for later access. - callback_data->wait_semaphore_list.count = wait_semaphore_list.count; - callback_data->wait_semaphore_list.semaphores = - (iree_hal_semaphore_t**)callback_ptr; - memcpy(callback_data->wait_semaphore_list.semaphores, - wait_semaphore_list.semaphores, - wait_semaphore_list.count * sizeof(*wait_semaphore_list.semaphores)); - callback_data->wait_semaphore_list.payload_values = - (uint64_t*)(callback_ptr + wait_semaphore_list.count * - sizeof(*wait_semaphore_list.semaphores)); - memcpy( - callback_data->wait_semaphore_list.payload_values, - wait_semaphore_list.payload_values, - wait_semaphore_list.count * sizeof(*wait_semaphore_list.payload_values)); - for (iree_host_size_t i = 0; i < wait_semaphore_list.count; ++i) { - iree_hal_resource_retain(wait_semaphore_list.semaphores[i]); - } - callback_ptr += wait_semaphore_list_size; - - // Copy signal list for later access. - callback_data->signal_semaphore_list.count = signal_semaphore_list.count; - callback_data->signal_semaphore_list.semaphores = - (iree_hal_semaphore_t**)callback_ptr; - memcpy( - callback_data->signal_semaphore_list.semaphores, - signal_semaphore_list.semaphores, - signal_semaphore_list.count * sizeof(*signal_semaphore_list.semaphores)); - callback_data->signal_semaphore_list.payload_values = - (uint64_t*)(callback_ptr + signal_semaphore_list.count * - sizeof(*signal_semaphore_list.semaphores)); - memcpy(callback_data->signal_semaphore_list.payload_values, - signal_semaphore_list.payload_values, - signal_semaphore_list.count * - sizeof(*signal_semaphore_list.payload_values)); - for (iree_host_size_t i = 0; i < signal_semaphore_list.count; ++i) { - iree_hal_resource_retain(signal_semaphore_list.semaphores[i]); - } - callback_ptr += signal_semaphore_list_size; - - // Copy the execution resources for later access. - callback_data->queue_affinity = queue_affinity; - callback_data->command_buffer = command_buffer; - - // Retain all command buffers and semaphores. - iree_status_t status = - iree_hal_resource_set_allocate(block_pool, &callback_data->resource_set); - if (iree_status_is_ok(status)) { - status = iree_hal_resource_set_insert(callback_data->resource_set, - wait_semaphore_list.count, - wait_semaphore_list.semaphores); - } - if (iree_status_is_ok(status)) { - status = iree_hal_resource_set_insert(callback_data->resource_set, - signal_semaphore_list.count, - signal_semaphore_list.semaphores); - } - if (iree_status_is_ok(status)) { - status = iree_hal_resource_set_insert(callback_data->resource_set, 1, - &command_buffer); - } - - callback_data->binding_table = binding_table; - iree_hal_buffer_binding_t* binding_element_ptr = - (iree_hal_buffer_binding_t*)callback_ptr; - callback_data->binding_table.bindings = binding_element_ptr; - memcpy(binding_element_ptr, binding_table.bindings, - sizeof(*binding_element_ptr) * binding_table.count); - status = iree_hal_resource_set_insert_strided( - callback_data->resource_set, binding_table.count, - callback_data->binding_table.bindings, - offsetof(iree_hal_buffer_binding_t, buffer), - sizeof(iree_hal_buffer_binding_t)); - - callback_data->status = iree_ok_status(); - iree_slim_mutex_initialize(&callback_data->status_mutex); - *out_data = callback_data; - IREE_TRACE_ZONE_END(z0); - return status; -} - void iree_hal_hip_device_destroy_callback_data( iree_hal_hip_device_semaphore_submit_callback_data_t* data) { if (!data) { return; } - iree_slim_mutex_deinitialize(&data->status_mutex); iree_hal_resource_set_free(data->resource_set); - for (iree_host_size_t i = 0; i < data->wait_semaphore_list.count; ++i) { - iree_hal_resource_release(data->wait_semaphore_list.semaphores[i]); - } - for (iree_host_size_t i = 0; i < data->signal_semaphore_list.count; ++i) { - iree_hal_resource_release(data->signal_semaphore_list.semaphores[i]); - } - iree_allocator_free(data->host_allocator, data); + iree_hal_hip_semaphore_callback_data_deinitialize(&data->base); + iree_allocator_free(data->base.host_allocator, data); } static iree_status_t iree_hal_hip_device_complete_submission( @@ -1727,10 +2062,11 @@ static iree_status_t iree_hal_hip_device_complete_submission( iree_hal_hip_device_semaphore_submit_callback_data_t* data = (iree_hal_hip_device_semaphore_submit_callback_data_t*)user_data; - iree_hal_hip_device_t* device = data->device; + iree_hal_hip_device_t* device = data->base.device; // Get the device_context from the queue_affinity. - int device_ordinal = iree_math_count_trailing_zeros_u64(data->queue_affinity); + int device_ordinal = + iree_math_count_trailing_zeros_u64(data->base.queue_affinity); // Read any tracing events that were submitted. @@ -1738,7 +2074,7 @@ static iree_status_t iree_hal_hip_device_complete_submission( iree_hal_command_buffer_t* command_buffer = data->command_buffer; if (iree_hal_hip_multi_queue_command_buffer_isa(command_buffer)) { status = iree_hal_hip_multi_queue_command_buffer_get( - command_buffer, data->queue_affinity, &command_buffer); + command_buffer, data->base.queue_affinity, &command_buffer); } if (iree_status_is_ok(status)) { @@ -1764,9 +2100,10 @@ static iree_status_t iree_hal_hip_device_complete_submission( iree_hal_hip_event_release(event); // Notify all of the signal semaphores that they have been incremented. - for (iree_host_size_t i = 0; i < data->signal_semaphore_list.count; ++i) { + for (iree_host_size_t i = 0; i < data->base.signal_semaphore_list.count; + ++i) { iree_status_ignore(iree_hal_hip_event_semaphore_advance( - data->signal_semaphore_list.semaphores[i])); + data->base.signal_semaphore_list.semaphores[i])); } iree_hal_hip_device_destroy_callback_data(data); @@ -1780,27 +2117,28 @@ static iree_status_t iree_hal_hip_device_execute_now(void* user_data, iree_hal_hip_device_semaphore_submit_callback_data_t* data = (iree_hal_hip_device_semaphore_submit_callback_data_t*)user_data; - IREE_ASSERT_EQ(iree_math_count_ones_u64(data->queue_affinity), 1, + IREE_ASSERT_EQ(iree_math_count_ones_u64(data->base.queue_affinity), 1, "Cannot execute a command buffer on more than one queue"); - iree_hal_hip_device_t* device = data->device; + iree_hal_hip_device_t* device = data->base.device; // If we had a semaphore failure then we should propagate it // but not run anything. - status = iree_status_join(status, data->status); + status = iree_status_join(status, data->base.status); - int device_ordinal = iree_math_count_trailing_zeros_u64(data->queue_affinity); + int device_ordinal = + iree_math_count_trailing_zeros_u64(data->base.queue_affinity); IREE_TRACE_ZONE_APPEND_VALUE_I64(z0, device_ordinal); if (iree_status_is_ok(status)) { status = IREE_HIP_CALL_TO_STATUS( - data->device->hip_symbols, - hipCtxPushCurrent(data->device->devices[device_ordinal].hip_context)); + device->hip_symbols, + hipCtxPushCurrent(device->devices[device_ordinal].hip_context)); } if (iree_status_is_ok(status)) { status = iree_hal_hip_device_stream_wait_for_semaphores( - data->device, data->wait_semaphore_list, device_ordinal); + device, data->base.wait_semaphore_list, device_ordinal); } // We have satisfied all of the waits. @@ -1809,7 +2147,7 @@ static iree_status_t iree_hal_hip_device_execute_now(void* user_data, if (iree_status_is_ok(status)) { if (iree_hal_hip_multi_queue_command_buffer_isa(command_buffer)) { status = iree_hal_hip_multi_queue_command_buffer_get( - command_buffer, data->queue_affinity, &command_buffer); + command_buffer, data->base.queue_affinity, &command_buffer); } } if (iree_status_is_ok(status)) { @@ -1825,9 +2163,8 @@ static iree_status_t iree_hal_hip_device_execute_now(void* user_data, ? IREE_HAL_COMMAND_BUFFER_MODE_UNVALIDATED : 0); status = iree_hal_hip_device_create_stream_command_buffer( - (iree_hal_device_t*)data->device, mode, - command_buffer->allowed_categories, data->queue_affinity, 0, - &stream_command_buffer); + (iree_hal_device_t*)device, mode, command_buffer->allowed_categories, + data->base.queue_affinity, 0, &stream_command_buffer); if (iree_status_is_ok(status)) { status = iree_hal_resource_set_insert(data->resource_set, 1, &stream_command_buffer); @@ -1850,7 +2187,7 @@ static iree_status_t iree_hal_hip_device_execute_now(void* user_data, hipGraphExec_t exec = iree_hal_hip_graph_command_buffer_handle(command_buffer); status = IREE_HIP_CALL_TO_STATUS( - data->device->hip_symbols, + device->hip_symbols, hipGraphLaunch( exec, device->devices[device_ordinal].hip_dispatch_stream)); IREE_TRACE_ZONE_END(z2); @@ -1865,18 +2202,19 @@ static iree_status_t iree_hal_hip_device_execute_now(void* user_data, // Store symbols, because the cleanup may trigger off-thread // before it returns. - const iree_hal_hip_dynamic_symbols_t* symbols = data->device->hip_symbols; + const iree_hal_hip_dynamic_symbols_t* symbols = device->hip_symbols; if (iree_status_is_ok(status)) { status = iree_hal_hip_device_stream_signal_semaphores_and_add_cleanup( - data->device, data->device->cleanup_thread, data->signal_semaphore_list, + device, device->cleanup_thread, data->base.signal_semaphore_list, device_ordinal, iree_hal_hip_device_complete_submission, data); } if (!iree_status_is_ok(status)) { - for (iree_host_size_t i = 0; i < data->signal_semaphore_list.count; ++i) { - iree_hal_semaphore_fail(data->signal_semaphore_list.semaphores[i], - iree_status_clone(data->status)); + for (iree_host_size_t i = 0; i < data->base.signal_semaphore_list.count; + ++i) { + iree_hal_semaphore_fail(data->base.signal_semaphore_list.semaphores[i], + iree_status_clone(data->base.status)); } iree_hal_hip_device_destroy_callback_data(data); } @@ -1886,29 +2224,79 @@ static iree_status_t iree_hal_hip_device_execute_now(void* user_data, status, IREE_HIP_CALL_TO_STATUS(symbols, hipCtxPopCurrent(NULL))); } -static iree_status_t iree_hal_hip_device_semaphore_submit_callback( - void* user_context, iree_hal_semaphore_t* semaphore, iree_status_t status) { - iree_hal_hip_device_semaphore_submit_callback_data_t* data = - (iree_hal_hip_device_semaphore_submit_callback_data_t*)user_context; +static iree_status_t iree_hal_hip_device_make_callback_data( + iree_hal_hip_device_t* device, iree_allocator_t host_allocator, + iree_arena_block_pool_t* block_pool, + iree_hal_queue_affinity_t queue_affinity, + const iree_hal_semaphore_list_t wait_semaphore_list, + const iree_hal_semaphore_list_t signal_semaphore_list, + iree_hal_command_buffer_t* command_buffer, + iree_hal_buffer_binding_table_t binding_table, + iree_hal_hip_device_semaphore_submit_callback_data_t** out_data) { + IREE_TRACE_ZONE_BEGIN(z0); - if (!iree_status_is_ok(status)) { - iree_slim_mutex_lock(&data->status_mutex); - data->status = iree_status_join(data->status, status); - iree_slim_mutex_unlock(&data->status_mutex); + *out_data = NULL; + + // Embed captured tables in the action allocation. + iree_hal_hip_device_semaphore_submit_callback_data_t* callback_data = NULL; + + const iree_host_size_t payload_size = + binding_table.count * sizeof(*binding_table.bindings); + + const iree_host_size_t additional_data_for_base = + iree_hal_hip_semaphore_callback_data_get_additional_allocation_size( + wait_semaphore_list, signal_semaphore_list); + + const iree_host_size_t total_callback_size = + sizeof(*callback_data) + additional_data_for_base + payload_size; + IREE_RETURN_AND_END_ZONE_IF_ERROR( + z0, iree_allocator_malloc(host_allocator, total_callback_size, + (void**)&callback_data)); + + iree_hal_hip_semaphore_callback_data_initialize( + host_allocator, device, queue_affinity, &iree_hal_hip_device_execute_now, + wait_semaphore_list, signal_semaphore_list, + (void*)((uint8_t*)callback_data + sizeof(*callback_data)), + &callback_data->base); + + // Copy the execution resources for later access. + callback_data->command_buffer = command_buffer; + + // Retain all command buffers and semaphores. + iree_status_t status = + iree_hal_resource_set_allocate(block_pool, &callback_data->resource_set); + if (iree_status_is_ok(status)) { + status = iree_hal_resource_set_insert(callback_data->resource_set, + wait_semaphore_list.count, + wait_semaphore_list.semaphores); } - if (iree_atomic_fetch_sub(&data->wait_semaphore_count, 1, - iree_memory_order_acq_rel) != 1) { - return iree_ok_status(); + if (iree_status_is_ok(status)) { + status = iree_hal_resource_set_insert(callback_data->resource_set, + signal_semaphore_list.count, + signal_semaphore_list.semaphores); + } + if (iree_status_is_ok(status)) { + status = iree_hal_resource_set_insert(callback_data->resource_set, 1, + &command_buffer); } - int device_ordinal = iree_math_count_trailing_zeros_u64(data->queue_affinity); + callback_data->binding_table = binding_table; + iree_hal_buffer_binding_t* binding_element_ptr = + (iree_hal_buffer_binding_t*)((uint8_t*)callback_data + + sizeof(*callback_data) + + additional_data_for_base); + callback_data->binding_table.bindings = binding_element_ptr; + memcpy(binding_element_ptr, binding_table.bindings, + sizeof(*binding_element_ptr) * binding_table.count); + status = iree_hal_resource_set_insert_strided( + callback_data->resource_set, binding_table.count, + callback_data->binding_table.bindings, + offsetof(iree_hal_buffer_binding_t, buffer), + sizeof(iree_hal_buffer_binding_t)); - // Now the actual submit happens, as all semaphore have been satisfied - // (by satisfied here, we specifically mean that the semaphore has been - // scheduled, not necessarily completed) - return iree_hal_hip_dispatch_thread_add_dispatch( - data->device->devices[device_ordinal].dispatch_thread, - &iree_hal_hip_device_execute_now, data); + *out_data = callback_data; + IREE_TRACE_ZONE_END(z0); + return status; } static iree_status_t iree_hal_hip_device_queue_execute( @@ -1953,12 +2341,11 @@ static iree_status_t iree_hal_hip_device_queue_execute( if (iree_status_is_ok(status)) { for (iree_host_size_t i = 0; i < wait_semaphore_list.count; ++i) { status = iree_status_join( - status, - iree_hal_hip_semaphore_notify_work( - wait_semaphore_list.semaphores[i], - wait_semaphore_list.payload_values[i], - device->devices[device_ordinal].device_event_pool, - &iree_hal_hip_device_semaphore_submit_callback, callback_data)); + status, iree_hal_hip_semaphore_notify_work( + wait_semaphore_list.semaphores[i], + wait_semaphore_list.payload_values[i], + device->devices[device_ordinal].device_event_pool, + &iree_hal_hip_device_semaphore_callback, callback_data)); } } else { iree_hal_hip_device_destroy_callback_data(callback_data); diff --git a/runtime/src/iree/hal/drivers/hip/per_device_information.h b/runtime/src/iree/hal/drivers/hip/per_device_information.h index 88c332f1b3e9..45454696994f 100644 --- a/runtime/src/iree/hal/drivers/hip/per_device_information.h +++ b/runtime/src/iree/hal/drivers/hip/per_device_information.h @@ -7,6 +7,7 @@ #ifndef IREE_HAL_DRIVERS_HIP_PER_DEVICE_INFORMATION_H_ #define IREE_HAL_DRIVERS_HIP_PER_DEVICE_INFORMATION_H_ +#include "iree/base/internal/synchronization.h" #include "iree/hal/drivers/hip/dispatch_thread.h" #include "iree/hal/drivers/hip/hip_headers.h" #include "iree/hal/drivers/hip/memory_pools.h" @@ -26,6 +27,14 @@ typedef struct iree_hal_hip_per_device_info_t { iree_hal_hip_dispatch_thread_t* dispatch_thread; + struct { + iree_hal_buffer_t* buffer; + iree_host_size_t head; + iree_host_size_t tail; + iree_slim_mutex_t mutex; + iree_notification_t notify; + } file_transfer_staging_buffer; + iree_hal_hip_memory_pools_t memory_pools; } iree_hal_hip_per_device_info_t; From 0a0483e997bea993cdf26e62ef5c68a043011e1b Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Fri, 20 Dec 2024 12:48:40 -0800 Subject: [PATCH 58/64] [hip] Add trace zones to copy/fill/update buffer commands. (#19544) They were missing in the hip stream case. Signed-off-by: Andrew Woloszyn --- .../hal/drivers/hip/stream_command_buffer.c | 21 +++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/runtime/src/iree/hal/drivers/hip/stream_command_buffer.c b/runtime/src/iree/hal/drivers/hip/stream_command_buffer.c index d0dcea11678c..6ccc93c8f849 100644 --- a/runtime/src/iree/hal/drivers/hip/stream_command_buffer.c +++ b/runtime/src/iree/hal/drivers/hip/stream_command_buffer.c @@ -328,6 +328,9 @@ static iree_status_t iree_hal_hip_stream_command_buffer_fill_buffer( iree_hal_buffer_byte_offset(target_ref.buffer) + target_ref.offset; hipDeviceptr_t dst = (uint8_t*)target_device_buffer + target_offset; size_t num_elements = target_ref.length / pattern_length; + IREE_HAL_STREAM_TRACE_ZONE_BEGIN(command_buffer->tracing_context, + &command_buffer->tracing_event_list, + IREE_HAL_STREAM_TRACING_VERBOSITY_FINE); switch (pattern_length) { case 4: { @@ -359,7 +362,9 @@ static iree_status_t iree_hal_hip_stream_command_buffer_fill_buffer( return iree_make_status(IREE_STATUS_INTERNAL, "unsupported fill pattern length"); } - + IREE_HAL_STREAM_TRACE_ZONE_END(command_buffer->tracing_context, + &command_buffer->tracing_event_list, + IREE_HAL_STREAM_TRACING_VERBOSITY_FINE); IREE_TRACE_ZONE_END(z0); return iree_ok_status(); } @@ -397,12 +402,17 @@ static iree_status_t iree_hal_hip_stream_command_buffer_update_buffer( hipDeviceptr_t dst = (uint8_t*)target_device_buffer + iree_hal_buffer_byte_offset(target_ref.buffer) + target_ref.offset; + IREE_HAL_STREAM_TRACE_ZONE_BEGIN(command_buffer->tracing_context, + &command_buffer->tracing_event_list, + IREE_HAL_STREAM_TRACING_VERBOSITY_FINE); IREE_HIP_RETURN_AND_END_ZONE_IF_ERROR( z0, command_buffer->hip_symbols, hipMemcpyHtoDAsync(dst, (void*)src, target_ref.length, command_buffer->hip_stream), "hipMemcpyHtoDAsync"); - + IREE_HAL_STREAM_TRACE_ZONE_END(command_buffer->tracing_context, + &command_buffer->tracing_event_list, + IREE_HAL_STREAM_TRACING_VERBOSITY_FINE); IREE_TRACE_ZONE_END(z0); return iree_ok_status(); } @@ -417,6 +427,9 @@ static iree_status_t iree_hal_hip_stream_command_buffer_copy_buffer( IREE_RETURN_AND_END_ZONE_IF_ERROR( z0, iree_hal_hip_stream_command_buffer_flush_collectives(command_buffer)); + IREE_HAL_STREAM_TRACE_ZONE_BEGIN(command_buffer->tracing_context, + &command_buffer->tracing_event_list, + IREE_HAL_STREAM_TRACING_VERBOSITY_FINE); hipDeviceptr_t target_device_buffer = iree_hal_hip_buffer_device_pointer( iree_hal_buffer_allocated_buffer(target_ref.buffer)); @@ -435,6 +448,10 @@ static iree_status_t iree_hal_hip_stream_command_buffer_copy_buffer( command_buffer->hip_stream), "hipMemcpyAsync"); + IREE_HAL_STREAM_TRACE_ZONE_END(command_buffer->tracing_context, + &command_buffer->tracing_event_list, + IREE_HAL_STREAM_TRACING_VERBOSITY_FINE); + IREE_TRACE_ZONE_END(z0); return iree_ok_status(); } From 76a7b893e4c62d52eae2c165bdb23952a8589689 Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Fri, 20 Dec 2024 14:21:00 -0800 Subject: [PATCH 59/64] Revert "[hip] Fixed a busy wait in event_semaphore." (#19548) Reverts iree-org/iree#19540 This might be causing flakes in some of the bots. --- runtime/src/iree/hal/drivers/hip/event_semaphore.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/runtime/src/iree/hal/drivers/hip/event_semaphore.c b/runtime/src/iree/hal/drivers/hip/event_semaphore.c index ddd00fc42a4d..276290c5f006 100644 --- a/runtime/src/iree/hal/drivers/hip/event_semaphore.c +++ b/runtime/src/iree/hal/drivers/hip/event_semaphore.c @@ -842,6 +842,10 @@ static iree_status_t iree_hal_hip_semaphore_wait( iree_notification_prepare_wait(&semaphore->state_notification); iree_slim_mutex_unlock(&semaphore->mutex); + // We are going to pick up the correct status from query_locked below. + iree_status_ignore( + iree_hal_hip_event_semaphore_run_scheduled_callbacks(base_semaphore)); + // We have to wait for the semaphore to catch up. bool committed = iree_notification_commit_wait(&semaphore->state_notification, wait, From 0184eeec8ed5620ad4aa97048da4d00f67c71109 Mon Sep 17 00:00:00 2001 From: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Sat, 21 Dec 2024 11:02:39 -0800 Subject: [PATCH 60/64] [Codegen][RoCDL] Add patterns for lowering bit-width emulation operations to LLVM (#19551) Signed-off-by: MaheshRavishankar --- .../Codegen/LLVMGPU/ConvertToROCDL.cpp | 4 + .../LLVMGPU/test/convert_to_rocdl.mlir | 245 ++++++++++++------ 2 files changed, 169 insertions(+), 80 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp index 694ed778b524..bf7860ab7f43 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/ConvertToROCDL.cpp @@ -159,7 +159,10 @@ struct ConvertToROCDLPass final populateDropSharedMemoryDeallocOpPatterns(patterns); populateScalarizeMathOps(patterns); vector::populateVectorToVectorCanonicalizationPatterns(patterns); + vector::populateBubbleVectorBitCastOpPatterns(patterns); vector::populateVectorBroadcastLoweringPatterns(patterns); + vector::populateVectorInterleaveLoweringPatterns(patterns); + vector::populateVectorInterleaveToShufflePatterns(patterns); vector::populateVectorContractLoweringPatterns( patterns, vector::VectorTransformsOptions().setVectorTransformsOptions( @@ -225,6 +228,7 @@ struct ConvertToROCDLPass final vector::populateVectorRankReducingFMAPattern(llvmPatterns); vector::populateVectorInsertExtractStridedSliceTransforms(llvmPatterns); vector::populateVectorStepLoweringPatterns(llvmPatterns); + vector::populateVectorBitCastLoweringPatterns(llvmPatterns); populateVectorToLLVMConversionPatterns(converter, llvmPatterns); vector::populateVectorTransferLoweringPatterns(llvmPatterns, /*maxTransferRank=*/1); diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/convert_to_rocdl.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/convert_to_rocdl.mlir index 896b6f2294a4..2401b6573291 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/convert_to_rocdl.mlir +++ b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/convert_to_rocdl.mlir @@ -1,5 +1,5 @@ -// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx908 --pass-pipeline="builtin.module(hal.executable(hal.executable.variant(builtin.module(iree-convert-to-rocdl))))" %s | FileCheck %s -// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx908 --pass-pipeline="builtin.module(hal.executable(hal.executable.variant(builtin.module(iree-convert-to-rocdl))))" --iree-hip-index-bits=32 %s | FileCheck %s --check-prefix=INDEX32 +// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx908 --iree-convert-to-rocdl %s | FileCheck %s +// RUN: iree-opt --split-input-file --iree-gpu-test-target=gfx908 --iree-convert-to-rocdl --iree-hip-index-bits=32 %s | FileCheck %s --check-prefix=INDEX32 // Test that that standard and GPU ops are converted to LLVM and NVVM. #pipeline_layout = #hal.pipeline.layout, #hal.pipeline.binding ]> -hal.executable @abs_ex_dispatch_0 { - hal.executable.variant @rocm target(<"rocm", "rocm-hsaco-fb">) { - hal.executable.export @abs_ex_dispatch_0 layout(#pipeline_layout) - builtin.module { - func.func @abs_ex_dispatch_0() { - %c0 = arith.constant 0 : index - %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) flags(ReadOnly) : memref<16xf32> - %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) : memref<16xf32> - %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) : memref<16xf32> - %3 = gpu.block_id x - %4 = gpu.block_dim x - %5 = gpu.thread_id x - %6 = arith.muli %3, %4 : index - %7 = arith.addi %6, %5 : index - %9 = memref.load %1[%7] : memref<16xf32> - %10 = memref.load %2[%7] : memref<16xf32> - %11 = arith.addf %9, %10 : f32 - memref.store %11, %0[%7] : memref<16xf32> - return - } - } +builtin.module { + func.func @abs_ex_dispatch_0() { + %c0 = arith.constant 0 : index + %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) flags(ReadOnly) : memref<16xf32> + %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) : memref<16xf32> + %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) : memref<16xf32> + %3 = gpu.block_id x + %4 = gpu.block_dim x + %5 = gpu.thread_id x + %6 = arith.muli %3, %4 : index + %7 = arith.addi %6, %5 : index + %9 = memref.load %1[%7] : memref<16xf32> + %10 = memref.load %2[%7] : memref<16xf32> + %11 = arith.addf %9, %10 : f32 + memref.store %11, %0[%7] : memref<16xf32> + return } } // CHECK-LABEL: llvm.func @abs_ex_dispatch_0 @@ -49,23 +44,18 @@ hal.executable @abs_ex_dispatch_0 { #hal.pipeline.binding, #hal.pipeline.binding ]> -hal.executable @abs_ex_dispatch_0 { - hal.executable.variant @rocm target(<"rocm", "rocm-hsaco-fb">) { - hal.executable.export @abs_ex_dispatch_0 layout(#pipeline_layout) - builtin.module { - func.func @reduction_maximum() { - %c0 = arith.constant 0 : index - %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : - memref<32x64x64xf32, strided<[4096, 64, 1], offset: ?>> - %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) : memref<32x64x64xf32, - strided<[4096, 64, 1], offset: ?>> - %2 = vector.load %0[%c0, %c0, %c0] : memref<32x64x64xf32, strided<[4096, 64, 1], offset: ?>>, vector<2xf32> - %3 = vector.reduction , %2 : vector<2xf32> into f32 - %4 = vector.splat %3 : vector<2xf32> - vector.store %4, %1[%c0, %c0, %c0] : memref<32x64x64xf32, strided<[4096, 64, 1], offset: ?>>, vector<2xf32> - return - } - } +builtin.module { + func.func @reduction_maximum() { + %c0 = arith.constant 0 : index + %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : + memref<32x64x64xf32, strided<[4096, 64, 1], offset: ?>> + %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) : memref<32x64x64xf32, + strided<[4096, 64, 1], offset: ?>> + %2 = vector.load %0[%c0, %c0, %c0] : memref<32x64x64xf32, strided<[4096, 64, 1], offset: ?>>, vector<2xf32> + %3 = vector.reduction , %2 : vector<2xf32> into f32 + %4 = vector.splat %3 : vector<2xf32> + vector.store %4, %1[%c0, %c0, %c0] : memref<32x64x64xf32, strided<[4096, 64, 1], offset: ?>>, vector<2xf32> + return } } // CHECK-LABEL: llvm.func @reduction_maximum @@ -76,15 +66,10 @@ hal.executable @abs_ex_dispatch_0 { #pipeline_layout = #hal.pipeline.layout ]> -hal.executable @simple_barrier { - hal.executable.variant @rocm target(<"rocm", "rocm-hsaco-fb">) { - hal.executable.export @simple_barrier layout(#pipeline_layout) - builtin.module { - func.func @simple_barrier() { - gpu.barrier - return - } - } +builtin.module { + func.func @simple_barrier() { + gpu.barrier + return } } // CHECK-LABEL: llvm.func @simple_barrier @@ -95,22 +80,17 @@ hal.executable @simple_barrier { #hal.pipeline.binding, #hal.pipeline.binding ]> -hal.executable @masked_load_store { - hal.executable.variant @rocm target(<"rocm", "rocm-hsaco-fb">) { - hal.executable.export @masked_load_store layout(#pipeline_layout) - builtin.module { - func.func @masked_load_store() { - %c0 = arith.constant 0 : index - %idx = gpu.thread_id x - %pass_thru = arith.constant dense<0.000000e+00> : vector<1xf32> - %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : memref<64xf32, #gpu.address_space> - %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) : memref<64xf32, #gpu.address_space> - %mask = vector.create_mask %idx : vector<1xi1> - %ld = vector.maskedload %0[%idx], %mask, %pass_thru : memref<64xf32, #gpu.address_space>, vector<1xi1>, vector<1xf32> into vector<1xf32> - vector.maskedstore %1[%idx], %mask, %ld : memref<64xf32, #gpu.address_space>, vector<1xi1>, vector<1xf32> - return - } - } +builtin.module { + func.func @masked_load_store() { + %c0 = arith.constant 0 : index + %idx = gpu.thread_id x + %pass_thru = arith.constant dense<0.000000e+00> : vector<1xf32> + %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : memref<64xf32, #gpu.address_space> + %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) : memref<64xf32, #gpu.address_space> + %mask = vector.create_mask %idx : vector<1xi1> + %ld = vector.maskedload %0[%idx], %mask, %pass_thru : memref<64xf32, #gpu.address_space>, vector<1xi1>, vector<1xf32> into vector<1xf32> + vector.maskedstore %1[%idx], %mask, %ld : memref<64xf32, #gpu.address_space>, vector<1xi1>, vector<1xf32> + return } } // CHECK-LABEL: llvm.func @masked_load_store @@ -125,23 +105,128 @@ hal.executable @masked_load_store { #hal.pipeline.binding, #hal.pipeline.binding ]> -hal.executable private @interface_wg_size { - hal.executable.variant @rocm target(<"rocm", "rocm-hsaco-fb">) { - hal.executable.export @interface_wg_size layout(#pipeline_layout) attributes { - workgroup_size = [32: index, 1: index, 1: index] - } - builtin.module attributes {} { - func.func @interface_wg_size() { - %c0 = arith.constant 0.0 : f32 - %workgroup_size_x = hal.interface.workgroup.size[0] : index - %workgroup_size_y = hal.interface.workgroup.size[1] : index - %subspan = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) : memref<64x64xf32> - memref.store %c0, %subspan[%workgroup_size_x, %workgroup_size_y] : memref<64x64xf32> - return - } - } +builtin.module attributes {} { + func.func @interface_wg_size() { + %c0 = arith.constant 0.0 : f32 + %workgroup_size_x = hal.interface.workgroup.size[0] : index + %workgroup_size_y = hal.interface.workgroup.size[1] : index + %subspan = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) : memref<64x64xf32> + memref.store %c0, %subspan[%workgroup_size_x, %workgroup_size_y] : memref<64x64xf32> + return } } // CHECK-LABEL: llvm.func @interface_wg_size // CHECK: %[[WGDIMX:.+]] = rocdl.workgroup.dim.x // CHECK: %[[WGDIMY:.+]] = rocdl.workgroup.dim.y + +// ----- + +// Check that the operations generated by emulate bit widths are lowered correctly + +module { + func.func @emulation_lowering() { + %cst = arith.constant dense<4> : vector<2x4xi8> + %cst_0 = arith.constant dense<15> : vector<2x4xi8> + %c1 = arith.constant 1 : index + %cst_1 = arith.constant dense<0> : vector<2x8xi4> + %cst_2 = arith.constant dense<0.000000e+00> : vector<8x1x2xf16> + %c32 = arith.constant 32 : index + %c2 = arith.constant 2 : index + %c8 = arith.constant 8 : index + %c4 = arith.constant 4 : index + %c0 = arith.constant 0 : index + %thread_id_x = gpu.thread_id x upper_bound 64 + %0 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : memref<131072x192xf16, #gpu.address_space> + memref.assume_alignment %0, 64 : memref<131072x192xf16, #gpu.address_space> + %1 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(ReadOnly) : memref<131072x192xf16, #gpu.address_space> + memref.assume_alignment %1, 64 : memref<131072x192xf16, #gpu.address_space> + %2 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(ReadOnly) : memref<402653184xi8, #gpu.address_space> + memref.assume_alignment %2, 64 : memref<402653184xi8, #gpu.address_space> + %3 = hal.interface.binding.subspan layout(, #hal.pipeline.binding, #hal.pipeline.binding, #hal.pipeline.binding, #hal.pipeline.binding], flags = Indirect>) binding(4) alignment(64) offset(%c0) flags(Indirect) : memref<131072x192x32xf16, #gpu.address_space> + memref.assume_alignment %3, 64 : memref<131072x192x32xf16, #gpu.address_space> + %4 = arith.divui %thread_id_x, %c4 : index + %5 = arith.remui %thread_id_x, %c4 : index + %6 = arith.muli %5, %c8 : index + %workgroup_id_x = hal.interface.workgroup.id[0] upper_bound 6 : index + %workgroup_id_y = hal.interface.workgroup.id[1] upper_bound 131072 : index + %7 = arith.muli %4, %c2 : index + %8 = arith.muli %workgroup_id_x, %c32 : index + %9 = arith.addi %7, %8 : index + %10 = vector.load %0[%workgroup_id_y, %9] : memref<131072x192xf16, #gpu.address_space>, vector<2xf16> + %11 = vector.broadcast %10 : vector<2xf16> to vector<1x2xf16> + %12 = vector.insert %11, %cst_2 [0] : vector<1x2xf16> into vector<8x1x2xf16> + %13 = vector.insert %11, %12 [1] : vector<1x2xf16> into vector<8x1x2xf16> + %14 = vector.insert %11, %13 [2] : vector<1x2xf16> into vector<8x1x2xf16> + %15 = vector.insert %11, %14 [3] : vector<1x2xf16> into vector<8x1x2xf16> + %16 = vector.insert %11, %15 [4] : vector<1x2xf16> into vector<8x1x2xf16> + %17 = vector.insert %11, %16 [5] : vector<1x2xf16> into vector<8x1x2xf16> + %18 = vector.insert %11, %17 [6] : vector<1x2xf16> into vector<8x1x2xf16> + %19 = vector.insert %11, %18 [7] : vector<1x2xf16> into vector<8x1x2xf16> + %20 = vector.transpose %19, [1, 2, 0] : vector<8x1x2xf16> to vector<1x2x8xf16> + %21 = vector.load %1[%workgroup_id_y, %9] : memref<131072x192xf16, #gpu.address_space>, vector<2xf16> + %22 = vector.broadcast %21 : vector<2xf16> to vector<1x2xf16> + %23 = vector.insert %22, %cst_2 [0] : vector<1x2xf16> into vector<8x1x2xf16> + %24 = vector.insert %22, %23 [1] : vector<1x2xf16> into vector<8x1x2xf16> + %25 = vector.insert %22, %24 [2] : vector<1x2xf16> into vector<8x1x2xf16> + %26 = vector.insert %22, %25 [3] : vector<1x2xf16> into vector<8x1x2xf16> + %27 = vector.insert %22, %26 [4] : vector<1x2xf16> into vector<8x1x2xf16> + %28 = vector.insert %22, %27 [5] : vector<1x2xf16> into vector<8x1x2xf16> + %29 = vector.insert %22, %28 [6] : vector<1x2xf16> into vector<8x1x2xf16> + %30 = vector.insert %22, %29 [7] : vector<1x2xf16> into vector<8x1x2xf16> + %31 = vector.transpose %30, [1, 2, 0] : vector<8x1x2xf16> to vector<1x2x8xf16> + %c3072 = arith.constant 3072 : index + %32 = arith.muli %workgroup_id_y, %c3072 : index + %c16 = arith.constant 16 : index + %33 = arith.muli %9, %c16 : index + %34 = arith.addi %32, %33 : index + %c2_3 = arith.constant 2 : index + %c0_4 = arith.constant 0 : index + %c-1 = arith.constant -1 : index + %35 = arith.cmpi slt, %6, %c0_4 : index + %36 = arith.subi %c-1, %6 : index + %37 = arith.select %35, %36, %6 : index + %38 = arith.divsi %37, %c2_3 : index + %39 = arith.subi %c-1, %38 : index + %40 = arith.select %35, %39, %38 : index + %41 = arith.addi %34, %40 : index + %42 = vector.load %2[%41] : memref<402653184xi8, #gpu.address_space>, vector<4xi8> + %43 = vector.bitcast %42 : vector<4xi8> to vector<8xi4> + %44 = vector.insert %43, %cst_1 [0] : vector<8xi4> into vector<2x8xi4> + %45 = arith.addi %9, %c1 : index + %c3072_5 = arith.constant 3072 : index + %46 = arith.muli %workgroup_id_y, %c3072_5 : index + %c16_6 = arith.constant 16 : index + %47 = arith.muli %45, %c16_6 : index + %48 = arith.addi %46, %47 : index + %c2_7 = arith.constant 2 : index + %c0_8 = arith.constant 0 : index + %c-1_9 = arith.constant -1 : index + %49 = arith.cmpi slt, %6, %c0_8 : index + %50 = arith.subi %c-1_9, %6 : index + %51 = arith.select %49, %50, %6 : index + %52 = arith.divsi %51, %c2_7 : index + %53 = arith.subi %c-1_9, %52 : index + %54 = arith.select %49, %53, %52 : index + %55 = arith.addi %48, %54 : index + %56 = vector.load %2[%55] : memref<402653184xi8, #gpu.address_space>, vector<4xi8> + %57 = vector.bitcast %56 : vector<4xi8> to vector<8xi4> + %58 = vector.insert %57, %44 [1] : vector<8xi4> into vector<2x8xi4> + %59 = vector.bitcast %58 : vector<2x8xi4> to vector<2x4xi8> + %60 = arith.andi %59, %cst_0 : vector<2x4xi8> + %61 = arith.shrui %59, %cst : vector<2x4xi8> + %62 = vector.interleave %60, %61 : vector<2x4xi8> -> vector<2x8xi8> + %63 = arith.extui %62 : vector<2x8xi8> to vector<2x8xi32> + %64 = arith.uitofp %63 : vector<2x8xi32> to vector<2x8xf16> + %65 = vector.extract %20[0] : vector<2x8xf16> from vector<1x2x8xf16> + %66 = arith.mulf %64, %65 : vector<2x8xf16> + %67 = vector.extract %31[0] : vector<2x8xf16> from vector<1x2x8xf16> + %68 = arith.addf %66, %67 : vector<2x8xf16> + %69 = vector.extract %68[0] : vector<8xf16> from vector<2x8xf16> + vector.store %69, %3[%workgroup_id_y, %9, %6] : memref<131072x192x32xf16, #gpu.address_space>, vector<8xf16> + %70 = vector.extract %68[1] : vector<8xf16> from vector<2x8xf16> + vector.store %70, %3[%workgroup_id_y, %45, %6] : memref<131072x192x32xf16, #gpu.address_space>, vector<8xf16> + return + } +} +// CHECK-LABEL: llvm.func @emulation_lowering( +// CHECK-NOT: builtin.unrealized_conversion_cast From 1f19761d98dde9f512feddc524d76c39bd66a42f Mon Sep 17 00:00:00 2001 From: Andrew Woloszyn Date: Mon, 23 Dec 2024 12:50:29 -0500 Subject: [PATCH 61/64] Enable peering among all devices on the system. (#19555) We have to do this for models that are using multiple devices but are not using them as a single logical device. Signed-off-by: Andrew Woloszyn --- runtime/src/iree/hal/drivers/hip/hip_device.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/runtime/src/iree/hal/drivers/hip/hip_device.c b/runtime/src/iree/hal/drivers/hip/hip_device.c index 47217fafdc73..3cedff6f2ab7 100644 --- a/runtime/src/iree/hal/drivers/hip/hip_device.c +++ b/runtime/src/iree/hal/drivers/hip/hip_device.c @@ -455,13 +455,16 @@ iree_status_t iree_hal_hip_device_create( } if (iree_status_is_ok(status)) { - for (iree_host_size_t j = 0; - j < device_count && iree_status_is_ok(status); ++j) { - if (i == j) { + int hip_device_count = 0; + status = IREE_HIP_CALL_TO_STATUS( + symbols, hipGetDeviceCount(&hip_device_count), "hipGetDeviceCount"); + + for (int j = 0; j < hip_device_count && iree_status_is_ok(status); ++j) { + if (j == device->devices[i].hip_device) { continue; } - status = IREE_HIP_CALL_TO_STATUS( - symbols, hipDeviceEnablePeerAccess(devices[j], 0)); + status = + IREE_HIP_CALL_TO_STATUS(symbols, hipDeviceEnablePeerAccess(j, 0)); } } } From f1e1866b6acfa8f05add4e16007ab4f2f3062df7 Mon Sep 17 00:00:00 2001 From: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> Date: Fri, 27 Dec 2024 08:15:22 +0700 Subject: [PATCH 62/64] Update LLVM to llvm/llvm-project@b13592219c421820b (#19554) --- .github/workflows/pkgci_regression_test.yml | 2 +- .../regression_suite/shark-test-suite-models/sd3/test_clip.py | 2 +- .../regression_suite/shark-test-suite-models/sd3/test_mmdit.py | 2 +- .../regression_suite/shark-test-suite-models/sd3/test_vae.py | 2 +- .../regression_suite/shark-test-suite-models/sdxl/test_clip.py | 2 +- .../regression_suite/shark-test-suite-models/sdxl/test_unet.py | 2 +- .../regression_suite/shark-test-suite-models/sdxl/test_vae.py | 2 +- third_party/llvm-project | 2 +- 8 files changed, 8 insertions(+), 8 deletions(-) diff --git a/.github/workflows/pkgci_regression_test.yml b/.github/workflows/pkgci_regression_test.yml index 04ae87f37105..98b4dc1c8a37 100644 --- a/.github/workflows/pkgci_regression_test.yml +++ b/.github/workflows/pkgci_regression_test.yml @@ -85,7 +85,7 @@ jobs: uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 with: repository: nod-ai/SHARK-TestSuite - ref: f5615ab29da491c0047146258dfa3a0c40c735e5 + ref: 601db0e472600a94ddb69b37d05cd7d4a17f89b2 path: SHARK-TestSuite submodules: false lfs: true diff --git a/experimental/regression_suite/shark-test-suite-models/sd3/test_clip.py b/experimental/regression_suite/shark-test-suite-models/sd3/test_clip.py index d5e2778481c6..0ba70978937e 100644 --- a/experimental/regression_suite/shark-test-suite-models/sd3/test_clip.py +++ b/experimental/regression_suite/shark-test-suite-models/sd3/test_clip.py @@ -63,7 +63,7 @@ ) sd3_clip_mlir = fetch_source_fixture( - "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sd3-prompt-encoder/model.mlirbc", + "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sd3-prompt-encoder/model.mlir", group="sd3_clip", ) diff --git a/experimental/regression_suite/shark-test-suite-models/sd3/test_mmdit.py b/experimental/regression_suite/shark-test-suite-models/sd3/test_mmdit.py index 09e14e2e969c..34376fad22c1 100644 --- a/experimental/regression_suite/shark-test-suite-models/sd3/test_mmdit.py +++ b/experimental/regression_suite/shark-test-suite-models/sd3/test_mmdit.py @@ -48,7 +48,7 @@ ) sd3_mmdit_mlir = fetch_source_fixture( - "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sd3-mmdit/model.mlirbc", + "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sd3-mmdit/model.mlir", group="sd3_mmdit", ) diff --git a/experimental/regression_suite/shark-test-suite-models/sd3/test_vae.py b/experimental/regression_suite/shark-test-suite-models/sd3/test_vae.py index 72ae9e28167e..47f5865d9f59 100644 --- a/experimental/regression_suite/shark-test-suite-models/sd3/test_vae.py +++ b/experimental/regression_suite/shark-test-suite-models/sd3/test_vae.py @@ -33,7 +33,7 @@ ) sd3_vae_mlir = fetch_source_fixture( - "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sd3-vae/model.mlirbc", + "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sd3-vae/model.mlir", group="sd3_vae", ) diff --git a/experimental/regression_suite/shark-test-suite-models/sdxl/test_clip.py b/experimental/regression_suite/shark-test-suite-models/sdxl/test_clip.py index 978595cd4c7d..04f24a4833ed 100644 --- a/experimental/regression_suite/shark-test-suite-models/sdxl/test_clip.py +++ b/experimental/regression_suite/shark-test-suite-models/sdxl/test_clip.py @@ -53,7 +53,7 @@ ) sdxl_clip_mlir = fetch_source_fixture( - "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/model.mlirbc", + "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/model.mlir", group="sdxl_clip", ) diff --git a/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py b/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py index 62113c39ea7c..d8824bc3c508 100644 --- a/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py +++ b/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py @@ -52,7 +52,7 @@ ) sdxl_unet_fp16_mlir = fetch_source_fixture( - "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-scheduled-unet/model.mlirbc", + "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-scheduled-unet/model.mlir", group="sdxl_unet_fp16", ) diff --git a/experimental/regression_suite/shark-test-suite-models/sdxl/test_vae.py b/experimental/regression_suite/shark-test-suite-models/sdxl/test_vae.py index 6eb8d903759b..bc674c0ee44b 100644 --- a/experimental/regression_suite/shark-test-suite-models/sdxl/test_vae.py +++ b/experimental/regression_suite/shark-test-suite-models/sdxl/test_vae.py @@ -33,7 +33,7 @@ ) sdxl_vae_mlir = fetch_source_fixture( - "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-vae-decode/model.mlirbc", + "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-vae-decode/model.mlir", group="sdxl_vae", ) diff --git a/third_party/llvm-project b/third_party/llvm-project index 42dc6d585fcb..13ae7e49984f 160000 --- a/third_party/llvm-project +++ b/third_party/llvm-project @@ -1 +1 @@ -Subproject commit 42dc6d585fcbdc366215133f23c77c244a7d9c81 +Subproject commit 13ae7e49984f863fdaa01833a3bd0ff20c71dc6e From d746a572f505f0697285a16d1edeb945a0ddfdd4 Mon Sep 17 00:00:00 2001 From: Kunwar Grover Date: Fri, 27 Dec 2024 16:26:48 +0000 Subject: [PATCH 63/64] Remove revert for https://github.com/llvm/llvm-project/pull/120115 (#19567) --- .../iree/compiler/Codegen/Common/CPU/CPUPrepareUkernels.cpp | 4 ++-- .../iree/compiler/Codegen/Common/DecomposePackUnPackOps.cpp | 2 +- .../src/iree/compiler/Codegen/Common/GPU/GPUTensorTile.cpp | 4 ++-- .../iree/compiler/Codegen/Common/GPU/GPUTileReduction.cpp | 3 ++- .../Codegen/LLVMCPU/LLVMCPU2DScalableTo1DScalable.cpp | 2 +- .../iree/compiler/Codegen/LLVMCPU/LLVMCPUSplitReduction.cpp | 5 +++-- compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUTile.cpp | 2 +- compiler/src/iree/compiler/Codegen/Transforms/Transforms.cpp | 2 +- third_party/llvm-project | 2 +- 9 files changed, 14 insertions(+), 12 deletions(-) diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUPrepareUkernels.cpp b/compiler/src/iree/compiler/Codegen/Common/CPU/CPUPrepareUkernels.cpp index 0f8c38bebdd3..7f4d62b26bda 100644 --- a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUPrepareUkernels.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/CPU/CPUPrepareUkernels.cpp @@ -75,7 +75,7 @@ static void tileNonPackedDimsFor3DPackOps(RewriterBase &rewriter, FailureOr tilingResult = scf::tileUsingSCF(rewriter, tilingInterfaceOp, options); assert(succeeded(tilingResult)); - rewriter.replaceOp(packOp, tilingResult->replacements); + rewriter.replaceOp(packOp, tilingResult->mergeResult.replacements); }); } @@ -110,7 +110,7 @@ static void tileNonPackedDimsFor5DPUnpackOps(RewriterBase &rewriter, FailureOr tilingResult = scf::tileUsingSCF(rewriter, tilingInterfaceOp, options); assert(succeeded(tilingResult)); - rewriter.replaceOp(unpackOp, tilingResult->replacements); + rewriter.replaceOp(unpackOp, tilingResult->mergeResult.replacements); }); } diff --git a/compiler/src/iree/compiler/Codegen/Common/DecomposePackUnPackOps.cpp b/compiler/src/iree/compiler/Codegen/Common/DecomposePackUnPackOps.cpp index 763b04042e30..173f50b4eed2 100644 --- a/compiler/src/iree/compiler/Codegen/Common/DecomposePackUnPackOps.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/DecomposePackUnPackOps.cpp @@ -200,7 +200,7 @@ static LogicalResult commonRunOnOperation( unpackTilingOptions); if (failed(tilingResult)) return WalkResult::interrupt(); - rewriter.replaceOp(op, tilingResult->replacements); + rewriter.replaceOp(op, tilingResult->mergeResult.replacements); return WalkResult::advance(); }); if (status.wasInterrupted()) { diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUTensorTile.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUTensorTile.cpp index a7f88042d3df..0487a898cdca 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUTensorTile.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUTensorTile.cpp @@ -90,7 +90,7 @@ class TileConsumerAndFuseInputProducer final } // Replace the tiled op with replacements. - rewriter.replaceOp(op, tilingResult->replacements); + rewriter.replaceOp(op, tilingResult->mergeResult.replacements); filter.replaceLinalgTransformationFilter(rewriter, tilingResult->tiledOps.front()); @@ -292,7 +292,7 @@ static LogicalResult tileParallelDims(mlir::FunctionOpInterface funcOp, if (failed(tilingResult)) { return tilingOp->emitOpError("failed to tile to scf.forall"); } - rewriter.replaceOp(tilingOp, tilingResult->replacements); + rewriter.replaceOp(tilingOp, tilingResult->mergeResult.replacements); } return success(); } diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUTileReduction.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUTileReduction.cpp index 28bed8f037f9..bad51d02d4c1 100644 --- a/compiler/src/iree/compiler/Codegen/Common/GPU/GPUTileReduction.cpp +++ b/compiler/src/iree/compiler/Codegen/Common/GPU/GPUTileReduction.cpp @@ -38,10 +38,11 @@ static LogicalResult tileReduction(linalg::LinalgOp op) { sizes.push_back(rewriter.getIndexAttr(size)); } rewriter.setInsertionPoint(op); - FailureOr results = scf::tileReductionUsingScf( + FailureOr results = scf::tileReductionUsingScf( rewriter, cast(op.getOperation()), sizes); if (failed(results)) return failure(); + rewriter.replaceOp(op, results->mergeResult.replacements); return success(); } diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPU2DScalableTo1DScalable.cpp b/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPU2DScalableTo1DScalable.cpp index 917fe500513f..f8400fe2ea6d 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPU2DScalableTo1DScalable.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPU2DScalableTo1DScalable.cpp @@ -161,7 +161,7 @@ dropScalabilityFromUnsupportedOperations(mlir::FunctionOpInterface funcOp, setLoweringConfig(newOp, newLoweringConfig); } - rewriter.replaceOp(tilingOp, tilingResult->replacements); + rewriter.replaceOp(tilingOp, tilingResult->mergeResult.replacements); } return success(); } diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUSplitReduction.cpp b/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUSplitReduction.cpp index 4ef391df6200..0627ae3ccbae 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUSplitReduction.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUSplitReduction.cpp @@ -132,7 +132,7 @@ LogicalResult splitReductionImpl(Operation *op, int64_t size, LLVM_DEBUG(llvm::dbgs() << "failed on step 1 (SCFTiling)\n"); return failure(); } - rewriter.replaceOp(linalgOp, tileResFirst->replacements); + rewriter.replaceOp(linalgOp, tileResFirst->mergeResult.replacements); // 2) Apply splitReduction on the single vector-length array. // splitReduction already replaces the op. @@ -159,7 +159,8 @@ LogicalResult splitReductionImpl(Operation *op, int64_t size, LLVM_DEBUG(llvm::dbgs() << "failed on step 3 (SCFTiling)\n"); return failure(); } - rewriter.replaceOp(splitRes->splitLinalgOp, tileRes->replacements); + rewriter.replaceOp(splitRes->splitLinalgOp, + tileRes->mergeResult.replacements); return success(); } diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUTile.cpp b/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUTile.cpp index aba35c0222ce..a0529d6410e9 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUTile.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUTile.cpp @@ -101,7 +101,7 @@ void LLVMCPUTilePass::runOnOperation() { scf::tileUsingSCF(rewriter, op, options); if (failed(tiledResults)) continue; - rewriter.replaceOp(op, tiledResults->replacements); + rewriter.replaceOp(op, tiledResults->mergeResult.replacements); } RewritePatternSet patterns = diff --git a/compiler/src/iree/compiler/Codegen/Transforms/Transforms.cpp b/compiler/src/iree/compiler/Codegen/Transforms/Transforms.cpp index 980925ccb11a..3b6aa5fa4beb 100644 --- a/compiler/src/iree/compiler/Codegen/Transforms/Transforms.cpp +++ b/compiler/src/iree/compiler/Codegen/Transforms/Transforms.cpp @@ -1256,7 +1256,7 @@ LogicalResult tileLinalgOpsWithFilter(mlir::FunctionOpInterface funcOp, for (auto tiledOp : tiledResults->tiledOps) { filter.replaceLinalgTransformationFilter(rewriter, tiledOp); } - rewriter.replaceOp(op, tiledResults->replacements); + rewriter.replaceOp(op, tiledResults->mergeResult.replacements); } return success(); diff --git a/third_party/llvm-project b/third_party/llvm-project index 13ae7e49984f..82c37276df5b 160000 --- a/third_party/llvm-project +++ b/third_party/llvm-project @@ -1 +1 @@ -Subproject commit 13ae7e49984f863fdaa01833a3bd0ff20c71dc6e +Subproject commit 82c37276df5bb5f6351079fa4df8d609abd908b7 From a43d893b4d1946fabd7a6c7eb74c63ba7d42cdd5 Mon Sep 17 00:00:00 2001 From: Ian Wood Date: Fri, 27 Dec 2024 12:08:55 -0800 Subject: [PATCH 64/64] [Dispatch] Disable scatter fusion with producers (#19565) Backends don't currently support scatter fusion and will silently compile incorrect code. This should be turned off in order to prevent backends from generating incorrect results. I don't think any users are running into this currently, but its best to keep it off for now. Similar to https://github.com/iree-org/iree/pull/19535 but for both `indices` and `updates`. --------- Signed-off-by: Ian Wood --- .../iree/compiler/DispatchCreation/FormDispatchRegions.cpp | 3 ++- .../DispatchCreation/test/dispatch_linalg_ext_fusion.mlir | 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp b/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp index 73d306bd7f6e..7cac15a828df 100644 --- a/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp +++ b/compiler/src/iree/compiler/DispatchCreation/FormDispatchRegions.cpp @@ -651,7 +651,8 @@ isFusableWithProducer(OpOperand &operand, } // Don't fuse attention with it's producer - if (isa(consumer)) { + // TODO: Enable scatter fusion when supported by backends. + if (isa(consumer)) { return false; } diff --git a/compiler/src/iree/compiler/DispatchCreation/test/dispatch_linalg_ext_fusion.mlir b/compiler/src/iree/compiler/DispatchCreation/test/dispatch_linalg_ext_fusion.mlir index e1bc91eafa80..1575cf7a46b4 100644 --- a/compiler/src/iree/compiler/DispatchCreation/test/dispatch_linalg_ext_fusion.mlir +++ b/compiler/src/iree/compiler/DispatchCreation/test/dispatch_linalg_ext_fusion.mlir @@ -37,9 +37,9 @@ util.func public @linalgext_scatter_dispatch() -> tensor<8192x16x8x128xf32> { } // CHECK-LABEL: util.func public @linalgext_scatter_dispatch +// CHECK-DAG: %[[INDICES:.+]] = flow.dispatch.region +// CHECK-DAG: %[[UPDATE:.+]] = flow.dispatch.region // CHECK: %[[RESULT:.+]] = flow.dispatch.region -// CHECK: %[[INDICES:.+]] = linalg.generic -// CHECK: %[[UPDATE:.+]] = linalg.generic // CHECK: %[[SCATTER_RESULT:.+]] = iree_linalg_ext.scatter // CHECK-SAME: ins(%[[UPDATE]], %[[INDICES]] : tensor<4x1x16x8x128xf32>, tensor<4x1xi32>) // CHECK: flow.return %[[SCATTER_RESULT]]