Migrating regression suite to iree-test-suite #20124

geomin12 · 2025-02-27T22:33:03Z

Moving regression suite over to iree-test-suite and calling those regression tests via composite action

ScottTodd · 2025-02-28T19:18:47Z

.github/workflows/pkgci_regression_test.yml

-            --goldentime-rocm-e2e-ms 1100.0 \
-            --goldentime-rocm-unet-ms 255.0 \
-            --goldentime-rocm-clip-ms 14.5 \
-            --goldentime-rocm-vae-ms 310.0 \
-            --goldendispatch-rocm-unet 1236 \
-            --goldendispatch-rocm-clip 967 \
-            --goldendispatch-rocm-vae 208 \
-            --goldensize-rocm-unet-bytes 2280000  \
-            --goldensize-rocm-clip-bytes 860000 \
-            --goldensize-rocm-vae-bytes 840000 \


Let's make sure these values can still be tracked in the IREE repo before migrating, or switch to some other mechanism for watching for benchmark metrics regressions, like comparing the observed values against a baseline stored in a remote database.

See also the discussion about updating benchmark limits from this morning here on Discord.

The workflow we want is:

Developer changes the compiler code in a pull request

The CI runs tests and checks the metrics

The metrics are compared against a baseline (either sourced from the repository itself or some external source)

If the metrics change significantly, the developer can choose to accept those changes or adjust their code. Accepting the changes can mean updating a line of source code like here, or it can mean merging the change and letting the database pick up the new results automatically, or something else.

What we want to avoid is having unactionable errors. If the expected results are stored in another repository and test results not matching those expectations result in failed workflow runs, the developer is stuck.

Here's an example of what we had last year on running on GCP infra: #17765 (comment). Benchmarks would run and report stats. If some stats were outside of some ranges, the top N out of all stats would be summarized. As long as all models compiled and ran, the workflow runs would succeed regardless of those benchmark results. After a commit would land on main from a merged pull request, the database would be updated to track the new benchmark results and they would be visible on a website.

Today we post job summaries with the benchmark results but they are buried, for example: https://github.com/iree-org/iree/actions/runs/13593525275#summary-38005698303 . We could elevate those to pull request comments, especially if the results are outside of expected ranges. Only some parts of the project can affect the current benchmarks though, so I'd be wary of generating too much noise for automated PR comments. See for example the MiGraphX repo that runs benchmarks and posts comments on every PR, even those that only update documentation, like ROCm/AMDMIGraphX#3832 (comment)

Yeah that makes sense! the stats are still tracked like before this line in iree-test-suite, but it would be helpful to display that information in the PR comments.

In terms of storing the data in a db, I was going to hold off on that until the dashboard work is productionalized, so we can get a UI along with the data

.github/workflows/pkgci_sharktank_regression.yml

ScottTodd · 2025-02-28T19:25:57Z

.github/workflows/pkgci_sharktank_regression.yml

+            rocm-chip: gfx942
+            backend: rocm
+            sku: mi300
+            runs-on: nodai-amdgpu-mi300-x86-64


FYI we may want to migrate workflows from these nodai-amdgpu-mi300-x86-64 runners to a new cluster of runners with the linux-mi300-gpu-1 label: nod-ai/shark-ai#793

That should be separate from this PR though.

There are some attribute changes upstream: - TOSA: llvm/llvm-project#128115 - NVPTX: llvm/llvm-project#127736 --------- Signed-off-by: yzhang93 <zhyuhang88@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

The convention is [compiler target, runtime HAL driver/device]. This was reported [here on Discord](https://discord.com/channels/689900678990135345/1166024193599615006/1341505000303759401). ci-exactly: build_packages, test_sharktank Signed-off-by: geomin12 <geomin12@amd.com>

…#20115) Follow-up to iree-org#20112. Signed-off-by: geomin12 <geomin12@amd.com>

Rename changes followed: - llvm/llvm-project#128751 - llvm/llvm-project#128869 --------- Signed-off-by: yzhang93 <zhyuhang88@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

…ed. (iree-org#20099) The revision teaches the SpecializeEncoding pass about the difference between dropping encodings and failing on resolving encodings. It introduces `IdentityEncodingAttr` resolver that always makes identity encodings. The implementation detail is using iree_encoding.pad_encoding_layout with zero elements padding config. If an encoding resolver (e.g., UnsupportedEncodingAttr) can not get the resolved layout, the pass signals a failure. With the change, the HAL default encoding resolver is switched from UnsupportedEncodingAttr to IdentityEncodingAttr. --------- Signed-off-by: hanhanW <hanhan0912@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

Add a verifier for tile sizes. --------- Signed-off-by: erman-gurses <erman@nod-labs.com> Signed-off-by: geomin12 <geomin12@amd.com>

…ree-org#20114) With this change, I am able to now follow our ROCm instructions at https://iree.dev/guides/deployment-configurations/gpu-rocm/ on Windows, running MobileNet and getting the expected outputs 🎉 This does two things: 1. Only try to enable peering between devices if more than 1 device is going to be used 2. Check `hipDeviceCanAccessPeer` before calling `hipDeviceEnablePeerAccess` so there is a chance to log a better error message ### More information On Windows 11 with 2x w7900 and the 6.2 HIP SDK, the previous code returned `hipErrorInvalidDevice` when calling `hipDeviceEnablePeerAccess`. On the same system under Linux (unknown HIP/ROCm version but pretty similar), the code works without errors. In both cases, I only requested a single device, so this peering code did not need to run at all. See also [this discussion on Discord](https://discord.com/channels/689900678990135345/1282818085153407038/1344105197353570377). See documentation for these APIs at https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___peer_to_peer.html. ### Testing > [!WARNING] > I've only tested on the machines I have easy access to. I think we still need CI coverage for multi-device (e.g. iree-org#19995) Signed-off-by: geomin12 <geomin12@amd.com>

…ree-org#20126) The revision simplifies the encoding content with `encoded` because encoding can have complicated parameters. E.g., they can have arbitrary dictionary attribute, and naming the variable with such information is making readability worse. Fixes iree-org#20090 Signed-off-by: hanhanW <hanhan0912@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

Make sure `rdna4@vulkan` is recognized as a Vulkan target. Handle wmma shapes not supported by the cooperative matrix extension. Signed-off-by: geomin12 <geomin12@amd.com>

…ree-org#20120) This PR flipped the switch for `allow-zero-slice` option, which means zero slice will be tolerated for igemm that requires a padding. The dynamic if condition check is unnecessary because we compute the K loop bound in a reasonable fashion and zero slice situation shouldn't happen. The motivation of the PR is performance benefit to remove an unnecessary conditional from padded gemm global loading. Signed-off-by: jerryyin <zhuoryin@amd.com> Signed-off-by: geomin12 <geomin12@amd.com>

Signed-off-by: Jakub Kuderski <jakub@nod-labs.com> Signed-off-by: geomin12 <geomin12@amd.com>

When running the ElideTimepoints pass on a large program (llama 8b for example) we exhaust our state, and fail to elide timepoints. This significantly reduces the number of timepoints that end up in the waiting set in those cases. The number of timepoint waits in the stream output are as follows. ``` 23765 << State Size 256 8801 << State Size 8192 ``` Signed-off-by: Andrew Woloszyn <awoloszy@amd.com> Signed-off-by: geomin12 <geomin12@amd.com>

No local patch is carried. --------- Signed-off-by: yzhang93 <zhyuhang88@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

…ree-org#20125) The small constants can be inlined into dispatches, and they could become the source of `set_encoding` ops. The revision adds the support for hoisting both operations out if the source is a constant. --------- Signed-off-by: hanhanW <hanhan0912@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

…ups (iree-org#20123) Adds cleanup patterns to run during `tileConsumerAndFuseProducersUsingSCF` in TileandDistributeToWorkgroupsUsingForall. The PR adds the `SwapExtractWithExpandPattern`, and some extra cleanup patterns for slices. This enables codegen for set_encoding ops on GPU, since the materialization of set_encoding contains an expand_shape that will block distribution tiling of the full dispatch. --------- Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

…iree-org#20133) The previous test did not catch the case because it did not have an affinity. The revision adds the affinity to the test, which ensures it is tested. Signed-off-by: hanhanW <hanhan0912@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

…ree-org#20131) Based on the announcement https://www.amd.com/en/newsroom/press-releases/2025-2-28-amd-unveils-next-generation-amd-rdna-4-architectu.html. Also sort the list of rocm and vulkan alphabetically, since any other order is hard to maintain by hand. Signed-off-by: Jakub Kuderski <jakub@nod-labs.com> Signed-off-by: geomin12 <geomin12@amd.com>

…ree-org#20142) When running the build action in the same thread we must still do a start-finish to get an initialized completed future. Signed-off-by: Boian Petkantchin <boian.petkantchin@amd.com> Signed-off-by: geomin12 <geomin12@amd.com>

This reflects improvements in the compiler for these models. Signed-off-by: MaheshRavishankar <mravisha@amd.com> Signed-off-by: geomin12 <geomin12@amd.com>

The type that IREE's code refered to as f8E4M3, which is an 8-bit float that has no infinities, 4 bits of exponent, and 3 bits of mantissa, is called f8E4M3FN in MLIR and LLVM's APFloat. MLIR has a separate f8E4M3 type that has different semantics. The fact that IREE would sometimes be using a different MLIR type from what we menat was causing crashes on RDNA4. Therefore, rename IREE's type to match Signed-off-by: geomin12 <geomin12@amd.com>

Bumps the github-actions group with 3 updates: [actions/download-artifact](https://github.com/actions/download-artifact), [actions/cache](https://github.com/actions/cache) and [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact). Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: geomin12 <geomin12@amd.com>

…op. (iree-org#20148) The encoding content could be vague because the attribute could be descriptive. However, we realized that some encodings with different content could result in the same encoding layout. E.g., a matmul encoding can encode indexing maps for B/M/N/K dimension, while an encoding attribute only cares about `K` dimension. One of the examples is `IREE::GPU::GPUPadLayoutAttr`. The revision introduces encoding compatibility in SerializableEncodingAttrInterface and relax the verifier of `stream.tensor.clone` op. If any encoding is not serialized, they are compatible. If both encodings are serialized, it checks if they are identical or not. Note that it defers the real check after encoding specialization. If they end up with different layouts, a failure is detected. Signed-off-by: hanhanW <hanhan0912@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

…#20117) I removed the duplicated code in the implementation of `CPUDeviceEncodingLayoutResolverAttrInterface`, `VMVXDeviceEncodingLayoutResolverAttrInterface`, and `GPUDeviceEncodingLayoutResolverAttrInterface`. I provide a base class `WrappedExternalModel` to provide `getConfiguration` and `getEncodingInfo`, so the derived class only needs to implement their own `getEncodingInfoImpl` method. Fixing iree-org#20050 --------- Signed-off-by: Jinjie Liu <jinjie.liu@usc.edu> Signed-off-by: geomin12 <geomin12@amd.com>

Drops a test in LLVMGPU/test/rocdl_pipeline_test.mlir according to changes in iree-org#20152 --------- Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

…iree-org#20152) With RDNA4, we now have cards that allow the OCP FP8 formats (f8E5M2 and f8E4M3FN). This means that the previous check that caused any module that used those formats to be rejected needs to be relaxed. However, the backend uses the same intrinsics for the gfx942-only FNUZ types and the OCP types. Therefore, the data type validity check needs to be aware of the chipset being targetted in order to validate that the dispatch can be compiled correctly. This patch implements this improved set of checks. In addition, it adds tests for these checks (sadly, we can't used expected-error here because the pipeline scrips debug info) and improves these checks to also look inside vector<> and memref<>. (This also means that there is now an IREE-side mechanism to reject compiling FP8 code pre-gfx942.) Signed-off-by: geomin12 <geomin12@amd.com>

…rg#20161) Fixes llama fp8 perf regression introduced by iree-org#20106. The PR stopped the linalg.generic from getting hoisted. This was causing a broadcast to get fused and `tensor<1x1x131072x131072xi1>` to be recomputed on each prefill call. --------- Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu> Signed-off-by: geomin12 <geomin12@amd.com>

…g#20143) This PR restricts the fusion of set_encoding ops to only dispatches with element-wise linalg ops. We don't have good codegen for fusions with reduction ops on GPU yet, so this will allow using the late materialization data tiling path on GPU while the codegen improves. --------- Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

Signed-off-by: geomin12 <geomin12@amd.com>

geomin12 · 2025-03-05T21:07:51Z

Apologies everyone, this PR has been open for quite some time. Will get those fixes in but will close for now to prevent notifications and not bother everyone!

ScottTodd · 2025-03-05T21:11:17Z

Aw, the review history would have been nice to keep here. Looks like some git rebase issues got the branch into a bad state? Feel free to ask for help with git and pull request workflows.

geomin12 · 2025-03-05T21:24:20Z

Aw, the review history would have been nice to keep here. Looks like some git rebase issues got the branch into a bad state? Feel free to ask for help with git and pull request workflows.

I am fixing it! Just feel bad for the notifications to folks

geomin12 requested review from ScottTodd, benvanik and stellaraccident as code owners February 27, 2025 22:33

ScottTodd requested changes Feb 28, 2025

View reviewed changes

ScottTodd mentioned this pull request Mar 3, 2025

Update limits for SDXL benchmarks. #20138

Merged

geomin12 requested review from kuhar, hanhanW, MaheshRavishankar, IanWood1, bjacob, antiagainst, qedawkins and Groverkss as code owners March 5, 2025 21:01

yzhang93 and others added 17 commits March 5, 2025 13:02

Integrate LLVM to llvm/llvm-project@6c2e170d (iree-org#20109)

d85bf8f

There are some attribute changes upstream: - TOSA: llvm/llvm-project#128115 - NVPTX: llvm/llvm-project#127736 --------- Signed-off-by: yzhang93 <zhyuhang88@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

[docs][hip] Mention Windows tools for checking HIP support. (iree-org…

7737404

…#20115) Follow-up to iree-org#20112. Signed-off-by: geomin12 <geomin12@amd.com>

Integrate LLVM to llvm/llvm-project@d39f4a19 (iree-org#20122)

32573ac

Rename changes followed: - llvm/llvm-project#128751 - llvm/llvm-project#128869 --------- Signed-off-by: yzhang93 <zhyuhang88@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

[LLVMGPU] Add a verifier for tile sizes. (iree-org#19906)

f1cacf2

Add a verifier for tile sizes. --------- Signed-off-by: erman-gurses <erman@nod-labs.com> Signed-off-by: geomin12 <geomin12@amd.com>

[Vulkan] Add initial support for rdna4 targets (iree-org#20128)

5dbd64c

Make sure `rdna4@vulkan` is recognized as a Vulkan target. Handle wmma shapes not supported by the cooperative matrix extension. Signed-off-by: geomin12 <geomin12@amd.com>

[docs] Add Vulkanised 2025 talk on IREE (iree-org#20130)

84c098f

Signed-off-by: Jakub Kuderski <jakub@nod-labs.com> Signed-off-by: geomin12 <geomin12@amd.com>

Integrate LLVM to llvm/llvm-project@e481943f (iree-org#20132)

a22df3b

No local patch is carried. --------- Signed-off-by: yzhang93 <zhyuhang88@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

sogartar and others added 21 commits March 5, 2025 13:02

Update limits for SDXL benchmarks. (iree-org#20138)

70b21cf

This reflects improvements in the compiler for these models. Signed-off-by: MaheshRavishankar <mravisha@amd.com> Signed-off-by: geomin12 <geomin12@amd.com>

Bump llvm-project to llvm/llvm-project@4c4fd6b (iree-org#20157)

b17bf6a

Drops a test in LLVMGPU/test/rocdl_pipeline_test.mlir according to changes in iree-org#20152 --------- Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Signed-off-by: geomin12 <geomin12@amd.com>

Adding combined GH action

6dd7871

Signed-off-by: geomin12 <geomin12@amd.com>

whoops it is iree-org

fa66864

Signed-off-by: geomin12 <geomin12@amd.com>

regression test cleanup

cfff973

Signed-off-by: geomin12 <geomin12@amd.com>

Addressing PR comments

b7e72c2

Signed-off-by: geomin12 <geomin12@amd.com>

fixing lint o.o

6c8ff24

Signed-off-by: geomin12 <geomin12@amd.com>

Migrating config and test files to this dir

c2624c9

Signed-off-by: geomin12 <geomin12@amd.com>

Updating GH action

b9641c8

Signed-off-by: geomin12 <geomin12@amd.com>

Adding combined GH action

03b8666

Signed-off-by: geomin12 <geomin12@amd.com>

whoops it is iree-org

a744d8c

Signed-off-by: geomin12 <geomin12@amd.com>

Addressing PR comments

b7b5e82

Signed-off-by: geomin12 <geomin12@amd.com>

fixing lint o.o

16fb20c

Signed-off-by: geomin12 <geomin12@amd.com>

geomin12 force-pushed the test-suite-migration branch from 742dcf8 to 16fb20c Compare March 5, 2025 21:05

geomin12 requested review from nithinsubbiah and rsuderman as code owners March 5, 2025 21:05

geomin12 closed this Mar 5, 2025

ScottTodd mentioned this pull request Mar 6, 2025

Generalize CI benchmarking script for manual use #20180

Open

geomin12 mentioned this pull request Mar 7, 2025

Migrating regression suite to iree-test-suite #20194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating regression suite to iree-test-suite #20124

Migrating regression suite to iree-test-suite #20124

geomin12 commented Feb 27, 2025 •

edited

Loading

ScottTodd Feb 28, 2025

geomin12 Feb 28, 2025

ScottTodd Feb 28, 2025

geomin12 commented Mar 5, 2025

ScottTodd commented Mar 5, 2025

geomin12 commented Mar 5, 2025

Migrating regression suite to iree-test-suite #20124

Migrating regression suite to iree-test-suite #20124

Conversation

geomin12 commented Feb 27, 2025 • edited Loading

ScottTodd Feb 28, 2025

Choose a reason for hiding this comment

geomin12 Feb 28, 2025

Choose a reason for hiding this comment

ScottTodd Feb 28, 2025

Choose a reason for hiding this comment

geomin12 commented Mar 5, 2025

ScottTodd commented Mar 5, 2025

geomin12 commented Mar 5, 2025

geomin12 commented Feb 27, 2025 •

edited

Loading