enforce irpa extension on save #11

dan-garvey · 2024-05-02T22:34:50Z

avoids user error like trying to use save_module_parameters("my_mod.safetensors", myModule)

This PR modifies the insertion point for iter args to ensure that the iter args are in the same order as the init args and outputs. This simplifies the mapping between init args, iter args and outputs. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

@maxbartel

Fixes #85 PR based on the work of @maxbartel Requires changes in torch-mlir: [llvm/torch-mlir/#3688](llvm/torch-mlir#3688) Adds the mutable modifier to a global buffer and lifts said buffer to a global if there is a store-producer node associated with it. Signed-off-by: Christopher McGirr <mcgirr@roofline.ai> Co-authored-by: Maximilian Bartel <bartel@roofline.ai>

…#162) This PR introduces changes to handle elementwise or general arithmetic operations after we did some tiled-loop-reduction ("Reduction") operation. The main problem with the current stack is indexing_dims information for Reduction relies on the user. This would work if it's user/consumer is tkw.write, but in other cases such as BinaryPyOp or UnaryPyOp, it will lack such information. To make matters worst BinaryPyOp/UnaryPyOp depends on it's src/producer for indexing dim, while Reduction op depends on it's dst/consumer for its' indexing dim information. This would ended up causing infinite loop between UnaryPyOp/BinaryPyOp <-> Reduction. This PR fixes the indexing dimension logic Reduction and GetResult (required for expanded Reduction) to be based on it's reduction axis(for Reduction) and it's source/consumer information. --------- Signed-off-by: Stanley Winata <stanley.winata@amd.com>

This PR removes the need for propagating indices using post expansion. The new approach propagates the MMA indices to the MMA dimensions of all tensors (rather than just MMA nodes) and then specializes them depending on whether they lie within the backward slices of the LHS and RHS or forward slices of the ACC. --------- Signed-off-by: Harsh Menon <harsh@nod-labs.com>

This PR adds more documentation about tkw. Specifically, it provides a first draft of the introduction and adds a section on memory access patterns. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

The main motivation behind this PR is to enable multiple induction variable/iterArg on the same tiled "Reduction" loop. To enable above we did a couple things: 1. Enable lowering/expansion on `operator.getitem` (the op that extract multiple results in python i.e `res0, res1 = fn`) by templating it on`GetResult(CustomOp)` since they have the same args and interface and can reuse most of the indexing/expansion helper. 2. Introduce `res_idx`, a variable to represent which result index of an op we are referring to, during expansion and context map. This is useful for ops that has more than one results / variables as outputs. 3. bug fix in expand_reduction, where we hoist out iterating and expanding of `reduction.init_args` out of the loop that iterates and expands over the `yield`/`return_val` of the reduction loop. It is expected that the size of `init_args` is the same as size of `yield`/`return_val`. Hence if we had N iter_args/yields, we ended up expanding the `init_args` N x N time instead of N times. We haven't seen it thus far because we have been only playing with 1 init_arg/iterArg, and 1x1 == 1. 4. Introduce a canonicalization pattern to fold chains of GetResult. this is because GetResult by semantic/design is only expected to extract and have one result. Hence a chain of GetResult should just be replaced by itself. This help clean up the IR. num.4 also helps circumvent issue where Reduction and GetResult is expanded completely by itself not following the DFS structure per dimension like the rest of the expansion code. This becomes especially problematic for multiple IterArg since Getitem is not expecting its' source value to be expanded without it. --------- Signed-off-by: Stanley Winata <stanley.winata@amd.com>

Signed-off-by: Boian Petkantchin <boian.petkantchin@amd.com>

Instead of generating individual element comparisons and doing `vector.insertelement` generate the whole mask using vector ops. Add support for vector codegen when generating MLIR IR from sympy expressions. Add method `IndexingContext.iota` to generate special symbols which map to `(1,2 ... n-1)` vec expressions. `gen_sympy_index` will start to generate vector ops when encountering such symbols, inserting proper `splat`'s between scalar vals when necessary. --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* Adds an option to `aot.export(import_symbolic_shape_expressions=True)` to enable emission of torch-mlir symbolic shape constraints. This is currently set to False until IREE is ready to ingest these by default. Rough sequence of work in IREE proper: * Custom lowering of `torch.symbolic_int` and `torch.bind_symbolic_shape` ops to IREE util "assume" ops. Note that we are only planning to lower "terminal" bindings (basically function arguments and a couple of other such categories). * Canonicalizations to ensure that assume equalities are == 0 (versus the native form from torch where they assume a non zero equality). * Fusion will clone corresponding bindings on dependent dims into dispatch regions. * Existing linalg shape analysis extended and queryable by codegen. --------- Signed-off-by: Stella Laurenzo <stellaraccident@gmail.com>

This PR adds code to construct the epilogue, kernel and prologue once we have computed a schedule. We simulate rotating registers in software and add visualization tools to show the pipelined graphs. --------- Signed-off-by: Harsh Menon <harsh@nod-labs.com>

This PR adds support for dynamic dimensions in the kernels. The user specifies the dynamic dimensions by - Not adding them to the hyperparameter dictionary - Explicitly specifying them with the dynamic_symbols kwarg and the dynamic_symbols_mapping kwarg to specify which values to use for the dynamic dims at runtime This PR does not modify the codegen and so incorrect or unsupported values for the dynamic dims will result in incorrect results. (garbage in -> garbage out) --------- Signed-off-by: Harsh Menon <harsh@nod-labs.com>

* Rework how we are lowering `rational` sympy expressions, instead of delayed materialization via lambdas introduce `_Rational` type and propagate `numerator/denominator` values independently. Division will only be materialized on explicit `sympy.floor/ceiling` op. * Rework how igemm test cases are generated and introduce few real shapes. * Use custom pytest markers to separate perf/non-perf tests --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Signed-off-by: erman-gurses <erman@nod-labs.com>

The motivation of this pass is to generalize the register analysis pass which is used to determine the thread shape of TKW.Register, to all other operations. One main use case for such is to allow reduction, and later on "broadcast" to use thread shape information from the kernel as opposed to relying on vector_shape which may not always be valid. We generalize the register analysis metho by finding a few anchor ops who's thread shape information is determined, and then propagate to it's successors and ancestors. In addition to that we also implemented a couple helper function/attributes. 1. Control_fn on BFS, ForwardSlice, BackwardSlice. This is to make it easier for us to control/stop the search when we hit ops we do not want to explore. In this case, we do not want to explore/propagate onto other anchor ops and their children. 2. Introducing parent_op to IterArg and region of Reduction, for developer ergonomics. 3. Move handling of IterArg and GetUser in BackwardSlice/BFS's get_input exploration phase to be handled individually as opposed to being handled when its' consumer is being explored. Previously to explore/propagate IterArg/GetUser, we need to explore its' consumer, just exploring IterArg/GetUser will not get handled correctly. This is useful for the case where we want to propagate/explore mma.acc (usually IterArg) directly. --------- Signed-off-by: Stanley Winata <stanley.winata@amd.com>

We would like this to be controlled with a flag. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

Our tests are flaky, `fail-fast: false` won't allow failing builds abort other. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Initial version of IGEMM benchmarking. * If `--runperf` pytest option is set, generate IREE ref code and run both TKW and ref code with `run_bench=True` * Add `--dump-perf-files-path` option to save perf info files into provided directory (filenames based on test name and params) --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* Add `arith.andi`, `arith.cmpi`, `vector.maskedload`, `vector.gather`, `vector.contant_mask`, `vector.insertelement`, `vectot.splat`, support non-splatted contants. * Add `interpret_ndrange` helper --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Motivation of this PR is to be able to codegen/lower broadcast properly. With that in mind, we implemented these things: 1. BroadcastOp class, op and lowering, to represent and store broadcasting information. Mostly S.T we can query target shape information and the source operand of broadcast. 2. Treat broadcast-add as an index conflict and handle it by emitting broadcastOp. --------- Signed-off-by: Stanley Winata <stanley.winata@amd.com>

This PR adds a flag to dump intermediates which include .ll and .s files to see what instructions were generated. --------- Signed-off-by: Harsh Menon <harsh@nod-labs.com>

* Main CI is flaky, add a separate pipeline, which tests only TK as temp solution * Make `pytest` output more verbose * Remove unnecessary stuff from perf pipeline --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Signed-off-by: erman-gurses <erman@nod-labs.com>

* Move files from files from `shark-turbine` to `iree/turbine`. * Update imports * Update `setup.py` * Make backward redirect `shark-turbine` -> `iree.turbine` (do we need this?) Progress on #28 --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

…e. (#196) Signed-off-by: Stanley Winata <stanley.winata@amd.com>

* Test `nchw_fchw` and `nhwc_hwcf` igemm conv layouts. * Perf test will use `nhwc_hwcf` as IREE seems to produce the best result for it. --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

This PR adds more information about the language. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

This PR adds op for scheduling barriers and scheduling group barriers. These are placed after every cycle in the kernel. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

Adds preliminary tikz overview figure and an initial description of the fx tracing process

This solves the problem in iree-org/iree#18283 The issue is that we generate cast to/from dynamic tensors that later lowering in IREE chokes on it. My assumption is that it should be able to digest this IR since it is of the form. ```mlir %2 = torch_c.to_builtin_tensor %arg0 : !torch.vtensor<[2,3,11,13],f32> -> tensor<2x3x11x13xf32> %cast = tensor.cast %2 : tensor<2x3x11x13xf32> to tensor<?x?x?x?xf32> %c0 = arith.constant 0 : index %dim = tensor.dim %cast, %c0 : tensor<?x?x?x?xf32> %c1 = arith.constant 1 : index %dim_0 = tensor.dim %cast, %c1 : tensor<?x?x?x?xf32> %c2 = arith.constant 2 : index %dim_1 = tensor.dim %cast, %c2 : tensor<?x?x?x?xf32> %c3 = arith.constant 3 : index %dim_2 = tensor.dim %cast, %c3 : tensor<?x?x?x?xf32> %3 = flow.tensor.transfer %cast : tensor<?x?x?x?xf32>{%dim, %dim_0, %dim_1, %dim_2} to #hal.device.promise<@__device_0> %cast_3 = tensor.cast %3 : tensor<?x?x?x?xf32> to tensor<2x3x11x13xf32> %4 = torch_c.from_builtin_tensor %cast_3 : tensor<2x3x11x13xf32> -> !torch.vtensor<[2,3,11,13],f32> ``` It essentially casts to a dynamic `tensor<...>` for the purpose of performing `flow.tensor.transfer` and then casts back to a static `torch.vtensor`. So it should be fine. With this change we get ```mlir %2 = torch_c.to_builtin_tensor %arg0 : !torch.vtensor<[2,3,11,13],f32> -> tensor<2x3x11x13xf32> %3 = flow.tensor.transfer %2 : tensor<2x3x11x13xf32> to #hal.device.promise<@__device_0> %4 = torch_c.from_builtin_tensor %3 : tensor<2x3x11x13xf32> -> !torch.vtensor<[2,3,11,13],f32> ``` Signed-off-by: Boian Petkantchin <boian.petkantchin@amd.com>

Signed-off-by: dan <danimal197@gmail.com>

aviator19941 · 2024-10-09T20:47:33Z

iree/turbine/aot/params.py

-        self._index.create_archive_file(str(file_path))
+        str_file_path = str(file_path)
+        if not str_file_path.endswith(".irpa"):
+            file_path = str_file_path + ".irpa"


Should this be str_file_path = str_file_path + ".irpa"?

ScottTodd · 2024-10-15T18:55:15Z

Something is off with the commits on this PR after a force push.

dan-garvey requested a review from stellaraccident May 2, 2024 22:35

dan-garvey force-pushed the forcibly_add_irpa_extension branch 3 times, most recently from ac89160 to f1beea9 Compare May 2, 2024 22:46

harsh-nod and others added 20 commits September 23, 2024 20:13

[TKW] Add xfail decorator for unaligned shape (#163)

65eb532

Add first draft of introduction (#168)

04a4ba5

This PR adds more documentation about tkw. Specifically, it provides a first draft of the introduction and adds a section on memory access patterns. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[TKW] igemm shared mem tests (#171)

7686157

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Handle complex element type in torch.vtensor conversion (#175)

192a786

Signed-off-by: Boian Petkantchin <boian.petkantchin@amd.com>

Add benchmark support for e2e tests (#183)

0f00c6d

Signed-off-by: erman-gurses <erman@nod-labs.com>

Disable benchmarking on all e2e tests for now (#189)

d98e521

We would like this to be controlled with a flag. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

Set fail-fast: false (#190)

a04ea80

Our tests are flaky, `fail-fast: false` won't allow failing builds abort other. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

dan-garvey force-pushed the forcibly_add_irpa_extension branch from f1beea9 to 9cd848b Compare October 4, 2024 18:05

dan-garvey requested a review from ScottTodd October 4, 2024 18:06

dan-garvey enabled auto-merge (squash) October 4, 2024 18:06

raikonenfnu and others added 3 commits October 4, 2024 11:16

Add ability to dump intermediates (#194)

83bbc40

This PR adds a flag to dump intermediates which include .ll and .s files to see what instructions were generated. --------- Signed-off-by: Harsh Menon <harsh@nod-labs.com>

Split TK CI from main CI (#195)

39acab8

* Main CI is flaky, add a separate pipeline, which tests only TK as temp solution * Make `pytest` output more verbose * Remove unnecessary stuff from perf pipeline --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

erman-gurses and others added 10 commits October 4, 2024 14:02

Add parameterization for benchmark flag (#192)

4fec47c

Signed-off-by: erman-gurses <erman@nod-labs.com>

Add padding to reduce shared memory bank conflicts (#193)

da3436d

[TKW] Minor bug fix expansion to handle reduction and MMA at same tim…

796f3a5

…e. (#196) Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[TKW] Test multiple igemm layouts (#201)

c076126

* Test `nchw_fchw` and `nhwc_hwcf` igemm conv layouts. * Perf test will use `nhwc_hwcf` as IREE seems to produce the best result for it. --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Update documentation (#199)

5986c3c

This PR adds more information about the language. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

Add support for scheduling barriers (#185)

f207ca5

This PR adds op for scheduling barriers and scheduling group barriers. These are placed after every cycle in the kernel. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

Add preliminary tikz overview figure (#206)

351f2fe

Adds preliminary tikz overview figure and an initial description of the fx tracing process

enforce irpa extension on save

9d8e8fc

Signed-off-by: dan <danimal197@gmail.com>

dan-garvey force-pushed the forcibly_add_irpa_extension branch from 9cd848b to 9d8e8fc Compare October 9, 2024 20:39

aviator19941 reviewed Oct 9, 2024

View reviewed changes

fix name

ca11957

aviator19941 approved these changes Oct 9, 2024

View reviewed changes

stellaraccident force-pushed the main branch from 1f75cd5 to bacfdcd Compare October 13, 2024 02:44

ScottTodd removed their request for review October 17, 2024 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enforce irpa extension on save #11

enforce irpa extension on save #11

dan-garvey commented May 2, 2024

aviator19941 Oct 9, 2024

ScottTodd commented Oct 15, 2024

enforce irpa extension on save #11

Are you sure you want to change the base?

enforce irpa extension on save #11

Conversation

dan-garvey commented May 2, 2024

aviator19941 Oct 9, 2024

Choose a reason for hiding this comment

ScottTodd commented Oct 15, 2024