Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARITH] DeduceBound #40

Merged
merged 27 commits into from
Feb 17, 2017
Merged

[ARITH] DeduceBound #40

merged 27 commits into from
Feb 17, 2017

Conversation

ZihengJiang
Copy link
Contributor

@ZihengJiang ZihengJiang commented Feb 10, 2017

WIP

std::vector<const Node*> path_;

private:
bool finded_{false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found

// a visitor to find the path to the target variable
// from a expression.
class VariableFinder: public IRVisitor {
public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariablePathFinder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to look out for errors when a variable appears in multiple locations in the expression

bool left = op->a.get() == path_[iter_];
Expr operand = left ? op->b : op->a;
if (is_negative_const(operand)) is_greater = !is_greater;
result /= operand;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be problem of rounding in here. if it is a lower bound and rounds toward 0, then it becomes problematic.

Maybe we should consider find out the direction first, before doing deduction

bool left = op->a.get() == path_[iter_];
Expr operand = left ? op->b : op->a;
if (is_negative_const(operand)) is_greater = !is_greater;
result = left ? result * operand : operand / result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about rounding

Type t = deducer.result.type();
return deducer.is_greater ?
IntSet::range(Range(deducer.result, Cast::make(t, Interval::pos_inf))) :
IntSet::range(Range(Cast::make(t, Interval::neg_inf), deducer.result));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use range, we can use IntervalSet::make, maybe we can consider to expose IntervalSet under arith

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to put IntervalSet and other data structure in int_set_internal.h

void Visit_(const Mul* op) final {
bool left = op->a.get() == path_[iter_];
Expr operand = left ? op->b : op->a;
if (is_negative_const(operand)) is_greater = !is_greater;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be cases where we cannot prove either it is negative_const or is positive const, which results in a deduction failure

@ZihengJiang ZihengJiang changed the title [LANG/PASS] PartitionLoops Deduce Feb 13, 2017
result += op->b;
} else {
result -= op->a;
result = -1 * result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to -result, or 0- result, negation should be overloaded already?


void Visit_(const LT* op) final {
is_greater = false;
is_equal = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be bad practice to detect it inside visitor. What if we have something like

a > 10 && b < 8

Since only simple case is handled, consider do the detection outside the visitor

Copy link
Contributor Author

@ZihengJiang ZihengJiang Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a check before deduce

if (!d.success) return IntSet();
Expr min = Interval::neg_inf, max = Interval::pos_inf;
if (d.is_greater) {
min = d.is_equal ? d.result : d.result+1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space between add

Expr expr_;
const std::unordered_map<const Variable*, IntSet>& dom_map_;
std::vector<const Node*> path_;
size_t iter_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly initialize it to 0 here

// a visitor to deduce the bound of a variable from a expression
class BoundDeducer: public IRVisitor {
public:
BoundDeducer(Var target, Expr expr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing everything in constructor have a problem of not being able to throw exception out, consider do it in another function, say Deduce

@tqchen tqchen merged commit 5198c10 into apache:master Feb 17, 2017
@tqchen tqchen mentioned this pull request Feb 24, 2017
2 tasks
tqchen pushed a commit to tqchen/tvm that referenced this pull request May 26, 2018
tqchen pushed a commit to tqchen/tvm that referenced this pull request May 26, 2018
* init onnx

finish onnx frontend

add onnx tests

fix various

backup

use transformer

[Frontend] graph passed

add test forward

test forward

fix doc and lint

fix test graph tuple

from_onnx now take 2 args, output (sym, params)

fix rename

fix input names

fix multiple

fix lint

fix lint check

* better doc
tqchen pushed a commit that referenced this pull request May 29, 2018
tqchen pushed a commit that referenced this pull request May 29, 2018
* init onnx

finish onnx frontend

add onnx tests

fix various

backup

use transformer

[Frontend] graph passed

add test forward

test forward

fix doc and lint

fix test graph tuple

from_onnx now take 2 args, output (sym, params)

fix rename

fix input names

fix multiple

fix lint

fix lint check

* better doc
tqchen pushed a commit to tqchen/tvm that referenced this pull request Jul 6, 2018
tqchen pushed a commit to tqchen/tvm that referenced this pull request Jul 6, 2018
* init onnx

finish onnx frontend

add onnx tests

fix various

backup

use transformer

[Frontend] graph passed

add test forward

test forward

fix doc and lint

fix test graph tuple

from_onnx now take 2 args, output (sym, params)

fix rename

fix input names

fix multiple

fix lint

fix lint check

* better doc
tqchen added a commit to tqchen/tvm that referenced this pull request Jul 12, 2018
tqchen added a commit that referenced this pull request Jul 12, 2018
tqchen added a commit to tqchen/tvm that referenced this pull request Aug 4, 2018
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
* init onnx

finish onnx frontend

add onnx tests

fix various

backup

use transformer

[Frontend] graph passed

add test forward

test forward

fix doc and lint

fix test graph tuple

from_onnx now take 2 args, output (sym, params)

fix rename

fix input names

fix multiple

fix lint

fix lint check

* better doc
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
jroesch added a commit to jroesch/tvm that referenced this pull request Aug 29, 2018
* Add implementation of attributes

* Add Attributes to the Call node

* Fix test

* Repair tests
tqchen added a commit to tqchen/tvm that referenced this pull request Mar 29, 2020
jcf94 pushed a commit to jcf94/tvm that referenced this pull request Jun 24, 2020
…pache#40)

* Add MutateComputeLocation and MutateParallel in evolutionary search

* fix lint
tqchen pushed a commit that referenced this pull request Jul 15, 2020
…generating (#5962)

* Code migration Start (#1)

* Init commit: Code migration Start

* Add loop_state.cc/h

* Add ComputeDAG basic test

* Split transform_step out & Update more UTs (#3)

* Split transform_step out

* Update GetProducers & GetConsumers

* Update UTs

* Add UT for CacheReadWrite & Some bug fix

* Add search_task, measure and serialization (#4)

* Add FollowSplit & FollowFusedSplit tests

* Update dag.InferBound & its UT

* Add search_task, measure and serialization

* Update Serialization UT

* Add MetaTileRewritePolicy (#5)

* Add feature

* Add cost_model, meta_tile_rewrite_policy

* Add MetaTileRewritePolicy basic UT

* Basic Python API for State (#6)

* Add Basic Python API for State

* Add UTs for State

* Add Python API: Measure & Task (#7)

* Update the return value of state operation

* Add task

* Copy measure.py & utils.py

* Fix LocalBuilder

* Fix LocalRunner

* Add ansor.auto_schedule() API; First AutoSchedule working version(#8)

* Add basic Python support for ansor.auto_schedule

* Update AutoSchedule API

* Bug fix for get the attach point of a fused iter

* Update UT after infer bug fix

* Bug fix & Add python serialization API (#10)

* Delete C++ UT hack since Python is ready

* Add ndarray.non_empty

* Update Serialization python API

* Improve code style, python wrapper and test cases (#11)

* Update c++ code style and unit test

* Update python State wrapper and test cases

* fix unit tests

* Add RPCRunner & OpenCL/CUDA test (#12)

* Add RPCRunner & OpenCL search test

* Add CUDA search test

* Add RPCRunner test

* rebase to upstream/master

* Add Ansor basic tutorial (#13)

* Add basic tutorial

* migrate feature extraction (#14)

* Add XGBModel & RPCRunnerWarpper (#15)

* Add XGBModel & RPCRunnerWarpper

* Revert "Add Parallel Granularity Mutation"

* Migrate workload_registry.py (#16)

* add workload registry

* update

* update

* add task scheduler (#17)

* Add conv2d cuda tutorial with workload registry (#18)

* add tune_test.py (the old tune_wkl.py) (#19)

* add tune_test.py (the old tune_wkl.py)

* update

* fix measure

* fix for gpu

* Code refine for tune_test.py & Add a pre load callback (#20)

* Bug fix for tutorials

* Add PreLoadMeasuredStates

* Add search_callback support for task tuner

* Code refine for tune_test.py

* Update

* Update

* Update

* Update

* Bug fix

* Add python custom sketch rule (#21)

* Add custom sketch rule

* Bug fix

* Ansor Relay Integration (without layout rewrite) (#22)

* relay integration

* Add tune_op_subgraph.py & Some code clean for tune_network.py (#23)

* Add single op tune scripts

* Add tune subgraph support

* Merge all op & all subgraph to one file

* Rename file

* add explicit_unroll_max_extent (#25)

* Add Index simplification & API update (#26)

* Add vectorized cooperative_fetching test

* Update math simplify for vectorized CF

* File rename

* Update tune_network

* API update

* Update PreLoadMeasuredStates & Some bug fix (#27)

* Add a threading wrapper to fix the test bug

* Set default TVM_USE_AUTO_SCHEDULER to false

* Update PreLoadMeasuredStates callback

* Add tensorize step for loop_state (#31)

* Add tensorize step

* State python api update (#33)

* Start to update api

* Add compute_dag to state

* API update

* kernel layout rewrite (#28)

* kernel layout rewrite

* remove some hacks

* add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass

* set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

* [cache flush] port cache flush to ansor (#32)

* Improve relay integration (#34)

* tmp checkpoint

* Improve relay integration

* Improve relay integration

* Fix xgb error & Simplify dispatcher (#35)

* Rename "MetaTileRewritePolicy" to "SketchPolicy". (#36)

* Rename "MetaTileRewritePolicy" to "SketchPolicy".

* Add a new class for auto_unroll_max_step, storage_offset in StageNode

* fix tune_op_subgraph.py

* rebase

* Migrate all node::make to noderef's construct function (#37)

* Start to move xxxnode::make to noderef()

* Update

* Update

* Finish transform_step

* Finish comute dag & auto schedule

* Update

* Update

* Update

* Update

* Update

* Code refine

* Code refine

* Code refine

* Update

* Update

* Some lint fix & Recover the double constructor of tvm::PrimExpr (#39)

* lint fix

* clang-format-fix

* pylint fix

* Update

* Recover the double constructor of tvm::PrimExpr

* Fix pylint

* pylint fix

* pylint fix

* Add MutateComputeLocation and MutateParallel in evolutionary search (#40)

* Add MutateComputeLocation and MutateParallel in evolutionary search

* fix lint

* Improve loop state python API (stage_tensors -> stage_ops) (#41)

* improve loop state python API (stage_tensors -> stage_ops)

* fix

* ComputeDAG bug fix & Add Custom TensorCore Matmul Example (#42)

* Bug Fix

* Sample example of Custom TensorCore Matmul

* Rever Commits, Start to build minimum Ansor system

* Code clean for minimum Ansor system

* Bug fix & Delete AccessAnalyzer

* Delete attachmap & Code clean

* Doc update

Update statenode::stages from vector to Array

* Headfile update & Python doc update

* clang-format fix

* pylint fix

* Update

* Doc update

* Update

* Bug fix after code merge to the new master

* clang-format fix

* Update

* Update

* Update std::vector to Array; Update verbosity setting; Some commemts
addressed

* std::vector->Array & std::string->String

* Add init_state to ComputeDAG

* Update

* Update some unordered_map to Map

* clang-format fix

* Comments addressed
Delete ReplayAndInferBound
Delete ReplaySteps & InferBoundCommon

* Lint fix

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Rename ansor namespace to auto_schedule

* Update

* Rename ThreadPool to ParallelFor

* Add parallel_for

* Remove ThreadPool

* Update python/tvm/auto_schedule/auto_schedule.py

* trigger CI

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Minmin Sun (孙敏敏) <minmin.smm@alibaba-inc.com>
Co-authored-by: Zhao Wu <zhaowu@apache.org>
CloudManX pushed a commit to CloudManX/incubator-tvm that referenced this pull request Sep 15, 2020
…generating (apache#5962)

* Code migration Start (apache#1)

* Init commit: Code migration Start

* Add loop_state.cc/h

* Add ComputeDAG basic test

* Split transform_step out & Update more UTs (apache#3)

* Split transform_step out

* Update GetProducers & GetConsumers

* Update UTs

* Add UT for CacheReadWrite & Some bug fix

* Add search_task, measure and serialization (apache#4)

* Add FollowSplit & FollowFusedSplit tests

* Update dag.InferBound & its UT

* Add search_task, measure and serialization

* Update Serialization UT

* Add MetaTileRewritePolicy (apache#5)

* Add feature

* Add cost_model, meta_tile_rewrite_policy

* Add MetaTileRewritePolicy basic UT

* Basic Python API for State (apache#6)

* Add Basic Python API for State

* Add UTs for State

* Add Python API: Measure & Task (apache#7)

* Update the return value of state operation

* Add task

* Copy measure.py & utils.py

* Fix LocalBuilder

* Fix LocalRunner

* Add ansor.auto_schedule() API; First AutoSchedule working version(apache#8)

* Add basic Python support for ansor.auto_schedule

* Update AutoSchedule API

* Bug fix for get the attach point of a fused iter

* Update UT after infer bug fix

* Bug fix & Add python serialization API (apache#10)

* Delete C++ UT hack since Python is ready

* Add ndarray.non_empty

* Update Serialization python API

* Improve code style, python wrapper and test cases (apache#11)

* Update c++ code style and unit test

* Update python State wrapper and test cases

* fix unit tests

* Add RPCRunner & OpenCL/CUDA test (apache#12)

* Add RPCRunner & OpenCL search test

* Add CUDA search test

* Add RPCRunner test

* rebase to upstream/master

* Add Ansor basic tutorial (apache#13)

* Add basic tutorial

* migrate feature extraction (apache#14)

* Add XGBModel & RPCRunnerWarpper (apache#15)

* Add XGBModel & RPCRunnerWarpper

* Revert "Add Parallel Granularity Mutation"

* Migrate workload_registry.py (apache#16)

* add workload registry

* update

* update

* add task scheduler (apache#17)

* Add conv2d cuda tutorial with workload registry (apache#18)

* add tune_test.py (the old tune_wkl.py) (apache#19)

* add tune_test.py (the old tune_wkl.py)

* update

* fix measure

* fix for gpu

* Code refine for tune_test.py & Add a pre load callback (apache#20)

* Bug fix for tutorials

* Add PreLoadMeasuredStates

* Add search_callback support for task tuner

* Code refine for tune_test.py

* Update

* Update

* Update

* Update

* Bug fix

* Add python custom sketch rule (apache#21)

* Add custom sketch rule

* Bug fix

* Ansor Relay Integration (without layout rewrite) (apache#22)

* relay integration

* Add tune_op_subgraph.py & Some code clean for tune_network.py (apache#23)

* Add single op tune scripts

* Add tune subgraph support

* Merge all op & all subgraph to one file

* Rename file

* add explicit_unroll_max_extent (apache#25)

* Add Index simplification & API update (apache#26)

* Add vectorized cooperative_fetching test

* Update math simplify for vectorized CF

* File rename

* Update tune_network

* API update

* Update PreLoadMeasuredStates & Some bug fix (apache#27)

* Add a threading wrapper to fix the test bug

* Set default TVM_USE_AUTO_SCHEDULER to false

* Update PreLoadMeasuredStates callback

* Add tensorize step for loop_state (apache#31)

* Add tensorize step

* State python api update (apache#33)

* Start to update api

* Add compute_dag to state

* API update

* kernel layout rewrite (apache#28)

* kernel layout rewrite

* remove some hacks

* add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass

* set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

* [cache flush] port cache flush to ansor (apache#32)

* Improve relay integration (apache#34)

* tmp checkpoint

* Improve relay integration

* Improve relay integration

* Fix xgb error & Simplify dispatcher (apache#35)

* Rename "MetaTileRewritePolicy" to "SketchPolicy". (apache#36)

* Rename "MetaTileRewritePolicy" to "SketchPolicy".

* Add a new class for auto_unroll_max_step, storage_offset in StageNode

* fix tune_op_subgraph.py

* rebase

* Migrate all node::make to noderef's construct function (apache#37)

* Start to move xxxnode::make to noderef()

* Update

* Update

* Finish transform_step

* Finish comute dag & auto schedule

* Update

* Update

* Update

* Update

* Update

* Code refine

* Code refine

* Code refine

* Update

* Update

* Some lint fix & Recover the double constructor of tvm::PrimExpr (apache#39)

* lint fix

* clang-format-fix

* pylint fix

* Update

* Recover the double constructor of tvm::PrimExpr

* Fix pylint

* pylint fix

* pylint fix

* Add MutateComputeLocation and MutateParallel in evolutionary search (apache#40)

* Add MutateComputeLocation and MutateParallel in evolutionary search

* fix lint

* Improve loop state python API (stage_tensors -> stage_ops) (apache#41)

* improve loop state python API (stage_tensors -> stage_ops)

* fix

* ComputeDAG bug fix & Add Custom TensorCore Matmul Example (apache#42)

* Bug Fix

* Sample example of Custom TensorCore Matmul

* Rever Commits, Start to build minimum Ansor system

* Code clean for minimum Ansor system

* Bug fix & Delete AccessAnalyzer

* Delete attachmap & Code clean

* Doc update

Update statenode::stages from vector to Array

* Headfile update & Python doc update

* clang-format fix

* pylint fix

* Update

* Doc update

* Update

* Bug fix after code merge to the new master

* clang-format fix

* Update

* Update

* Update std::vector to Array; Update verbosity setting; Some commemts
addressed

* std::vector->Array & std::string->String

* Add init_state to ComputeDAG

* Update

* Update some unordered_map to Map

* clang-format fix

* Comments addressed
Delete ReplayAndInferBound
Delete ReplaySteps & InferBoundCommon

* Lint fix

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Rename ansor namespace to auto_schedule

* Update

* Rename ThreadPool to ParallelFor

* Add parallel_for

* Remove ThreadPool

* Update python/tvm/auto_schedule/auto_schedule.py

* trigger CI

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Minmin Sun (孙敏敏) <minmin.smm@alibaba-inc.com>
Co-authored-by: Zhao Wu <zhaowu@apache.org>
wjj19950828 pushed a commit to wjj19950828/tvm that referenced this pull request Sep 18, 2021
MasterJH5574 pushed a commit to MasterJH5574/tvm that referenced this pull request Dec 24, 2021
MasterJH5574 pushed a commit to MasterJH5574/tvm that referenced this pull request Mar 7, 2022
[SparseTIR] Constructors and Python Interface for `Axis` and `SparseBuffer` (apache#2)

* add methods for Object

* axis constructors

* methods for SparseBuffer

* put into registry

* python interface

[CherryPick][Intrinsic] lower_bound and upper_bound for binary search in Sparse TIR. (apache#483) (apache#4)

* upd

* upd

* fix

* upd

* upd

* upd

* upd

* upd

* fix

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* codegen-rule

* upd

* upd

* test

* upd

* fix

* two arguments

Co-authored-by: Zihao Ye <expye@outlook.com>

Fix AxisTree (apache#3)

* fix axis tree

* upd

[SparseTIR] Add SparseBufferLoad/SparseBufferStore (apache#5)

* Add dtype for SparseBuffer

* Add name for SparseBuffer. Remove `ndim`

* Remove namespace sparse

* Add SparseBufferLoad/Store

* Add method `ndim()`

[SparseTIR] Introduce SpIterVar (apache#6)

* [SparseTIR] Introduce SpIterVar

* Add conversion to PrimExpr

[BugFix] Fix binary search & SpIterVar (apache#7)

[BugFix] Add field `is_reduction` for SpIterVar (apache#9)

* [BugFix] Add field `is_reduction` for SpIterVar

* Formatting

[SparseTIR] Index Lowering (apache#8)

* Add StmtFunctor/ExprFunctor for SparseBufferStore/Load

* Add basic index lowering

* Finish index lowering (maybe)

* Address comments

* Convert CRLF to LF

Frontend update, demo scripts. (apache#10)

* Format and Buffer data structure (apache#1)

* [SparseTIR] Constructors and Python Interface for `Axis` and `SparseBuffer` (apache#2)

* add methods for Object

* axis constructors

* methods for SparseBuffer

* put into registry

* python interface

* [CherryPick][Intrinsic] lower_bound and upper_bound for binary search in Sparse TIR. (apache#483) (apache#4)

* upd

* upd

* fix

* upd

* upd

* upd

* upd

* upd

* fix

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* codegen-rule

* upd

* upd

* test

* upd

* fix

* two arguments

Co-authored-by: Zihao Ye <expye@outlook.com>

* Fix AxisTree (apache#3)

* fix axis tree

* upd

* Format and Buffer data structure (apache#1)

* [SparseTIR] Constructors and Python Interface for `Axis` and `SparseBuffer` (apache#2)

* add methods for Object

* axis constructors

* methods for SparseBuffer

* put into registry

* python interface

* fix axis tree

* upd

* Format and Buffer data structure (apache#1)

* [SparseTIR] Constructors and Python Interface for `Axis` and `SparseBuffer` (apache#2)

* add methods for Object

* axis constructors

* methods for SparseBuffer

* put into registry

* python interface

* [CherryPick][Intrinsic] lower_bound and upper_bound for binary search in Sparse TIR. (apache#483) (apache#4)

* upd

* upd

* fix

* upd

* upd

* upd

* upd

* upd

* fix

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* codegen-rule

* upd

* upd

* test

* upd

* fix

* two arguments

Co-authored-by: Zihao Ye <expye@outlook.com>

* Fix AxisTree (apache#3)

* fix axis tree

* upd

* [SparseTIR] Add SparseBufferLoad/SparseBufferStore (apache#5)

* Add dtype for SparseBuffer

* Add name for SparseBuffer. Remove `ndim`

* Remove namespace sparse

* Add SparseBufferLoad/Store

* Add method `ndim()`

* Format and Buffer data structure (apache#1)

* [SparseTIR] Constructors and Python Interface for `Axis` and `SparseBuffer` (apache#2)

* add methods for Object

* axis constructors

* methods for SparseBuffer

* put into registry

* python interface

* [CherryPick][Intrinsic] lower_bound and upper_bound for binary search in Sparse TIR. (apache#483) (apache#4)

* upd

* upd

* fix

* upd

* upd

* upd

* upd

* upd

* fix

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* codegen-rule

* upd

* upd

* test

* upd

* fix

* two arguments

Co-authored-by: Zihao Ye <expye@outlook.com>

* Fix AxisTree (apache#3)

* fix axis tree

* upd

* [SparseTIR] Add SparseBufferLoad/SparseBufferStore (apache#5)

* Add dtype for SparseBuffer

* Add name for SparseBuffer. Remove `ndim`

* Remove namespace sparse

* Add SparseBufferLoad/Store

* Add method `ndim()`

* [SparseTIR] Introduce SpIterVar (apache#6)

* [SparseTIR] Introduce SpIterVar

* Add conversion to PrimExpr

* [BugFix] Fix binary search & SpIterVar (apache#7)

* [BugFix] Add field `is_reduction` for SpIterVar (apache#9)

* [BugFix] Add field `is_reduction` for SpIterVar

* Formatting

* upd

* upd

Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>

[SparseTIR] SparseBlock on C++/Python side (apache#11)

* Fix a bug in the last commit

* SparseBlock on C++ & Python side

[BugFix][SparseTIR] TVMScript Parser for Axis & SpIterVar (apache#12)

* Update `cord` and `pos`

* Fix `idtype`

* Formatting..

* Bug fix 1

* Move new special stmts

* Parser for Axis and SpIterVar

* Fix context_maintainer.py

[SparseTIR] Enhance SparseBlock to contain enough PrimFunc information (apache#13)

* Enhance SparseBlock to have enough PrimFunc info

* Remove `func_sparse_buffer_map_`

* Don't print the map uh-huh

[SparseTIR] Parser, Printer, Roundtrip (apache#14)

* SparseBlock scope handler (part 1)

* SparseBlock scope handler (part 2)

* SparseBlock scope handler (part 3)

* SparseBlock scope handler (fix 1)

* Add SparseBufferLoad/Store on Python side

* Parser for SparseBufferLoad/Store

* Add SparseBlock to Python __init__

* StmtFunctor for SparseBlock

* Ensure at least one dimension for SparseBuffer

* Make `axis` field of SpIterVar mandatory

* SparseBlock scope handler (fix 2)

* Update Axis syntax by removing `name` parameter

* Move to intrin.py

* Add filed `from_sparse` to DenseFixedAxis

* SparseTIR script printer

* Roundtrip test

* `update_symbol` bug fix

* Fix attr visit in SparseBuffer

* Define then compare in SparseBlock

* Fix printer bug for SparseBuffer

* Enable graph match for Axis and SparseBuffer

* Complete HashReduce and EqualReduce for AxisTree and SparseBuffer

* Fix typo

* Rename test

* Bug fix 1

* Bug fix 2

* Add more tests

Move tests (apache#15)

[SparseTIR] ReprPrinter for Axis and SpIterVar (apache#16)

upd (apache#17)

flatten (apache#18)

ELL and BSR correctness test scripts (apache#19)

[SparseTIR] SparseTIR Lowering (apache#20)

* Fix a previous bug of sparse-fixed SpIterVar creation

* Fix a previous bug in `GetDenseValue`

* Refactor Collector and IndexTransformer

* Construct block and loops

* Fix a previous bug which rejects DV iters in collector

* Update buffer map

* Create root block

* Fix bug of sparse-fixed SpIterVar creation

* Fix bug on SpIterVar conversion (with refactor)

* Fix bug when getting dependent SpIterVars

* Fix bug on dependency map and index lowering

* Full block read/write region

* Test version 1

* Fix bug of loop order

* Fix bug of batch-mm iterator ordering

* Update PrimFunc args to use symbolic params

* Fix bug of test "csr_element_wise"

* Fix bug of index accumulation for sparse-fixed axis

* Update correctness test

* Test structural equality

* Refactor and use Array

fix nnz cols

Add docstring for sparse tir lowering (apache#21)

* add docstring

* upd

Add more examples part 1 (sddmm) (apache#22)

* upd

* upd

* upd

[SparseTIR][Schedule] SparseBlockRV, GetSparseBlock, SparseReorder (apache#23)

* Test initialization

* Fix a stupid bug of ReprPrinter

* Add SparseBlockRV

* Schedule: GetSparseBlock

* Schedule: Reorder

[SparseTIR][Schedule] GetSpIters (apache#24)

remove hybrid script for successful compilation

Add atomic intrinsic for output nonzero inference. (apache#25)

* upd

* upd

Add "sparse" block attribute. (apache#26)

Revert "remove hybrid script for successful compilation"

This reverts commit eebd7c1.

[SparseTIR] Hack `IsAffineBinding` check (apache#27)

* [TensorIR][Schedule] Inherit block anotation upon creating new blocks

* Fix SDDMM test

* Hack IsAffineBinding for sparse blocks

Axis Dependency Tree aware code-gen and bmm example (apache#28)

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* remove redundancy

* fix

* upd

* upd

Re-design Indices lowering (apache#29)

* upd

* upd

* upd

* upd

* upd

* init

* format

* fix

* revise coding-style

* format

Complete indices lowering (apache#30)

* upd

* upd

* upd

* done

* upd

* passed test

* upd

Add more docstrings and depress warnings for new lowering algorithm. (apache#31)

Refactor derived axis, frontend support of fusion. (apache#32)

* upd

* upd

* fix

Fatal bugfix and change the signature of DenseVariableAxis.  (apache#33)

Syntax simplification (apache#34)

Change the order of generated blocks for block isolation. (apache#35)

* upd

* upd

* upd

Syntax of AttachAxis for BMM (apache#36)

* upd

* upd

* upd

[SparseTIR] Add "square sum" lowering test (apache#37)

* Add square sum test

* Remove pylint comment

[BugFix] Fix offset caching in lowering (apache#38)

* Hack compact dataflow check in a dirty way

* Add two-K square sum test

* Mark skipped tests

* Fix offset saving in lowering

Fusion syntax fix + SDDMM example.  (apache#39)

Some structure change on update offsets. (apache#40)

[Refactor] SparseTIR Lowering (apache#41)

* Take out methods in Scope

* Refactor

* Refactor "match"

* Tweak scope contents

* Refactor ViewIndexInAxis

* Refactor Scope

* SDDMM tests under implementation

* Refactor block stack

* Use Map for var_map

* Extract NeedCreateNewBlock

* Simplify SpIterVarToIterVar via GetIterExtent

* Refactor NeedCreateNewBlock

* Add docstring

* Use "auto" correctly

* Minor refactor and use some move

Remove redundant analyzers (apache#42)

Support indices lowering for attach and fuse. (apache#43)

* upd

* upd

* upd

Fix irregular BMM example. (apache#44)

* upd

* upd

* upd

* upd

RGCN forward and butterfly pattern example. (apache#45)

Fused SDDMM example. (apache#46)

* upd

* wip

* fix

Fix sparse reorder after refactor (apache#47)

[Refactor] Refactor Unittest (apache#48)

* upd

* remove redundancy

[Unittest] Correctness test for benchmarking scripts (apache#49)

Bugfix and more test for axis fusion, new workload (apache#50)

* upd

* upd

upd
cyx-6 pushed a commit to cyx-6/tvm that referenced this pull request Jun 10, 2022
junrushao added a commit to cyx-6/tvm that referenced this pull request Jun 11, 2022
cyx-6 pushed a commit to cyx-6/tvm that referenced this pull request Jun 25, 2022
junrushao added a commit to cyx-6/tvm that referenced this pull request Jul 4, 2022
cyx-6 pushed a commit to cyx-6/tvm that referenced this pull request Jul 13, 2022
Hzfengsy pushed a commit to Hzfengsy/tvm that referenced this pull request Jul 30, 2022
vinx13 added a commit to vinx13/tvm that referenced this pull request Mar 27, 2023
* [microTVM] Fix tvmc tutorial (#14076)

This PR applies appropriate changes to make sure the CI fails if micro_tvmc.sh tutorial fails. This issue was captured in #14074.
This PR also makes changes to avoid this breakage in bash script tutorials in future. In addition, this PR fixes the bug in running TVMC tutorial which happened due to renaming zephyr_board to board.

* [MetaSchedule] Introduce Async Pipeline in MultiLevelTiling (#14009)

This PR introduces async pipeline in the current TVM's MultiLevelTiling Rules. This PR is based on apache/tvm#13966, which is already merged. This is because some conv2d workload will use `tir.if_then_else` to pad the input to the correct size, and this PR uses async copy in such copy statement.

1. Add a subrule in `src/meta_schedule/schedule_rule/multi_level_tiling.h/.cc` that annotate async copy for mlt in supported arch (>= sm80).

In CUDA Core, this PR has a perf boost of around 1T GFLOP/s in most Conv2d test cases and 1T ~ 2T in most GEMM test cases.

All generated codes, scripts, and traces are available at https://github.com/Rainy-Memory/tvm-async-rule-benchmark.

Currently tested on commit `afbfb7aa7e43732cb716f8e443df696110be6afc` in conv2d NHWC workload, with a RTX 3080 GPU.

**Notice: given the stochastic nature of evolutionary search, perfromance might become worse if enable this PR.**

Workload: Conv2d NHWC

|Shape|Mainline TVM|Mainline TVM with Async|Performance Boost|
|-|-|-|-|
|N=1_H=224_W=224_C=3_K=64_R=7_S=7_STR=2_PAD=3_DIL=1|13838.05219|14687.89452|6.141343581679319%|
|N=1_H=56_W=56_C=64_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|5398.305085|5613.892553|3.9936140067192905%|
|N=1_H=56_W=56_C=64_K=64_R=3_S=3_STR=1_PAD=1_DIL=1|11652.96825|13157.88249|12.91442839038028%|
|N=1_H=56_W=56_C=64_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|10638.8309|11674.68499|9.736540600527816%|
|N=1_H=56_W=56_C=256_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|8692.32829|9469.264089|8.938178277203573%|
|N=1_H=56_W=56_C=256_K=128_R=1_S=1_STR=2_PAD=0_DIL=1|4685.767442|5698.19634|21.606469175684712%|
|N=1_H=28_W=28_C=128_K=128_R=3_S=3_STR=1_PAD=1_DIL=1|9872.787087|10404.60405|5.38669535070061%|
|N=1_H=28_W=28_C=128_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|9974.281496|10073.31657|0.9929043414276753%|
|N=1_H=28_W=28_C=512_K=128_R=1_S=1_STR=1_PAD=0_DIL=1|7075.866932|8564.572712|21.039199780135142%|
|N=1_H=28_W=28_C=512_K=256_R=1_S=1_STR=2_PAD=0_DIL=1|3648.330914|4021.923142|10.240086132713124%|
|N=1_H=14_W=14_C=256_K=256_R=3_S=3_STR=1_PAD=1_DIL=1|8192.954618|9160.182054|11.805599824451525%|
|N=1_H=14_W=14_C=256_K=1024_R=1_S=1_STR=1_PAD=0_DIL=1|8008.870153|9362.825279|16.90569456283206%|
|N=1_H=14_W=14_C=1024_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|5210.062241|6051.208379|16.144646629759908%|
|N=1_H=14_W=14_C=1024_K=512_R=1_S=1_STR=2_PAD=0_DIL=1|2550.787202|3587.902938|40.65865373586739%|
|N=1_H=7_W=7_C=512_K=512_R=3_S=3_STR=1_PAD=1_DIL=1|4350.626084|5432.788068|24.873706981617943%|
|N=1_H=7_W=7_C=512_K=2048_R=1_S=1_STR=1_PAD=0_DIL=1|6672.068026|7663.725217|14.862815953549454%|
|N=1_H=7_W=7_C=2048_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|3142.564263|4297.988014|36.766909259541826%|

Workload: GEMM NN

|Shape|Mainline TVM|Mainline TVM with Async|Performance Boost|
|-|-|-|-|
|M=512_N=256_K=640|8678.46|10607.37|22.226408832903555%|
|M=512_N=384_K=256|8109.13|10290.72|26.902886006267003%|
|M=512_N=512_K=512|11419.83|14000.86|22.601299669084398%|
|M=512_N=3072_K=768|19709.39|18351.61|-6.8890006235606425%|
|M=512_N=768_K=3072|12844.59|13730.88|6.90010346768561%|
|M=896_N=896_K=896|16149.91|16131.39|-0.11467556165947945%|
|M=1024_N=1024_K=1024|18842.11|19662.8|4.355616223448428%|
|M=1152_N=1152_K=1152|15386.79|16736.1|8.769275462913303%|
|M=1536_N=1536_K=1536|18522.67|18872.06|1.88628313304725%|
|M=2048_N=2048_K=2048|19515.42|18874.85|-3.282378754851291%|
|M=3072_N=3072_K=3072|19233.9|19291.42|0.2990553137948975%|
|M=4096_N=4096_K=4096|17122.17|19259.01|12.479960191961652%|

* [TVMScript] Use op attribute to control whether to print dtype in TVMScript (#14111)

This PR adds an op attribute `TScriptDtypePrintLocation`, and modifies the dtype printing logic of the builtin op to check this attribute. So that user defined operators can use it to specify how there dtype argument are printed by appending attributes instead of appending members to `dtype_first_arg`/`dtype_last_arg`.

* [Fix][TVMScript] Fix index of metadata in printed script (#14130)

Currently, if the same metadata object (e.g. a multi-line `tir.StringImm`) is referenced for more than one times in an IRModule, each reference will have different indices of the metadata array. For example, this code

```
str_imm = T.StringImm("aaa\nbbb\n")
@I.ir_module
class Module:
    @T.prim_func
    def foo() -> None:
        A = str_imm
        B = str_imm

    @T.prim_func
    def foo1() -> None:
        A = str_imm
Module.show()
```

where `str_imm` is referenced three times, will generate such output:

```
@I.ir_module
class Module:
    @T.prim_func
    def foo():
        A: T.handle = metadata["tir.StringImm"][0]
        B: T.handle = metadata["tir.StringImm"][1]
        T.evaluate(0)

    @T.prim_func
    def foo1():
        A: T.handle = metadata["tir.StringImm"][2]
        T.evaluate(0)
```

Each time has a different metadata index. 

This PR fixes this problem by detecting duplicate item in `IRDocsifierNode::AddMetadata`.

* [Pytorch] frontend full_impl fix (#14122)

Minor fix in pytorch frontend to compile gpt2 model, reproduce script.
torch_version = 1.13.1
transformers_version = 4.26.1

```
from transformers import GPT2LMHeadModel
import torch
import tvm
from tvm import relay

inp = torch.ones((1, 128)).to(torch.int64)
input_shapes = [("input_ids", ((1, 128), "int64"))]

model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=False)
trace_model = torch.jit.trace(model, inp, strict=False)
outputs = trace_model(inp)

mod, params = relay.frontend.from_pytorch(trace_model, input_shapes)
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target='llvm', params=params)

runtime = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.device('cpu', 0)))
runtime.set_input("input_ids", inp.numpy())
runtime.run()
out = runtime.get_output(0).numpy()
print(out)
print('Done...')
```

Before the fix, the error message
```
Traceback (most recent call last):
  File "gpt2_compile.py", line 13, in <module>
    mod, params = relay.frontend.from_pytorch(trace_model, input_shapes)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4791, in from_pytorch
    outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4164, in convert_operators
    relay_out = relay_op(
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 841, in full
    return self.full_impl(data, fill_value, dtype)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 743, in full_impl
    fill_value = _expr.const(fill_value, dtype=dtype)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/expr.py", line 707, in const
    raise ValueError("value has to be scalar or NDArray")
ValueError: value has to be scalar or NDArray
```

because `fill_value` is
```
%0 = cast(64, dtype="float32");
power(%0, 0.5f)
```

* [DOCKER] Configurable NDK version support (#14000)

Let the Android NDK version configurable as a command line argument

* [Fix][TIR] SampleCategorical apply-to-schedule (#14133)

This PR is another way to fix the issue described in #14118.

Since we do not have a standard for json file on the format of float
numbers (for example, we cannot require a json file producer to print
the "integer" float numbers with at least one decimal), and the json
parser is not responsible for determining if an integer in a json file
should be parsed to a float or an int, the most convenient way of fixing
the SampleCategorical issue will be allowing both FloatImms and IntImms
as input, and converting all IntImms to FloatImms accordingly.

This PR fixes the issue in this way.

* [Arith] ConstIntBound was incorrectly assuming bounds were over int64… (#13918)

[Arith] ConstIntBound was incorrectly assuming bounds were over int64_t range

This commit improved the floormod and floordiv conversion check to be
simpler for the negative range and adds a  test to cover all integer data types.

* [CMSIS-NN] Reduction in code size of AOT test runner binary (#13815)

* [CMSIS-NN] Reduction in code size of AOT test runner binary


Co-authored-by: Ashutosh Parkhi <ashutosh.parkhi@arm.com>

* [CMSIS-NN] Add a runtime error message (#13643)

[CMSIS-NN] Add a runtime error message

APIs TVMAPISetLastError and TVMGetLastError are used to propagate CMSIS-NN
errors caught in the backend. AOT test runner was improved to observe the contents
of this global variable. A test was added to check for the last set error as part of this
commit.

* [CRT]Cleanup unused macros in crt_config.h.template (#14125)

This PR removes old macros in crt_config.h.template.

* [Fix][Relay] Fix axis transformation in squeeze shape function (#14135)

* fix squeeze shape function issue and add testcase.

* fix lint

* [Unittest] merge test_cp_async_in_if_then_else into test_tir_transform_inject_ptx_async_copy (#14138)

This PR merge two related unittests into one.

* [Frontend][TFLite] Fix conv2d import bug (#14124)

* Fix TFLite frontend bug and add test

* lint

* [ONNX][TORCH] Replace scatter op by scatter_elements (#14019)

* remove scatter attr class

* update pytorch: scatter was replaced by scatter_elements

* remove scatter compute and strategy registration

* remove scatter attrs registration

* update onnx front-end: replace _op.scatter by _op.scatter_elements, add checks

* update oneflow front-end

* update paddlepaddle front-end

* update pytorch utils

* remove front-end scatter definition

* fix scatter strategy for rocm

* small update

* remove scatter definition in back-end

* remove scatter strategy for cuda, gpu. transfer special case to scatter_elements

* fix test

* small fix

* upstream scatter with torch description

* last upstream of scatter in pytorch front-end

* fix reduction attribute in cuda strategy

* set scalar to test instead of tensor. update check for dynamic dim

* skip scalar source check in tests for scatter due to issue on torch side

* remove scatter op implementation from topi/cuda

* remove scatter op implementation from topi. small clean code

---------

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>

* [TVMScript][Printer] Remove relax prefix for now (#14140)

Remove relax prefix for now

This PR cleans up relax prefix in printer for now.
While these setups are useful and do not cause any
technical debts in the codebase. We remove it given requests.
They can be added back to unity branch and later as part
of upstream

* [microNPU] Sum legalization support (#13997)

Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis.

* [Fix][MetaSchedule] Fix redundant stages in async pipeline for mlt (#14143)

This PR fixes redundant stages if visiting `InitializeWithTuneContext`
multiple times.

* [COMMUNITY] Cheng Wen -> Reviewer (#14153)

Please join me @chengven027-intellif as a new Reviewer in TVM.
Cheng has contributed to ONNX/PyTorch frontend and Relay passes, making TVM support more input models.

- [Commits History](https://github.com/apache/tvm/pulls?q=author%3Achengven027-intellif+)
- [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Achengven027-intellif+)

* [Runtime] Fix high RAM usage when saving / loading paramters of big models   (#14147)

* add load_params_from_file

* add save_params_to_file

* avoid making another copy in save_params

* black

* add test

* update doc

* [Relay][Frontend] Span Filling PyTorch (#14050)

* [Relay][Frontend] Span Filling PyTorch

- Construct debug name of C graph instruction as the source name of span for pytorch model.
- To get the reference of renamed nodes. Add a function to export the converted C graph after conversion.
- Add structural_equal comparisons with and without set_span to the existing test cases.
- Add span test cases for frequent conversions.
- Add span test case for exporting model parameter.

* [SpanFillingPyTorch]

- Return TupleGetItem expr from TupleWrapper with the span of its Tuple.
- Add None type symbol in set sapn for certain conversion.
- Add current_op member varible to PyTorchOpConverter to track which op
  is converting for pytorch frontend.

* [SpanFillingPyTorch]

- Fix the error caused by the quantized params not found after renaming
  the debug name of C graph.

---------

Co-authored-by: Joey Tsai <chunit@qti.qualcomm.com>

* [TRT][BYOC] allow strided_slice ops on selected dimensions (#14142) (#14144)

* [ONNX][TOPI] Add `DFT` operator (#13999)

* init convertor for DFT

* init test for DFT

* init DFT operator in Relay

* update topi implementation for DFT

* clean up

* update ONNX frontend

* support  attribute

* fix error: Expected Array[Tensor], but got Array[index 0: Array]

* support inverse, onsided, dft_lenght

* update tests for DFT

* update TOPI test for DFT

* add documentation

* fix pylint

* fix cpplint

* fix cpplint

* fix threshold for FP16 (ARM)

* add CUDA compute

* fix pylint

* fix doc string

* code review fixes for ONNX front-end

* code review fixes for TOPI

* rename: stft.py -> signal.py

* pass input_shape and output_shape to verify_dft

* [CRT][microTVM] Enable USMP by default for AoTExecutor + CRT runtime (#14107)

This PR enables USMP by default when AoTExecutor and CRT runtime are selected. Check forum discussion about this change: https://discuss.tvm.apache.org/t/enable-usmp-by-default-in-aot-executor-with-runtime-crt/14406

As a result, the workspace memory in mlperftiny project type is removed since memory allocation is not required. If we keep this workspace, the model doesn't fit since some of the memory is allocated twice.

* [Android] Fix using system libraries in Android apps (#14145)

- Starting from API 31, using `uses-native-library` is required if we
  want to open system library:
  https://developer.android.com/about/versions/12/reference/compat-framework-changes#enforce_native_shared_library_dependencies

  We should specify OpenCL library in `user-native-library` in all
  applications where OpenCL backend might be used.

- Updated README files and describe how to fix synchronization issues
  in Android Studio.

* [microTVM]Enable TVMC micro with AoT Executor  (#14077)

This PR enables AoT Executor for tvmc micro compilation.

* [bugfix] Fix the write buffer scope of `mma_store_impl` (#14174)

fix

* [Relay] Enhance EliminateCommonSubexpr to support Tuple argument (#14169)

If an argument of a call is a Tuple, we should check its fields.
Different tuples with the same fields should be treated as same inputs

* [TIR] Fix typo in doc (#14178)

* [microTVM] Use QNN schedules to give SOTA performance (#13752)

In #13242, I rewrote microTVM's convolution schedules to give a major improvement in performance. While I demonstrated in tests that my changes worked, they could not be used with relay.build.

This pull request expands the functionality of #13242 and adds new legalize and alter_op passes to take advantage of the quantized schedules. This dramatically improves performance on some models, dramatically cuts RAM usage, and removes the need for autotuning on microTVM. More specifically, for the vww model from MLPerf Tiny running on the nucleo_l4r5zi, this pull request:

- Improves untuned performance from 1741 ms to 137 ms - a 6.8x improvement!
- Improves tuned performance from 337 ms to 137 ms.
- Sets a new state-of-the-art for MLPerf Tiny, beating Plumerai's previous 208 ms record
- Reduces RAM consumption by 73 KB (a large amount on microcontrollers!) by eliminating intermediate buffers.
- Reduces flash consumption for model weights by 5x
- Slightly improves accuracy

@mehrdadh has kindly tested these changes himself, and has confirmed my 137 ms figure.

To enable the schedules that grant these performance improvements, this pull request:

1. Adds out_layout support to the regular and depthwise conv2d schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242.
2. Generalizes the schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242 to be more widely applicable.
3. Adds a layout alternation pass to ensure regular and depthwise conv2d schedules always get their desired input formats.
4. Adds a conv2d -> depthwise conv2d -> unpadded conv2d rewrite step to remove empty channels from conv2d operators.
5. Adds a conv2d -> average pool -> dense rewrite step to remove empty channels from conv2d operators.
6. Adds an alter_op pass to fold padding into a separate Relay operator.

* Add v0.11.0 docs link to site (#14181)

Update the version menu in TVM documentation to add a specific v0.11.0 release docs link.

* [TIR] Allow TransformLayout with non-inversible index map (#14095)

* [TIR] Allow TransformLayout with non-inversible index map

TransformLayout requires the index map to have inverse map that can be
calculated by the analyzer in order to check whether padding is added.
However, such check doesn't always work for all cases because of
limitation of the affine analysis that can only handle a set of
supported patterns. In some cases, even if the index map doesn't
introduce padding, the schedule primitive throws `TransformationIntroducesPaddingError` because it
fails to calculate the inverse index map.

It is safe to allow buffer being padded without providing pad_value
because the original loop extent is not changed and the padded region is not accessed.
This PR changes the behavior of `TransformLayout` to allow
non-inversible index map.

Previous discussion:
https://discuss.tvm.apache.org/t/conflict-free-shared-memory-permutation-in-tensorir/13959/9

* add assume_injective_transform option

* Apply suggestions from code review

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

---------

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [TIR][Analysis] Implement IdentifyMemCpy analysis function (#13947)

* [HotFix][MetaSchedule] Turn off database shash check (#14188)

At this moment, the structural hash values of IR in TVM is platform
dependent (e.g., the hash values of a String may differ on different
platforms). In our recent practice, we found this an obstacle for us
to apply one existing database on different platforms (say we tune
an IRModule with MetaSchedule on Metal, and then apply the database
on CUDA, etc.)

To clear this obstacle, we decide to remove the shash value check. The
purpose of that check is mainly to ensure safety, and thus turning it
off will make no difference in terms of using MetaSchedule in most of
the cases that we can imagine.

Meanwhile, it is equally important that we need to make our structural
hash platform independent. There are plans ongoing for this target.

* [TOPI] Batch Norm Training Mode (#14190)

Prior to this PR, TOPI batch_norm only supports inference.

This PR adds training: bool flag and momentum: float argument to support training mode (update moving_mean / var and return), which aligns with torch.nn.functional.batch_norm.

* [TOPI] Group normalization (#14193)

As more and more ML models nowadays contain the group normalization
computation, we find it beneficial to introduce this op to TOPI level.
It will enable us to optimize the group normalization operation as a
whole in a more convenient way.

This PR introduces the group normalization op to TOPI. The group norm
operation was introduced in https://arxiv.org/abs/1803.08494. The
implementation uses tuple reduction, same as the implementation of layer
norm. Implemented with tuple reduction, the corresponding generated TIR
function can be optimized by cross-thread reduction or rfactor through
MetaSchedule.


Co-authored-by: Bohan Hou <spectrometerh@gmail.com>

* [Fix][TIR] LowerCrossThreadReduction with write-back predicate (#14199)

Prior to this PR, the cross-thread reduction lowering pass does not
add a store predicate to the write-back block. This is in consideration
that for a certain write-back buffer position, all values being stored
(by all the threads) in the write-back block are the same. Since all
threads are writing the same value, we were assuming that not having a
write-back block predicate is fine, because the result will not be wrong
in any way.

However, recently we noticed that some GPU backend compiler will capture
this behavior (multiple threads writing a same position) as a race
condition and thus throw compilation error. The compiler does not take
the fact that all values being stored are the same, and insist on
complaining.

This means that we will still need the write-back block predicate to
make things work. And this PR does this change. I have done integration
tests locally to make sure that the generated kernels is right and
produces the right results numerically.

* [Unity] Relax VM (#13878)

This PR implements a flexible register-based VM to execute relax programs with dynamic shape and control flow. Design: https://github.com/tlc-pack/relax/wiki/Relax-VM-Design.

Co-Authored-by: Ziheng Jiang <ziheng@apache.org>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Steven S. Lyubomirsky <slyubomirsky@octoml.ai>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>

* [Unity] Relax expressions and types (#13901)

* [Unity][IR] First-class StructInfo (#13907)

* [Unity][IR] First-class StructInfo

Relax tracks structural information (such as tensor shape) via `StructInfo` about the values in Relax.

* Fix rust build

---------

Co-authored-by: Junru Shao <junrushao1994@gmail.com>

* [Unity][CI] Unity specific jenkins setup (do not upstream to main) (#13910)

This PR setup a unity specific jenkins with minimum jenkinsfile
without sharding and disables most of the tests to reduce overall
cost. We can add tests of unty branch by configuring the specific
groovy file.

* [Unity] Basic StructInfo Analysis and Expr construction (#13916)

[Unity] Basic StructInfo Analysis and Expr construction.

This PR adds struct info analysis and expr support.
These are logics to construct the IR node and perform
struct info related analysis.

Testcases are added to cover the IR node construction
and related struct info analysis checks.

Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Altan Haan <altanh@cs.washington.edu>
Co-authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Co-authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
Co-authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-authored-by: masahi <masahi129@gmail.com>
Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-authored-by: Ziheng Jiang <ziheng@apache.org>

* [Unity] Relax BlockBuilder and ExprMutator (#13926)

This PR adds BlockBuilder: the core data structure to construct Relax AST, and ExprMutator: performs AST mutation for implementing transformation passes.

Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-Authored-by: masahi <masahi129@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Yixin Dong <ubospica@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Ziheng Jiang <ziheng@apache.org>

* [Unity] Relax TVMScript Parser. (#13932)

This PR adds the TVMScript parser/ir_builder support based on the blockbuilder.

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>

* [Unity] Relax TVMScript Printer (#13944)

This PR introduces Relax as a dialect supported by the TVMScript
Printer. Some caveats:
- Needs to rebase to mainline before merging.
- Some tests are skiped because some operators are not upstreamed to
  the unity branch yet.

Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity] Relax VM codegen (#13954)

* [Unity] Relax VM shape lowering pass (#13956)

This PR introduces Relax `FunctionPass` and `DataflowBlockPass` API, and the `VMShapeLower` pass to lower the shape expression in Relax to TIR functions and VM shape heap builtin functions.

Co-Authored-by: Ziheng Jiang <ziheng@apache.org>
Co-Authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>

* [Unity] e2e Relax minimum build flow (#13961)

This PR introduces the e2e Relax lowering flow (`relax.vm.build`). Tests for each pass in the flow are added.

Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Ziheng Jiang <ziheng@apache.org>

* [Unity][TVMScript] Use explicit `R.shape` in TVMScript (#13979)

As we've introduced `arg_sinfo` in CallNode, implicit shape constructor
is not widely used in TVMScript. This PR removes the implicit shape since
it may cause confusion between shape and tuple.

* [Unity] Relax op: index (#13987)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the tensor indexing operators.

* [Unity] Relax op: datatype (#13986)

* [Unity] Relax op: set (#13990)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the set operators.

Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>

* [Unity] Relax op: image (#13994)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the image operators.

* [Unity] Relax op: arithmetic, comparison (#13983)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the unary, binary and ternary arithmetic and
comparison operators.

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Chaofan Lin <1713833595@qq.com>

* [Unity] Relax op: statistical (#13991)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the statistical operators.

* [Unity] Relax op: neural networks (#13993)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the neural network operators.

* [Unity] Relax op: creation (#13984)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the tensor creation operators.

* [Unity] Relax op: linear algebra (#13988)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the linear algebra operators.

Co-authored-by: Siyuan Fneg <Hzfengsy@sjtu.edu.cn>

* [Unity] Relax op: search (#13992)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the search operators.

* [Unity] Relax op: manipulation (#13989)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the tensor manipulation operators.

Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>

* [Unity] NestedMsg Support utility (#13995)

This PR introduce NestedMsg to robustly handle nested-tuple analysis.

Relax support nested tuple structures in the IR.
Nested tuple structure is important to support advanced groupings in
cases such as gradient calculation and other scenarios.

The possible presence of nested tuple does mean that we need to to
robustly handle analysis that contains nested tuple structures in a dataflow graph.

This PR introduces a NestedMsg<T> class that corresponds to a possibly
nested message tuple for a given leaf message class T.
We also introduces various helper functions to compose and decompose messages.

Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

* [Unity][Pass] Operator Fusion Passes (#14001)

[Unity][Pass] Operator fusion passes

This PR introduces three passes for operator fusion:
1. AnnotateTIROpPattern: analysis the operator kind from PrimFunc.
2. FuseOps: fuse operators for Relax functions, which adds a new fused
relax primitive function.
3. FuseTIR: fuse corresponding TIR PrimFuncs for the fused relax.

* [Unity][Pass] LambdaLift pass (#14012)

* [Unity][VM] Supporting "compiled" exec mode. (#14015)

[VM] Supporting "compiled" exec mode.

This PR adds support of "compiled" mode to the VM. The compiled mode translate
the relax function into TIR function and drive it through the TIR function.

It is different from the micro AOT codegen, which generate TIR code that targets
the micro C runtime environment and useful for resource limited settings with
smaller set of features. Both leverages the low-level TIR build that is also shared with TensorIR.

The current implementation targets full TVM (VM) runtime, that comes with PackedFunc,
object, tuple, closure and all kinds of rich structure support. This also mean that
we can leverage the full runtime support to handle things like allocation, dynamic shape,
easy plugins and python interaction, which are not available in more limited runtime.

The user directly use the same API to load the generated code regardless of
compiled mode or bytecode. And just need to change one line

```python
ex = relax.vm.build(mod, target, exec_mode="compiled")
```

The simplicity is thanks to the TVM runtime archiecture that allows us to compose things together in objects.
The only difference is how the PackedFunc of high-level driving is being provided. In the case of bytecode
it is normal interpretation and in the case of compiled mode it is TIR.

It is a complete implementation Unit-testcases are added. All codegen build tests are updated to include two
exec_modes and have passed locally.

Co-authored-by: Junru Shao <junrushao1994@gmail.com>

* [Unity][Pass] BindParams pass, FoldConstant pass (#14016)

This PR introduces FoldConstant/BindParam passes.

Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][Pass][TuningAPI] Introduce TuningAPI and MetaSchedule pass (#14014)

Add TuningAPI and MetaSchedule tuning pass

* [Unity] Relay -> Relax translator  (#14026)

This PR implements a Relay to Relax translator, which allows us to import Relay workloads to Relax for benchmarking and development purposes (tests and examples are added).

* [Unity][Pass] Normalize Pass (#14031)

This PR implements relax `Normalize` Pass, which allows users to transform Relax IR to normal form, i.e., the expressions are normalized (no nesting and hence the AST is in ANF), and all `checked_type_` and `shape_` of expressions are available. (tests are added).

Co-Authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>

* [Unity][BlockBuilder] CallTE convert PrimValue args  (#14028)

Prior to this PR, the `call_te` of BlockBuilder is not capable of converting PrimValue arguments and directly rejects PrimValues instead. This PR fixes this behavior with PrimValue conversion support and one regression test.

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][Pass] Wellformed Analysis (#14032)

This PR implements relax wellformed analysis, which checks if the IRModule is well-formed. (tests and examples are added).

Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-Authored-by: Yixin Dong <ubospica@gmail.com>
Co-Authored-by: Chaofan Lin <siriusneo@sjtu.edu.cn>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>

* [Unity][TVMScript] Move tir/relax import in script out of __init__.py (#14033)

Prior to this PR, `python/tvm/script/__init__.py` imports both tir and relax
submodules. This leads to the phenomenum that when people does
```python
from tvm.script import tir as T
```
, the relax submodule will be implicitly visited by `__init__.py` as well.

Since TIR does not rely on Relax, it is good not to import both of them
at the same time. (This can prevent cyclic imports sometimes.)

This PR does this decoupling by introducing two files

* `python/tvm/script/relax.py`
* `python/tvm/script/tir.py`

and removing the imports from `python/tvm/script/__init__.py` and
`python/tvm/script/parser/__init__.py`. With this change, we force people to
manually do `from tvm.script import tir` and `from tvm.script import relax`
to use TVMScript parser, which is right our conventional way.

* [Unity][Pass] Operator legalization (#14029)

This PR is the operator legalization pass, which transforms high-level
operator calls to `call_tir`s of corresponding low-level TIR PrimFuncs.

- The legalization pass provides customizability, which enables people
to pass in a customized legalization map to override the default
legalization method.

- The legalization supports symbolic shape. (At this moment only pooling
does not support symbolic shape, as TOPI pooling does not support. This
needs to be fixed in followup PRs.)


Co-authored-by: Chaofan Lin <siriusneo@sjtu.edu.cn>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][Op] Add ShapeExpr Tests for Reshape Op (#14035)

This PR specially checks the relax.reshape operator when the input is a ShapeExpr.

* [Unity] Initial PyTorch Frontend (#14037)

[Unity] Initial PyTorch Frontend

This PR introduces initial pytorch frontend components of Relax, including
- a FX translator that translates a Torch FX graph module to an TVM IRModule,
- a Relax-backend of Torch Dynamo, which brings the mechanism to build PyTorch model using Relax compilation pipeline,
- a pipeline prototype that contains the collection of pre-defined pipelines that optimizes and lower IRModule before passing to minimum build.

Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][Pass] Block-level static memory planning (#14038)

This PR introduces the static memory planning pass on binding block level, as well as an analysis function that estimate the memory usage after the memory planning pass. It supports the following features: nested-tuples,  reuse memory of the input of reshape ops, an estimator that returns total memory size needed to be allocated before and after memory planning, as well as the number of tensors / memory blocks to be allocated before and after memory planning.

The estimation is static -- it does not consider control flows (such as “if” and cross-function calls). It simply accumulates the size of every alloc_tensor and alloc_storage.

We will produce “`relax.memory.alloc_tensor/storage`” as the results produced by memory planning.

* [Unity] Disallow inline prim_func in relax IR (#14040)

Disallow inline prim_func in relax IR

* [Unity] Update tests to adapt to latest TVMScript syntax (#14039)

Given that some latest changes of TVMScript syntax have been merged,
some test files are now containing deprecated uses of TVMScript syntax.
This PR updates the test files with latest TVMScript syntax so that
running the tests will not trigger deprecation warnings.

Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>

* [Unity] Relax dataflow pattern language (matching) (#14041)

The dataflow pattern language for Relax (originally from https://github.com/tlc-pack/relax/pull/163).

The implementation splits patterns into two parts:

- Match an Expression: match an expression syntactically (MatchExprPattern, i.e., DFPatternMatcher);
- Match a Graph: match a graph (cross multiple VarBinding) topologically (MatchGraphPattern);

* [Unity] Statement rewriter for DataflowBlock (#14043)

This PR implements a few APIs to quickly perform statement-level mutation:
`add`/`remove_unused`/`remove_all_unused`/`replace_all_uses`.
It also implements `remove_all_unused` to remove dead statements inside `DataflowBlock`.

* [Unity][Pass] FuseOps FuseTIR fixes (#14044)

This PR fixes two bugs of FuseOps and FuseTIR:

It fixes FuseOps who only rewrites the "main" function of the
IRModule. After the fix, FuseOps now goes through each non-primitive
Relax function. Test cases for both FuseOps and FuseTIR sides are added
so ensure that both of the two passes work for cases of multiple Relax
functions.

It also fixes FuseOps and FuseTIR who did not take "call_dps_packed" style
"call_tir" into account. The previous behavior will directly downcast
the first argument of "call_tir" to GlobalVar, which is not right when
the "call_tir" is in "call_dps_packed" stype and the first argument is
a PackedFunc. With this fix, FuseOps and FuseTIR will skip such
"call_tir"s. Tests for both CallTIR and CallOps are added accordingly.

* [Unity][TVMScript] Overload `__neg__` for relax expr (#14045)

This PR overloads `__neg__` given that `relax.negative` is now supported. Besides, it adds `test_op_misc.py` and brings tests for calling overloaded operators.

* [Unity][VM] Add per-op profiling support  (#14053)

Adds per-op profiling support to Relax VM, in a way similar to how Relay VM is instrumented via the common profiling infra in the runtime. Profiling over RPC is supported.

Example output:
```
Name                          Duration (us)  Percent    Device  Count                                 Argument Shapes  
conv2d1                          705,779.00    51.22  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 54, 54]  
conv2d                           669,589.00    48.60  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 56, 56]  
relu                                 683.00     0.05  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 56, 56]  
relu1                                679.00     0.05  hexagon0      1  float32[1, 64, 54, 54], float32[1, 64, 54, 54]  
vm.builtin.check_tensor_info          28.00     0.00  hexagon0      1                          float32[1, 64, 56, 56]  
vm.builtin.match_shape                25.00     0.00  hexagon0      1                          float32[1, 64, 56, 56]  
----------                                                                                                             
Sum                            1,376,783.00    99.93                6                                                  
Total                                  0.00               cpu0      1                                                  
Total                          1,377,809.00           hexagon0      1                                                  

Configuration
-------------
Number of threads: 4
Executor: VM
```


The original PR: https://github.com/tlc-pack/relax/pull/422

* [Unity][BYOC] Add pattern-based partitioning pass (#14054)

This adds a new pass, FuseOpsByPattern, which applies pattern matching to each function in the given module, and groups matched expressions into a new function. The end result is similar to FuseOps, but fusion is driven completely by
the provided patterns. The implementation also reuses OperatorFusor used by FuseOps to create grouped functions from partitioned groups, further illustrating the similarity between the two passes.

The new pass will serve the same role the MergeComposite pass plays in Relay BYOC - grouped functions are annotated with the "composite" attribute to denote what operations a given function consists of, and offloaded to external backends. But it can be also useful in non-BYOC settings, for example to support advanced fusion that the op-kind based one doesn't handle (fused MHA, conv2d / gemm + reduction fusion, etc).

The original PR: https://github.com/tlc-pack/relax/pull/366

* [Unity] Relax op: collapse sum (#14059)

This PR brings high-level operators `relax.collapse_sum_like` and `relax.collapse_sum_to` which is useful when doing AD in Relax. To achieve this, it exposes the interface of `topi.collapse_sum`. Moreover, this PR also implements the legalization of these op and adds corresponding tests.

* [Unity][Fix][Pass] Fix FuseOps for lack graph edges (#14058)

This PR fixes a mistake of #14044. In #14044, in VisitLeaf of graph
construction of FuseOps, we first check if the input node is Leaf and
then check if it is Tuple. This is not right: as Tuple is not
categorized as one leaf node, when the input node is a Tuple, the
function will return since the input is not a LeafNode. And the check
for Tuple will thereby never holds.

It is quite interesting that our existing unit tests fail to filter this
mistake out. I add a regression test for this case, which can ensure
that the tuple is always visited.

* [Unity][Pass] Remove Unused Function (#14061)

This PR implements a pass to clean up unused functions. 

Co-authored-by: masahi <masahi129@gmail.com>

* [Unity][BYOC] Add pass to merge composite functions to offload large subgraphs (#14062)

This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. 

Original PR
https://github.com/tlc-pack/relax/pull/372

Substantial improvement by @yelite 
https://github.com/tlc-pack/relax/pull/411

Related fix PR by @yelite 
https://github.com/tlc-pack/relax/pull/406

Co-authored-by: Lite Ye  <yelite958@gmail.com>

* [Unity][Frontend] Annotate number of non-static input of FX function (#14067)

* [Unity][Transform] Add LiftTransformParams pass (#14069)

This PR added a pass `LiftTransformParams`. It allows to compile the
end-to-end model without weights provided. The idea is annotate the
input parameters that are weights, and identify and lift the
transformations to weights, and compile it to a separate function
`transform_params` that can be executed in runtime. Users can run
`transform_params` with weights to get the weights for the optimized
model as a prep step before the deployment. In this way, we perform the
same optimizations and defer the weight transformations to the user
side, while the overhead of the deferred weight transformation can be
ignored as it only need to be run once.

This pass is integrated with the default `vm.build`. It is optional and
only necessary when the parameters are kept as inputs when importing the
model from the frontend.

* [Unity][BYOC][Pass] RunCodegen and TensorRT  (#14078)

This PR introduces the fundamental workflow for BYOC and integrate TensorRT as a demonstration.

* [Unity][Pass] Canonicalize Bindings (#14079)

It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions.

This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. Additionally, `MatchCast` bindings where the LHS and the RHS are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.

Example:
```python
y = x
z = y
w = z
o = w
p = o
```
Will be replaced with
```python
y = x
z = x
w = x
o = x
p = x
```

Original PR: https://github.com/tlc-pack/relax/pull/233

Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>

* [Unity] Add testcases for `expr_args_converter` (#14080)

This is a missing test file when we added the `expr_args_converter`. This
PR adds it back.

* [Unity][BYOC] Add CUTLASS backend (#14081)



Co-authored-by: Lite Ye  <yelite958@gmail.com>

* [Unity][BYOC] Add DNNL backend (#14082)

This PR adds dnnl backend to the unity flow.

* [Unity][Op] `log_softmax` and `cross_entropy_with_logits` (#14083)

This PR introduces two high-level operators log_softmax and cross_entropy_with_logits, which are important when we are calculating CrossEntropyLoss (in torch).

Co-authored-by: Yixin Dong <ubospica@gmail.com>

* [Unity][Analysis] TIR pattern kind analysis for multi-buffer write block (#14075)

This PR supports TIR pattern kind analysis for TIR blocks which write
to multiple buffers, which is helpful for normalization operators like
layernorm, groupnorm, etc.

Prior to this PR, the analyzer does not support a blocks which write to
multiple buffers. On seeing such a block, the analyzer simply sets the
analysis result to "opaque". With this PR, on seeing a block which
writes multiple buffers, the analyzer will check if all the BufferStores
have the same indices. And it will only set the result to "opaque" when
the BufferStores have different indices.

By doing this, the analysis works for common cases where a block may
write to multiple buffers, like layernorm or groupnorm.

Besides the unit test for the analysis itself, this PR also adds a unit
test for FuseOps pass, make sure that a "layernorm + relu" pattern can
be fused together.

* [Unity][Fix][Pass] FoldConstant with DCE in dataflow block (#14087)

The current FoldConstant pass does not support removing unused bindings
in the post-folding function. Therefore, for large real-world models,
the built executable will be overlarge because of the redundant unused
constants.

This PR removes the redundant unused constant bindings in FoldConstant
by using the analysis function "RemoveAllUnused".

Note that "RemoveAllUnused" only works at dataflow block level.
Therefore FoldConstant will not remove unused bindings outside of
dataflow block as well.

* [Unity] Refactor Relax Build JIT UX (#14088)

This PR refactors relax build so it get exposed at the opt-level.
We also introduces an explicit jit functionality to handle
live loading of compiled artifacts from cutlass.

We also move relax vm to runtime so it can be clearly isolated
from the rest of the compiler stack.

* [Unity][Relax] Set Shape Function to Be Host Function (#14090)

Set shape function to be host func.

* [Unity] Fix typo in the comment (#14096)

* [Unity] Lower `shape_of` to a builtin (#14093)

This PR lowers shape_of op to a Relax VM builtin, and changes a utility function to take StructInfo as input.

Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>

* [Unity] Relax Recursive function (#14092)

This PR adds TVMScript local recursive function support. It also update lambda lifting pass. Removed CalledGlobalVars, it was not used anymore. It also updates well-form pass to allow un-defined vars for recursive call

* [Unity][Layout] Add layout transformation analysis for PrimFunc (#14066)

* [Layout] Add layout transformation analysis for PrimFunc.

This change adds a PrimFunc level analysis to suggest layout transformations to block and buffers in the PrimFunc based on the layout transformations to PrimFunc outputs.

* Add support for multiple blocks such as split op.

* Add negative tests and increase coverage.

* fix warning message

* fix lint

* remove unused header

* Address comments.
Moved some utility functions to support/array.h
improve doc

* fix deprecation warn T.var("int64") to T.int64()

* address comments

* [Unity] Remove attributes of relax.print, assert and unique (#14101)

Remove the attributes of operators assert, print and unique.
Use PrimValue as substitute.

Co-authored-by: Steven S. Lyubomirsky [slyubomirsky@gmail.com](mailto:slyubomirsky@gmail.com)
Co-authored-by: Prakalp Srivastava [prakalp@octoml.ai](mailto:prakalp@octoml.ai)

* [Unity][BYOC]Add relax backend pattern registry (#14106)

* Add relax backend pattern registry

* Add doc

* [Unity] Update tests again to adapt to latest TVMScript syntax (#14115)

* finished

* fix

* rollback merge_composite_functions

* [Unity][Fix] Fix bug in MergeCompositeFunctions (#14117)

Currently `MergeCompositeFunctions` will modify the map while iterating over it, and that makes tests/python/relax/test_transform_merge_composite_functions.py does not pass. This PR fixes this bug.

* [Unity][BlockBuilder] Add `name_hint` argument for `emit` and `emit_output` (#14126)

This PR adds `name_hint` argument for `emit` and `emit_output` API of Relax blockbuilder. The argument exists in the C++ side but not exposed to Python side (So user who use the Python bb.emit will let `name_hint` be `""` by default).

Co-authored-by: Yixin Dong <ubospica@gmail.com>

* [Unity][WEB] Relax vm on web runtime (#14131)

This PR brings initial relax vm support on web runtime

* [Unity] Add Global info (#14132)

* [Unity][BYOC] Add transposed matmul support to Relax CUTLASS BYOC (#14128)

Add transposed matmul support for Relax CUTLASS

* [Unity][TVMScript] emit_te sugar (#14123)

This PR adds R.emit_te meta-programming mechanism to emit a topi operator from TVMScript

* [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc (#14139)

* [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc

* Update fuse_ops.cc

* [Unity] Add callback to FuseOpsByPattern to check match result is accepted (#14109)

* [Unity] Add callback to FuseOpsByPattern to check match result is accepted

* add callnode to callback args

* update pattern registry

* fix

* [Unity][Legalize] Fix Scalar Constant Legalization (#14127)

This PR fixes the issue of loss of data type during Legalization. Previously, if we use a constant scalar in operators like `multiply`, it will automatically be converted to a python data type variable, which may lose its original data type. For example, `float16` may become python `float` and be interpreted as `float32` later.

This is now fixed by avoiding scalar value conversion. The conversion could be added back once we have better support for scalar prim value.

Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Wuwei Lin <wuwei@apache.org>

* [Unity][Pass] Enhance constant folding to fold relax ops by evaluating them. (#14146)

* [Unity][Pass] Enhance constant folding to fold relax ops
by evaluating them.

This uses the registered legalization function attached to
the op to lower it to call_tir and uses the existing call_tir
folding mechanism to fold it.

This kind of op folding is only allowed within dataflow block
as ops could have side-effects.

Limitations:
* This currently does not support folding ops
that could lower to multiple call_tir bindings.
* Folding by evaluating ops is not always beneficial.
We need a heuristic to check if it is useful. This is
not implemented yet and folding is always allowed
by evaluating expressions.

* fix ci error

* fix doc

* fix bug

* [Unity][Debugging] AST printer (#14152)

This PR transfers over the AST printer from tlc-pack/relax. The AST printer is a debugging tool that prints out a Relax AST in a precise and human-readable format, which can be helpful for debugging the parser or various passes.

Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Ruihang Lai <ruihangl@cd.cmu.edu>
Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>

* [Unity][Pass] Support Symbolic Shape Deduction during BindParam (#14154)

`BindParam` replace function params to constant nodes. However, it will
drop the shape information of the params, considering the following case:

```python
@R.function
def main(
    x: R.Tensor(("batch", "m"), dtype="float32"),
    w0: R.Tensor(("n", "m"), dtype="float32"),
    b0: R.Tensor(("n",), dtype="float32"),
    w1: R.Tensor(("k", "n"), dtype="float32"),
    b1: R.Tensor(("k",), dtype="float32"),
) -> R.Tensor(("batch", "k"), dtype="float32"):
    batch = T.Var("batch", "int64")
    k = T.Var("k", "int64")
    m = T.Var("m", "int64")
    n = T.Var("n", "int64")
    with R.dataflow():
        lv0 = R.call_tir("linear0", (x, w0, b0), out_sinfo=R.Tensor((batch, n), dtype="float32"))
        out = R.call_tir("linear1", (lv0, w1, b1), out_sinfo=R.Tensor((batch, k), dtype="float32"))
        R.output(out)
    return out
```

The current pass will simply drop the symbolic var `n`, `k` and cause
undefined vars during build as
```python
@R.function
def main(x: R.Tensor((1, "m"), dtype="float32")) -> R.Tensor(dtype="float32", ndim=2):
    m = T.Var("m", "int64")
    n = T.Var("n", "int64")
    k = T.Var("k", "int64")
    with R.dataflow():
        lv0 = R.call_tir("linear0", (x, metadata["relax.expr.Constant"][0], metadata["relax.expr.Constant"][1]), out_sinfo=R.Tensor((1, n), dtype="float32"))
        out = R.call_tir("linear1", (lv0, metadata["relax.expr.Constant"][2], metadata["relax.expr.Constant"][3]), out_sinfo=R.Tensor((1, k), dtype="float32"))
        R.output(out)
    return out
```

This PR updates the pass to bind the symbolic shape during binding.

* [Unity][Analysis] Checking function return struct info in well-formed check (#14155)

The current well-formed misses the check of function return struct info,
which may mistakenly pass the check if there are undefined vars in the
function return struct info.

* [Unity][BYOC] Use Relax legalize + CPU build for reference in tests (#14162)

* clean dnnl test

* clean trt test

* clean cutlass test

* fix gelu legalize for fp16

* use memoize in dnnl and trt tests

* [Unity] Add bind_constants option to FuseOpsByPattern (#14151)

* [Unity] Add lift_constatns option to FuseOpsByPattern

* lift_constants -> bind_constants

* [Unity][Analysis] Analysis for detecting recursion in Relax (#14149)

* DFS based attempt to detect mutual recursion

* Use Johnson's circuit-detecting algorithm instead

* Fix control flow test

* Detect all recursion anyway

* Add new test cases for simple recursion

* Fix mistake in test case

* Include missing dependencies

* Remove trailing whitespace

* Dependencies are simply references, not necessarily calls

* More trailing whitespace

* Newline at end of file

* Fix spacing in docstring

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

---------

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][BYOC] Add batch matmul support to Relax CUTLASS BYOC (#14166)

* Add batch matmul support to Relax CUTLASS BYOC

* Allow more dtypes

* Fix tests

* Revert how to get batch attr

* [Unity][Op] Full support of Relax op `power` (#14171)

This PR provides a full support of `R.power` including op registering, legalization, overloading `__power__` for Expr and torch fx frontend.

* [Unity][Analysis] Restore Python bindings for var analyses (#14180)

Restore Python bindings for var analyses

* [Unity][OP] Add an operator for fused multi head attention (#14150)

* [Unity][OP] Add an operator for fused multi head attention

This PR introduces the new relax operator `R.nn.attention` for fused multi head attention, and the support of fused multi head attention to relax cutlass BYOC. The input of the operator are query, key and value tensor, with `BSNH` layout, namely `[batch size, sequence length, number of heads, dimension of heads]`. And the output shares the same layout with all input tensor.

* remove useless codes, remove attrs and add memoize

* add more dispatches

* nit and fix rebase

* fix linter

* add support for bias

* fix lint

* BNSS layout for bias

* update doc

* fix typo

* support bias broadcast

* [Unity][WEBGPU] Codegen improvements and WebRuntime (#14187)

This PR makes various improvements web codegen in relax web runtime.

Correct support of shift operators.
Update relax vm to make most use of internal allocators.
Update the webgpu API to the latest spec.

* [Unity][Transform] LiftTransformParams handling multiple functions (#14192)

Previously, the LiftTransformParams pass only works on function
`"main"`. This is a bit restrictive as in our recent practice on stable
diffusion, there are cases where multiple Relax functions inside an
IRModule all need to be transformed.

Therefore, this PR enhances the LiftTransformParams pass, so that it
will now transform **all** functions **with attribute `num_input`**. For
functions without this attribute, the pass will simply skip them.

* [Unity][Op] Group normalization (#14194)

* [TOPI] Group normalization

As more and more ML models nowadays contain the group normalization
computation, we find it beneficial to introduce this op to TOPI level.
It will enable us to optimize the group normalization operation as a
whole in a more convenient way.

This PR introduces the group normalization op to TOPI. The group norm
operation was introduced in https://arxiv.org/abs/1803.08494. The
implementation uses tuple reduction, same as the implementation of layer
norm. Implemented with tuple reduction, the corresponding generated TIR
function can be optimized by cross-thread reduction or rfactor through
MetaSchedule.

Prior to this PR, the group normalization operations in frontend models
are translated to a series of operations, which brings inconvenience
when we want to optimize the group norm op as a whole.

With the TOPI implementation of group norm being introduced by #14193,
we can now use it to legalize the high-level group norm op and optimize
it using cross-thread reduction or rfactor via MetaSchedule.


Co-authored-by: Bohan Hou <spectrometerh@gmail.com>

* [Unity][Op] Argmax and argmin (#14195)

This PR introduces full support to the argmax and argmin op to the unity
branch, including the structure info inference, the legalization, and
the translation from Torch FX.

* [Unity][Op] Legalize `round`, `floor`, `ceil`, `sign` (#14198)

This PR implements the legalization for four unary operators:
* round,
* floor,
* ceil,
* sign.

Unit tests are provided accordingly.

* [Unity][Frontend] FX translator supporting more ops (#14196)

This PR improves the torch FX translator in the following perspectives:
* support unary op `sigmoid` and `round`,
* support in-place `fill`, `triu` and `tril`,
* support `tensor`, `arange`, `empty`,
* support `bmm` (batch matrix multiplication),
* support `astype`,
* support `chunk` and `squeeze`.

This PR also fixes `Embedding`. Previously the translation assumes that
the input to Embedding will only be 1-dimensional, and will throw
exception when the input has more than one dimension (i.e., batched).
This PR brings the support.

* [Unity][Frontend] FX translator returning weights with `keep_params_as_input` (#14197)

PR #14067 introduces the flag `keep_params_as_input` to the FX
translator, in the purpose to handle to model weights outside of the
translated Relax function.

This PR takes a further step, by returning the model weights as
NDArrays when the flag `keep_params_as_input` is true. With this PR, the
translator now can return back the weights upon requested. Otherwise,
after the import we will lose the model weights in the given PyTorch
model.

* [Unity][Fix] FX translating dtype (#14201)

This PR fixes a bug of the current FX translator when dealing with
dtype.

Previously, the translator does not take the cases
```python
dtype = x.getattr("dtype")
```
into consideration. In this case, the dtype will be a fx.Node object,
while the translator assumes that the dtype is either a string or
a torch native datatype (e.g., torch.float32).

This PR fixes this by doing an environment table lookup before for all
dtypes.

* [Unity][TIR][Pass] ForceNarrowIndexToInt32 (#14203)

[TIR][Pass] ForceNarrowIndexToInt32

This PR introduces a pass which forces every index expression in a
PrimFunc to have dtype int32. Meanwhile, it also checks if all integer
buffers in the PrimFunc have int32 dtype, and report error if some
integer buffer has dtype other than int32.

In terms of implementation, this pass leverages the
IndexDataTypeNormalizer, with the target dtype being int32.

This PR contains a few basic tests that come from
`test_tir_transform_narrow_datatype.py`, and contains some negative
tests as well.

* [Unity][Frontend] FX translator support torch.baddbmm (#14202)

This PR brings the support of translating `torch.baddbmm` into
combination of operators (matmul, add, multiply). Unit tests
are provided accordingly.

This PR also fixes the kwarg fetching issue of `torch.interpolate`.

* [CI] Point cpu ci to dep with onnx (#40)

Point cpu ci to dep with onnx

* [Unity] Introduce Default GPU Schedule Pass (#14182)

* Implement default schedule.

* Add test.

* Add tests.

* Fix linting.

* Skip scheduled blocks.

* Address issues.

* Use target current.

* Minor fixes.

* Remove Mutator.

* Move pas…
mikeseven pushed a commit to mikeseven/tvm that referenced this pull request Sep 27, 2023
Added make_custom_op_from_attrs_str function.

Approved-by: Ashok Sudarsanam
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants