[RFC][TIR] TIR Non-scalar Constants #22

manupak · 2021-08-17T16:57:11Z

added the markdown
added a commit msg header

Change-Id: I0fb3e6b97242ba219c157c9abe5184f14a9f8eff

* added the markdown * added a commit msg header Change-Id: I0fb3e6b97242ba219c157c9abe5184f14a9f8eff

* adding the PR number Change-Id: Ia84e39506934919c25f8265fc4f6d3bc6f3a5140

manupak · 2021-08-17T17:01:31Z

cc : @tqchen @junrushao1994 @jroesch @areusch

areusch

thanks @manupa-arm, this proposal looks good. i mainly would like to ask for clarifications; if they're made while i'm out and @jroesch @mbs-octoml @tqchen approve this, feel free to merge without my approval

rfcs/0022-tir-non-scalar-constants.md

junrushao · 2021-08-30T00:39:35Z

rfcs/0022-tir-non-scalar-constants.md

+
+## Winograd Constants
+
+The Winograd transformation (used for fast GEMMs) involves multiplication by a hard-coded constant tensor. This is currently accomplished in TE using a complicated TE compute expression with many nested selects. Being able to directly express a constant tensor here would significantly simplify this code.


The constants in wingrad convolutions are a relatively small matrix, which are always unrolled in TIR scheduling, inlined in generated binary and thus won't incur any extra storage like what relay.Constant does. So my question is, how does this RFC handle inlining such constants into corresponding unrolled operations?

As a concrete example, suppose we have the code snippet below:

tir.constant(c, [1, 2, 3]) # a temporary syntax for declaring constants for i in tir.unroll(3): a[i] = b[i] * c[i]

Does this RFC consider the pass that inlines these constants into the code and transform into the following TIR:

a[0] = b[0] * 1 a[1] = b[1] * 2 a[2] = b[2] * 3

Thats a good question @junrushao1994 .

I think this becomes useful in a full unrolling (which is only useful in small matrices as you correctly points out) where variable indices becomes constant indices and the access could be replaced with immediate value. I think we are not planning to improve this just yet, but something worth revisiting later, depending on observations -- especially after observing downstream compilers perceive/optimize constant propagation in such non-aliased array accesses.

cc: @d-smirnov

Thanks @manupa-arm! This is the particular case in my mind that we didn't do well with TE, where we use a lot of if-then-elses to represent a winograd array :-) Therefore, if I understand correctly, this is not going to be addressed within the scope of this RFC. Is that accurate? Thanks in advance!

My second thought: if we are able to identify if a buffer is constant in tir.Load, then we can actually simplify the tir.Load into its constant value - which IMO perfectly resolves the question I asked above. What do you think?

Yes, agreed.

I dont disagree with the suggestion, it is just
I think this could be a follow up with an introduction of a pass to change from :

a[0] = b[0] * c[0] a[1] = b[1] * c[1] a[2] = b[2] * c[2]

to

a[0] = b[0] * 1 a[1] = b[1] * 2 a[2] = b[2] * 3

I think we'd want to evaluate the benefit of implementation after we land tir.allocate_const support proposed in this RFC.
Does that sound good ?

junrushao · 2021-08-30T00:41:17Z

rfcs/0022-tir-non-scalar-constants.md

+
+## Weight compression
+
+When lowering for accelerators (E.g. : [Arm(R) Ethos(TM)-U NPU](https://github.com/apache/tvm-rfcs/pull/11)), certain operations will need to get tiled to co-optimize performance and memory utilization. Such tiling patterns create slices of weights that need compressing that will end up with varying sizes. Therefore, the knowledge of some tir::Vars refer to constants are critical in the level of TIR to perform this.


Hey I have question here. I wonder if the values inside these constant weights actually matter, or what matters is their shapes/dtypes? If we only care about the shapes/dtypes, we don't need to link the actual values into TIR, is that correct? Thanks!

@junrushao1994 , the values do matter, as with compression the entropy of the constant data information determines the size post-compression.

That makes sense to me. Thanks for the explanation!

What I am thinking of is, to be consistent with our unified IR design, if we have a section in IRModule attributes (which @electriclilies has done plenty of work recently), in which those parameters are linked in, then we could actually refer to these parameters to IRModule attributes, and in the meantime Relay and TIR can share the same mechanism

Does not relay use relay.Constant() ?

Are you suggesting that we just use the tir.node while actually placing the constant NDArray in the attribute section?

junrushao · 2021-08-30T00:43:19Z

rfcs/0022-tir-non-scalar-constants.md

+
+## Memory Planning
+
+The TIR program has the ability to express both inter and intra operator memory requirement, post-scheduling as explained further by [Unified Static Memory Planning RFC](https://github.com/apache/tvm-rfcs/pull/9). It would be better if the constants could be embedded to the TIR PrimFunc. Moreover, this allows various [target-dependent lowerings](https://github.com/apache/tvm-rfcs/pull/10), to produce TIR PrimFuncs with constants in it.


I am sorry but I am not sure if I understand this claim correctly. Do we need to understand the actual values inside the constants to do memory planning? My understanding is that if we know the shapes/dtypes, and the address to which the constants are stored in memory, then we can do all sorts of memory planning and prefetching stuff. Is that correct? Thanks a lot!

As I've replied before, the values determines the size in the case of the compression.

Moreover, the knowledge of the values of constants (as opposed to holding a de-referencing map to get values) will be beneficial to writing lowering passes with maximum access to knowledge of the program/operator.

May I ask the benefits of hiding the constant values for TIR passes?

Oh I am not thinking of hiding the content of constant Tensors, but of putting them anywhere else, e.g. IRModule's attributes, to mimic the behavior where we put those data in .rodata section of an artifact. In benefit of doing this is that it is more consistent with our unified IR approach and this mechanism could be shared between Relay and TIR.

The reason that I was asking if we care about the values of those constants is that I was trying to figure out what works best for our human-readable TIR printer/parser. If we directly print a chunk of things on screen, it might be hard to read.

Let me know what you think :-)

Does not relay use show_meta=off for this exact problem with relay.constant ?
I dont see a problem of printing associated with introduction of the node.

Have a look at the PR : https://github.com/apache/tvm/pull/8472/files

The reason we dont prefer to store them IRModule is that our passes will perform optimizations (such as slicing weights using compute_at and performing compression again) and all such passes has to undergo a in-direction. It feels unncessary if we had a leaf node to represent constants. Moreover, given such passes it feels ability to represent non-scalar constants in TIR is bit limiting. I personally dont feel like putting such a intimate construct that is used with the compute as an 'attribute'.

Let me know if you have further concerns here.

ok so i think here we are printing the NDArray contents to TVMScript but not to repr()--does that match your understanding @manupa-arm ? i'm generally ok with this PR as long as we don't cause all TIR printing to be consumed by constant data.

junrushao · 2021-08-30T00:49:41Z

rfcs/0022-tir-non-scalar-constants.md

+   param = tir.allocate_const([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], "int32", [10])
+```
+
+This follows closely the semantics of tir.allocate and the difference being it represent a buffer filled with constants.


Perhaps it is worthwhile to discuss a bit more about the semantics :-)

Will the constants be allocated on stack or on heap? Is this designed for small matrices (e.g. the small matrix in winograd), or relatively larger matrices (e.g. the weight that needs prefetching)? How will lowering and code generation be affected? Does it work for GPU and other devices? How does it affect linkers' job?

I dont think where constants are allocated need to be decided here based on how we decides to represent constants in TIR.

The proposal here, is to represent constants intimately in the TIR primfunc, that could be lowered in a way each target wants them lowered (if they support it).

Will the constants be allocated on stack or on heap?

So currently where we want to 'link' in the parameters, they will generated part of the runtime.Module and linked via the executor : https://github.com/apache/tvm/pull/6917/files.
For CPU targets, they will go to a .rodata section, where the constants are held to keep compatibility with linked param code generation of what exists today.

Not sure we want them in stack nor the heap, however, we might want them in different sections if the system has more non-volatile memories.

Is this designed for small matrices (e.g. the small matrix in winograd), or relatively larger matrices (e.g. the weight that needs prefetching)?

This is size agnostic, therefore I'd not expect a difference here.

How will lowering and code generation be affected?

This will only be supported (at least initially and agreed with @tqchen ) with targets that support code generations for constants (currently it uses link-params target argument). Therefore, if a target have this capability (and enabled in a given compilation flow), we go with assumption the target knows how to generate code for the constants to be used by the operator.

Does it work for GPU and other devices?

I dont think GPU is a target that support (currently) code generation for constants, therefore the constants will live it in the tvm_main function (as relay.Constants).

How does it affect linkers' job?

So there are mainly two ways linkers' job could be affected, AFAIK.

1.) If the code generation for constants is supported by the respective target, we'll assume the code will be generated with appropriate sections (if they are C-like) or consumed in any other artifact that expect the constants to be embedded. If the target support neither, then that target is not a target that requires constant 'link'ed to the TIR.

2,) if the USMP is invoked, in which case it will pool all the constants and pulled out of tvm_main and exposed to application layer.

For basic and most usecases, the constant pools will be generated in the metadata module (See U1 of https://github.com/apache/tvm-rfcs/blob/c520a3912279bcd3c432dafd4070661f614882cf/rfcs/0009_Unified_Static_Memory_Planning.md).

For certain use-cases (See U3 of https://github.com/apache/tvm-rfcs/blob/c520a3912279bcd3c432dafd4070661f614882cf/rfcs/0009_Unified_Static_Memory_Planning.md), this would be where the user writes the application.

@junrushao1994 does this answer your questions?

it does seem like we could specify placement of the parameters in terms of memory pool/scope in tir.allocate_const node if we wanted to, though. the main use case i could see for this is marking weights to be loaded into e.g. accelerator memory. i do think that sort of thing is related to this proposal, but i don't think anything here stops this feature from being implemented later on. so i think we can continue with the proposal as written

Thanks for answering my questions! It is super helpful for me to understand the motivation and scope of this RFC :-)

I like the idea of having tir.allocate_constant, which I dreamed of before, and of course am not going to be against it. I was just really cautious when extending an IR with new nodes, because it usually means breaking most of the passes which usually don't handle unknown node types, and this might lead to the impact to the overall TVM stack, as well as compilation workflow for various different architectures and usecases.

Here are my thoughts (I'm sure I am limited in certain aspects, so please feel free to disagree and propose alternatives):

A. To be consistent with the unified IR effort, what about we use the IRModule-level attributes that @electriclilies is pushing for recently, and put those parameters as IRModule attributes. It could be directly translated into the .rodata section in LLVM-based codegen, which both CPU target and CUDA host target use. We can do some small refactors to make both Relay and TIR level tensor constants into using IRModule attributes in a unified approach.

B. tir.allocate_constant needs proper support by the TVM script parser and printer. Otherwise, the roundtrippable workflow will be broken after this node is introduced. In this case, we might need to think of supporting parsing tensors (or meta info of tensors if the actual parameters are not linked in yet).

C. On architectures other than CPU, like GPUs, my quick thoughts are: at least we can lower this into a call that copies chunk of data onto GPUs using TVM's C API. Let me know if it makes sense.

D. If I understand correctly, because the data is generated into the .rodata section, which is loaded directly to certain memory, so it's not going to be on the stack. Also there is probably zero copy mechanism so that we don't need to waste any extra copy - which sounds cool to me.

E. On winograd weights, my second thought is that if we are able to identify if a buffer is constant in tir.Load, then we can actually simplify the tir.Load into its corresponding scalar constant - which IMO perfectly resolves the question I asked above. What do you think?

Since this is more like a summary of your thoughts let me answer below as a main comment.

junrushao · 2021-08-30T00:51:00Z

Thanks @manupa-arm for the nice RFC! I left a few questions, and happy to discuss more!

areusch · 2021-09-08T00:52:50Z

@junrushao1994 please follow up on the comments here

u99127 · 2021-09-16T08:37:03Z

How can we move this forward - this appears to be getting stalled ?

manupak · 2021-09-16T13:17:05Z

@junrushao1994 has followed up with individual comments and its pending my response :)
I ll do it next once I get some free cycles.

manupak · 2021-09-16T20:11:53Z

@junrushao1994 ,

For A.)

The reason is constants represent something intimate to the compute and requiring space. Moreover, in the scheduling passes where we want to do slicing in loops where the weights get sliced and need to undergo transformations (e.g. compression), it will need to keep on reading attributes and writing attributes throughout such passes. That seems like we are abusing attributes for something that should have been able to express in TIR.

I guess the main difference is unlike the existing passes, the constant will be treated as a primary entity in TIR rather than being treated as another input. Therefore, I would argue against putting constants as attributes of IRModule.

For B.)

I believe this support is already implemented in the PR : apache/tvm#8472
https://github.com/apache/tvm/blob/7458e2ddd22e14ba9f58f3c8dd3b5842b03a9aab/tests/python/unittest/test_tvmscript_roundtrip.py#L2895-L2911

For C.)

I feel if the constants do not get embedded in the compilation artifact of TVM, we should avoid 'linking' the constant into the device PrimFunc, however, the host PrimFunc could codegen it in an appropriate section and pass a pointer to device PrimFunc if the non-volatile memory is accessible by the device. Otherwise, the host PrimFunc would need a cache_read stage to copy it to local storage before passing it onto the device PrimFunc.

For D.)

By default, the data could be generated into .rodata section, however USMP could allow user to place it in a different section that goes to different memory in the case where SoC have multiple non-volatile memories. See U3 here : https://github.com/apache/tvm-rfcs/blob/c520a3912279bcd3c432dafd4070661f614882cf/rfcs/0009_Unified_Static_Memory_Planning.md

For E.)

As I replied the original discussion, I think after the initial stage we could revisit to insert a new pass to do this optimization.

I believe I have answered all your questions. Let me know what you think.

junrushao · 2021-09-16T22:19:21Z

Thanks @manupa-arm for the reply! These are all good points 👍

Again, I am supportive of this change, and the reason I am asking so many questions is that it is a change to the generic IR, which is supposed to be the most stable part of the system. As a generic concept, "constant" is supposed to be something people will assume to work generically in the future. Therefore, I wanted to be serious and make sure it does work generically.

On A). As the primary author in TensorIR scheduling, I feel like reading/writing attributes seem to be expected behavior by design of our unified IR infrastructure, I would love to learn more about why these are issues here.

On B). Thanks for supporting this in TVM script! I definitely see a path that we can complete this feature. The reason I am asking about "parsing tensors" is that tensors can be large and not sure they could fit in python's AST (too slow or something). In Relay, the workaround is to treat tensors as metadata, so that it could be parsed in a separate file with a much faster parser.

Again, would you to hear about your opinions about what the best design is to parse tensors in TVM script, which handles:

multi-dimensional tensors with different dtypes
large tensors
numerical precision issues
do we want to support parsing from a different file?

On C). I think we are on the same page, but to clarify, cache_read on GPU isn't able to copy CPU data to GPU. To move data correctly across devices, we need to call TVM's C API TVMArrayCopyFromTo as I suggested. To summarize, there is some way where we can get tir.allocate_constant to support all devices, on CPU it's pretty easy, and on GPUs, we need to link those parameters in host-side PrimFunc and call TVM's C API to put them on device accordingly.

On D). Thanks for clarification! I am definitely not an expert on these SoCs. I think it's cool as long as we have control in compilation time 😄

On E). Yeah I think we are on the same page.

To summarize the discussion, I think we have consensus on most of the points, but the final sticky point here is about the unified IR infrastructure. Currently Relay has support for constant NDArrays but TIR doesn't, and I would hope to have a unified representation of such constants, because divergence is always less desirable (and TVM developers did pay the price in divergence before). It would be nice if all of us could think of this representation, as well as text format, serialization of these constants.

Here is my proposal:

Store constants in IRModule attributes for both Relay and TIR
Refer to constants in the IRModule in tir.allocate_constant as well as relay.constant
Serialization infra supports dumping/loading constants inline or in a separate file
On GPU and other devices, insert TVM's C API call to copy from host-side constant arrays to CUDA's global memory.

Note that I am not indicating that we have to get everything implemented initially, but wanted to lay out a path to make this feature complete and work in a generic way.

junrushao · 2021-09-16T22:19:54Z

Also CC @tqchen for comments

manupak · 2021-09-20T13:38:09Z

Hi @junrushao1994 ,

We have discussed this internally and we find referring to NDArrays in IRModule attributes seems reasonable through tir.allocate_const nodes. I'll do a pass to modify the text in the RFC.

@d-smirnov, any thoughts from you ? -- I think we will have to store the NDArrays as IRModule attrs and cross refer them in the PRs.

If we all agree, I can go ahead and do the RFC changes while @d-smirnov will do the necessary changes in the PRs.

areusch · 2021-09-21T23:44:20Z

@tqchen @jroesch @mbs-octoml could you please review this? it's been out for quite a while and it would be great to move forward. as it is a quite invasive IR change, i'd like to ensure we are all okay with this.

junrushao

Thanks @manupa-arm! Glad that we have in-depth discussion so far and finally reach a solution that we all agree :-) Great work btw!

jroesch · 2021-09-22T06:01:13Z

I read everything and will respond in depth in the morning after I have thought about it for a bit. I see both of the concerns here.

areusch · 2021-09-22T06:02:07Z

a couple thoughts on @junrushao1994 's comments:

on B: I'm okay with either thing. I think the primary concern here is whether constant data should become a PrimExpr and how much that impacts TIR parsing time?
on C: in C Device API #31 (perhaps that or perhaps a derivative of it) we will need to find a TIR expression that can do this. So I think the answer is yes, we just need to deduce exactly how.

junrushao · 2021-09-22T06:16:25Z

@areusch

whether constant data should become a PrimExpr and how much that impacts TIR parsing time?

Because the scope of the RFC is more like generic constant nodes in TIR, I would say the size could be as large as weights in NN, which can be MBs or even more. Parsing these large weights are generally super slow and thus we should support saving them in separate files.

we will need to find a TIR expression that can do this

Yeah I agree, and it can be simpler: as long as it is a packed function in the global registry, we can use tir.call_packed to make cross-device transfer possible.

areusch · 2021-09-22T19:05:59Z

@junrushao1994 so the concern is embedding e.g. the output of NDArray::repr into TIR text format, correct? That makes sense to me. It seems like small tir.constant (e.g. perhaps we could constraint this as: tir.constant with shape that can be squeezed to rank 2 and, when doing so, can reasonably be print each row on a single ilne) could be inlined. this would make it easier to analyze optimized kernels.

other things it seems like we need to place in an external .params somehow and refer to it from text format. That is more problematic as it suggests API changes and semantic changes in TVMScript:

how does one supply the data file containing said params?
how are they named? shouldn't those names be unique across the IRModule?
what if someone prints a TIR PrimFunc--currently we allow anonymous Vars so I presume we do not allow the same with such tir.constant?
how does this impact the signature of tvm::runtime::IRModule::Add()?

None of these questions are addressed here and I think we need to if we are adopting the IRModule metadata approach.

junrushao · 2021-09-22T19:47:13Z

@areusch Thanks for the discussion! Yep, I agree with you that we need to address these points, and it's certainly doable under the current design. @manupa-arm would you like to respond to Andrew's question? Thanks a lot!

manupak · 2021-09-22T20:00:42Z

Since we have relay.Constants having the same need to parse in constants, it would be appreciated not to block progress on deciding on the mechanics of parsing in NDArrays in to IRModule.

I think we could seperate out the discussion of enabling the additional pathway of getting the constants into the IRModule apart from the general compilation flow.

areusch · 2021-09-23T22:42:47Z

@manupa-arm i agree we don't need to care so much about how we parse and load .params into IRModule, but what about the question of Add? i think that implies we must traverse the other TIR in Add and copy any referenced NDArray from the other IRModule into this IRModule. is that part of this proposal? this then drags in the naming question.

* adding how constants are stored in the IRModule Change-Id: Ie45b1c76e9c522595646fd3f49d5c253d7818e9b

* adding IRNode definition * further explaining how constants gets added to IRModule

manupak · 2021-09-28T06:16:35Z

@areusch @junrushao1994 I have added a section to say how constants are added to the IRModule, now.

Summary :

The storage of constants in the IRModule, will be in "Constants" attribute as Array<NDArray>

Basically, if the tir.allocate_const node is created first, then the PrimFunc and lastly if it gets added to the IRModule, IRModule::Add(...) will need to traverse the stmts and fetch the constants and store the ObjectRefs in the IRModule and cross-update the optional index in the allocate_const node.

On the other hands, if the constants are already present in the IRModule, the tir.allocate_const could be created using the NDArray ObjectRef and the index of NDarray in the IRModule, directly.

* Updating drawbacks related IRModule::Add(...) changes

* Addressing nits

mbs-octoml

LGTM

mbs-octoml · 2021-09-28T19:02:27Z

rfcs/0022-tir-non-scalar-constants.md

+       this is an optional index to indicate the index within
+       "Constants" attribute, that is a Array<NDArray> of IRModule.
+   */
+  Optional<Integer> irmod_storage_idx;


nit: Suggest we make data and irmod_storage_idx mutually exclusive. I assume you'll want a Pass to hoist (and share?) the data into the IRModule "Constants" attribute, in which case you'll rewrite from data to irmod_storage_idx.

This is almost identical to how the Relay parser uses the 'meta' syntax and MetaTable to refer to and resolve relay.Constants, except:

The array is keyed by 'relay.Constant'.

The array holds Relay Constants, not NDArrays.

The MetaTable is just to help parsing and is discarded.

This suggest we should similarly move the contents of the MataTable into IRModule attributes (keyed by "Constants"), and allow a Relay Constant to similarly represent the NDArray immediately or via an index.

If that were already in place then I'd suggest we replace data and irmod_storage_idx with just a Relay Constant so that constants can easily make the hop from Relay to TIR. Could you capture that under the alts considered so I don't forget it?

Hi @mbs-octoml ,

Thanks for taking a look.

I've made a change to make it mutually exclusive.

I think the world in which we want to move to is to hold NDArrays and cross refer them from both Relay and TIR.
Therefore, it would be a bit confusing to discuss about relay.Constant in this RFC as we taking a 'future' precedence on what relay.Constants should become. Therefore, I'd suggest we discuss that in the future RFC which we make the relay.Constants actually work like that.

It would be great to finalize this -- so we can amend the PRs.

* making irmod_storage_idx and data mutually exclusive

manupak · 2021-09-30T05:29:35Z

@areusch a friendly ping!

manupak · 2021-09-30T06:12:16Z

@d-smirnov -- I think the design is stable (just waiting on @areusch), shall we look to update the PRs :

areusch

good with this one now! thanks @manupa-arm and apologies for the delay!

areusch · 2021-10-04T18:15:39Z

@manupa-arm can you please create an RFC tracking issue and reference this PR from there?

manupak added 2 commits August 17, 2021 17:37

[RFC][TIR] TIR Non-scalar Constants

d7349ca

* added the markdown * added a commit msg header Change-Id: I0fb3e6b97242ba219c157c9abe5184f14a9f8eff

[RFC][TIR] TIR Non-scalar Constants

e4c8e10

* adding the PR number Change-Id: Ia84e39506934919c25f8265fc4f6d3bc6f3a5140

areusch requested changes Aug 19, 2021

View reviewed changes

junrushao reviewed Aug 30, 2021

View reviewed changes

manupak mentioned this pull request Sep 6, 2021

Fix incorrect AOT Memory Planning apache/tvm#8926

Merged

areusch mentioned this pull request Sep 8, 2021

[TIR][USMP] Added buffer info extraction pass apache/tvm#8468

Merged

junrushao approved these changes Sep 22, 2021

View reviewed changes

manupak added 2 commits September 27, 2021 18:46

[RFC][TIR] TIR Non-scalar Constants

1a560d9

* adding how constants are stored in the IRModule Change-Id: Ie45b1c76e9c522595646fd3f49d5c253d7818e9b

[RFC][TIR] TIR Non-scalar Constants

1cea327

* adding IRNode definition * further explaining how constants gets added to IRModule

manupak added 2 commits September 28, 2021 07:20

[RFC][TIR] TIR Non-scalar Constants

a8340de

* Updating drawbacks related IRModule::Add(...) changes

[RFC][TIR] TIR Non-scalar Constants

6c0f540

* Addressing nits

mbs-octoml approved these changes Sep 28, 2021

View reviewed changes

[RFC][TIR] TIR Non-scalar Constants

3424bda

* making irmod_storage_idx and data mutually exclusive

areusch approved these changes Oct 4, 2021

View reviewed changes

areusch merged commit cfdf1df into apache:main Oct 4, 2021

junrushao mentioned this pull request Mar 4, 2022

[MetaSchedule] Enable AutoTVM-style template-based search space apache/tvm#10461

Merged


		## Winograd Constants

		The Winograd transformation (used for fast GEMMs) involves multiplication by a hard-coded constant tensor. This is currently accomplished in TE using a complicated TE compute expression with many nested selects. Being able to directly express a constant tensor here would significantly simplify this code.


		## Weight compression

		When lowering for accelerators (E.g. : [Arm(R) Ethos(TM)-U NPU](https://github.com/apache/tvm-rfcs/pull/11)), certain operations will need to get tiled to co-optimize performance and memory utilization. Such tiling patterns create slices of weights that need compressing that will end up with varying sizes. Therefore, the knowledge of some tir::Vars refer to constants are critical in the level of TIR to perform this.


		## Memory Planning

		The TIR program has the ability to express both inter and intra operator memory requirement, post-scheduling as explained further by [Unified Static Memory Planning RFC](https://github.com/apache/tvm-rfcs/pull/9). It would be better if the constants could be embedded to the TIR PrimFunc. Moreover, this allows various [target-dependent lowerings](https://github.com/apache/tvm-rfcs/pull/10), to produce TIR PrimFuncs with constants in it.

[RFC][TIR] TIR Non-scalar Constants #22

[RFC][TIR] TIR Non-scalar Constants #22

Conversation

manupak commented Aug 17, 2021

manupak commented Aug 17, 2021

areusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manupak Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrushao Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manupak Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrushao Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrushao commented Aug 30, 2021

areusch commented Sep 8, 2021

u99127 commented Sep 16, 2021

manupak commented Sep 16, 2021

manupak commented Sep 16, 2021

junrushao commented Sep 16, 2021 • edited Loading

junrushao commented Sep 16, 2021

manupak commented Sep 20, 2021 • edited Loading

areusch commented Sep 21, 2021

junrushao left a comment

Choose a reason for hiding this comment

jroesch commented Sep 22, 2021

areusch commented Sep 22, 2021

junrushao commented Sep 22, 2021

areusch commented Sep 22, 2021

junrushao commented Sep 22, 2021

manupak commented Sep 22, 2021 • edited Loading

areusch commented Sep 23, 2021

manupak commented Sep 28, 2021

mbs-octoml left a comment

Choose a reason for hiding this comment

mbs-octoml Sep 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manupak commented Sep 30, 2021

manupak commented Sep 30, 2021

areusch left a comment

Choose a reason for hiding this comment

areusch commented Oct 4, 2021

manupak Sep 2, 2021 •

edited

Loading

junrushao Sep 8, 2021 •

edited

Loading

manupak Aug 31, 2021 •

edited

Loading

junrushao Sep 8, 2021 •

edited

Loading

junrushao commented Sep 16, 2021 •

edited

Loading

manupak commented Sep 20, 2021 •

edited

Loading

manupak commented Sep 22, 2021 •

edited

Loading

mbs-octoml Sep 28, 2021 •

edited

Loading