-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][TIR] TIR Non-scalar Constants #22
Conversation
* added the markdown * added a commit msg header Change-Id: I0fb3e6b97242ba219c157c9abe5184f14a9f8eff
* adding the PR number Change-Id: Ia84e39506934919c25f8265fc4f6d3bc6f3a5140
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @manupa-arm, this proposal looks good. i mainly would like to ask for clarifications; if they're made while i'm out and @jroesch @mbs-octoml @tqchen approve this, feel free to merge without my approval
|
||
## Winograd Constants | ||
|
||
The Winograd transformation (used for fast GEMMs) involves multiplication by a hard-coded constant tensor. This is currently accomplished in TE using a complicated TE compute expression with many nested selects. Being able to directly express a constant tensor here would significantly simplify this code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constants in wingrad convolutions are a relatively small matrix, which are always unrolled in TIR scheduling, inlined in generated binary and thus won't incur any extra storage like what relay.Constant
does. So my question is, how does this RFC handle inlining such constants into corresponding unrolled operations?
As a concrete example, suppose we have the code snippet below:
tir.constant(c, [1, 2, 3]) # a temporary syntax for declaring constants
for i in tir.unroll(3):
a[i] = b[i] * c[i]
Does this RFC consider the pass that inlines these constants into the code and transform into the following TIR:
a[0] = b[0] * 1
a[1] = b[1] * 2
a[2] = b[2] * 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a good question @junrushao1994 .
I think this becomes useful in a full unrolling (which is only useful in small matrices as you correctly points out) where variable indices becomes constant indices and the access could be replaced with immediate value. I think we are not planning to improve this just yet, but something worth revisiting later, depending on observations -- especially after observing downstream compilers perceive/optimize constant propagation in such non-aliased array accesses.
cc: @d-smirnov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @manupa-arm! This is the particular case in my mind that we didn't do well with TE, where we use a lot of if-then-elses to represent a winograd array :-) Therefore, if I understand correctly, this is not going to be addressed within the scope of this RFC. Is that accurate? Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My second thought: if we are able to identify if a buffer is constant in tir.Load, then we can actually simplify the tir.Load into its constant value - which IMO perfectly resolves the question I asked above. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed.
I dont disagree with the suggestion, it is just
I think this could be a follow up with an introduction of a pass to change from :
a[0] = b[0] * c[0]
a[1] = b[1] * c[1]
a[2] = b[2] * c[2]
to
a[0] = b[0] * 1
a[1] = b[1] * 2
a[2] = b[2] * 3
I think we'd want to evaluate the benefit of implementation after we land tir.allocate_const support proposed in this RFC.
Does that sound good ?
|
||
## Weight compression | ||
|
||
When lowering for accelerators (E.g. : [Arm(R) Ethos(TM)-U NPU](https://github.com/apache/tvm-rfcs/pull/11)), certain operations will need to get tiled to co-optimize performance and memory utilization. Such tiling patterns create slices of weights that need compressing that will end up with varying sizes. Therefore, the knowledge of some tir::Vars refer to constants are critical in the level of TIR to perform this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey I have question here. I wonder if the values inside these constant weights actually matter, or what matters is their shapes/dtypes? If we only care about the shapes/dtypes, we don't need to link the actual values into TIR, is that correct? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junrushao1994 , the values do matter, as with compression the entropy of the constant data information determines the size post-compression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. Thanks for the explanation!
What I am thinking of is, to be consistent with our unified IR design, if we have a section in IRModule attributes (which @electriclilies has done plenty of work recently), in which those parameters are linked in, then we could actually refer to these parameters to IRModule attributes, and in the meantime Relay and TIR can share the same mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not relay use relay.Constant() ?
Are you suggesting that we just use the tir.node while actually placing the constant NDArray in the attribute section?
|
||
## Memory Planning | ||
|
||
The TIR program has the ability to express both inter and intra operator memory requirement, post-scheduling as explained further by [Unified Static Memory Planning RFC](https://github.com/apache/tvm-rfcs/pull/9). It would be better if the constants could be embedded to the TIR PrimFunc. Moreover, this allows various [target-dependent lowerings](https://github.com/apache/tvm-rfcs/pull/10), to produce TIR PrimFuncs with constants in it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry but I am not sure if I understand this claim correctly. Do we need to understand the actual values inside the constants to do memory planning? My understanding is that if we know the shapes/dtypes, and the address to which the constants are stored in memory, then we can do all sorts of memory planning and prefetching stuff. Is that correct? Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've replied before, the values determines the size in the case of the compression.
Moreover, the knowledge of the values of constants (as opposed to holding a de-referencing map to get values) will be beneficial to writing lowering passes with maximum access to knowledge of the program/operator.
May I ask the benefits of hiding the constant values for TIR passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I am not thinking of hiding the content of constant Tensors, but of putting them anywhere else, e.g. IRModule's attributes, to mimic the behavior where we put those data in .rodata
section of an artifact. In benefit of doing this is that it is more consistent with our unified IR approach and this mechanism could be shared between Relay and TIR.
The reason that I was asking if we care about the values of those constants is that I was trying to figure out what works best for our human-readable TIR printer/parser. If we directly print a chunk of things on screen, it might be hard to read.
Let me know what you think :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not relay use show_meta=off for this exact problem with relay.constant ?
I dont see a problem of printing associated with introduction of the node.
Have a look at the PR : https://github.com/apache/tvm/pull/8472/files
The reason we dont prefer to store them IRModule is that our passes will perform optimizations (such as slicing weights using compute_at and performing compression again) and all such passes has to undergo a in-direction. It feels unncessary if we had a leaf node to represent constants. Moreover, given such passes it feels ability to represent non-scalar constants in TIR is bit limiting. I personally dont feel like putting such a intimate construct that is used with the compute as an 'attribute'.
Let me know if you have further concerns here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so i think here we are printing the NDArray contents to TVMScript but not to repr()--does that match your understanding @manupa-arm ? i'm generally ok with this PR as long as we don't cause all TIR printing to be consumed by constant data.
param = tir.allocate_const([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], "int32", [10]) | ||
``` | ||
|
||
This follows closely the semantics of tir.allocate and the difference being it represent a buffer filled with constants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is worthwhile to discuss a bit more about the semantics :-)
Will the constants be allocated on stack or on heap? Is this designed for small matrices (e.g. the small matrix in winograd), or relatively larger matrices (e.g. the weight that needs prefetching)? How will lowering and code generation be affected? Does it work for GPU and other devices? How does it affect linkers' job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think where constants are allocated need to be decided here based on how we decides to represent constants in TIR.
The proposal here, is to represent constants intimately in the TIR primfunc, that could be lowered in a way each target wants them lowered (if they support it).
Will the constants be allocated on stack or on heap?
So currently where we want to 'link' in the parameters, they will generated part of the runtime.Module and linked via the executor : https://github.com/apache/tvm/pull/6917/files.
For CPU targets, they will go to a .rodata section, where the constants are held to keep compatibility with linked param code generation of what exists today.
Not sure we want them in stack nor the heap, however, we might want them in different sections if the system has more non-volatile memories.
Is this designed for small matrices (e.g. the small matrix in winograd), or relatively larger matrices (e.g. the weight that needs prefetching)?
This is size agnostic, therefore I'd not expect a difference here.
How will lowering and code generation be affected?
This will only be supported (at least initially and agreed with @tqchen ) with targets that support code generations for constants (currently it uses link-params target argument). Therefore, if a target have this capability (and enabled in a given compilation flow), we go with assumption the target knows how to generate code for the constants to be used by the operator.
Does it work for GPU and other devices?
I dont think GPU is a target that support (currently) code generation for constants, therefore the constants will live it in the tvm_main function (as relay.Constants).
How does it affect linkers' job?
So there are mainly two ways linkers' job could be affected, AFAIK.
1.) If the code generation for constants is supported by the respective target, we'll assume the code will be generated with appropriate sections (if they are C-like) or consumed in any other artifact that expect the constants to be embedded. If the target support neither, then that target is not a target that requires constant 'link'ed to the TIR.
2,) if the USMP is invoked, in which case it will pool all the constants and pulled out of tvm_main and exposed to application layer.
For basic and most usecases, the constant pools will be generated in the metadata module (See U1 of https://github.com/apache/tvm-rfcs/blob/c520a3912279bcd3c432dafd4070661f614882cf/rfcs/0009_Unified_Static_Memory_Planning.md).
For certain use-cases (See U3 of https://github.com/apache/tvm-rfcs/blob/c520a3912279bcd3c432dafd4070661f614882cf/rfcs/0009_Unified_Static_Memory_Planning.md), this would be where the user writes the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junrushao1994 does this answer your questions?
it does seem like we could specify placement of the parameters in terms of memory pool/scope in tir.allocate_const
node if we wanted to, though. the main use case i could see for this is marking weights to be loaded into e.g. accelerator memory. i do think that sort of thing is related to this proposal, but i don't think anything here stops this feature from being implemented later on. so i think we can continue with the proposal as written
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering my questions! It is super helpful for me to understand the motivation and scope of this RFC :-)
I like the idea of having tir.allocate_constant
, which I dreamed of before, and of course am not going to be against it. I was just really cautious when extending an IR with new nodes, because it usually means breaking most of the passes which usually don't handle unknown node types, and this might lead to the impact to the overall TVM stack, as well as compilation workflow for various different architectures and usecases.
Here are my thoughts (I'm sure I am limited in certain aspects, so please feel free to disagree and propose alternatives):
A. To be consistent with the unified IR effort, what about we use the IRModule-level attributes that @electriclilies is pushing for recently, and put those parameters as IRModule attributes. It could be directly translated into the .rodata
section in LLVM-based codegen, which both CPU target and CUDA host target use. We can do some small refactors to make both Relay and TIR level tensor constants into using IRModule attributes in a unified approach.
B. tir.allocate_constant
needs proper support by the TVM script parser and printer. Otherwise, the roundtrippable workflow will be broken after this node is introduced. In this case, we might need to think of supporting parsing tensors (or meta info of tensors if the actual parameters are not linked in yet).
C. On architectures other than CPU, like GPUs, my quick thoughts are: at least we can lower this into a call that copies chunk of data onto GPUs using TVM's C API. Let me know if it makes sense.
D. If I understand correctly, because the data is generated into the .rodata
section, which is loaded directly to certain memory, so it's not going to be on the stack. Also there is probably zero copy mechanism so that we don't need to waste any extra copy - which sounds cool to me.
E. On winograd weights, my second thought is that if we are able to identify if a buffer is constant in tir.Load
, then we can actually simplify the tir.Load
into its corresponding scalar constant - which IMO perfectly resolves the question I asked above. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is more like a summary of your thoughts let me answer below as a main comment.
Thanks @manupa-arm for the nice RFC! I left a few questions, and happy to discuss more! |
@junrushao1994 please follow up on the comments here |
How can we move this forward - this appears to be getting stalled ? |
@junrushao1994 has followed up with individual comments and its pending my response :) |
For A.) The reason is constants represent something intimate to the compute and requiring space. Moreover, in the scheduling passes where we want to do slicing in loops where the weights get sliced and need to undergo transformations (e.g. compression), it will need to keep on reading attributes and writing attributes throughout such passes. That seems like we are abusing attributes for something that should have been able to express in TIR. I guess the main difference is unlike the existing passes, the constant will be treated as a primary entity in TIR rather than being treated as another input. Therefore, I would argue against putting constants as attributes of IRModule. For B.) I believe this support is already implemented in the PR : apache/tvm#8472 For C.) I feel if the constants do not get embedded in the compilation artifact of TVM, we should avoid 'linking' the constant into the device PrimFunc, however, the host PrimFunc could codegen it in an appropriate section and pass a pointer to device PrimFunc if the non-volatile memory is accessible by the device. Otherwise, the host PrimFunc would need a cache_read stage to copy it to local storage before passing it onto the device PrimFunc. For D.) By default, the data could be generated into .rodata section, however USMP could allow user to place it in a different section that goes to different memory in the case where SoC have multiple non-volatile memories. See U3 here : https://github.com/apache/tvm-rfcs/blob/c520a3912279bcd3c432dafd4070661f614882cf/rfcs/0009_Unified_Static_Memory_Planning.md For E.) As I replied the original discussion, I think after the initial stage we could revisit to insert a new pass to do this optimization. I believe I have answered all your questions. Let me know what you think. |
Thanks @manupa-arm for the reply! These are all good points 👍 Again, I am supportive of this change, and the reason I am asking so many questions is that it is a change to the generic IR, which is supposed to be the most stable part of the system. As a generic concept, "constant" is supposed to be something people will assume to work generically in the future. Therefore, I wanted to be serious and make sure it does work generically. On A). As the primary author in TensorIR scheduling, I feel like reading/writing attributes seem to be expected behavior by design of our unified IR infrastructure, I would love to learn more about why these are issues here. On B). Thanks for supporting this in TVM script! I definitely see a path that we can complete this feature. The reason I am asking about "parsing tensors" is that tensors can be large and not sure they could fit in python's AST (too slow or something). In Relay, the workaround is to treat tensors as metadata, so that it could be parsed in a separate file with a much faster parser. Again, would you to hear about your opinions about what the best design is to parse tensors in TVM script, which handles:
On C). I think we are on the same page, but to clarify, On D). Thanks for clarification! I am definitely not an expert on these SoCs. I think it's cool as long as we have control in compilation time 😄 On E). Yeah I think we are on the same page. To summarize the discussion, I think we have consensus on most of the points, but the final sticky point here is about the unified IR infrastructure. Currently Relay has support for constant NDArrays but TIR doesn't, and I would hope to have a unified representation of such constants, because divergence is always less desirable (and TVM developers did pay the price in divergence before). It would be nice if all of us could think of this representation, as well as text format, serialization of these constants. Here is my proposal:
Note that I am not indicating that we have to get everything implemented initially, but wanted to lay out a path to make this feature complete and work in a generic way. |
Also CC @tqchen for comments |
Hi @junrushao1994 , We have discussed this internally and we find referring to NDArrays in IRModule attributes seems reasonable through tir.allocate_const nodes. I'll do a pass to modify the text in the RFC. @d-smirnov, any thoughts from you ? -- I think we will have to store the NDArrays as IRModule attrs and cross refer them in the PRs. If we all agree, I can go ahead and do the RFC changes while @d-smirnov will do the necessary changes in the PRs. |
@tqchen @jroesch @mbs-octoml could you please review this? it's been out for quite a while and it would be great to move forward. as it is a quite invasive IR change, i'd like to ensure we are all okay with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @manupa-arm! Glad that we have in-depth discussion so far and finally reach a solution that we all agree :-) Great work btw!
I read everything and will respond in depth in the morning after I have thought about it for a bit. I see both of the concerns here. |
a couple thoughts on @junrushao1994 's comments:
|
Because the scope of the RFC is more like generic constant nodes in TIR, I would say the size could be as large as weights in NN, which can be MBs or even more. Parsing these large weights are generally super slow and thus we should support saving them in separate files.
Yeah I agree, and it can be simpler: as long as it is a packed function in the global registry, we can use |
@junrushao1994 so the concern is embedding e.g. the output of other things it seems like we need to place in an external
None of these questions are addressed here and I think we need to if we are adopting the IRModule metadata approach. |
@areusch Thanks for the discussion! Yep, I agree with you that we need to address these points, and it's certainly doable under the current design. @manupa-arm would you like to respond to Andrew's question? Thanks a lot! |
Since we have relay.Constants having the same need to parse in constants, it would be appreciated not to block progress on deciding on the mechanics of parsing in NDArrays in to IRModule. I think we could seperate out the discussion of enabling the additional pathway of getting the constants into the IRModule apart from the general compilation flow. |
@manupa-arm i agree we don't need to care so much about how we parse and load |
* adding how constants are stored in the IRModule Change-Id: Ie45b1c76e9c522595646fd3f49d5c253d7818e9b
* adding IRNode definition * further explaining how constants gets added to IRModule
@areusch @junrushao1994 I have added a section to say how constants are added to the IRModule, now. Summary : The storage of constants in the IRModule, will be in "Constants" attribute as Array<NDArray> Basically, if the tir.allocate_const node is created first, then the PrimFunc and lastly if it gets added to the IRModule, IRModule::Add(...) will need to traverse the stmts and fetch the constants and store the ObjectRefs in the IRModule and cross-update the optional index in the allocate_const node. On the other hands, if the constants are already present in the IRModule, the tir.allocate_const could be created using the NDArray ObjectRef and the index of NDarray in the IRModule, directly. |
* Updating drawbacks related IRModule::Add(...) changes
* Addressing nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
this is an optional index to indicate the index within | ||
"Constants" attribute, that is a Array<NDArray> of IRModule. | ||
*/ | ||
Optional<Integer> irmod_storage_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Suggest we make data and irmod_storage_idx mutually exclusive. I assume you'll want a Pass to hoist (and share?) the data into the IRModule "Constants" attribute, in which case you'll rewrite from data to irmod_storage_idx.
This is almost identical to how the Relay parser uses the 'meta' syntax and MetaTable to refer to and resolve relay.Constants, except:
- The array is keyed by 'relay.Constant'.
- The array holds Relay Constants, not NDArrays.
- The MetaTable is just to help parsing and is discarded.
This suggest we should similarly move the contents of the MataTable into IRModule attributes (keyed by "Constants"), and allow a Relay Constant to similarly represent the NDArray immediately or via an index.
If that were already in place then I'd suggest we replace data and irmod_storage_idx with just a Relay Constant so that constants can easily make the hop from Relay to TIR. Could you capture that under the alts considered so I don't forget it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mbs-octoml ,
Thanks for taking a look.
I've made a change to make it mutually exclusive.
I think the world in which we want to move to is to hold NDArrays and cross refer them from both Relay and TIR.
Therefore, it would be a bit confusing to discuss about relay.Constant in this RFC as we taking a 'future' precedence on what relay.Constants should become. Therefore, I'd suggest we discuss that in the future RFC which we make the relay.Constants actually work like that.
It would be great to finalize this -- so we can amend the PRs.
* making irmod_storage_idx and data mutually exclusive
@areusch a friendly ping! |
@d-smirnov -- I think the design is stable (just waiting on @areusch), shall we look to update the PRs : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good with this one now! thanks @manupa-arm and apologies for the delay!
@manupa-arm can you please create an RFC tracking issue and reference this PR from there? |
Change-Id: I0fb3e6b97242ba219c157c9abe5184f14a9f8eff