-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices #9022
Comments
@mbs-octoml To give a bit of context In the context of CPU, we want to preserve small alloca until the code generation point. And then the code will generate the stack alloca in an explicit way. Only when memory is big enough(bigger than a constant), we will use an opaque allocation instead. Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases. Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code. Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.
So rationales for the specific CPU side logic:
My guess is we need to look into why VM cannot work with code that allocates on stack in the multiple target case |
Thanks so much for for the context. I'll try to capture that in a comment. |
Right, this the gets to the target dependent generation regime where TargetKind attribute is indeed the right solution. We should also send a PR to add comments to that code block so we could have more context in the future. |
This is not specific to Arm(R) Ethos(TM)-U codegen and its generally applicable for any micro controller where we would want to avoid creating allocation of memories in the stack but rather service them via platform abstraction that is handled via TVMBackendWorkspaceAlloc --> TVMPlatformAllocate. This only showed up in Arm(R) Ethos(TM)-U codegen because we use TVMPlatformAllocate allocate memory from a buffer placed in a memory that is both accessible by CPU and NPU. Thus, it makes this a functional bug.
The correct way represent this seems to be using tir.allocates with storage_scope="local" for device=CPU to go into the stack. For targets that needs this behavior, there should be an explicit pass to convert them to local to make them placed to the stack rather than assuming this default behaviour.
Clearly, this definition of CPU leaves out micros.
This certainly sounds like we could use an optimization pass to convert the tir.allocate's storage_scope for targets that requires this rather than making that the default behaviour for tir.allocates with "global" storage scope. |
Our proposal is to add a check to that loop whether it has 'local' storage_scope before we place them into the stack as it is the solution that works for the wider definition of the CPU, rather than performing a hidden optimization in the codegen that is applicable for a subset of CPUs. |
@manupa-arm in cpu we do not necessarily differentiate local from global for now as they are from the same namespace. I can understand the need from the micro side, and I believe this can be resolved by making TVMLowerBuiltin target dependent, and query the max stack alloca property from the target. We can set the value to be zero for micro related targets. |
I feel that is a workaround for an optimization that certain set of CPUs require it. Moreover, TVMBAWs allows an abstraction to use an Arena that could be used in any memory for non-micro cases. Therefore it feels more like the general solution we want unless a specific target/option requires it placed on the stack. Landing the above change that provides an optimization for certain CPUs in certain cases where creating functional problems for other backend does not seem like the right thing to do. I'd rather revert the change for now and land the correct one that works for the general case. |
Thanks for the discussions. Before we suggest a resolution, it would be helpful Semantics of Allocate and storage_scopeAllocate at the moment does not specify the way to allocate the memory. In practie, we see two different kinds of needs:
N0 is needed to get best performing kernel, since a native way of allocation N1 is needed for "driving" the set of computations. In the case of GPU, we have One possible approach is to try to ask user to differentiate these two kinds of In specific specialized target devices, we also see an additional need:
For example, an unified device pinned memory that is accessible from both NPU and CPU. Right now N2 can be covered by introducing a special memory tag -- "global.npu_unified". Possible ResolutionsBased on discussions so far, there are two suggested resolutions.
|
So in the above post I tried to summarize the state. Now let me try to share some of my thoughts based on the summary. First of all, R0 and R1 are not that different in nature. Both tries to introduce two separate scopes that brings different behavior. The main questions boils down to how can we name the "global" scope. Per allocate semantics, we treats "global" as normal CPU memory which can come from stack or platform specific allocation. The system can choose the best way of doing such lowering. Always lowering to TBAW is indeed more general for the need of N1. However, the need N0 would favor stack allocation when possible. Note that we will likely need a related behavior for micro devices as well when generating operator kernels. While it is OK to differentiate stack allocated memory from a platform specific one, doing so would bring additional burden to the user and would require significant refactor of the operator implementations. The main requests so far comes from need of N1. In that case, it would be easy for AOT generator to allocate memory with special tags("global.workspace"), that enforces workspace allocation since in this setting there is a single expected behavior. So my suggestion would be R1+R2, as it helps to resolve the need in a way that is compatible with the current semantics and usecases. It will also open doors for more future scope dependent optimizations |
Thanks @tqchen for summarizing the ideas and presenting possible resolutions. The two needs seems very valid. For N0, The operators should really be tagged with 'local' storage scope for the needs of N0 as they are quite local to the operator primfunc and they benefit of further optimizations within and beyond TVM -- i.e. follow up C compiler / LLVM. For N1, we could use the 'global' tag to give the responsibility for the application/runtime layer to service the allocation. Therefore, the actual fix should have been tagging the allocates that are expected to be optimized to be 'local' to the PrimFunc, rather than making the 'global' allocates to CPU being treated as local.
I feel we are incorrectly tagging the storage_scopes here. They should really be 'local' for this specific usecase.
In the solution R1, I still that as a workaround for incorrect treatment of 'global' scoped memories where we create an override of an actual 'global' what we declare as 'global.workspace'. In shared memory SoCs, it would be un-scalable explosion of tags if we want to keep tagging memories for devices which have access to them. I would think we'd want to treat memories to devices having a many-to-many relationship. The lowering we had until few days back was general enough so they were serviced by a runtime/application layer routine and that were more aligned with what we call as 'global' (with respect to codegen) scoped storage.
Can you explain what you define as 'normal CPU memory' ? A CPU can technically have access to many memories.
It would have been nice to have a RFC (Apologize in advance if I missed this if there was a one already) to discuss before we move from TVMBAW style allocation which I find more generic than just stack allocations. It almost feel the schedules should have tagged them 'local' if this was the expectation rather than assuming a combined logic : 'global' and CPU.
Wouldn't it be simpler if tag allocations for N0 to be 'local' and N1 to be 'global' ?
Its a memory where multiple Target have access to which the runtime/application could provide via TVMBAW with a specialized global.<pool_name>.
Hmmmm, this argument seems counter-intuitive to me. I think we should assume the 'global' to be accessible unless they are explicitly specified to be restricted. i.e. global.<pool_name>. Otherwise, the terminology is confusing.
Ideally, the allocates destined to be end up in stack should have been 'local'. Moreover, at which point we can decide not to tag allocates that exceed the target specific attribute rather than dealing this in the codegen or lower_builtin_tvm pass. I would propose : R4 : R0 + I think its cleaner if we just introduce a optional pass to tag memories with 'local'. At which point, we should only tag them according to the target specific attribute max stack alloca size -- that could work until we fix the schedules that wants them in stack to have 'local' storage scope for the need N0 -- that is a conscious decision the schedule writer / autoscheduler takes. |
Thanks @manupa-arm . Trying to capture some of the discussions.
The main reason I would suggest R1 is because this is a backward compatible change(no changes to topi and other libraries needed). Additionally, it reduces the users' mental cost overall to think and choose another kind of memory. Note that introducing an explicit tag for TVMBAW via "global.workspace" won't necessarily mean we need to introduce tags for all kinds of memory. It does places constraints on how this kind of memory can be allocated. So things can work out of box |
Thanks @tqchen . In the R4, I was not suggesting to change TOPI but saying we could just do a Pass to change the storage scope. What are your thoughts about making the pass to make the storage_scope 'global.stack' after the pass? This should be backward compatible as it just performs an explicit specialization transparently in the IR. We could even do the max alloca size check in this particular pass. This makes it much clear and specific. Moreover, global.stack will not require further specialization unlike global.workspace where we would want to choose between multiple workspace buffers. The reason being I'd like to keep the tag to denote between global memories the compute devices have access to. |
plus @csullivan who also needs finer grained scopes |
Thanks @manupa-arm . I understand that proposal R4 can also work by having a pass to convert "global" to something more specialize as a pass (essentially R1 and R4 are not that different except for different choices of scope names). The main question is what is the semantics around the scope "global". Each memory scope represent a "constraint" of what kind of memory it is. Right now, when the device type is CPU, "global" means any memory that can be accessed by the host cpu. This means the actual implement can come from include TVMBAW, memory from stack, or memory allocated by other means. While a memory allocated by TVMBAW can have other benefit(e.g. accessible by other devices because it is pinned), it is not the constraint specified by the "global" scope. We can of course further constraint the setting, to be say "global.workspace", that reduces the possible ways to allocate the memory, but still not preclude from choosing between multiple workspace buffers. So from semantics point of view. The pass can indeed choose to return "global" or rewrite to "global.stack" to ensure it is a stack allocation. But if the scope remains "global", we should not preclude the possibility for downstream from allocating from stack(the code generator should be able to choose any kind that satisfies the constraint). To say it in another way, we cannot say that "global" definitely mean no stack allocation. If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly. As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory) |
@mbs-octoml The PR #8366 I did might answer your question. The storage scope attached to |
@tqchen ,
The current issue is in the device "CPU" && 'global' certainly means its definitely stack allocation if its less that heuristic size and not the other way around.
Yes this is something we want eventually and we will be working towards achieving with USMP work. Until we have that, the natural assumption should be in absense of a 'constraint' that the memories are more accessible rather than being less accessible (e.g. stack). Its unfortunate that the current design prefers the latter especially in a absense of a constraint. Short term solution :I think you are right, we might want to unblock this using a target-dependent kMaxStackAllocaSize. May I ask why was the default chosen to be this ? tvm/include/tvm/runtime/device_api.h Lines 60 to 61 in 1fd8f61
Its interesting because the stack size go up beyond that size as it is just looking on a single allocate at a time. i.e. you could have multiple allocates that are less than < 1024. So the stack usage is not even bounded by the current approach. Therefore, to both unlock us with Ethos-U and also somewhat solve the problem that current micro builds using stack for tensors < 1024 instead of the workspace buffer provided, maybe we should just make kMaxStackAllocaSize=0 (a binary decision rather than a value range). @Mousius @leandron @areusch , this means there is going to be another argument for a simple micro deployment to be added to the already long list of arguments. Something like "--use-external-workspace" ? @tqchen , I still feel it would have been super helpful that kMaxStackAllocaSize is by default zero but with the option of going higher based on a user argument. e.g. --max-stack-alloca-size=1024. It is not very convincing that we are leaving out stack allocation of tensors with the prospect of being optimized by mem2reg without doing any marking (i.e. special storage scope). |
Please allow me to explain the overall rationale here, in particular over the term "constraint"
The two ways to see the constraintingThe way C0 sees the constaint is about the possible accessor of the memory
The way C1 sees the constraint is about possible memories to choose from.
DiscussionsWhen we say a compiler IR is more constrainted than another one. Usually we mean that less optimizations can be performed, because there is lack of flexibility in terms of rewriting. For example, This makes C1 more aligned in the common compiler IR design. Note that "memory that is accessible from all devices" is term that depends on the specific runtime platform, and not very well defined in a generic IR. Under the current semantics "CPU" && "global" can result in stack allocation. Note that is is one kind of flexibility we want to offer to later stages so that specializations can be made.
One thing we should keep in mind is that the codegen and AOT compiler should not rely on the behavior of TVMLowerBuiltin to ensure correctness(since it can choose to do anything in the case of "global", including dispatching to another custom allocator). If a special kind of memory is needed, we should declare such constraint through IR. Attaching a special scope is the best way to do so under the current semantics, regardless of the implementation of TVMLowerBuiltin. TVMLowerBuiltin picks |
Thanks @tqchen for the explanation of two viewpoints of how we could see the constraints. I do agree that we should put more constraints progressively to guide the compiler in the lowering. It is just that the treatment of TVMBAW as a peer to heap and stack seemed not right. In fact one could use TVMBAW to perform heap allocation. My concern was that we are taking a decision way down in the compilation flow where it could have been taken transparently in the IR itself a bit higher. I think we are moving there with scopes but it would have been nicer to stick to TVMBAW for now as it is the more general lowering for 'global' and I would not see necessarily that as an orthogonal choice to the list you have mentioned. It just boils to the fact that we just want them to be on stack for mem2reg optimizations. While I dont disagree with the logic of the argument, but wish it was more explicit higher up in the compilation flow. If it was not for mem2reg optimizations, one could simply provide a Arena that could provide the workspace required as it would from the stack -- thus it seemed to me like a better interrim solution until we specialize the scope in the lowering. Yes, as you suggested the implementation of the target-dependent query for the max alloca size is not particularly challenging, it is just the API that we provide for the user is what we were worried about. This is important especially "micro" is not a target really in TVM -- so the short term solution seems like we would need one of the following TargetKind attributes for C and LLVM backends : A1 : --max-stack-alloca-size So here ideally from UX point of view, it would be better to give the user A2 argument rather than a constraining the size of single allocation seems like a proxy to control the behaviour of mem2reg transformations. However, that does not match with to behavior controlled by kMaxStackAllocaSize. A3 on the other hand set kMaxStackAllocaSize to zero and forcing all the allocates to be serviced by TVMBAW which could be from heap or Arena placed anywhere controlled by the runtime/application. |
Thanks @manupa-arm. I agree that putting TVMBAW as a peer to heap is not right(that was meant as an example to demonstrate the viewpoint. I do not necessary want to enforce heap as a peer to TVMBAW). For the case of AOT, however. I think it would still be useful to introduce a specialized scope eventually. So we can place the constraint that the memory is accessible by npu explicitly in the IR. While there is no micro specific target, we can introduce micro specific tags that set these properties. See https://github.com/apache/tvm/blob/main/src/target/tag.cc#L84 as an example of cuda tags I think all three attributes you proposed could work. A1 might be the simplest for now and it won't preclude A2/A3 options in the future. |
apologies for the delay in reply here. I agree that allocating on the stack is, in the current GraphExecutor world, a codegen-level optimization. In my mind, the introduction of AOTExecutor (and actually also VMExecutor) means that this is moving one level up. Here is why:
Now that this is the case, it doesn't make sense to make stack-allocation of Then question then is how to model this in the program. I think using
Under this definition, a special Something we still have yet to address is how to configure the It's likely that a scalable solution in micro-land is to have the Project API server able to provide a boilerplate set of Lastly, as to the
in the former case, it's likely that in the latter case, we probably additionally need a thread identifier e.g. Until we have USMP, my thoughts are that the short-term solution should be to stick with a codegen-level optimization and add an attribute which can be used to disable the stack-allocation optimization. What do you think @tqchen @manupa-arm ? |
One thing I want to note here is that the prefix(global/shared/warp/local) have things to do with thread execution hierachy so we cannot add arbitrary prefix here. we can however, add any tags that follows. In most cases the storage is assumed local to the execution context(of the thread), so we do not need to annotate threading as part of the storage scope. |
So there are two sub issues here, if you follow the discussion : P1: The AoT executor codegen creates tir.allocates that should be served TVMBAW calls more generally. For P1, For P2, I feel that should solve the issues in the scope. @areusch shall we move the discussion to another issue/discuss post above promoting/introducing storage_scope that is more higher than the current definition of 'global'? I agree with you that the current definition of 'global' is a misnomer. There are certain memories (e.g. stack) that could be 'owned' by a TargetDevice -- in which case we could do <TargetDevice>.stack. In a way, stack is special as it does not create an opaque call to get hold of the memory, so that LLVM (or any other post-TVM compiler) could do optimizations. Hence, my emphasis on treating it especially with 'local' storage_scope. However, I think it would be too big of a change given the historical precedence of TVM, as @tqchen points out. However, in shared memory SoCs, the ownership of memories to TargetDevice will not scale very well. For e.g., if we have a SRAM that is accessed by CPU, DSP, and NPU in the same SoC, what should be the prefix? The current approach we are taking (and discussed in the RFC) of USMP is to use the storage_scope: global.<pool_name> where pool_name points to PoolInfo that describes which TargetDevice have access to it. Therefore, I believe we could make an Arena that is pinned to the memories in the runtime layer to serve each Pool via TVMBAWs. If USMP is invoked, the static-sized allocations will be translated to fixed offsets within a static buffer pinned to the same memories. I guess my point is the provision of Arenas/buffers should be lifted out of the codegen to the application/runtime layer and interfaced by generated codes via TVMBAWs (for non-static sized allocates) or offset (for static-sized allocates). |
P1 : #9065 |
okay @manupa-arm this short-term solution seems workable to unblock the Ethos-U merge. I agree we should revisit how to solve this more generally in AOT once we have USMP and TargetDevice merged. we can discuss this in a separate post. |
Can we close this? |
P2 : #9950 |
tvm/src/tir/transforms/lower_tvm_builtin.cc
Line 116 in 2aebd33
This particular code has a complex history but it's purpose is to avoid pool allocations for small tensors on CPU targets (a big perf win). However, not all devices of type 'CPU' support alloca since the stack may not be hardware addressable. In particular, the EthosU target requires all mem ops to be rewritten to the pool allocation primitives for later handling.
So the logic which fires this early exit heuristic (so that a later pass will rewrite allocs to alloca) is too hard coded.
Any easy fix would be to make the kMaxStackAlloca threshold zero on target which do not support alloca. That can be done by moving the value to a TargetKind attribute.
The text was updated successfully, but these errors were encountered: