-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Discuss New Features of AOT Runtime #2122
Comments
One thing that would be useful (that I think is orthogonal to JIT vs AOT) is considering relaxing the requirement on fully-specified shapes in graph_runtime. That is, we'd like to allow variable dimensions (consider supporting e.g. variable batch size) and allocate (at least some) memory dynamically. |
@ajtulloch can you give some specific application examples. For example, dynamic batch size in image modeling, dynamic length in language models etc. We do support certain forms of type inference with dynamic variables (the current type inference support symbolic integer in batch, so we can know a shape of the tensor is (n, 100, 100) ), and likely that would help in certain cases. |
@tqchen yes, a simple case would be supporting using a graph in |
I agree with @ajtulloch. More dynamic is necessary for future application. We should list a few tasks for each field, including CV/NLP/RL/ASR/TTS etc, check one by one to see what kind of feature will be necessary and what is the trend. In this phrase domain experts' input will be very helpful. |
For embedded applications it is often important not to use dynamic memory. For example, some tiny systems may miss memory manager at all (stubs in place of malloc/free in libc). In my opinion, user should always have an option to generate static-memory code or the code where upper bound for memory usage is well-defined. @ajtulloch Is there a possibility to support varying shapes but still use static memory allocations? For example, could we generate static memory code, which would allow users to vary one dimension of one input tensor? Edit correct text in last paragraph |
@grwlf I don't envision this would require dynamic memory allocations, more that it would make it possible. If you know all shapes statically then of course you can just statically allocate. This is more enabling new cases where you don't know all dimensions for all tensors statically. |
FWIW here are some thoughts on possible usages for the NNVMv2 runtime :
|
Is training in the scope of discussion? |
Another thing that could be useful in runtime is to support graph partition in case certain operators are not supported by accelerators or resources at runtime doesn't permit. |
I would list three most demanding features of our runtime: 1) handling inputs in dynamic shape, 2) handling execution on multiple compute resources (CPU, GPU, DSP, NPU, etc.), and 3) enabling operator-level parallelism. I think having multiple runtimes makes sense as the scenarios of server and edge are vastly different. On the server it is relatively flexible due to the sufficient compute resource and less constraint of power consumption. For example, we can use JIT to handle dynamic shapes, execute sample runs to determine the resource availability beforehand. The critical part is how to design runtime in the constraints of edge devices. I think a minimal static interpreter @ajtulloch suggested makes sense. However, an edge device may not have much space to store many pre-compiled object files. |
Several concerns:
|
I agree with @szha that we should have a scope of discussion. In this thread, should we talk about future integration into upstream deep learning framework (e.g. MXNet), should we talk about inference v.s. training, should we consider distributed stuff, and should we consider thread safety? These are all concerns from different scopes. |
To limit the scope of the discussion, let us first focus on the low resource and pure AOT scenario. The multiple target execution is already supported in the current graph runtime as #1695, and we only need to build compiler support for that. My guess is that jit and training is its own beast and would deserve another thread. |
#1695 is great but doesn't cover all I meant by "handling execution on multiple compute resources". Given multiple targets, how to schedule the execution on them in parallel would be an interesting research/engineering topic to explore. Also, TVM runtime may call a third-party runtime (e.g. TensorRT) for a particular target in the heterogenous environment. |
@szha We would be glad to continue discussion related to training in another thread: #1996 (please cc @sgrechanik-h) |
@grwlf This sounds pretty cool! So how could we do manual (or automatic) scheduling on a automatically generated backprop? CC: @were seems to be interested in this as well. |
@ajtulloch I meant the case where we have to allocate memory statically, but still want to vary one of input dimensions. Having this feature implemented, TVM may gain positions in the domain of resource-critical embedded applications. If I am correct, in dynamic batching, as mentioned by @junrushao1994, we typically vary a batch size dimension, probably it is a good example of such case. |
+1 to assigning graph partitions to threads instead of axes. There are some convincing benchmarks and discussions which suggest that cache locality is a primary performance booster. |
Normally if workload is big enough, our past experience in MXNet suggest that parallelization within op has better potential than bind parts to threads. The pipeline partition would be useful though on small ops |
In inference, graph-level parallelism does not help that much (at least on CPU/GPU), because normally an operator has been big enough to occupy all CPU threads or GPU streaming processors. About small workloads, @tqchen @nhynes do you have any specific example of such workloads, i.e. that could be accelerated by multiple issuing, but couldn't be fused into a single kernel? |
Glow is a framework that does not support multi-threading (at least at the time of writing this post), i.e. all their operators are single-threaded. This could somehow explain why multiple issuing helps in Glow (imho). |
Right, but that's what they're planning on doing now. They're going the graph partitioning route and have expressed that cache locality makes this approach more efficient than op-parallelism. I can't quite think of an example of a "small" op in the world of fusion, though. |
Looks very interesting. @were do you have some bandwidth looking into this? |
related #2810 |
As we move to the NNVMv2(relay), there is a clear separation of compiler and runtime. The compiler IR is maximumly flexible, while the runtime is a virtual machine(interpreter) that the executes the code which compiler generates.
This is an RFC to discuss what kinds of runtimes do we have. Given the separation of compiler and runtime, it might make sense to have multiple runtimes since there is a tradeoff on how rich the feature we want to support(e.g. JIT) vs the minimalism we need on an embedded device. Likely we will need an AOT runtime and JIT one.
There are also questions on what data structures we want to expose in the runtime. Likely the TVM runtime and PackedFunc is already our friend, but we may need a few more things to accommodate control-flow and other applications
The text was updated successfully, but these errors were encountered: