-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] [Contrib] [Runtime] Minimal runtime (~12kb .text on ARMv7/x86) for subset of TVM models #3567
Conversation
3efca52
to
e51b271
Compare
This is a great step toward putting tvm into more resource constrained devices. Given that we have another effort(uTVM @weberlo ) that aims to enable automatic optimizations, we still lack a minimum runtime that we can serve on the device. This PR seems to bring one great step toward that direction. One thing we can try to do is to consolidate it with uTVM and put it under tvm/runtime/micro namespace later. A fun challenge would be to further iterate to remove most needs on the OS(mainly alloc) so we can really run it on bare metal devices. |
@tqchen yes absolutely - from talking to you yesterday I hadn't thought of the uTVM application, but it certainly could be interesting. One possible improvement in that direction could be to create a mmap'able representation of the parsed graph_json, i.e. these fields of
which would allow us to 'allocation-free' construct the GraphRuntime (and eliminate the code-size cost of the json parser), and then the remaining allocations are the NDArray tensor allocations themselves which could be handled via a static storage plan or similar? |
Most micro controllers do have stacks(heaps) and we just need to pre-define a section in the memory space, and implement a arena style allocator (always allocate without de-allocation) and at an RAII point recycles all memory |
+ 1 on getting this integrated with uTVM. @weberlo, care to take a look at this PR and make some high-level comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool work. I wonder if we could in addition provide a simple step by step guide to deploy a simple model on a ARMv7 device with this minimal runtime. It would certainly help bring people up to speed on using this runtime on their edge devices.
This looks great. As mentioned above it potentially fits well with uTVM. For use with uTVM it would be useful to have this runtime or a derivative built in C rather than C++ in order to be deployable to the various embedded environments out there that don't have C++ runtime / tooling support. |
@mshawcroft oh interesting - yeah, I started off with a pure C API (https://github.com/dmlc/tvm/pull/3567/files#diff-cf8621d821243d3ba906f0d9154abcea), but internally it's implemented with C++ (although it's deliberately designed to be compiled with -fno-rtti, -fno-exceptions, etc) - is the constraint that any use of C++ makes this unsuitable for embedded environments? |
@ajtulloch the situation is not black and white, at one end of the scale is pure 'C' at the other end of the scale is 'C++' using the standard c++ libraries and all the language bells and whistles, in the middle is a bunch of intermediate restricted subsets of c++ with arbitrary subsets of the c++ std library. The broadest reach lowest friction to potential users is at the C end of the scale. Aside from the language subset used, other issues are availability (and size!) of the std c++ library on a platform. The memory management strategy used (at the small end, memory fragmentation kills you, hence arbitrary use of the heap is undesirable). By way of example, last time I checked on zephyr rtos their C++ application support capability was broadly: no use of new / delete, no rtti, no exceptions, no static global object destruction.... (not that new/delete ban has a significant impact on the std c++ library available!) Other RTOS environments are richer, others are more constrained. There is a limited cost to the tvm community to provide a 'C' runtime rather than a C++ runtime, but doing so broadens tvm's reach. BTW.... Im really excited so see all the current activity in the uTVM, small runtime, embedded space.... ;-) |
To summarize some of the points.
The arena-style allocator may be fine for most of our cases, the idea is that we always allocate and de-allocate in a bulk. This allows us to keep most of the allocation in a single user defined stack on a memory region. void MyApp() {
// RAII, everything allocated within the function will only get the space,
// de-allocate the necessary space when MyApp
tvm::micro::AllocatorContext ctx;
} A slight variation would be having the allocator remember the number of object it allocates so far in the current context, when we call free, it only decreases the counter, and we recycle everything when the counter goes to zero. This should work for most cases we care about(where the allocation/free pattern are like a stack). |
Given the current discussions, perhaps we can decide on the naming, do a few improvement if you feel you can push some of them in a few days. Then we merge it in. In terms of naming and code location, given the relation to uTVM. We could think about a good name for the minimal runtime. One example ("src/runtime/micro/standalone"), perhaps @mshawcroft @ajtulloch @weberlo has better ideas |
@ajtulloch Awesome work on this! We'll need a runtime for uTVM when we want to try self-hosted models, so the timing on this is great. My general understanding is that it's much more common for bare-metal devices to support C, so it'd be interesting to see if we could incrementally whittle this down to pure C, like @mshawcroft said. Even if not, this would be a nice bonus for users targeting devices that do have C++ support. If we want to merge this into the µTVM namespace, |
To make this PR actionable, @ajtulloch can you decide on the name space choices, make the changes, fix the CI and let us merge it in? |
OK, so changes planned are:
Will work on it right now, thank you folks. |
ab32dfa
to
4ec5d16
Compare
@tqchen does this look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most high level naming convention changes. The overall code looks good
* under the License. | ||
*/ | ||
|
||
#include "minimalruntime_api.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utvm_runtime_api.cc
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utvm_runtime_standalone_test.cc
@mshawcroft @weberlo @tmoreau89 please help to review if you have time and https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly |
} | ||
} | ||
|
||
void parseAttrs(const picojson::object& jattr, GraphAttr* attr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParseAttrs( tobe consistent with Google C style)
void* lib_handle_{nullptr}; | ||
}; | ||
|
||
struct GraphAttr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document each struct, field and functions
So I've not had the time to study the code in detail, sorry, I would like to, but it won;t happen this week. Skimming the code does raise one immediate question: Are we sure the memory management policy implemented in the module does not lead to fragmentation? |
b6e941a
to
dd6f59e
Compare
@ajtulloch Which models have you been able to run on this runtime so far? |
@weberlo eg CPU CNNs like mobilenet, resnet, etc. One thing not supported is eg tvm.extern since we don’t support packed funcs. |
* under the License. | ||
*/ | ||
|
||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we allow #pragma once
, for compatibility reasons. I hope I'm mistaken, because header guards are gross.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us still use header guard as per google C style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some final nits
* under the License. | ||
*/ | ||
|
||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us still use header guard macro as per Google C style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RUNTIME_MICRO_STANDALONE_MINIMAL_VECTOR_H_
@antinucleon @weberlo please https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly To be clear, the current set of changes does not yet meet the requirement of no-std. It still depends on new/malloc, etc. Further refactor will be necessary, to make sure that the utvm standalone takes in a memory region that is pre-allocated, and only use memories from that region to allocate most of the executables. |
@ajtulloch can you act on the final comments and let us get it in:) |
Will do today @tqchen, my bad. |
b80a33d
to
4a7c3f6
Compare
@ajtulloch please look into the CI error and see if we can fix it. |
@tqchen sure, will do on the weekend. |
ping @ajtulloch |
… of TVM models This is an alternative implementation of a subset of the TVM runtime API (and graph runtime) that focuses entirely on reducing code size, at the expense of functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might be worth incrementally expanding the surface area if there's interest. The motivation for this work was seeing what the minimal useful subset of the TVM runtime is. This is relevant for e.g. super code-size constrained applications in e.g. embedded/mobile. The current runtime is more like O(100KiB) or so, so this might be compelling for some users. The smaller surface area for auditing might make this relevant for apache#3159, or the usecases I was thinking about in apache#2523 (comment) re: the Rust runtime. The symbols in the tvm::minimalruntime space (i.e. excluding std:: and picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we could replace picojson:: with [`jsmn`](https://zserge.com/jsmn.html) or something, and we could replace more of the `std::unordered_map` usage, etc with custom primitives as well (similar to the `DynArray`).
4a7c3f6
to
4344ca1
Compare
Thanks @ajtulloch @weberlo @antinucleon @mshawcroft, this PR is now merged |
… of TVM models (apache#3567) This is an alternative implementation of a subset of the TVM runtime API (and graph runtime) that focuses entirely on reducing code size, at the expense of functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might be worth incrementally expanding the surface area if there's interest. The motivation for this work was seeing what the minimal useful subset of the TVM runtime is. This is relevant for e.g. super code-size constrained applications in e.g. embedded/mobile. The current runtime is more like O(100KiB) or so, so this might be compelling for some users. The smaller surface area for auditing might make this relevant for apache#3159, or the usecases I was thinking about in apache#2523 (comment) re: the Rust runtime. The symbols in the tvm::minimalruntime space (i.e. excluding std:: and picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we could replace picojson:: with [`jsmn`](https://zserge.com/jsmn.html) or something, and we could replace more of the `std::unordered_map` usage, etc with custom primitives as well (similar to the `DynArray`).
… of TVM models (apache#3567) This is an alternative implementation of a subset of the TVM runtime API (and graph runtime) that focuses entirely on reducing code size, at the expense of functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might be worth incrementally expanding the surface area if there's interest. The motivation for this work was seeing what the minimal useful subset of the TVM runtime is. This is relevant for e.g. super code-size constrained applications in e.g. embedded/mobile. The current runtime is more like O(100KiB) or so, so this might be compelling for some users. The smaller surface area for auditing might make this relevant for apache#3159, or the usecases I was thinking about in apache#2523 (comment) re: the Rust runtime. The symbols in the tvm::minimalruntime space (i.e. excluding std:: and picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we could replace picojson:: with [`jsmn`](https://zserge.com/jsmn.html) or something, and we could replace more of the `std::unordered_map` usage, etc with custom primitives as well (similar to the `DynArray`).
… of TVM models (apache#3567) This is an alternative implementation of a subset of the TVM runtime API (and graph runtime) that focuses entirely on reducing code size, at the expense of functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might be worth incrementally expanding the surface area if there's interest. The motivation for this work was seeing what the minimal useful subset of the TVM runtime is. This is relevant for e.g. super code-size constrained applications in e.g. embedded/mobile. The current runtime is more like O(100KiB) or so, so this might be compelling for some users. The smaller surface area for auditing might make this relevant for apache#3159, or the usecases I was thinking about in apache#2523 (comment) re: the Rust runtime. The symbols in the tvm::minimalruntime space (i.e. excluding std:: and picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we could replace picojson:: with [`jsmn`](https://zserge.com/jsmn.html) or something, and we could replace more of the `std::unordered_map` usage, etc with custom primitives as well (similar to the `DynArray`).
Summary
This is an alternative implementation of a subset of the TVM runtime API (and
graph runtime) that focuses entirely on reducing code size, at the expense of
functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might
be worth incrementally expanding the surface area if there's interest.
Motivation
The motivation for this work was seeing what the minimal useful subset of the
TVM runtime is. This is relevant for e.g. super code-size constrained
applications in e.g. embedded/mobile. The current runtime is more like O(100KiB)
or so, so this might be compelling for some users.
The smaller surface area for auditing might make this relevant for
#3159, or the usecases I was thinking about in
#2523 (comment) re: the Rust
runtime.
Analysis
The symbols in the tvm::minimalruntime space (i.e. excluding std:: and
picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we
could replace picojson:: with
jsmn
orsomething, and we could replace more of the
std::unordered_map
usage, etc withcustom primitives as well (similar to the
DynArray
).