Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [Contrib] [Runtime] Minimal runtime (~12kb .text on ARMv7/x86) for subset of TVM models #3567

Merged
merged 1 commit into from
Sep 12, 2019

Conversation

ajtulloch
Copy link
Contributor

Summary

This is an alternative implementation of a subset of the TVM runtime API (and
graph runtime) that focuses entirely on reducing code size, at the expense of
functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might
be worth incrementally expanding the surface area if there's interest.

Motivation

The motivation for this work was seeing what the minimal useful subset of the
TVM runtime is. This is relevant for e.g. super code-size constrained
applications in e.g. embedded/mobile. The current runtime is more like O(100KiB)
or so, so this might be compelling for some users.

The smaller surface area for auditing might make this relevant for
#3159, or the usecases I was thinking about in
#2523 (comment) re: the Rust
runtime.

Analysis

The symbols in the tvm::minimalruntime space (i.e. excluding std:: and
picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we
could replace picojson:: with jsmn or
something, and we could replace more of the std::unordered_map usage, etc with
custom primitives as well (similar to the DynArray).

@ajtulloch ajtulloch changed the title [RFC] [Contrib] Minimal runtime (~12kb .text on ARMv7/x86) for subset of TVM models [RFC] [Contrib] [Runtime] Minimal runtime (~12kb .text on ARMv7/x86) for subset of TVM models Jul 18, 2019
@tqchen
Copy link
Member

tqchen commented Jul 18, 2019

This is a great step toward putting tvm into more resource constrained devices. Given that we have another effort(uTVM @weberlo ) that aims to enable automatic optimizations, we still lack a minimum runtime that we can serve on the device.

This PR seems to bring one great step toward that direction. One thing we can try to do is to consolidate it with uTVM and put it under tvm/runtime/micro namespace later.

A fun challenge would be to further iterate to remove most needs on the OS(mainly alloc) so we can really run it on bare metal devices.

@ajtulloch
Copy link
Contributor Author

@tqchen yes absolutely - from talking to you yesterday I hadn't thought of the uTVM application, but it certainly could be interesting. One possible improvement in that direction could be to create a mmap'able representation of the parsed graph_json, i.e. these fields of MinimalGraphRuntime:

  DynArray<Node> nodes_;
  DynArray<uint32_t> input_nodes_;
  DynArray<uint32_t> node_row_ptr_;
  DynArray<NodeEntry> outputs_;

which would allow us to 'allocation-free' construct the GraphRuntime (and eliminate the code-size cost of the json parser), and then the remaining allocations are the NDArray tensor allocations themselves which could be handled via a static storage plan or similar?

@tqchen
Copy link
Member

tqchen commented Jul 18, 2019

Most micro controllers do have stacks(heaps) and we just need to pre-define a section in the memory space, and implement a arena style allocator (always allocate without de-allocation) and at an RAII point recycles all memory

@tmoreau89
Copy link
Contributor

tmoreau89 commented Jul 18, 2019

+ 1 on getting this integrated with uTVM. @weberlo, care to take a look at this PR and make some high-level comments?

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool work. I wonder if we could in addition provide a simple step by step guide to deploy a simple model on a ARMv7 device with this minimal runtime. It would certainly help bring people up to speed on using this runtime on their edge devices.

@mshawcroft
Copy link
Contributor

This looks great. As mentioned above it potentially fits well with uTVM. For use with uTVM it would be useful to have this runtime or a derivative built in C rather than C++ in order to be deployable to the various embedded environments out there that don't have C++ runtime / tooling support.

@ajtulloch
Copy link
Contributor Author

This looks great. As mentioned above it potentially fits well with uTVM. For use with uTVM it would be useful to have this runtime or a derivative built in C rather than C++ in order to be deployable to the various embedded environments out there that don't have C++ runtime / tooling support.

@mshawcroft oh interesting - yeah, I started off with a pure C API (https://github.com/dmlc/tvm/pull/3567/files#diff-cf8621d821243d3ba906f0d9154abcea), but internally it's implemented with C++ (although it's deliberately designed to be compiled with -fno-rtti, -fno-exceptions, etc) - is the constraint that any use of C++ makes this unsuitable for embedded environments?

@mshawcroft
Copy link
Contributor

@ajtulloch the situation is not black and white, at one end of the scale is pure 'C' at the other end of the scale is 'C++' using the standard c++ libraries and all the language bells and whistles, in the middle is a bunch of intermediate restricted subsets of c++ with arbitrary subsets of the c++ std library. The broadest reach lowest friction to potential users is at the C end of the scale. Aside from the language subset used, other issues are availability (and size!) of the std c++ library on a platform. The memory management strategy used (at the small end, memory fragmentation kills you, hence arbitrary use of the heap is undesirable). By way of example, last time I checked on zephyr rtos their C++ application support capability was broadly: no use of new / delete, no rtti, no exceptions, no static global object destruction.... (not that new/delete ban has a significant impact on the std c++ library available!) Other RTOS environments are richer, others are more constrained.

There is a limited cost to the tvm community to provide a 'C' runtime rather than a C++ runtime, but doing so broadens tvm's reach.

BTW.... Im really excited so see all the current activity in the uTVM, small runtime, embedded space.... ;-)

@tqchen
Copy link
Member

tqchen commented Jul 18, 2019

To summarize some of the points.

  • No new/delete, but allows use of custom allocators that does arena-like allocations.
  • C++ is fine, template is fine, but maybe no stl

The arena-style allocator may be fine for most of our cases, the idea is that we always allocate and de-allocate in a bulk. This allows us to keep most of the allocation in a single user defined stack on a memory region.

void MyApp() {
  // RAII, everything allocated within the function will only get the space, 
   // de-allocate the necessary space when MyApp 
   tvm::micro::AllocatorContext ctx;
}

A slight variation would be having the allocator remember the number of object it allocates so far in the current context, when we call free, it only decreases the counter, and we recycle everything when the counter goes to zero. This should work for most cases we care about(where the allocation/free pattern are like a stack).

@tqchen
Copy link
Member

tqchen commented Jul 19, 2019

Given the current discussions, perhaps we can decide on the naming, do a few improvement if you feel you can push some of them in a few days. Then we merge it in.

In terms of naming and code location, given the relation to uTVM. We could think about a good name for the minimal runtime. One example ("src/runtime/micro/standalone"), perhaps @mshawcroft @ajtulloch @weberlo has better ideas

@weberlo
Copy link
Contributor

weberlo commented Jul 20, 2019

@ajtulloch Awesome work on this! We'll need a runtime for uTVM when we want to try self-hosted models, so the timing on this is great.

My general understanding is that it's much more common for bare-metal devices to support C, so it'd be interesting to see if we could incrementally whittle this down to pure C, like @mshawcroft said. Even if not, this would be a nice bonus for users targeting devices that do have C++ support.

If we want to merge this into the µTVM namespace, src/runtime/micro/standalone seems fine. But since this is code that would be loaded onto the device, we could also put it in src/runtime/micro/device/standalone. Then we could move the current runtime in device into its own subfolder device/host_driven (or we could name it something else).

@tqchen
Copy link
Member

tqchen commented Jul 22, 2019

To make this PR actionable, @ajtulloch can you decide on the name space choices, make the changes, fix the CI and let us merge it in?

@ajtulloch
Copy link
Contributor Author

OK, so changes planned are:

  • Move this to src/runtime/micro/standalone
  • Rename flag from MINIMAL_RUNTIME to MICRO_STANDALONE_RUNTIME
  • Fix CI

Will work on it right now, thank you folks.

@ajtulloch ajtulloch force-pushed the minimal-runtime branch 3 times, most recently from ab32dfa to 4ec5d16 Compare July 22, 2019 23:32
@ajtulloch
Copy link
Contributor Author

@tqchen does this look good to you?

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most high level naming convention changes. The overall code looks good

include/tvm/runtime/micro/standalone/minimalruntime.h Outdated Show resolved Hide resolved
include/tvm/runtime/micro/standalone/minimalruntime.h Outdated Show resolved Hide resolved
include/tvm/runtime/micro/standalone/minimalruntime.h Outdated Show resolved Hide resolved
src/runtime/micro/standalone/minimalgraphruntime.cc Outdated Show resolved Hide resolved
* under the License.
*/

#include "minimalruntime_api.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utvm_runtime_api.cc

src/runtime/micro/standalone/minimalvector.h Outdated Show resolved Hide resolved
src/runtime/micro/standalone/minimalvector.h Outdated Show resolved Hide resolved
src/runtime/micro/standalone/minimalvector.h Outdated Show resolved Hide resolved
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utvm_runtime_standalone_test.cc

@tqchen
Copy link
Member

tqchen commented Jul 23, 2019

}
}

void parseAttrs(const picojson::object& jattr, GraphAttr* attr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseAttrs( tobe consistent with Google C style)

void* lib_handle_{nullptr};
};

struct GraphAttr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document each struct, field and functions

src/runtime/micro/standalone/picojson.h Outdated Show resolved Hide resolved
@mshawcroft
Copy link
Contributor

@mshawcroft @weberlo @tmoreau89 please help to review if you have time and https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly

So I've not had the time to study the code in detail, sorry, I would like to, but it won;t happen this week. Skimming the code does raise one immediate question:

Are we sure the memory management policy implemented in the module does not lead to fragmentation?

src/runtime/micro/standalone/minimalgraphruntime.cc Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
@ajtulloch ajtulloch force-pushed the minimal-runtime branch 4 times, most recently from b6e941a to dd6f59e Compare July 23, 2019 22:18
@weberlo
Copy link
Contributor

weberlo commented Jul 23, 2019

@ajtulloch Which models have you been able to run on this runtime so far?

@ajtulloch
Copy link
Contributor Author

@weberlo eg CPU CNNs like mobilenet, resnet, etc. One thing not supported is eg tvm.extern since we don’t support packed funcs.

* under the License.
*/

#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we allow #pragma once, for compatibility reasons. I hope I'm mistaken, because header guards are gross.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us still use header guard as per google C style

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some final nits

* under the License.
*/

#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us still use header guard macro as per Google C style

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUNTIME_MICRO_STANDALONE_MINIMAL_VECTOR_H_

@tqchen
Copy link
Member

tqchen commented Jul 25, 2019

@antinucleon @weberlo please https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly

To be clear, the current set of changes does not yet meet the requirement of no-std. It still depends on new/malloc, etc. Further refactor will be necessary, to make sure that the utvm standalone takes in a memory region that is pre-allocated, and only use memories from that region to allocate most of the executables.

@tqchen
Copy link
Member

tqchen commented Jul 27, 2019

@ajtulloch can you act on the final comments and let us get it in:)

@ajtulloch
Copy link
Contributor Author

Will do today @tqchen, my bad.

@ajtulloch ajtulloch force-pushed the minimal-runtime branch 4 times, most recently from b80a33d to 4a7c3f6 Compare July 29, 2019 21:42
@tqchen
Copy link
Member

tqchen commented Aug 1, 2019

@ajtulloch please look into the CI error and see if we can fix it.

@ajtulloch
Copy link
Contributor Author

@tqchen sure, will do on the weekend.

@tqchen
Copy link
Member

tqchen commented Aug 12, 2019

ping @ajtulloch

… of TVM models

This is an alternative implementation of a subset of the TVM runtime API (and
graph runtime) that focuses entirely on reducing code size, at the expense of
functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might
be worth incrementally expanding the surface area if there's interest.

The motivation for this work was seeing what the minimal useful subset of the
TVM runtime is. This is relevant for e.g. super code-size constrained
applications in e.g. embedded/mobile. The current runtime is more like O(100KiB)
or so, so this might be compelling for some users.

The smaller surface area for auditing might make this relevant for
apache#3159, or the usecases I was thinking about in
apache#2523 (comment) re: the Rust
runtime.

The symbols in the tvm::minimalruntime space (i.e. excluding std:: and
picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we
could replace picojson:: with [`jsmn`](https://zserge.com/jsmn.html) or
something, and we could replace more of the `std::unordered_map` usage, etc with
custom primitives as well (similar to the `DynArray`).
@tqchen tqchen merged commit 1de52bb into apache:master Sep 12, 2019
@tqchen
Copy link
Member

tqchen commented Sep 12, 2019

Thanks @ajtulloch @weberlo @antinucleon @mshawcroft, this PR is now merged

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
… of TVM models (apache#3567)

This is an alternative implementation of a subset of the TVM runtime API (and
graph runtime) that focuses entirely on reducing code size, at the expense of
functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might
be worth incrementally expanding the surface area if there's interest.

The motivation for this work was seeing what the minimal useful subset of the
TVM runtime is. This is relevant for e.g. super code-size constrained
applications in e.g. embedded/mobile. The current runtime is more like O(100KiB)
or so, so this might be compelling for some users.

The smaller surface area for auditing might make this relevant for
apache#3159, or the usecases I was thinking about in
apache#2523 (comment) re: the Rust
runtime.

The symbols in the tvm::minimalruntime space (i.e. excluding std:: and
picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we
could replace picojson:: with [`jsmn`](https://zserge.com/jsmn.html) or
something, and we could replace more of the `std::unordered_map` usage, etc with
custom primitives as well (similar to the `DynArray`).
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
… of TVM models (apache#3567)

This is an alternative implementation of a subset of the TVM runtime API (and
graph runtime) that focuses entirely on reducing code size, at the expense of
functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might
be worth incrementally expanding the surface area if there's interest.

The motivation for this work was seeing what the minimal useful subset of the
TVM runtime is. This is relevant for e.g. super code-size constrained
applications in e.g. embedded/mobile. The current runtime is more like O(100KiB)
or so, so this might be compelling for some users.

The smaller surface area for auditing might make this relevant for
apache#3159, or the usecases I was thinking about in
apache#2523 (comment) re: the Rust
runtime.

The symbols in the tvm::minimalruntime space (i.e. excluding std:: and
picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we
could replace picojson:: with [`jsmn`](https://zserge.com/jsmn.html) or
something, and we could replace more of the `std::unordered_map` usage, etc with
custom primitives as well (similar to the `DynArray`).
wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 16, 2019
… of TVM models (apache#3567)

This is an alternative implementation of a subset of the TVM runtime API (and
graph runtime) that focuses entirely on reducing code size, at the expense of
functionality (no tvm.extern(..) calls via PackedFunc, CPU only, etc). It might
be worth incrementally expanding the surface area if there's interest.

The motivation for this work was seeing what the minimal useful subset of the
TVM runtime is. This is relevant for e.g. super code-size constrained
applications in e.g. embedded/mobile. The current runtime is more like O(100KiB)
or so, so this might be compelling for some users.

The smaller surface area for auditing might make this relevant for
apache#3159, or the usecases I was thinking about in
apache#2523 (comment) re: the Rust
runtime.

The symbols in the tvm::minimalruntime space (i.e. excluding std:: and
picojson::) are about 5KiB, so I think there's a bunch of room here (i.e. we
could replace picojson:: with [`jsmn`](https://zserge.com/jsmn.html) or
something, and we could replace more of the `std::unordered_map` usage, etc with
custom primitives as well (similar to the `DynArray`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants