Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][Relay] Add VM doc #3188

Merged
merged 16 commits into from
May 27, 2019
318 changes: 318 additions & 0 deletions docs/dev/virtual_machine.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.

Putting the VM in TVM: The Relay Virtual Machine
================================================

Relay, a new program representation, enabled the representation and optimization of
wweic marked this conversation as resolved.
Show resolved Hide resolved
a greater breadth of machine learning programs.
wweic marked this conversation as resolved.
Show resolved Hide resolved
Unfortunately by supporting a more expressive set of programs we
wweic marked this conversation as resolved.
Show resolved Hide resolved
introduced several new execution challenges.

Relay's “debug” interpreter can execute the full language but has notable limitations
wweic marked this conversation as resolved.
Show resolved Hide resolved
that makes it unsuited for production deployments. It is structured as an inefficient
wweic marked this conversation as resolved.
Show resolved Hide resolved
interpreter that performs AST traversal to execute the program. This approach is conceptually
wweic marked this conversation as resolved.
Show resolved Hide resolved
simple but requires traversal of the program for each evaluation. The program is stored as a
wweic marked this conversation as resolved.
Show resolved Hide resolved
tree which makes heavy use of indirection and leads to inefficient execution.
wweic marked this conversation as resolved.
Show resolved Hide resolved

There are still unknown dynamism issues such as dynamic scheduling and allocation,
wweic marked this conversation as resolved.
Show resolved Hide resolved
fully dynamic tensor shapes, and control-flow. The interpreter has simple solutions
wweic marked this conversation as resolved.
Show resolved Hide resolved
for these, but none provide a compelling and optimized solution.
wweic marked this conversation as resolved.
Show resolved Hide resolved

The second execution mechanism is the existing graph runtime, in order to target Relay
wweic marked this conversation as resolved.
Show resolved Hide resolved
programs to this we translate a small subset of them to the old graph format, and execute
wweic marked this conversation as resolved.
Show resolved Hide resolved
them on the runtime.
This provides a solid execution experience but only for a very limited subset of Relay programs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "solid" here? Free from bugs and fast? I think the text should be specific about this.


An alternative but not-standard approach is Relay's ahead-of-time compiler
wweic marked this conversation as resolved.
Show resolved Hide resolved
which transforms a Relay program into a shared library containing an ahead
wweic marked this conversation as resolved.
Show resolved Hide resolved
of time implementation. The ahead of time compiler provides compelling performance
wweic marked this conversation as resolved.
Show resolved Hide resolved
but is difficult to extend, and instrument, requiring modifications to the
wweic marked this conversation as resolved.
Show resolved Hide resolved
code generation and optimizations.
wweic marked this conversation as resolved.
Show resolved Hide resolved

The Relay virtual machine is intended to be a framework that balances these competing
approaches providing a dynamic execution environment which can be extended, instrumented,
wweic marked this conversation as resolved.
Show resolved Hide resolved
and integrated with other approaches like ahead of time compilation via a flexible extension
wweic marked this conversation as resolved.
Show resolved Hide resolved
mechanism.

The virtual machine is designed to strike a balance between performance and flexibility
when deploying and executing Relay programs, without giving up the benefits of TVM.

Virtual machine (VM) design is a well studied area in programming languages and systems,
wweic marked this conversation as resolved.
Show resolved Hide resolved
and there have been various virtual machine designs for both full fledged,
wweic marked this conversation as resolved.
Show resolved Hide resolved
and embedded programing languages.
Previous language VM designs have been heavily tailored to the execution profile of traditional programs.
Traditional programs manipulate small scalar values and consist of a large number of low level instructions.
wweic marked this conversation as resolved.
Show resolved Hide resolved
The sheer quantity of instructions to compute requires instruction execution and dispatch to be extremely efficient.
wweic marked this conversation as resolved.
Show resolved Hide resolved
In the context of machine learning we manipulate primarily tensor values, using a (relatively)
low number of high level instructions. ML program's cost centers are expensive operator invocations
wweic marked this conversation as resolved.
Show resolved Hide resolved
such as GEMM or convolution, over a large input. Due to the execution profile exhibited by ML programs
wweic marked this conversation as resolved.
Show resolved Hide resolved
micro-optimizations present in scalar-VMs are dramatically less important.
wweic marked this conversation as resolved.
Show resolved Hide resolved
A model’s runtime will be dominated by executing expensive operators on large inputs.
wweic marked this conversation as resolved.
Show resolved Hide resolved

TVM has provided a strong support for vision models,
wweic marked this conversation as resolved.
Show resolved Hide resolved
but we want to grow to support a wider variety of models.
The graph runtime is able to utilize the fully static nature of the input graphs to perform
aggressive optimization such as fully static allocation, and optimal memory reuse.
When we introduce models which make use of control-flow, recursion, dynamic shapes, and dynamic
wweic marked this conversation as resolved.
Show resolved Hide resolved
allocation, we must change how execution works.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "we must change how execution works" is too vague. I think the sentence should reflect that we have to have mechanisms in our compilation tailored to accommodate these features or something that is more clear about how we are changing the execution mechanisms


The rest of this document provides a high level overview of the Relay
wweic marked this conversation as resolved.
Show resolved Hide resolved
virtual machine design, and its instruction set.
wweic marked this conversation as resolved.
Show resolved Hide resolved

Design
------

The VM's design is focused on simplicity without sacrificing performance.
In order to accomplish this we have ignored traditional wisdom in scalar
wweic marked this conversation as resolved.
Show resolved Hide resolved
VM design, and focused on designing a tensor VM.
wweic marked this conversation as resolved.
Show resolved Hide resolved

In the tensor VM setting, we optimize for cheap “allocation” of objects (by trying to avoid real allocation,
reuse of static fragments, and the ability to do dynamic (i.e jagged tensors)).
wweic marked this conversation as resolved.
Show resolved Hide resolved

Instruction Set
~~~~~~~~~~~~~~~

The critical design choice of a VM is the instruction set and their representation.
wweic marked this conversation as resolved.
Show resolved Hide resolved
The current representation of the instructions is a tagged union, containing the op-code and the data payload. An important design decision is the level of abstraction of the instructions, and how they take their data, that is RISC vs. CISC and fixed-width instruction encoding vs. variable length. The current version is closer to CISC with complex instructions like AllocTensor, and is variable length due to the inclusion of the shape as part of the instruction. The current instruction set is very high level and corresponds roughly to high level operations in Relay.
wweic marked this conversation as resolved.
Show resolved Hide resolved

Ret
^^^
**Arguments**:
::
RegName dst
RegName result

Returns the object in register "result" to caller's register "dst".
wweic marked this conversation as resolved.
Show resolved Hide resolved

InvokePacked
^^^^^^^^^^^^
**Arguments**:
::
size_t packed_index
size_t arity
size_t output_size
RegName* packed_args

Invoke the packed function denoted by packed_index. The arity
wweic marked this conversation as resolved.
Show resolved Hide resolved
and output size are used to inform the VM how many inputs and
wweic marked this conversation as resolved.
Show resolved Hide resolved
outputs to expect. packed_args stores the list of argument registers,
wweic marked this conversation as resolved.
Show resolved Hide resolved

AllocTensor
^^^^^^^^^^^
**Arguments**:
::
RegName dst
RegName shape_register
size_t ndim
DLDataType dtype

Allocate a tensor value of the appropriate shape(stored in shape_register) and dtype. The result
wweic marked this conversation as resolved.
Show resolved Hide resolved
is saved to register dst.
wweic marked this conversation as resolved.
Show resolved Hide resolved

AllocDatatype
^^^^^^^^^^^^^
**Arguments**:
::
RegName dst
size_t tag
size_t num_fields
RegName* datatype_fields

Allocate a data type with the tag `tag` using the `num_fields` entries
from registers datatype_fields. The result is saved to register dst.
wweic marked this conversation as resolved.
Show resolved Hide resolved

AllocClosure
^^^^^^^^^^^^
**Arguments**:
::
RegName dst
size_t clo_index
size_t num_freevar
RegName* free_vars;

Allocate a closure with the VMFunction at clo_index as
wweic marked this conversation as resolved.
Show resolved Hide resolved
its code, and the `num_freevar` entries from registers in
free_vars. The result is saved to register dst.
wweic marked this conversation as resolved.
Show resolved Hide resolved

GetField
^^^^^^^^
**Arguments**:
::
RegName dst
RegName object
size_t field_index

Get the field value with index field_index from object. And saves the result to register dst.
wweic marked this conversation as resolved.
Show resolved Hide resolved

If
^^
**Arguments**:
::
RegName if_cond
size_t true_offset
size_t false_offset

Check if the object at register if_cond is `true` or `false`.
wweic marked this conversation as resolved.
Show resolved Hide resolved
If true relative jump by `true_offset`, else relative
wweic marked this conversation as resolved.
Show resolved Hide resolved
jump by `false_offset`.

Goto
^^^^
**Arguments**:
::
size_t pc_offset

Relative unconditional jump by `pc_offset`.

Invoke
^^^^^^
**Arguments**:
::
size_t func_index

Invoke function at `func_index`, consumes the number of arguments contained in the VMFunction's
arity field.

InvokeClosure
^^^^^^^^^^^^^
**Arguments**:
::
RegName closure
size_t closure_args_num
RegName* closure_args

Invokes closure consuming the number of arguments declared in the closure's VMFunction.
wweic marked this conversation as resolved.
Show resolved Hide resolved

LoadConst
^^^^^^^^^
**Arguments**:
::
RegName dst
size_t const_index

Load the constant at `const_index` from the constant pool. The result is saved to register dst.
wweic marked this conversation as resolved.
Show resolved Hide resolved

Object Representation
~~~~~~~~~~~~~~~~~~~~~
We use a simple object representation that uses shared pointers and tagging.
There is a huge space of object representations we can trade off here, but we
wweic marked this conversation as resolved.
Show resolved Hide resolved
believe micro-optimizing this code has little to no-effect on the end-to-end performance.
wweic marked this conversation as resolved.
Show resolved Hide resolved

::

struct ObjectCell {
ObjectTag tag;
...
};

struct Object {
std::shared_ptr<ObjectCell> ptr;
...
}

See `vm.h` for more details.
wweic marked this conversation as resolved.
Show resolved Hide resolved

Currently we support 3 types of objects: tensors, data types, and closures.
wweic marked this conversation as resolved.
Show resolved Hide resolved

::

VMObject VMTensor(const tvm::runtime::NDArray& data);
VMObject VMDatatype(size_t tag, const std::vector<VMObject>& fields);
VMObject VMClosure(size_t func_index, std::vector<VMObject> free_vars);


Stack and State
~~~~~~~~~~~~~~~

The Relay VM maintains a frame stack, which contains information about how to resume the
wweic marked this conversation as resolved.
Show resolved Hide resolved
previous call. Registers are allocated in a continuous space(virtual register file) for each function.
wweic marked this conversation as resolved.
Show resolved Hide resolved

We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter).

::

struct VirtualMachine {
...
std::vector<VMFrame> frames;
...
// Current function.
size_t func_index;
// Pointer into the current function's instructions.
const Instruction* code;
// Current program counter relative to the code pointer.
size_t pc;
// The current base pointer.
size_t bp;
...
};


Dispatch Loop
~~~~~~~~~~~~~
A very critical piece of a VM is the dispatch loop, usually this dominates execution time of a virtual machine, but experimentally we have found the performance of the loop to not be of much importance. We have just implemented a simple switch/goto dispatch loop which dispatches based on instruction op code.
wweic marked this conversation as resolved.
Show resolved Hide resolved

This loop is implemented by `VirtualMachine::Run()`.

It is my belief that this code is not as important to end-to-end performance as allocation,
wweic marked this conversation as resolved.
Show resolved Hide resolved
and memory reuse.

VM Compiler
~~~~~~~~~~~

An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode.
The VM compiler transforms a `tvm::relay::Module` into a `tvm::relay::vm::VirtualMachine`. The virtual
machine contains a set of compiled functions, the compiled functions are contained in `tvm::relay::vm::Function`. The functions contain metadata about the the function as well as its compiled bytecode. For full definitions of the data structures see `vm.h`.

Optimizations
~~~~~~~~~~~~~

There are quite a few optimizations required by the VM compiler.

We have implemented them in the old pass style, but plan to port them to
the new pass manager (#2546) before merging.

- A-Normal Form
- Lambda Lift (see `src/relay/vm/lambda_lift.cc`)
- Inline Primitives (see `src/relay/vm/inline_primitives.cc`)
- Inliner (see `src/relay/pass/inliner.cc`)
- Tail Call Optimization (see ...)
- Constant Pool Layout (see ...)
- ADT Tag Allocation (see ...)
- Liveness Analysis (see ...)
wweic marked this conversation as resolved.
Show resolved Hide resolved

Serialization
~~~~~~~~~~~~~

A final and yet to be implemented part of the VM design is serialization. This accompanying PR will introduce both the bytecode, its serialization, as well as VM level serialization. The idea being that a VM can be efficiently stored to disk and resumed at a later time. This would also allow us to efficiently schedule many models on to a single machine in order to obtain good utilization.
wweic marked this conversation as resolved.
Show resolved Hide resolved

Unresolved Questions
~~~~~~~~~~~~~~~~~~~~

How do we handle dynamic shapes?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I have another prototype extension to Relay which adds initial support for compiling and executing programs containing fully dynamic shapes. I will post an RFC and prototype PR on this subject soon.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we such design discussion to be in the doc? It sounds like it would quickly become outdated (and docs don't tend to be updated that frequently). The same goes for the above remarks on serialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slyubomirsky good catch! I'll probably add a TODO section here since we haven't finalized on dynamic shape yet.

wweic marked this conversation as resolved.
Show resolved Hide resolved

How can we modify the VM to support JIT compilation of certain code paths?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
to be very flexible so we can modify it for future experiments.

How do we support heterogenous execution?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
In order to do this properly we need to run the device annotation and copying passes. We forsee nothing too complex in this work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last sentence would be better left unsaid

wweic marked this conversation as resolved.
Show resolved Hide resolved