From f3b4c80e3f8574f46f74a929bf6628ef38b96c74 Mon Sep 17 00:00:00 2001 From: Wei Chen Date: Tue, 28 May 2019 06:24:54 +0800 Subject: [PATCH] [Doc][Relay] Add VM doc (#3188) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * [Doc][Relay] Add VM doc * Add Apache header * Apply suggestions from code review Co-Authored-By: Steven S. Lyubomirsky Co-Authored-By: 雾雨魔理沙 Co-Authored-By: Logan Weber <36520469+weberlo@users.noreply.github.com> Co-Authored-By: Zhi <5145158+zhiics@users.noreply.github.com> * Junru's comment * More fix * More fix * More fix * last fix * Apply suggestions from code review Co-Authored-By: 雾雨魔理沙 * Apply suggestions from code review Co-Authored-By: Logan Weber <36520469+weberlo@users.noreply.github.com> * Add code links * Remove unused bp * Update docs/dev/virtual_machine.rst Co-Authored-By: Logan Weber <36520469+weberlo@users.noreply.github.com> * Explain TODO * Yong's comment Co-Authored-By: Yong Wu <55wuyong@163.com> * Comment --- docs/dev/virtual_machine.rst | 314 +++++++++++++++++++++++++++++++++++ 1 file changed, 314 insertions(+) create mode 100644 docs/dev/virtual_machine.rst diff --git a/docs/dev/virtual_machine.rst b/docs/dev/virtual_machine.rst new file mode 100644 index 000000000000..a59620a0a861 --- /dev/null +++ b/docs/dev/virtual_machine.rst @@ -0,0 +1,314 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +Putting the VM in TVM: The Relay Virtual Machine +================================================ + +Relay, a new program representation, has enabled the representation and optimization of +a great breadth of machine learning programs. +Unfortunately, by supporting a more expressive set of programs, we have +introduced several new execution challenges. + +Relay's interpreter can execute the full language but has notable limitations +that make it unsuited for production deployments. It is structured as an inefficient +interpreter that performs AST traversal to execute the program. This approach is conceptually +simple but inefficient, as the AST traversal heavily relies on indirection. + +There are further challenges in compiling dynamic code, such as dynamic scheduling and allocation, +fully dynamic tensor shapes, and control flow. The interpreter offers simple solutions +for these, but none is sufficiently compelling or optimized. + +The second execution mechanism is the existing graph runtime. In order to target Relay +programs to this, we compile a small subset of them to the old graph format and execute +them on the runtime. Graph runtime provides a fast execution experience but only for a very limited +subset of Relay programs. + +An alternative but not-standard approach is Relay's ahead-of-time compiler, +which compiles a Relay program into a shared library containing an ahead- +of-time implementation. The ahead-of-time compiler provides compelling performance +but is difficult to extend and instrument, which can only be done by modifying the +code generation and optimization mechanisms. + +The Relay virtual machine is intended to be a framework that balances these competing +approaches, providing a dynamic execution environment which can be extended, instrumented, +and integrated with other approaches like ahead-of-time compilation via a flexible extension +mechanism. + +The virtual machine is designed to strike a balance between performance and flexibility +when deploying and executing Relay programs, without giving up the benefits of TVM. + +Virtual machine (VM) design is a well-studied area in programming languages and systems, +and there have been various virtual machine designs for both full-fledged +and embedded programing languages. +Previous language VM designs have been heavily tailored to the execution profile of traditional programs. +Traditional programs manipulate small scalar values and consist of a large number of low-level instructions. +The sheer quantity of instructions requires instruction execution and dispatch to be extremely efficient. +In the context of machine learning we manipulate primarily tensor values, using a (relatively) +low number of high level instructions. ML programs' cost centers are expensive operator invocations, +such as GEMM or convolution, over a large input. Due to the execution profile exhibited by ML programs, +micro-optimizations present in scalar VMs are dramatically less important. + +TVM has provided strong support for vision models, +but we want to grow to support a wider variety of models. +The graph runtime is able to utilize the fully static nature of the input graphs to perform +aggressive optimization such as fully static allocation, and optimal memory reuse. +When we introduce models which make use of control flow, recursion, dynamic shapes, and dynamic +allocation, we must change how execution works. A virtual machine for Relay is a natural choice. + +The rest of this document provides a high-level overview of the Relay +virtual machine design and its instruction set. + +Design +------ + +The VM's design is focused on simplicity without sacrificing performance. +In order to accomplish this we have focused on designing a tensor VM rather than a scalar VM. + +In the tensor VM setting, we optimize for cheap “allocation” of objects (by trying to avoid real allocation), +reuse of static fragments, and the ability to do dynamic shape (i.e jagged tensors). + +Instruction Set +~~~~~~~~~~~~~~~ + +The choices of an instruction set and instruction representation are the most critical design decisions for a VM. +The current representation of the instructions is a tagged union containing the op-code and the data payload. An important design decision is the level of abstraction of the instructions (RISC vs. CISC) and how they take their data (fixed-width instruction encoding vs. variable-length encoding). The current version is closer to CISC, with complex instructions like AllocTensor, and is variable-length due to the inclusion of the shape as part of the instruction. The current instruction set is very high-level and corresponds roughly to high-level operations in Relay. + +Ret +^^^ +**Arguments**: +:: + RegName dst + RegName result + +Returns the object in register `result` to caller's register `dst`. + +InvokePacked +^^^^^^^^^^^^ +**Arguments**: +:: + size_t packed_index + size_t arity + size_t output_size + RegName* packed_args + +Invoke the packed function denoted by `packed_index`. The `arity` +and `output_size` are used to inform the VM how many inputs and +outputs to expect. `packed_args` stores the list of argument registers. + +AllocTensor +^^^^^^^^^^^ +**Arguments**: +:: + RegName dst + RegName shape_register + size_t ndim + DLDataType dtype + +Allocate a tensor value of the appropriate shape (stored in `shape_register`) and `dtype`. The result +is saved to register `dst`. + +AllocDatatype +^^^^^^^^^^^^^ +**Arguments**: +:: + RegName dst + size_t tag + size_t num_fields + RegName* datatype_fields + +Allocate a data type with the tag `tag` using the `num_fields` entries +from registers `datatype_fields`. The result is saved to register `dst`. + +AllocClosure +^^^^^^^^^^^^ +**Arguments**: +:: + RegName dst + size_t clo_index + size_t num_freevar + RegName* free_vars; + +Allocate a closure with the VMFunction at `clo_index` as +its code, and the `num_freevar` entries from registers in +`free_vars`. The result is saved to register `dst`. + +GetField +^^^^^^^^ +**Arguments**: +:: + RegName dst + RegName object + size_t field_index + +Get the field value with index `field_index` from `object`. And saves the result to register `dst`. + +If +^^ +**Arguments**: +:: + RegName if_cond + size_t true_offset + size_t false_offset + +Check if the object at register `if_cond` is `true` or `false`. +If `true`, relative jump by `true_offset`, else relative +jump by `false_offset`. + +Goto +^^^^ +**Arguments**: +:: + size_t pc_offset + +Relative unconditional jump by `pc_offset`. + +Invoke +^^^^^^ +**Arguments**: +:: + size_t func_index + +Invoke function at `func_index`, consumes the number of arguments contained in the VMFunction's +arity field. + +InvokeClosure +^^^^^^^^^^^^^ +**Arguments**: +:: + RegName closure + size_t closure_args_num + RegName* closure_args + +Invokes `closure`, consuming the number of arguments declared in the closure's VMFunction. + +LoadConst +^^^^^^^^^ +**Arguments**: +:: + RegName dst + size_t const_index + +Load the constant at `const_index` from the constant pool. The result is saved to register `dst`. + +Object Representation +~~~~~~~~~~~~~~~~~~~~~ +We use a simple object representation that uses shared pointers and tagging. +There is a huge space of possible object representations trade-offs, but we +believe micro-optimizing this code has little to no effect on the end-to-end performance. + +:: + + struct ObjectCell { + ObjectTag tag; + ... + }; + + struct Object { + std::shared_ptr ptr; + ... + } + +See `include/tvm/runtime/vm.h` for more details. + +Currently, we support 3 types of objects: tensors, data types, and closures. + +:: + + VMObject VMTensor(const tvm::runtime::NDArray& data); + VMObject VMDatatype(size_t tag, const std::vector& fields); + VMObject VMClosure(size_t func_index, std::vector free_vars); + + +Stack and State +~~~~~~~~~~~~~~~ + +The Relay VM maintains a stack frame, which contains information about how to resume the +previous call. Registers are allocated in a continuous space (virtual register file) for each function. + +We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter). + +:: + + struct VirtualMachine { + ... + std::vector frames; + ... + // Current function. + size_t func_index; + // Pointer into the current function's instructions. + const Instruction* code; + // Current program counter relative to the code pointer. + size_t pc; + ... + }; + + +Dispatch Loop +~~~~~~~~~~~~~ +A critical piece of a VM is the dispatch loop. The dispatch loop usually dominates the execution time of a +virtual machine, but we have experimentally found this not to be the case for Relay. We have just implemented +a simple `switch`/`goto` dispatch loop which dispatches based on instruction op code. + +This loop is implemented by `VirtualMachine::Run()`. + +VM Compiler +~~~~~~~~~~~ + +An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode. +The VM compiler transforms a `tvm::relay::Module` into a `tvm::relay::vm::VirtualMachine`. The virtual +machine contains a set of compiled functions, the compiled functions are contained in `tvm::relay::vm::Function`. The functions contain metadata about the the function as well as its compiled bytecode. For full definitions of the data structures see `vm.h`. + +Optimizations +~~~~~~~~~~~~~ + +There are quite a few optimizations required by the VM compiler. + +We have implemented them in the old pass style, but plan to port them to +the new pass manager (#2546) before merging. + +Optimizations marked with `TODO` are not implemented yet. + +- A-Normal Form +- Lambda Lift (see `src/relay/vm/lambda_lift.cc`) +- Inline Primitives (see `src/relay/vm/inline_primitives.cc`) +- Inliner (see `src/relay/pass/inliner.cc`) +- Constant Pool Layout (see `src/relay/backend/vm/compiler.cc`) +- ADT Tag Allocation (see `src/relay/backend/vm/compiler.cc`) +- Tail Call Optimization (TODO) +- Liveness Analysis (TODO) + +Serialization +~~~~~~~~~~~~~ + +A final and yet-to-be-implemented part of the VM design is serialization. The accompanying PR will introduce both the bytecode and its serialization, as well as VM-level serialization. The design premise is that a VM can be efficiently stored to disk and resumed at a later time. This would also allow us to efficiently schedule many models on to a single machine in order to obtain good utilization. + +Unresolved Questions +~~~~~~~~~~~~~~~~~~~~ + +How do we handle dynamic shapes? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +TODO + +How can we modify the VM to support JIT compilation of certain code paths? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +In the code generation space there are still many tradeoffs to be analyzed and the VM is designed +to be very flexible so we can modify it for future experiments. + +How do we support heterogenous execution? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies. +In order to do this properly we need to run the device annotation and copying passes.