diff --git a/docs/api/python/index.rst b/docs/api/python/index.rst index f62a4b848650..796bba8c591c 100644 --- a/docs/api/python/index.rst +++ b/docs/api/python/index.rst @@ -21,6 +21,7 @@ Python API .. toctree:: :maxdepth: 2 + tvm runtime ndarray error diff --git a/docs/api/python/relay/op.rst b/docs/api/python/relay/op.rst index 4c62a06d0959..3d8460adcce7 100644 --- a/docs/api/python/relay/op.rst +++ b/docs/api/python/relay/op.rst @@ -53,8 +53,3 @@ tvm.relay.op .. automodule:: tvm.relay.op.nn :members: - -.. automodule:: tvm.relay.op.vision.multibox - :members: - -.. autofunction:: tvm.relay.vision.nms diff --git a/docs/api/python/runtime.rst b/docs/api/python/runtime.rst index 9e395712aa6d..30d1b98650a3 100644 --- a/docs/api/python/runtime.rst +++ b/docs/api/python/runtime.rst @@ -27,7 +27,7 @@ tvm.runtime .. autoclass:: tvm.runtime.PackedFunc :members: - :inheritated-members: + :inherited-members: .. autofunction:: tvm.register_func diff --git a/docs/contribute/pull_request.rst b/docs/contribute/pull_request.rst index 7ad53758c54b..51626a16eb1a 100644 --- a/docs/contribute/pull_request.rst +++ b/docs/contribute/pull_request.rst @@ -29,12 +29,11 @@ This is a quick guide to submit a pull request, please also refer to the detaile git rebase upstream/master - Make sure code style check pass by typing the following command, and all the existing test-cases pass. - - ``docker/bash.sh tvmai/ci-lint ./tests/scripts/task_lint.sh`` - (Note: You must install docker beforehand so you can run a docker image.) +- ``docker/bash.sh tvmai/ci-lint ./tests/scripts/task_lint.sh``. (Note: You must install docker beforehand so you can run a docker image.) - Add test-cases to cover the new features or bugfix the patch introduces. - Document the code you wrote, see more at :ref:`doc_guide` -- Send the pull request, fix the problems reported by automatic checks. - Request code reviews from other contributors and improves your patch according to feedbacks. +- Send the pull request and fix the problems reported by automatic checks. +- Request code reviews from other contributors and improves your patch according to feedbacks. - To get your code reviewed quickly, we encourage you to help review others' code so they can do the favor in return. - Code review is a shepherding process that helps to improve contributor's code quality. diff --git a/docs/deploy/index.rst b/docs/deploy/index.rst index 9a30b96ca66e..a43cce728f61 100644 --- a/docs/deploy/index.rst +++ b/docs/deploy/index.rst @@ -56,7 +56,6 @@ embedded devices is through TVM's RPC API. Here are the links to the related tutorials. - :ref:`tutorial-cross-compilation-and-rpc` -- :ref:`tutorial-deploy-model-on-mali-gpu` - :ref:`tutorial-deploy-model-on-rasp` After you finished tuning and benchmarking, you might need to deploy the model on the @@ -68,3 +67,5 @@ target device without relying on RPC. see the following resources on how to do s cpp_deploy android integrate + aocl_fpga + aws_fpga diff --git a/docs/dev/inferbound.rst b/docs/dev/inferbound.rst index d9fedf8296ef..2f4d428dba99 100644 --- a/docs/dev/inferbound.rst +++ b/docs/dev/inferbound.rst @@ -118,13 +118,11 @@ In the Operation class declaration above, we can see that each operation also ha .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/stage_graph.png :align: center - :scale: 70% InferBound makes one pass through the graph, visiting each stage exactly once. InferBound starts from the output stages (i.e., the solid blue nodes in the graph above), and moves upwards (in the opposite direction of the edges). This is achieved by performing a reverse topological sort on the nodes of the graph. Therefore, when InferBound visits a stage, each of its consumer stages has already been visited. .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/inferbound_traversal.png :align: center - :scale: 70% The InferBound pass is shown in the following pseudo-code: @@ -162,7 +160,6 @@ Recall that all IterVars of the stage are related by IterVarRelations. The IterV .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/relations.png :align: center - :scale: 70% The above diagram shows the IterVar hyper-graph for one stage. The stage has one root_iter_var, ``i``. It has been split, and the resulting inner axis ``i.inner``, has been split again. The leaf_iter_vars of the stage are shown in green: ``i.outer``, ``i.inner.outer``, and ``i.inner.inner``. @@ -208,7 +205,6 @@ As mentioned above, a consumer may only require a small number of elements from .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/inferbound_phases.png :align: center - :scale: 70% IntSets ~~~~~~~ @@ -323,14 +319,12 @@ A ComputeOp has only a single output Tensor, whose axes correspond to the axis v .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/gatherbound.png :align: center - :scale: 70% The union of IntSets is computed by converting each IntSet to an Interval, and then taking the minimum of all minimums, and the maximum of all of these interval's maximums. .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/union.png :align: center - :scale: 70% This clearly results in some unnecessary computation, i.e., tensor elements will be computed that are never used. @@ -340,7 +334,6 @@ Unfortunately, even if we're lucky and the IntervalSet unions do not produce unn .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/gatherbound_problem.png :align: center - :scale: 70% .. _InferBoundCA: @@ -696,7 +689,6 @@ When InferRootBound is working on stage B, it visits B's consumer stage C to fin .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/passupdomain_problem.png :align: center - :scale: 70% @@ -756,17 +748,14 @@ If the split factor is 4, or 8, in the above example, the region of B needed in .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/passupdomain_div.png :align: center - :scale: 70% However, if the split factor is changed from 4 to 3 in the example above, it is easy to see that the region of B that C needs can no longer be described by an independent Range for each of its axes. .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/passupdomain_nodiv.png :align: center - :scale: 70% The best that can be done with rectangular regions is shown in the following diagram. The orange regions are the minimum rectangular regions covering the region of B that needs to be computed, at each iteration of the outer loop. .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/passupdomain_min.png :align: center - :scale: 70% diff --git a/docs/dev/relay_bring_your_own_codegen.rst b/docs/dev/relay_bring_your_own_codegen.rst index 3e3369dba52f..b7d5fa9f5fd6 100644 --- a/docs/dev/relay_bring_your_own_codegen.rst +++ b/docs/dev/relay_bring_your_own_codegen.rst @@ -535,7 +535,7 @@ To simplify, we define a graph representation named "ExampleJSON" in this guide. Then the ExampleJON of this subgraph looks like: -.. code-block:: json +.. code-block:: none subgraph_0 input 0 10 10 @@ -544,7 +544,7 @@ Then the ExampleJON of this subgraph looks like: input 3 10 10 add 4 inputs: 0 1 shape: 10 10 sub 5 inputs: 4 2 shape: 10 10 - add 6 inputs: 5 3 shape: 10 10 + mul 6 inputs: 5 3 shape: 10 10 The ``input`` keyword declares an input tensor with its ID and shape; while the other statements describes computations in `` inputs: [input ID] shape: [shape]`` syntax. diff --git a/docs/dev/relay_intro.rst b/docs/dev/relay_intro.rst index 526822afe422..fac447987717 100644 --- a/docs/dev/relay_intro.rst +++ b/docs/dev/relay_intro.rst @@ -39,7 +39,6 @@ compile for heterogeneous execution environments (e.g., executing parts of the g .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/relay/dataflow.png :align: center - :scale: 70% You can use Relay to build a computational (dataflow) graph. Specifically, the above code shows how to @@ -130,7 +129,6 @@ The code example below shows one program with two forms side by side. .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/relay/dataflow_vs_func.png :align: center - :scale: 70% The nested let binding is called A-normal form, and it is commonly used as IRs in functional programming languages. @@ -155,7 +153,6 @@ which does not use let bindings. .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/relay/let_scope.png :align: center - :scale: 70% The problem comes when we try to decide where we should evaluate node ``%1``. In particular, while the text format seems to suggest that we should evaluate node ``%1`` outside the if scope, the AST(as shown in the picture) does not suggest so. diff --git a/docs/dev/runtime.rst b/docs/dev/runtime.rst index 353b3392d181..5ed5f86ed44b 100644 --- a/docs/dev/runtime.rst +++ b/docs/dev/runtime.rst @@ -258,6 +258,7 @@ It also allows us to get members of an object easily in front-end language. For example, in the following code, we accessed the op field of the TensorNode. .. code:: python + import tvm x = tvm.placeholder((3,4), name="x") diff --git a/docs/dev/virtual_machine.rst b/docs/dev/virtual_machine.rst index 338fb46050e9..5bb5adee5459 100644 --- a/docs/dev/virtual_machine.rst +++ b/docs/dev/virtual_machine.rst @@ -91,6 +91,7 @@ Ret ^^^ **Arguments**: :: + RegName dst RegName result @@ -100,6 +101,7 @@ InvokePacked ^^^^^^^^^^^^ **Arguments**: :: + Index packed_index Index arity Index output_size @@ -114,6 +116,7 @@ AllocTensor ^^^^^^^^^^^ **Arguments**: :: + RegName dst RegName storage uint32_t ndim @@ -127,6 +130,7 @@ AllocTensorReg ^^^^^^^^^^^^^^ **Arguments**: :: + RegName dst RegName storage RegName shape_register @@ -139,6 +143,7 @@ AllocStorage ^^^^^^^^^^^^ **Arguments**: :: + RegName dst RegName size RegName alignment @@ -151,6 +156,7 @@ AllocADT ^^^^^^^^ **Arguments**: :: + RegName dst Index tag Index num_fields @@ -163,6 +169,7 @@ AllocClosure ^^^^^^^^^^^^ **Arguments**: :: + RegName dst Index clo_index Index num_freevar @@ -176,6 +183,7 @@ GetField ^^^^^^^^ **Arguments**: :: + RegName dst RegName object Index field_index @@ -186,6 +194,7 @@ If ^^ **Arguments**: :: + RegName test RegName target Index true_offset @@ -199,6 +208,7 @@ GetTag ^^^^^^ **Arguments**: :: + RegName object RegName dst @@ -212,6 +222,7 @@ Goto ^^^^ **Arguments**: :: + Index pc_offset Relative unconditional jump by ``pc_offset``. @@ -220,6 +231,7 @@ Invoke ^^^^^^ **Arguments**: :: + Index func_index Invoke function at ``func_index``, consumes the number of arguments contained in the VMFunction's @@ -229,6 +241,7 @@ InvokeClosure ^^^^^^^^^^^^^ **Arguments**: :: + RegName closure Index num_closure_args RegName* closure_args @@ -239,6 +252,7 @@ LoadConst ^^^^^^^^^ **Arguments**: :: + RegName dst Index const_index @@ -248,6 +262,7 @@ LoadConsti ^^^^^^^^^^ **Arguments**: :: + Index val RegName dst @@ -277,7 +292,7 @@ previous call. Registers are allocated in a continuous space (virtual register f We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter). -:: +.. code-block:: c struct VirtualMachine { ... @@ -331,6 +346,7 @@ Optimizations marked with `TODO` are not implemented yet. Serialization ~~~~~~~~~~~~~ + Serializing and deserializing the executable generated by the Relay VM compiler is a must as we may want to save the model to the disk and perform inference later. Previously, Relay has produced a serialized form in a json file for the graph runtime. However, the same format is not directly @@ -372,14 +388,17 @@ Unresolved Questions How do we handle dynamic shapes? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + TODO How can we modify the VM to support JIT compilation of certain code paths? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + In the code generation space there are still many tradeoffs to be analyzed and the VM is designed to be very flexible so we can modify it for future experiments. How do we support heterogenous execution? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies. In order to do this properly we need to run the device annotation and copying passes. diff --git a/docs/vta/dev/hardware.rst b/docs/vta/dev/hardware.rst index 7ec073b7d905..784cd54d6d45 100644 --- a/docs/vta/dev/hardware.rst +++ b/docs/vta/dev/hardware.rst @@ -215,6 +215,7 @@ This would result in a ``load-gemm-activate-store`` task pipeline which closely Adding more stages has a cost however: it can add storage and extra logic overhead, which is why we opted for a default 3-stage pipeline. .. _vta-uarch: + Microarchitectural Overview --------------------------- diff --git a/docs/vta/hardware.rst b/docs/vta/hardware.rst deleted file mode 100644 index cfd7be333081..000000000000 --- a/docs/vta/hardware.rst +++ /dev/null @@ -1,19 +0,0 @@ -.. Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - -.. http://www.apache.org/licenses/LICENSE-2.0 - -.. Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -VTA Hardware Design Overview -============================ diff --git a/tutorials/frontend/build_gcn.py b/tutorials/frontend/build_gcn.py index 077b066f7f62..d385dc9e72ab 100644 --- a/tutorials/frontend/build_gcn.py +++ b/tutorials/frontend/build_gcn.py @@ -16,32 +16,29 @@ # under the License. """ Building a Graph Convolutional Network -===================== +====================================== **Author**: `Yulun Yao `_, \ `Chien-Yu Lin `_ This article is an introductory tutorial to build a Graph Convolutional Network (GCN) with Relay. - In this tutorial, we will run our GCN on Cora dataset to demonstrate. - Cora dataset is a common benchmark for Graph Neural Networks (GNN) and frameworks that support GNN training and inference. - We directly load the dataset from DGL library to do the apples to apples comparison against DGL. Please refer to DGL doc for DGL installation at -https://docs.dgl.ai/install/index.html +https://docs.dgl.ai/install/index.html. -and refer to PyTorch guide for PyTorch installation at -https://pytorch.org/get-started/locally/ +Please refer to PyTorch guide for PyTorch installation at +https://pytorch.org/get-started/locally/. """ ###################################################################### # Define GCN in DGL with PyTorch backend -# ------------------ +# -------------------------------------- # # DGL example: https://github.com/dmlc/dgl/tree/master/examples/pytorch/gcn -# This part reuses the code from the above example +# This part reuses the code from the above example. import torch import torch.nn as nn import torch.nn.functional as F @@ -78,7 +75,7 @@ def forward(self, features): ###################################################################### # Define the functions to load dataset and evaluate accuracy -# ------------------ +# ---------------------------------------------------------- # You may substitute this part with your own dataset, here we load data from DGL from dgl.data import load_data from collections import namedtuple @@ -106,7 +103,7 @@ def evaluate(data, logits): ###################################################################### # Load the data and set up model parameters -# ------------------ +# ----------------------------------------- """ Parameters ---------- @@ -136,7 +133,7 @@ def evaluate(data, logits): ###################################################################### # Set up the DGL-PyTorch model and get the golden results -# ------------------ +# ------------------------------------------------------- # # The weights are trained with https://github.com/dmlc/dgl/blob/master/examples/pytorch/gcn/train.py from tvm.contrib.download import download_testdata @@ -162,7 +159,7 @@ def evaluate(data, logits): ###################################################################### # Run the DGL model and test for accuracy -# ------------------ +# --------------------------------------- torch_model.eval() with torch.no_grad(): logits_torch = torch_model(features) @@ -174,9 +171,8 @@ def evaluate(data, logits): ###################################################################### # Define Graph Convolution Layer in Relay -# ---------------------------- +# --------------------------------------- # To run GCN on TVM, we first need to implement Graph Convolution Layer. -# # You may refer to https://github.com/dmlc/dgl/blob/master/python/dgl/nn/mxnet/conv.py for a GraphConv Layer implemented in DGL with MXNet Backend # # The layer is defined with below operations, note that we apply two transposes to keep adjacency matrix on right hand side of sparse_dense operator, @@ -251,7 +247,7 @@ def GraphConv(layer_name, ###################################################################### # Prepare the parameters needed in the GraphConv layers -# ------------------ +# ----------------------------------------------------- # import numpy as np import networkx as nx @@ -282,7 +278,7 @@ def prepare_params(g, data): ###################################################################### # Put layers together -# ------------------ +# ------------------- # Define input features, norms, adjacency matrix in Relay infeats = relay.var("infeats", shape=data.features.shape) @@ -321,7 +317,8 @@ def prepare_params(g, data): ###################################################################### # Compile and run with TVM -# ------------------ +# ------------------------ +# # Export the weigths from PyTorch model to Python Dict model_params = {} for param_tensor in torch_model.state_dict(): @@ -345,7 +342,7 @@ def prepare_params(g, data): ###################################################################### # Run the TVM model, test for accuracy and verify with DGL -# ------------------ +# -------------------------------------------------------- m.run() logits_tvm = m.get_output(0).asnumpy() print("Print the first five outputs from TVM execution\n", logits_tvm[:5]) diff --git a/tutorials/frontend/from_tflite.py b/tutorials/frontend/from_tflite.py index 85ece270a21b..e93a71ce4a78 100644 --- a/tutorials/frontend/from_tflite.py +++ b/tutorials/frontend/from_tflite.py @@ -16,13 +16,12 @@ # under the License. """ Compile TFLite Models -=================== +===================== **Author**: `Zhao Wu `_ This article is an introductory tutorial to deploy TFLite models with Relay. To get started, Flatbuffers and TFLite package needs to be installed as prerequisites. - A quick solution is to install Flatbuffers via pip .. code-block:: bash @@ -68,7 +67,7 @@ """ ###################################################################### # Utils for downloading and extracting zip files -# --------------------------------------------- +# ---------------------------------------------- import os def extract(path): @@ -84,28 +83,28 @@ def extract(path): ###################################################################### # Load pretrained TFLite model -# --------------------------------------------- -# we load mobilenet V1 TFLite model provided by Google +# ---------------------------- +# Load mobilenet V1 TFLite model provided by Google from tvm.contrib.download import download_testdata model_url = "http://download.tensorflow.org/models/mobilenet_v1_2018_08_02/mobilenet_v1_1.0_224.tgz" -# we download model tar file and extract, finally get mobilenet_v1_1.0_224.tflite +# Download model tar file and extract it to get mobilenet_v1_1.0_224.tflite model_path = download_testdata(model_url, "mobilenet_v1_1.0_224.tgz", module=['tf', 'official']) model_dir = os.path.dirname(model_path) extract(model_path) -# now we have mobilenet_v1_1.0_224.tflite on disk and open it +# Now we can open mobilenet_v1_1.0_224.tflite tflite_model_file = os.path.join(model_dir, "mobilenet_v1_1.0_224.tflite") tflite_model_buf = open(tflite_model_file, "rb").read() -# get TFLite model from buffer +# Get TFLite model from buffer import tflite.Model tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0) ###################################################################### # Load a test image -# --------------------------------------------- +# ----------------- # A single cat dominates the examples! from PIL import Image from matplotlib import pyplot as plt @@ -118,10 +117,10 @@ def extract(path): plt.show() image_data = np.asarray(resized_image).astype("float32") -# after expand_dims, we have format NHWC +# Add a dimension to the image so that we have NHWC format layout image_data = np.expand_dims(image_data, axis=0) -# preprocess image as described here: +# Preprocess image as described here: # https://github.com/tensorflow/models/blob/edb6ed22a801665946c63d650ab9a0b23d98e1b1/research/slim/preprocessing/inception_preprocessing.py#L243 image_data[:, :, :, 0] = 2.0 / 255.0 * image_data[:, :, :, 0] - 1 image_data[:, :, :, 1] = 2.0 / 255.0 * image_data[:, :, :, 1] - 1 @@ -130,50 +129,50 @@ def extract(path): ###################################################################### # Compile the model with relay -# --------------------------------------------- +# ---------------------------- # TFLite input tensor name, shape and type input_tensor = "input" input_shape = (1, 224, 224, 3) input_dtype = "float32" -# parse TFLite model and convert into Relay computation graph +# Parse TFLite model and convert it to a Relay module from tvm import relay mod, params = relay.frontend.from_tflite(tflite_model, shape_dict={input_tensor: input_shape}, dtype_dict={input_tensor: input_dtype}) -# target x86 CPU +# Build the module against to x86 CPU target = "llvm" with relay.build_config(opt_level=3): graph, lib, params = relay.build(mod, target, params=params) ###################################################################### # Execute on TVM -# --------------------------------------------- +# -------------- import tvm from tvm.contrib import graph_runtime as runtime -# create a runtime executor module +# Create a runtime executor module module = runtime.create(graph, lib, tvm.cpu()) -# feed input data +# Feed input data module.set_input(input_tensor, tvm.nd.array(image_data)) -# feed related params +# Feed related params module.set_input(**params) -# run +# Run module.run() -# get output +# Get output tvm_output = module.get_output(0).asnumpy() ###################################################################### # Display results -# --------------------------------------------- +# --------------- -# load label file +# Load label file label_file_url = ''.join(['https://raw.githubusercontent.com/', 'tensorflow/tensorflow/master/tensorflow/lite/java/demo/', 'app/src/main/assets/', @@ -181,15 +180,15 @@ def extract(path): label_file = "labels_mobilenet_quant_v1_224.txt" label_path = download_testdata(label_file_url, label_file, module='data') -# list of 1001 classes +# List of 1001 classes with open(label_path) as f: labels = f.readlines() -# convert result to 1D data +# Convert result to 1D data predictions = np.squeeze(tvm_output) -# get top 1 prediction +# Get top 1 prediction prediction = np.argmax(predictions) -# convert id to class name and show the result +# Convert id to class name and show the result print("The image prediction result is: id " + str(prediction) + " name: " + labels[prediction]) diff --git a/tutorials/frontend/using_external_lib.py b/tutorials/frontend/using_external_lib.py index 35b015bffcd3..71acedaf181b 100644 --- a/tutorials/frontend/using_external_lib.py +++ b/tutorials/frontend/using_external_lib.py @@ -16,7 +16,7 @@ # under the License. """ Using External Libraries in Relay -================================ +================================= **Author**: `Masahiro Masuda `_, `Truman Tian `_ This is a short tutorial on how to use external libraries such as cuDNN, or cuBLAS with Relay. diff --git a/tutorials/language/tuple_inputs.py b/tutorials/language/tuple_inputs.py index 0c5c85ca585a..715e2ef36f7e 100644 --- a/tutorials/language/tuple_inputs.py +++ b/tutorials/language/tuple_inputs.py @@ -56,7 +56,7 @@ # operators, and the inputs will collaborate together, e.g. :code:`argmax`. # In the reduction procedure, :code:`argmax` need to compare the value of # operands, also need to keep the index of operand. It can be expressed -# with :any:`comm_reducer` as below: +# with :py:func:`tvm.comm_reducer` as below: # x and y are the operands of reduction, both of them is a tuple of index # and value. diff --git a/tutorials/optimize/opt_conv_tensorcore.py b/tutorials/optimize/opt_conv_tensorcore.py index 774b4c7258bb..ef840892d7d5 100644 --- a/tutorials/optimize/opt_conv_tensorcore.py +++ b/tutorials/optimize/opt_conv_tensorcore.py @@ -18,7 +18,7 @@ .. _opt-conv-tensorcore: How to optimize convolution using TensorCores -================================== +============================================= **Author**: `Siyuan Feng `_ In this tutorial, we will demonstrate how to write a high performance convolution @@ -29,7 +29,7 @@ ################################################################ # TensorCore Introduction -# ------------------------- +# ----------------------- # Each Tensor Core provides a 4x4x4 matrix processing array that operates # :code:`D = A * B + C`, where A, B, C and D are 4x4 matrices as Figure shows. # The matrix multiplication inputs A and B are FP16 matrices, while the accumulation @@ -45,7 +45,7 @@ ################################################################ # Preparation and Algorithm -# -------------------------- +# ------------------------- # We use the fixed size for input tensors with 256 channels and 14 x 14 dimensions. # The batch size is 256. Convolution filters contain 512 filters of size 3 x 3. # We use stride size 1 and padding size 1 for the convolution. In the example, we use @@ -126,8 +126,7 @@ ############################################################################### # Memory Scope -# ---------------- -# +# ------------ # In traditional GPU schedule, we have global, shared and local memory scope. # To support TensorCores, we add another three special memory scope: :code:`wmma.matrix_a`, # :code:`wmma.matrix_b` and :code:`wmma.accumulator`. On hardware, all fragments scope @@ -142,6 +141,7 @@ ############################################################################### # Define Tensor Intrinsic +# ----------------------- # In fact, TensorCore is a special hardware operation. So, we can just use tensorize # to replace a unit of computation with the TensorCore instruction. The first thing is # that we need to define tensor intrinsic. @@ -246,7 +246,6 @@ def intrin_func(ins, outs): # easiest way to solve this. Then We can bind threadIdx.x to any loops except those contain # TensorCore intrinsics directly or indirectly. Also note that it is not the unique solution. # The only thing we should do is to make sure all threads in a warp can call TensorCore at the same time. -# # Define tiling sizes block_row_warps = 4 @@ -312,10 +311,9 @@ def intrin_func(ins, outs): ############################################################################### # Lowering Computation to Intrinsics -# -------------------------- +# ---------------------------------- # The last phase is to lower the computation loops down to TensorCore hardware intrinsics # by mapping the 2D convolution to tensor intrinsics -# s[AF].tensorize(AF.op.axis[-2], intrin_wmma_load_matrix('wmma.matrix_a')) s[WF].tensorize(WF.op.axis[-2], intrin_wmma_load_matrix('wmma.matrix_b')) @@ -344,5 +342,6 @@ def intrin_func(ins, outs): ############################################################################### # Summary +# ------- # This tutorial demonstrates how TVM scheduling primitives can be used to # call TensorCores on specific GPUs. diff --git a/tutorials/optimize/opt_gemm.py b/tutorials/optimize/opt_gemm.py index a23589a4ab19..8ed152aee918 100644 --- a/tutorials/optimize/opt_gemm.py +++ b/tutorials/optimize/opt_gemm.py @@ -232,7 +232,6 @@ # # .. image:: https://github.com/dmlc/web-data/raw/master/tvm/tutorial/array-packing.png # :align: center -# :scale: 100% # diff --git a/tutorials/optimize/opt_matmul_auto_tensorcore.py b/tutorials/optimize/opt_matmul_auto_tensorcore.py index 00dbe9de838f..eb6501ba4f3d 100644 --- a/tutorials/optimize/opt_matmul_auto_tensorcore.py +++ b/tutorials/optimize/opt_matmul_auto_tensorcore.py @@ -18,7 +18,7 @@ .. _opt-matmul-auto-tensorcore: How to optimize matmul with Auto TensorCore CodeGen -================================== +=================================================== **Author**: `Minmin Sun `_, \ `Lanbo Li `_, \ `Chenfan Jia `_, \ @@ -31,12 +31,11 @@ Users can also write schedule with tensorization to generate TensorCore code. Both solutions use the same tensorcore intrinsics. Please refer to :ref:`opt-conv-tensorcore` tutorial for more details. - """ ################################################################ # Preparation and Algorithm -# -------------------------- +# ------------------------- # 2 kinds of input data types are supported: float16 and int8. # For float16, the accumulator is float32. # For int8, the accumulator is int32. @@ -215,7 +214,7 @@ def test_gemm(N, L, M, dtype, layout): ############################################################################### # AutoTune and Test -# -------------------- +# ----------------- # Finally we use a tuner to tune the schedule, generate code with best config # and run the kernel to compare with numpy to check whether the results are correct. @@ -460,6 +459,6 @@ def tune_and_evaluate(M, N, L, dtype, layout): ############################################################################### # Summary -# -------------------------- +# ------- # This tutorial demonstrates how to use the AutoTensorCoreCodeGen of TVM # to generate tensorcore kernels. diff --git a/tutorials/relay_quick_start.py b/tutorials/relay_quick_start.py index 5461b08a4b35..6cded3325ad6 100644 --- a/tutorials/relay_quick_start.py +++ b/tutorials/relay_quick_start.py @@ -18,7 +18,7 @@ .. _tutorial-relay-quick-start: Quick Start Tutorial for Compiling Deep Learning Models -====================================================== +======================================================= **Author**: `Yao Wang `_, `Truman Tian `_ This example shows how to build a neural network with Relay python frontend and @@ -33,7 +33,6 @@ # # .. image:: https://github.com/dmlc/web-data/raw/master/tvm/tutorial/tvm_support_list.png # :align: center -# :scale: 100% # # In this tutorial, we'll choose cuda and llvm as target backends. # To begin with, let's import Relay and TVM. @@ -47,7 +46,7 @@ ###################################################################### # Define Neural Network in Relay -# ----------------------------- +# ------------------------------ # First, let's define a neural network with relay python frontend. # For simplicity, we'll use pre-defined resnet-18 network in Relay. # Parameters are initialized with Xavier initializer. diff --git a/vta/tutorials/frontend/README.txt b/vta/tutorials/frontend/README.txt index 319506d21f8f..7adec27a9bc0 100644 --- a/vta/tutorials/frontend/README.txt +++ b/vta/tutorials/frontend/README.txt @@ -1,4 +1,4 @@ -.. _tutorial-frontend: +.. _vta-tutorial-frontend: Compile Deep Learning Models ---------------------------- diff --git a/vta/tutorials/frontend/deploy_vision_on_vta.py b/vta/tutorials/frontend/deploy_vision_on_vta.py index c410d24d07ae..df02b4842488 100644 --- a/vta/tutorials/frontend/deploy_vision_on_vta.py +++ b/vta/tutorials/frontend/deploy_vision_on_vta.py @@ -94,7 +94,7 @@ ###################################################################### # Obtain an execution remote -# --------------------------------- +# -------------------------- # When target is 'pynq', reconfigure FPGA and runtime. # Otherwise, if target is 'sim', execute locally. @@ -136,15 +136,16 @@ # --------------------------------- # Grab vision model from Gluon model zoo and compile with Relay. # The compilation steps are: -# 1) Front end translation from MxNet into Relay module. -# 2) Apply 8-bit quantization: here we skip the first conv layer, -# and dense layer which will both be executed in fp32 on the CPU. -# 3) Perform graph packing to alter the data layout for tensorization. -# 4) Perform constant folding to reduce number of operators (e.g. eliminate -# batch norm multiply). -# 5) Perform relay build to object file. -# 6) Load the object file onto remote (FPGA device). -# 7) Generate graph runtime, `m`. +# +# 1. Front end translation from MxNet into Relay module. +# 2. Apply 8-bit quantization: here we skip the first conv layer, +# and dense layer which will both be executed in fp32 on the CPU. +# 3. Perform graph packing to alter the data layout for tensorization. +# 4. Perform constant folding to reduce number of operators (e.g. eliminate batch norm multiply). +# 5. Perform relay build to object file. +# 6. Load the object file onto remote (FPGA device). +# 7. Generate graph runtime, `m`. +# # Load pre-configured AutoTVM schedules with autotvm.tophub.context(target):