Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latency measurements and exporter functions to nnp and onnx #7

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -136,4 +136,12 @@ venv.bak/
dmypy.json

# Pyre type checker
.pyre/
.pyre/

#NNP - ONNX
*.onnx
*.nnp
*.lat
graph
graph.pdf
examples/*
118 changes: 93 additions & 25 deletions docs/source/features/estimator.rst
Original file line number Diff line number Diff line change
@@ -1,72 +1,140 @@
Estimating the Latency of DNN Architectures
-------------------------------------------

Hardware aware NAS addresses the problem, how to fit the architecture of DNNs to specific target devices, such that they fulfill given performance requirements. This is, for example, important if we want to deploy DNN based algorithms to mobile devices. Naturally, we want to find DNN architectures that run fast and require only little memory. More specifically, we might be interested in DNNs that have
Hardware aware NAS addresses the problem, how to fit the architecture of DNNs to specific target devices,
such that they fulfill given performance requirements. This is, for example, important if we want to deploy
DNN based algorithms to mobile devices. Naturally, we want to find DNN architectures that run fast and require
only little memory. More specifically, we might be interested in DNNs that have

- a low latency.
- a small parameter memory footprint.
- a small activation memory footprint.
- a high throughput.
- a low power consumption.

To perform hardware aware NAS, we, therefore, need tools to estimate such performance measures on target devices. The module nnabla_nas.utils.estimator implements
such tools, i.e., it provides methods to
estimate the latency and the memory footprint of DNN architectures.
To perform hardware aware NAS, we, therefore, need tools to estimate such performance measures on target devices.
The modules inside nnabla_nas.utils.estimator.latency implement
such tools, i.e., they provide methods to estimate the latency and the memory footprint
of DNN architectures.



How to estimate the latency of DNN architectures
................................................

There are different ways how to estimate the latency of a DNN architecture on the device. Two naive ways how to do it are given in the figure below, namely
There are different ways how to estimate the latency of a DNN architecture on the device.
Two naive ways how to do it are given in the figure below, namely

- network-based estimation
- layer-based estimation

.. image:: images/measurement.png

Here, z is a random vector which encodes the structure of the network.
A network-based latency estimator instantiates and measures the time it takes to calculate the output of the computational graph at once. We call the resulting latency the true latency. A layer-based estimator instantiates the computational graph and estimates the latency of each layer separately. The latency of the whole network is calculated as the sum of all the individual layer latencies. We call this the accumulated latency. Because the individual calculation of each layer causes some computational overhead, the layer-based latency estimate is not the same as the true latency. However, experiments show that the differences between the true and the accumulated latency estimates are small, meaning that both can be used for hardware aware NAS.
A network-based latency estimator instantiates and measures the time it takes to calculate
the output of the computational graph at once. We call the resulting latency the true latency.
A layer-based estimator instantiates the computational graph and estimates the latency of
each layer separately. The latency of the whole network is calculated as the sum of all the
latencies from the individual layers. We call this the accumulated latency. Because the individual
calculation of each layer causes some computational overhead, the layer-based latency estimate
is not the same as the true latency. However, experiments show that the differences between
the true and the accumulated latency estimates are small, meaning that both can be used for hardware aware NAS.

In the NNabla NAS framework, we only implement layer-based latency estimators. The reason for this is, that we want the estimators to run offline, i.e., before the architecture search. Depending on the target hardware, a latency measurement on a device can take considerable time. Therefore, latency measurements during architecture search are
not desirable. With network-based estimators, the number of networks to measure grows exponentially with the number of layers and the number of candidates per layer. However, with the layer-based approach, the growth is only linear.
In the NNabla NAS framework, we only implemented layer-based latency estimators:

- *nnabla_nas.utils.estimator.latency.LatencyEstimator*: a layer-based estimator that extracts the layers of the network based on the active modules contained in the network
- *nnabla_nas.utils.estimator.latency.LatencyGraphEstimator*: a layer-based estimator that extracts the layers of the network based on the NNabla graph of the network

The reason for this is that we want the estimators to run offline, i.e., before the architecture search.
Depending on the target hardware, a latency measurement on a device can take considerable time. Therefore,
latency measurements during architecture search are not desirable.
With network-based estimators, the number of networks to measure grows exponentially with the number of
layers and the number of candidates per layer. However, with the layer-based approach, the growth is only linear.

However, you can try and use the network-based latency estimator from NNabla, which we make available at
*nnabla_nas.utils.estimator.latency.Profiler*


How to use the estimators
.........................

The following example shows how to use an estimator. First, we instantiate the model we want to estimate the latency of. To this end, we borrow the implementation of the MobileNet from :ref:`mobilenet`. If the network is constructed from dynamic modules, the NNabla graph must be constructed once, such that each module knows its input shapes. We can then feed the model to the estimator to calculate the latency. Please note, the estimator always assumes a batch size of one. Further, the model will always be profiled with the input shapes that have been calculated when the last NNabla graph was created.
The following example shows how to use an estimator. First, we instantiate the model we want to estimate the latency of.
To this end, in this example, we borrow the implementation of the MobileNet from :ref:`mobilenet`. If the network
is constructed from dynamic modules (as in this case), the NNabla graph must be constructed once, such that each
module knows its input shapes. We can then feed the model to the estimator to calculate the latency. Please note that
the estimator always assumes a batch size of one. Further, the model will always be profiled with the input shapes
that have been calculated when the last NNabla graph was created.

.. code-block:: python

from nnabla_nas.contrib.mobilenet import TrainNet
from nnabla_nas.utils.estimator import LatencyEstimator
from nnabla_nas.contrib.classification.mobilenet import SearchNet
from nnabla_nas.utils.estimator.latency import LatencyGraphEstimator, LatencyEstimator
import nnabla as nn
from nnabla.ext_utils import get_extension_context

cuda_device_id = 0
ctx = get_extension_context('cudnn', device_id=cuda_device_id)
# Parameters for the Latency Estimation
outlier = 0.05
max_measure_execution_time = 500
time_scale = "m"
n_warmup = 10
n_runs = 100
device_id = 0
ext_name='cudnn'

ctx = get_extension_context(ext_name=ext_name, device_id=device_id)
nn.set_default_context(ctx)

layer_based_estim_by_module = LatencyEstimator(
device_id=device_id, ext_name=ext_name, outlier=outlier, time_scale=time_scale, n_warmup=n_warmup,
max_measure_execution_time=max_measure_execution_time,n_run=n_runs
)

layer_based_estim_by_graph = LatencyGraphEstimator(
device_id=device_id, ext_name=ext_name, outlier=outlier, time_scale=time_scale, n_warmup=n_warmup,
max_measure_execution_time=max_measure_execution_time,n_run=n_runs
)

# create the network
net = SearchNet()

# For *dynamic graphs* like MobileNetV2, we need to create the nnabla graph once
# This defines the input shapes of all modules
inp = nn.Variable((1,3,32,32))
net = TrainNet()
#create the nnabla graph once (this defines the input shapes of all modules)
out = net(inp)

est = LatencyEstimator()
latency = est.get_estimation(net)
layer_based_latency_by_m = layer_based_estim_by_module.get_estimation(net)
layer_based_latency_by_g = layer_based_estim_by_graph.get_estimation(out)

The two results differ just because the estimation is carried out two separate times,
for each different layer found. Conceptually, both estimations should be identical.

Please note, if the candidate space contains zero modules, the estimate can deviate considerably
if the model is constructed from dynamic modules. To make this clearer, we continue the
code example from above.
Please note: if the candidate space contains zero modules, the estimate can deviate considerably
if the model is constructed from *dynamic* modules. To make this clearer, we continue the code example from above:

.. code-block:: python

inp = nn.Variable((1,3,128,128))
out2 = net(inp)
latency2 = est.get_estimation(net)
inp2 = nn.Variable((1,3,1024,1024))
out2 = net(inp2)
layer_based_latency_by_m2 = layer_based_estim_by_module.get_estimation(net)
layer_based_latency_by_g2 = layer_based_estim_by_graph.get_estimation(out2)

Because we constructed a second NNabla graph (out2) that has a much larger input, the input shapes of all modules
in the network will be changed accordingly. Therefore, these later latencies will be much larger than the previously
measured latencies.

Profiling *static* graphs is similar. The only difference is that the input shapes of
static modules cannot change after instantiation, meaning that we do not need to construct the NNabla
graph before latency estimation.


As a further example, you can also use the network-based estimator (Profiler) by continuing the code above:

.. code-block:: python

Because we constructed a second NNabla graph (out2) that has a much larger input, the input shapes of all modules in the network will be changed accordingly. Therefore, latency2 will be much larger than the previously measured latency. Profiling static graphs are similar.
The only difference is, that the input shapes of static modules cannot change after instantiation, meaning that we do not need to construct an NNabla graph before latency estimation.
from nnabla_nas.utils.estimator.latency import Profiler
network_based_estim = Profiler(out2,
device_id=device_id, ext_name=ext_name, outlier=outlier, time_scale=time_scale, n_warmup=n_warmup,
max_measure_execution_time=max_measure_execution_time,n_run=n_runs
)
network_based_estim.run()
net_based_latency = float(network_based_estim.result['forward_all'])
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ NNablaNAS is a Python package that provides methods in neural architecture searc

- Searcher algorithms to learn the architecture and model parameters (e.g., *DartsSearcher* and *ProxylessNasSearcher*)

- Regularizers (e.g., *LatencyEstimator* and *MemoryEstimator*) which can be used to enforce hardware constraints
- Regularizers (e.g., *LatencyGraphEstimator* and *MemoryEstimator*) which can be used to enforce hardware constraints

In this document, we will describe how to use the Python APIs, some examples, and the contribution guideline for developers. The latest release version can be installed from `here <https://github.com/sony/nnabla-nas>`_.

Expand Down
71 changes: 38 additions & 33 deletions docs/source/tutorials/static_module_construct.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,10 @@ that defines a simple d layer CNN:

from nnabla_nas import module as Mo
import nnabla as nn

inp = nn.Variable((10, 3, 32, 32))

def net(x, d=10):
c_inp = Mo.Conv(3, 64, (32,32))
c_l = [Mo.Conv(64, 64, (32,32)) for i in range(d-1)]
c_inp = Mo.Conv(3, 64, (3,3))
c_l = [Mo.Conv(64, 64, (3,3)) for i in range(d-1)]
x = c_inp(x)

for i in range(d-1):
Expand All @@ -32,7 +30,7 @@ that defines a simple d layer CNN:

out = net(inp)

The network consists of 3 convolutional layers, with a 3x3 kernel. Each layer
The network consists of 10 convolutional layers, with a 3x3 kernel. Each layer
computes 64 feature maps. Following the dynamic graph paradigm,
the structure of the network is only defined in the code, i.e., it is only defined
by the sequence in which we apply the layers c_l. The modules themselves are agnostic to
Expand All @@ -49,16 +47,18 @@ The example network from the example above can, for example, be defined as:

.. code-block:: python

from nnabla_nas.module import static_module as Smo
from nnabla_nas.module import static as Smo
import nnabla as nn


inp = nn.Variable((10, 3, 32, 32))

def net(x, d=10):
modules = [Smo.Input(nn.Variable((10, 3, 32, 32)))]
for i in range(d):
modules.append(Smo.Conv(parents=[modules[-1]], modules[-1].shape[1], 64, (32,32)))
for i in range(d-1):
modules.append(Smo.Conv(parents=[modules[-1]], in_channels=modules[-1].shape[1], out_channels=64, kernel=(3,3)))
return modules[-1]

out = net()
out = net(inp)

In comparison to dynamic modules, each static module keeps a list of its parents. Therefore, the graph structure is stored within and can later be retrieved from the modules. Furthermore, static_modules introduce a sort of shape security, i.e., once a module is instantiated, the input and output shape of the module are fixed and cannot be changed anymore.

Expand All @@ -76,45 +76,49 @@ while others are dropped. For an efficient search, it is desirable to have simpl
graph optimization algorithms in place, i.e., algorithms that optimize the computational
graph of the selected subnetworks before executing them.

Consider for example the following search space: 1) The network applies an input convolution (conv 1). 2) Two candidate
layers are applied to the output of conv 1, that are a zero operation and another convolution (conv 2). 3) The Join layer
randomly selects the output of one of the candidate layers and feeds it to conv 3. If Join selects Conv 2, we need to calculate
the output of Conv 1, Conv 2 and Conv 3. However, if Join selects Zero, only the output of Conv 3 must be calculated, because
selecting Zero, effectively cuts the computational graph, meaning that all layers that are the parent of Zero and that have
no shortcut connection to any following layer can be deleted from the computational graph. Static modules implement such graph optimization, meaning that they can speed up computations.
Consider for example the following search space:
- The network applies an input convolution (conv 1).
- Two candidate layers are applied to the output of conv 1: a zero operation and another convolution (conv 2).
- The Join layer randomly selects the output of one of the candidate layers and feeds it to (conv 3).

If Join selects Conv 2, we need to calculate the output of Conv 1, Conv 2 and Conv 3. However, if Join selects Zero, only the output of Conv 3 must be calculated, because
selecting Zero, effectively cuts the computational graph, meaning that all layers that are parents of Zero and that have no shortcut connection to any following layers can be deleted from the computational graph.

Static modules implement such graph optimization, meaning that they can speed up computations.

.. image:: ../images/static_example_graph.png

A second reason why a static graph definition is a natural choice for hardware aware NAS is related to latency modeling.
To perform hardware aware NAS, we need to estimate the latency of the subnetworks that have been
In order to perform hardware aware NAS, we need to estimate the latency of the subnetworks that have been
drawn from the candidate space in order to decide whether the network meets our latency requirements or not.
Typically, the latency of all layers (modules) within the search space are measured once individually. The latency of a
subnetwork of the search space, then, is a function of those individual latencies and of the structure of the subnetwork. Note,
simply summing up all the latencies of the modules that are contained in the subnetwork is wrong. This is obvious if we reconsider the
example from above. All the modules Conv 1 to Conv 3 have a latency > 0, while Zero and Join have a latency of 0. If Join selects Zero,
Conv 1, Zero, Join and Conv 3 are part of the subnetwork. However, summing up the latency of Conv 1,
Zero, Join and Conv 3 are wrong. The correct latency would be if we only consider Conv 3.
subnetwork of the search space, then, is a function of those individual latencies and of the structure of the subnetwork.
Simply summing up all the latencies of the modules that are contained in the subnetwork is wrong.

This is obvious if we reconsider the example from above. All the modules Conv 1 to Conv 3 have a latency > 0, while Zero and Join have a latency of 0. If Join selects Zero,
Conv 1, Zero, Join and Conv 3 are part of the subnetwork. However, summing up the latency of Conv 1, Zero, Join and Conv 3 are wrong. The correct latency would be calculated if we only consider Conv 3.

Other problems which need knowledge of the graph structure are for example:
1) Graph similarity calculation
2) NAS, using Bayesian optimization algorithms
3) Modeling the memory footprint of DNNs (activation memory)
- Graph similarity calculation
- NAS, using Bayesian optimization algorithms
- Modeling the memory footprint of DNNs (activation memory)

Which modules are currently implemented?
........................................

There is a static version of all dynamic modules implemented in nnabla_nas.modules. There are currently two static search spaces,
namely contrib.zoph and the contrib.random_wired.
There is a static version of all dynamic modules implemented in nnabla_nas.modules. There are currently two static search spaces, namely contrib.zoph and contrib.random_wired.

Implementing new static modules
...............................

There are different ways of how to define static modules.

- You can derive a static version from a dynamic module. Consider the following
You can derive a static version from a dynamic module. Consider the following
example, where we want to derive a static Conv module from the dynamic Conv module.
First, we derive our StaticConv module from A) The dynamic Conv class, B) The StaticModule base class.
We call the __init__() of both parent classes. Please note, that the order of inheritance is important.
First, we derive our StaticConv module from
- The dynamic Conv class
- The StaticModule base class
We call the __init__() of both parent classes. Please note that the order of inheritance is important !

.. code-block:: python

Expand All @@ -128,8 +132,9 @@ We call the __init__() of both parent classes. Please note, that the order of in
if len(self._parents) > 1:
raise RuntimeError

- We can also implement a new static module from scratch, implementing the call method. Please follow the same steps that are documented in the dynamic module tutorial. In the following example, we define a StaticConv, implementing
the call method. You can either use the NNabla API or dynamic modules to define the transfer function. In our case, we use dynamic modules.
We can also implement a new static module from scratch, implementing the call method. Please follow the same steps that are documented in the dynamic module tutorial.

In the following example, we define a StaticConv, implementing the call method. You can either use the NNabla API or dynamic modules to define the transfer function. In our case, we use dynamic modules.

.. code-block:: python

Expand Down
Loading