sony · TE-JavierAlonsoGarcia · Jun 17, 2021 · Jun 17, 2021
diff --git a/.gitignore b/.gitignore
@@ -136,4 +136,12 @@ venv.bak/
 dmypy.json
 
 # Pyre type checker
-.pyre/
+.pyre/
+
+#NNP - ONNX
+*.onnx
+*.nnp
+*.lat
+graph
+graph.pdf
+examples/*
diff --git a/docs/source/features/estimator.rst b/docs/source/features/estimator.rst
@@ -1,72 +1,140 @@
 Estimating the Latency of DNN Architectures
 -------------------------------------------
 
-Hardware aware NAS addresses the problem, how to fit the architecture of DNNs to specific target devices, such that they fulfill given performance requirements. This is, for example, important if we want to deploy DNN based algorithms to mobile devices. Naturally, we want to find DNN architectures that run fast and require only little memory. More specifically, we might be interested in DNNs that have
+Hardware aware NAS addresses the problem, how to fit the architecture of DNNs to specific target devices,
+such that they fulfill given performance requirements. This is, for example, important if we want to deploy
+DNN based algorithms to mobile devices. Naturally, we want to find DNN architectures that run fast and require 
+only little memory. More specifically, we might be interested in DNNs that have
 
 - a low latency.
 - a small parameter memory footprint.
 - a small activation memory footprint.
 - a high throughput.
 - a low power consumption.
 
-To perform hardware aware NAS, we, therefore, need tools to estimate such performance measures on target devices. The module nnabla_nas.utils.estimator implements
-such tools, i.e., it provides methods to
-estimate the latency and the memory footprint of DNN architectures.
+To perform hardware aware NAS, we, therefore, need tools to estimate such performance measures on target devices. 
+The modules inside nnabla_nas.utils.estimator.latency implement
+such tools, i.e., they provide methods to estimate the latency and the memory footprint 
+of DNN architectures. 
 
 
 
 How to estimate the latency of DNN architectures
 ................................................
 
-There are different ways how to estimate the latency of a DNN architecture on the device. Two naive ways how to do it are given in the figure below, namely
+There are different ways how to estimate the latency of a DNN architecture on the device. 
+Two naive ways how to do it are given in the figure below, namely
 
 - network-based estimation
 - layer-based estimation
 
 .. image:: images/measurement.png
 
 Here, z is a random vector which encodes the structure of the network.
-A network-based latency estimator instantiates and measures the time it takes to calculate the output of the computational graph at once. We call the resulting latency the true latency. A layer-based estimator instantiates the computational graph and estimates the latency of each layer separately. The latency of the whole network is calculated as the sum of all the individual layer latencies. We call this the accumulated latency. Because the individual calculation of each layer causes some computational overhead, the layer-based latency estimate is not the same as the true latency. However, experiments show that the differences between the true and the accumulated latency estimates are small, meaning that both can be used for hardware aware NAS.
+A network-based latency estimator instantiates and measures the time it takes to calculate
+the output of the computational graph at once. We call the resulting latency the true latency.
+A layer-based estimator instantiates the computational graph and estimates the latency of 
+each layer separately. The latency of the whole network is calculated as the sum of all the
+latencies from the individual layers. We call this the accumulated latency. Because the individual
+calculation of each layer causes some computational overhead, the layer-based latency estimate
+is not the same as the true latency. However, experiments show that the differences between
+the true and the accumulated latency estimates are small, meaning that both can be used for hardware aware NAS.
 
-In the NNabla NAS framework, we only implement layer-based latency estimators. The reason for this is, that we want the estimators to run offline, i.e., before the architecture search. Depending on the target hardware, a latency measurement on a device can take considerable time. Therefore, latency measurements during architecture search are
-not desirable. With network-based estimators, the number of networks to measure grows exponentially with the number of layers and the number of candidates per layer. However, with the layer-based approach, the growth is only linear.
+In the NNabla NAS framework, we only implemented layer-based latency estimators:
 
+- *nnabla_nas.utils.estimator.latency.LatencyEstimator*: a layer-based estimator that extracts the layers of the network based on the active modules contained in the network
+- *nnabla_nas.utils.estimator.latency.LatencyGraphEstimator*: a layer-based estimator that extracts the layers of the network based on the NNabla graph of the network
+
+The reason for this is that we want the estimators to run offline, i.e., before the architecture search.
+Depending on the target hardware, a latency measurement on a device can take considerable time. Therefore,
+latency measurements during architecture search are not desirable. 
+With network-based estimators, the number of networks to measure grows exponentially with the number of 
+layers and the number of candidates per layer. However, with the layer-based approach, the growth is only linear.
+
+However, you can try and use the network-based latency estimator from NNabla, which we make available at 
+*nnabla_nas.utils.estimator.latency.Profiler*
 
 
 How to use the estimators
 .........................
 
-The following example shows how to use an estimator. First, we instantiate the model we want to estimate the latency of. To this end, we borrow the implementation of the MobileNet from :ref:`mobilenet`. If the network is constructed from dynamic modules, the NNabla graph must be constructed once, such that each module knows its input shapes. We can then feed the model to the estimator to calculate the latency. Please note, the estimator always assumes a batch size of one. Further, the model will always be profiled with the input shapes that have been calculated when the last NNabla graph was created.
+The following example shows how to use an estimator. First, we instantiate the model we want to estimate the latency of. 
+To this end, in this example, we borrow the implementation of the MobileNet from :ref:`mobilenet`. If the network
+is constructed from dynamic modules (as in this case), the NNabla graph must be constructed once, such that each
+module knows its input shapes. We can then feed the model to the estimator to calculate the latency. Please note that
+the estimator always assumes a batch size of one. Further, the model will always be profiled with the input shapes 
+that have been calculated when the last NNabla graph was created.
 
 .. code-block:: python
 
-    from nnabla_nas.contrib.mobilenet import TrainNet
-    from nnabla_nas.utils.estimator import LatencyEstimator
+    from nnabla_nas.contrib.classification.mobilenet import SearchNet
+    from nnabla_nas.utils.estimator.latency import LatencyGraphEstimator, LatencyEstimator
     import nnabla as nn
     from nnabla.ext_utils import get_extension_context
 
-    cuda_device_id = 0
-    ctx = get_extension_context('cudnn', device_id=cuda_device_id)
+    # Parameters for the Latency Estimation
+    outlier = 0.05
+    max_measure_execution_time = 500
+    time_scale = "m"
+    n_warmup = 10
+    n_runs = 100
+    device_id = 0
+    ext_name='cudnn'
+
+    ctx = get_extension_context(ext_name=ext_name, device_id=device_id)
     nn.set_default_context(ctx)
 
+    layer_based_estim_by_module = LatencyEstimator(
+        device_id=device_id, ext_name=ext_name, outlier=outlier, time_scale=time_scale, n_warmup=n_warmup,
+        max_measure_execution_time=max_measure_execution_time,n_run=n_runs
+        )
+
+    layer_based_estim_by_graph = LatencyGraphEstimator(
+        device_id=device_id, ext_name=ext_name, outlier=outlier, time_scale=time_scale, n_warmup=n_warmup,
+        max_measure_execution_time=max_measure_execution_time,n_run=n_runs
+        )
+
+    # create the network
+    net = SearchNet()
+
+    # For *dynamic graphs* like MobileNetV2, we need to create the nnabla graph once 
+    # This defines the input shapes of all modules
     inp = nn.Variable((1,3,32,32))
-    net = TrainNet()
-    #create the nnabla graph once (this defines the input shapes of all modules)
     out = net(inp)
 
-    est = LatencyEstimator()
-    latency = est.get_estimation(net)
+    layer_based_latency_by_m = layer_based_estim_by_module.get_estimation(net)
+    layer_based_latency_by_g = layer_based_estim_by_graph.get_estimation(out)
 
+The two results differ just because the estimation is carried out two separate times,
+for each different layer found. Conceptually, both estimations should be identical.
 
-Please note, if the candidate space contains zero modules, the estimate can deviate considerably
-if the model is constructed from dynamic modules. To make this clearer, we continue the
-code example from above.
+Please note: if the candidate space contains zero modules, the estimate can deviate considerably
+if the model is constructed from *dynamic* modules. To make this clearer, we continue the code example from above:
 
 .. code-block:: python
 
-    inp = nn.Variable((1,3,128,128))
-    out2 = net(inp)
-    latency2 = est.get_estimation(net)
+    inp2 = nn.Variable((1,3,1024,1024))
+    out2 = net(inp2)
+    layer_based_latency_by_m2 = layer_based_estim_by_module.get_estimation(net)
+    layer_based_latency_by_g2 = layer_based_estim_by_graph.get_estimation(out2)
+
+Because we constructed a second NNabla graph (out2) that has a much larger input, the input shapes of all modules
+in the network will be changed accordingly. Therefore, these later latencies will be much larger than the previously
+measured latencies. 
+
+Profiling *static* graphs is similar. The only difference is that the input shapes of
+static modules cannot change after instantiation, meaning that we do not need to construct the NNabla 
+graph before latency estimation.
+
+
+As a further example, you can also use the network-based estimator (Profiler) by continuing the code above:
+
+.. code-block:: python
 
-Because we constructed a second NNabla graph (out2) that has a much larger input, the input shapes of all modules in the network will be changed accordingly. Therefore, latency2 will be much larger than the previously measured latency. Profiling static graphs are similar.
-The only difference is, that the input shapes of static modules cannot change after instantiation, meaning that we do not need to construct an NNabla graph before latency estimation.
+    from nnabla_nas.utils.estimator.latency import Profiler
+    network_based_estim = Profiler(out2,
+        device_id=device_id, ext_name=ext_name, outlier=outlier, time_scale=time_scale, n_warmup=n_warmup, 
+        max_measure_execution_time=max_measure_execution_time,n_run=n_runs
+    )
+    network_based_estim.run()
+    net_based_latency = float(network_based_estim.result['forward_all'])
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -9,7 +9,7 @@ NNablaNAS is a Python package that provides methods in neural architecture searc
 
 - Searcher algorithms to learn the architecture and model parameters (e.g., *DartsSearcher* and *ProxylessNasSearcher*)
 
-- Regularizers (e.g., *LatencyEstimator* and *MemoryEstimator*) which can be used to enforce hardware constraints
+- Regularizers (e.g., *LatencyGraphEstimator* and *MemoryEstimator*) which can be used to enforce hardware constraints
 
 In this document, we will describe how to use the Python APIs, some examples, and the contribution guideline for developers. The latest release version can be installed from  `here <https://github.com/sony/nnabla-nas>`_.
 

diff --git a/docs/source/tutorials/static_module_construct.rst b/docs/source/tutorials/static_module_construct.rst
@@ -18,12 +18,10 @@ that defines a simple d layer CNN:
 
    from nnabla_nas import module as Mo
    import nnabla as nn
-
    inp = nn.Variable((10, 3, 32, 32))
-
    def net(x, d=10):
-      c_inp = Mo.Conv(3, 64, (32,32))
-      c_l = [Mo.Conv(64, 64, (32,32)) for i in range(d-1)]
+      c_inp = Mo.Conv(3, 64, (3,3))
+      c_l = [Mo.Conv(64, 64, (3,3)) for i in range(d-1)]
       x = c_inp(x)
 
       for i in range(d-1):
@@ -32,7 +30,7 @@ that defines a simple d layer CNN:
 
    out = net(inp)
 
-The network consists of 3 convolutional layers, with a 3x3 kernel. Each layer
+The network consists of 10 convolutional layers, with a 3x3 kernel. Each layer
 computes 64 feature maps. Following the dynamic graph paradigm,
 the structure of the network is only defined in the code, i.e., it is only defined
 by the sequence in which we apply the layers c_l. The modules themselves are agnostic to
@@ -49,16 +47,18 @@ The example network from the example above can, for example, be defined as:
 
 .. code-block:: python
 
-   from nnabla_nas.module import static_module as Smo
+   from nnabla_nas.module import static as Smo
    import nnabla as nn
-
+
+   inp = nn.Variable((10, 3, 32, 32))
+
    def net(x, d=10):
       modules = [Smo.Input(nn.Variable((10, 3, 32, 32)))]
-      for i in range(d):
-         modules.append(Smo.Conv(parents=[modules[-1]], modules[-1].shape[1], 64, (32,32)))
+      for i in range(d-1):
+         modules.append(Smo.Conv(parents=[modules[-1]], in_channels=modules[-1].shape[1], out_channels=64, kernel=(3,3)))
       return modules[-1]
-
-   out = net()
+   
+   out = net(inp)
 
 In comparison to dynamic modules, each static module keeps a list of its parents. Therefore, the graph structure is stored within and can later be retrieved from the modules. Furthermore, static_modules introduce a sort of shape security, i.e., once a module is instantiated, the input and output shape of the module are fixed and cannot be changed anymore.
 
@@ -76,45 +76,49 @@ while others are dropped. For an efficient search, it is desirable to have simpl
 graph optimization algorithms in place, i.e., algorithms that optimize the computational
 graph of the selected subnetworks before executing them.
 
-Consider for example the following search space: 1) The network applies an input convolution (conv 1). 2) Two candidate
-layers are applied to the output of conv 1, that are a zero operation and another convolution (conv 2). 3) The Join layer
-randomly selects the output of one of the candidate layers and feeds it to conv 3. If Join selects Conv 2, we need to calculate
-the output of Conv 1, Conv 2 and Conv 3. However, if Join selects Zero, only the output of Conv 3 must be calculated, because
-selecting Zero, effectively cuts the computational graph, meaning that all layers that are the parent of Zero and that have
-no shortcut connection to any following layer can be deleted from the computational graph. Static modules implement such graph optimization, meaning that they can speed up computations.
+Consider for example the following search space: 
+   - The network applies an input convolution (conv 1). 
+   - Two candidate layers are applied to the output of conv 1: a zero operation and another convolution (conv 2). 
+   - The Join layer randomly selects the output of one of the candidate layers and feeds it to (conv 3). 
+
+If Join selects Conv 2, we need to calculate the output of Conv 1, Conv 2 and Conv 3. However, if Join selects Zero, only the output of Conv 3 must be calculated, because
+selecting Zero, effectively cuts the computational graph, meaning that all layers that are parents of Zero and that have no shortcut connection to any following layers can be deleted from the computational graph. 
+
+Static modules implement such graph optimization, meaning that they can speed up computations.
 
 .. image:: ../images/static_example_graph.png
 
 A second reason why a static graph definition is a natural choice for hardware aware NAS is related to latency modeling.
-To perform hardware aware NAS, we need to estimate the latency of the subnetworks that have been
+In order to perform hardware aware NAS, we need to estimate the latency of the subnetworks that have been
 drawn from the candidate space in order to decide whether the network meets our latency requirements or not.
 Typically, the latency of all layers (modules) within the search space are measured once individually. The latency of a
-subnetwork of the search space, then, is a function of those individual latencies and of the structure of the subnetwork. Note,
-simply summing up all the latencies of the modules that are contained in the subnetwork is wrong. This is obvious if we reconsider the
-example from above. All the modules Conv 1 to Conv 3 have a latency > 0, while Zero and Join have a latency of 0. If Join selects Zero,
-Conv 1, Zero, Join and Conv 3 are part of the subnetwork. However, summing up the latency of Conv 1,
-Zero, Join and Conv 3 are wrong. The correct latency would be if we only consider Conv 3.
+subnetwork of the search space, then, is a function of those individual latencies and of the structure of the subnetwork. 
+Simply summing up all the latencies of the modules that are contained in the subnetwork is wrong. 
+
+This is obvious if we reconsider the example from above. All the modules Conv 1 to Conv 3 have a latency > 0, while Zero and Join have a latency of 0. If Join selects Zero,
+Conv 1, Zero, Join and Conv 3 are part of the subnetwork. However, summing up the latency of Conv 1, Zero, Join and Conv 3 are wrong. The correct latency would be calculated if we only consider Conv 3.
 
 Other problems which need knowledge of the graph structure are for example:
-1) Graph similarity calculation
-2) NAS, using Bayesian optimization algorithms
-3) Modeling the memory footprint of DNNs (activation memory)
+   - Graph similarity calculation
+   - NAS, using Bayesian optimization algorithms
+   - Modeling the memory footprint of DNNs (activation memory)
 
 Which modules are currently implemented?
 ........................................
 
-There is a static version of all dynamic modules implemented in nnabla_nas.modules. There are currently two static search spaces,
-namely contrib.zoph and the contrib.random_wired.
+There is a static version of all dynamic modules implemented in nnabla_nas.modules. There are currently two static search spaces,  namely contrib.zoph and  contrib.random_wired.
 
 Implementing new static modules
 ...............................
 
 There are different ways of how to define static modules. 
 
-- You can derive a static version from a dynamic module. Consider the following
+You can derive a static version from a dynamic module. Consider the following
 example, where we want to derive a static Conv module from the dynamic Conv module.
-First, we derive our StaticConv module from A) The dynamic Conv class, B) The StaticModule base class. 
-We call the __init__() of both parent classes. Please note, that the order of inheritance is important.
+First, we derive our StaticConv module from
+   - The dynamic Conv class
+   - The StaticModule base class
+We call the __init__() of both parent classes. Please note that the order of inheritance is important !
 
 .. code-block:: python
 
@@ -128,8 +132,9 @@ We call the __init__() of both parent classes. Please note, that the order of in
             if len(self._parents) > 1:
                 raise RuntimeError
 
-- We can also implement a new static module from scratch, implementing the call method. Please follow the same steps that are documented in the dynamic module tutorial. In the following example, we define a StaticConv, implementing
-the call method. You can either use the NNabla API or dynamic modules to define the transfer function. In our case, we use dynamic modules.
+We can also implement a new static module from scratch, implementing the call method. Please follow the same steps that are documented in the dynamic module tutorial. 
+
+In the following example, we define a StaticConv, implementing the call method. You can either use the NNabla API or dynamic modules to define the transfer function. In our case, we use dynamic modules.
 
 .. code-block:: python