There are three main parts of the hardware profiling module.
Define search space and then traverse it to generate all primitives.
We define a tuple Prim
to describe primitives, which consists of:
prim_type
: primitive type. It must be defined inaw_nas/ops
and can be fetched byget_op
.spatial_size
: input feature map size of primitives.C
: input channel number.C_out
: output channel number.stride
: only set for convolutional type primitives, 2 or 1.kernel_size
: ont set for convolutional type primitives.kwargs
: a dict for extra keys.
After traversing search space, all primitives are sent to ProfilingNetAssember to be assembled as complete networks to be profiled. Each assembled network is represented as an awnas
yaml file that can be passed into general model
defined in aw_nas/final/general_model.py
.
For DPU or other embedded devices, measurement has to be taken offline. We provide an example of DPUCompiler
defined in aw_nas/hardware/dpu.py
, which includes pytorch2caffe and fixed-point process, and you can adopt your own tools by yourself.
After measurement finished, the result which consists of performances of all basic layers supported in embedded devices should be parsed by DPUCompiler.parse_file
.
For GPU/CPU, we provide a script to profile those networks online.
The final part of hardware-profiling. This part organizes the previous result to build an actual model that can predict the hardware-related measures of an arbitrary architecture in the search space. Currently, only latency table is available, which may be inconsistent with real performances because it is based on a linear hypothesis. We will provide the revised models later.
aw_nas
provides a command-line interface awnas-hw
to orchestrate the hardware-related objective (e.g., latency, energy, etc.) profiling and parsing flow. A complete workflow example is illustrated as follows.
Use genprof
command to generate a series of config files to be profiled.
awnas-hw genprof examples/hardware/configs/ofa_final.yaml examples/hardware/configs/ofa_lat.yaml --result-dir ./results --compile-hardware dpu
ofa_final.yaml
: the aw_nas config file. Search space should be defined here.ofa_lat.yaml
: the hardware-object config file for generating config files to be profiled.profiling_primitive_cfg
,hwobjmodel_type
andmixin_search_space_type
and corresponding configs should be defined here, and it also should provide a template defined inprofiling_net_cfg.base_cfg_template
to generate profiling config files.- (optional)
--result-dir
: the directory saving the results. You should CAREFULLY use it because it will erase all contents in the directory if it already exists. - (optional)
--compile-hardware
: specify which hardware compiler would be used. The compiler is defined inhardware/compiler
. The interfacecompile
must be implemented, and conversion / quantization / fixed-point or other steps can be instantiated here. In our examplehardware/compiler/dpu.py
, we provide an instance of converting profiled-to-be config files to Caffe model, which needs to import extra module pytorch2caffe.
The result is shown as follows.
results
├── config.yaml
├── hardwares
│ └── 0-dpu
│ └── ...
├── hwobj_config.yaml
├── prof_nets
│ └── ...
└── prof_prims.yaml
config.yaml
: a copy of the aw_nas config file.hardwares/{$exp-num}-{$compiler}
: Caffe models converted from aw_nas config files.prof_nets
: aw_nas config files to be profiled.prof_prims.yaml
: includes all primitives to be profiled.hwobj_config.yaml
:- (optional)
pytorch_to_caffe
: meta information that contains a mapping from pytorch module name to Caffe prototype name or other deployable formats, which is necessary for later profiling.
Measure all Caffe models generated by the previous step. Notice: some problems may occur during the compiling process because of unsupported layers in your pytorch code. You can cope with them by either removing unsupported layers or add them into the pytorch2caffe module.
Use parse
command to parse the DPU measurement result files to YAML format files.
awnas-hw parse {$hw_cfg_file} {$prof_result_dir} {$prof_prim_file}{$prim_to_ops_file} --hwobj-type latency --result-dir profiled_nets
hw_cfg_file
: config file for hardware, example can be find inexamples/hardware/configs/ofa_lat.yaml
.hardware_compiler_type
andhardware_compiler_cfg
should be defined here.prof_result_dir
: actual profiled results measured offline, which contains performances(latency, energy or memory, etc.) of each layer(represented as Caffe prototype name or other deployable formats). Notice: there may encounter layer-fusion problems in some devices that cause uncertainty of some layers. Parsing method can be implemented asCompiler.parse_file
.prof_prim_file
: contains all primitives generated by step 1.prim_to_ops_file
: meta information generated by step 1. It contains a mapping from pytorch module name to Caffe prototype name or other deployable formats.result_dir
: has the exactly same structure asprof_result_dir
, but it consists of YAML format files that contains performances of each primitive in the network.
Use genmodel
command to generate a latency model.
awnas-hw genmodel {$cfg_file} {$hwobj_cfg_file} {$prof_prim_dir} --result_file
cfg_file
: awnas config file that defines search space.hwobj_cfg_file
: hardware config file that containsmixin_search_space_cfg
,profiling_primitive_cfg
andhwobjmodel_type
.prof_prim_dir
: profiled networks generated by the previous step.--result-file
: dump the hardware objective model (a latency model by default, other types of measures like energy can also be incorporated).
In hwobj_cfg_file
, entry prof_prims_cfg
must be defined with following keys specified:
sample
, as_dict
, spatial_size
, base_channels
, mult_ratio
, strides
, acts
, use_ses
, stem_stride
, primitive_type
. An example:
prof_prims_cfg:
sample: null # or int
as_dict: true # if set false, the return value is a namedtuple
spatial_size: 300
base_channels: [16, 16, 24, 32, 64, 96, 160, 960, 1280]
mult_ratio: 1.
strides: [1, 2, 2, 2, 1, 2]
acts: ["relu6", "relu6", "relu6", "h_swish", "h_swish", "h_swish"]
use_ses: [ False, False, True, False, True, True ]
stem_stride: 2
primitive_type: 'mobilenet_v3_block
The profiling primitive configurations must be identical to those of hardware objective configurations in the search space configuration file. For example, please refer to examples/hardware/det_ofa_xavier.yaml
and examples/hardware/det_ofa_hardware.yaml
Compared with DPU or other hardware devices, it is easier to implement profiling for CPU/GPU because there is no need to measure performances offline.
Do the same step as DPU-profiling, but it does not have to specify compiler.
awnas-hw genprof examples/hardware/configs/ofa_final.yaml examples/hardware/configs/ofa_lat.yaml --result-dir ./results
python scripts/hardware/latency.py config_0.yaml[, config_1.yaml[, ...]] --device {$device_id} --perf_dir {$result_directory}
config_i.yaml
: awnas config files generated by the previous step. You can pass arbitrary number of files to the script by using shell regex expression, e.g.python latency.py config_{0..20}.yaml
, orpython latency.py config_dir/*.yaml
--device
: device id. Using0~MAX_CUDA_NUM-1
specifies which GPU device will be used to measure performances, and-1
specifies CPU. Passing other device id will raise aValueError
.--perf_dir
: result directory. The result in this dir has the exactly same format with the result ofstep 3
of DPU profiling.
Since parsing profiled result of GPU/CPU is done by step 2, we can directly generate latency model as DPU Step 4
.
awnas-hw genmodel {$cfg_file} {$hwobj_cfg_file} {$prof_prim_dir} --result_file
Latency models take different input features as input. For example, linear regression model predicts the actual network latency from the latency sum of its blocks. MLP takes a padded list of latency data as input for each prediction. LSTM's input feature contains the configurations for every block. The different input formats are constructed with preprocessors, and each hardware cost model uses a list of preprocessors:
Legal Preprocessor Combinations
- table:
["block_sum", "remove_anomaly", "flatten"]
- regression:
["block_sum", "remove_anomaly", "flatten", "extract_sum_features"]
- mlp:
["block_sum", "remove_anomaly", "flatten", "padding"]
- lstm:
["block_sum", "remove_anomaly", "flatten", "extrace_lstm_features"]
You can apply the latency model now during the searching process by combining your objective with HardwareObjectve
defined in aw_nas/objective/hardware.py
. We provide a cascading objective called ContainerObjective
to do this, which accepts a series of objectives then assemble their performances, losses, and rewards. You can find more details in aw_nas/objective/container.py
, and an example in examples/hardware/configs/hardware_obj.yaml
.
For CPU/GPU profiling, we provide some latency tables and latency regression models here. You can reuse them directly during the search process.
We provide a mixin class MixinProfilingSearchSpace
. This interface has two methods that must be implemented:
generate_profiling_primitives
: profiling cfgs => return the profiling primitive listparse_profiling_primitives
: primitive hw-related objective list, profiling/hwobj model cfgs => hwobj model
You might need to implement the hardware-related objective model class for the new search space. You can reuse some codes in aw_nas/hardware/ofa_obj.py
.
To implement hardware-specific compilation and parsing process, create a new class inheriting BaseHardwareCompiler
, implement the compile
and hwobj_net_to_primitive
methods. As stated before, you can put your new hardware implementation python file into the AWNAS_HOME/plugins
, to make it accessible by aw_nas
.