NVIDIA DLA hardware is a fixed-function accelerator engine targeted for deep learning operations. It’s designed to do full hardware acceleration of convolutional neural networks, supporting various layers such as convolution, deconvolution, fully connected, activation, pooling, batch normalization, and others. NVIDIA’s Orin SoCs feature up to two second-generation DLAs while Xavier SoCs feature up to two first-generation DLAs.
DLA software consists of the DLA compiler and the DLA runtime stack. The offline compiler translates the neural network graph into a DLA loadable binary and can be invoked using NVIDIA TensorRT™, NvMedia-DLA or cuDLA. The runtime stack consists of the DLA firmware, kernel mode driver, and user mode driver.
Reference for more details: DLA product page
- DLA peak performance contributes between 38% and 74% to Orin's total Deep Learning (DL) performance (depending on the power mode, see table below)
- DLA is on average 3x to 5x more power-efficient than the GPU (depending on the power mode and the workload, see DLA Performance per Watt (Power Efficiency) for details)
Here the distribution of DL TOPs between GPU and DLA on a Jetson AGX Orin 64GB, depending on the power mode:
Power mode: MAXN | Power mode: 50W | Power mode: 30W | Power mode: 15W | |
---|---|---|---|---|
GPU sparse INT8 peak DL performance | 171 TOPs | 109 TOPs | 41 TOPs | 14 TOPs |
2x DLA sparse INT8 peak performance | 105 TOPs | 92 TOPs | 90 TOPs | 40 TOPs |
Total Orin peak INT8 DL performance | 275 TOPs | 200 TOPs | 131 TOPs | 54 TOPs |
Percentage: DLA peak INT8 performance of total Orin peak DL INT8 performance | 38% | 46% | 69% | 74% |
Note:
- The DLA TOPs of the 30W & 50W power modes on Jetson AGX Orin 64GB are comparable to the maximum clocks on DRIVE Orin platforms for Automotive
- The maximum DLA TOPs on Jetson Orin NX 16GB are comparable to the 15W power mode on Jetson AGX Orin 64GB
In this repo, we will cover a few key DLA-related metrics for standard deep learning model architectures in the context of common reference application implementations.
The goal is to provide a reference baseline about network architectures and how they map to DLA as well as the INT8 accuracy of these networks.
Use case | Network | INT8 Accuracy on Orin’s DLA | Layers always running on GPU | Instructions |
---|---|---|---|---|
Object Detection | RetinaNet ResNeXt-50 | mAP OpenImages MLPerf validation set*: 0.3741 (GPU INT8: 0.3740, FP32 reference 0.3757) |
NMS (Last node of the network) | See RetinaNet ResNeXt-50 section in scripts/prepare_models/README.md |
Classification | ResNet-50 | Top-1 ImageNet 2012*: 75.54% (GPU INT8: 76.00%, FP32 reference: 76.46%) |
Top-K (Last node of the network) | See ResNet-50 section in scripts/prepare_models/README.md |
Object Detection | SSD-ResNet-34 | mAP COCO 2017*: 0.21 (GPU INT8: 0.21, FP32 reference 0.20) |
NMS (Last node of the network) | See SSD-ResNet-34 section in scripts/prepare_models/README.md |
Object Detection | SSD-MobileNetV1 | mAP COCO 2017*: 0.23 (GPU INT8: 0.23, FP32 reference: 0.23) |
NMS (Last node of the network) | See SSD-MobileNetV1 section in scripts/prepare_models/README.md |
*Accuracy measured internally by NVIDIA, there may be slight differences compared to previous MLPerf Inference submissions.
Key takeaways:
- Networks tend to have common network backbones and some variation at the end. You can run the compute-intensive backbones on DLA and final layers for post-processing on the GPU.
- GPU and DLA do not produce bitwise identical results. So the difference in the math is expected and the difference would be within a certain acceptable % of the FP32 reference.
More resources:
- MLPerf Inference 3.1 (Orin DLA)
- MLPerf Inference 3.0 (Orin DLA)
- MLPerf Inference 2.1 (Orin DLA)
- MLPerf Inference 2.0 (Xavier & Orin DLA)
DLA on the NVIDIA Orin platform supports Structured Sparsity that offers the opportunity to minimize latency and maximize throughput in production. See the TensorRT documentation for details (note that the listed restrictions may not apply anymore to most recent DLA releases).
Below case study presents that training models for Structured Sparsity is expected to maintain accuracy:
Network | Weight mask pattern | Waymo Open Dataset Test Accuracy – 2D Detection 2020 IoU=0.50:0.95 (all sizes) @ FP16 on Host |
Steps involved |
---|---|---|---|
RetinaNet ResNet-34 | No mask / Dense | 44.7 | Trained from scratch for 26 epochs. |
RetinaNet ResNet-34 | 2:4 Sparse | 44.8 | Sparsified dense model as detailed in sparsity whitepaper, then trained for another 26 epochs. |
2x DLA images per second on a Jetson AGX Orin 64GB in dense operation measured with JetPack 5.1.1, depending on the power mode:
Network | Power mode: MAXN | Power mode: 50W | Power mode: 30W | Power mode: 15W |
---|---|---|---|---|
RetinaNet ResNeXt-50 (800x800, bs=1) | 78 | 72 | 71 | 36 |
ResNet-34 backbone (1280x768, bs=1) | 285 | 260 | 255 | 121 |
RetinaNet ResNet-34 (1280x768, bs=1) | 108 | 98 | 96 | 45 |
SSD-ResNet-34 (1200x1200, bs=1) | 83 | 76 | 74 | 36 |
ResNet-50 (224x224, bs=2) | 2037 | 1948 | 1928 | 1072 |
SSD-MobileNetV1 (300x300, bs=2) | 2664 | 2506 | 2472 | 1313 |
Key takeaways:
- DLA's peak performance for these models in MAXN power mode is close to the achieved performance with 50W and 30W power modes - even at a reduced power budget, your DLA-based inference pipeline can sustain high throughput.
- Attaching a RetinaNet head to a ResNet-34 backbone adds a considerable number of theoretical MAC operations per inference, and is hence expected to result in lower end-to-end throughput. Choose your task-specific head that you attach to backbones carefully.
- Even at the lowest power mode, all of the benchmarks with high input resolution and batch size 1 can still sustain 30 images per second on 2 DLAs combined.
Generally, Structured Sparsity shows perf improvements over dense operation for dense Convolution layers that are already math-bound. The more math-bound a layer in dense operation, the higher the expected dense->sparse speedup after applying a 2:4 sparsity pattern.
2x DLA images per second on a Jetson AGX Orin 64GB in sparse operation measured with JetPack 5.1.1, depending on the power mode (with dense->sparse speedup):
Network | Power mode: MAXN | Power mode: 50W | Power mode: 30W | Power mode: 15W |
---|---|---|---|---|
RetinaNet ResNeXt-50 (800x800, bs=1) | 102 (1.31x) | 96 (1.34x) | 95 (1.34x) | 51 (1.43x) |
ResNet-34 backbone (1280x768, bs=1) | 384 (1.35x) | 360 (1.38x) | 354 (1.39x) | 176 (1.45x) |
RetinaNet ResNet-34 (1280x768, bs=1) | 143 (1.32x) | 133 (1.36x) | 131 (1.37x) | 66 (1.47x) |
SSD-ResNet-34 (1200x1200, bs=1) | 103 (1.24x) | 97 (1.28x) | 95 (1.28x) | 49 (1.36x) |
Just add --sparsity=force
to the trtexec
commands from scripts/prepare_models/README.md
to reproduce.
Key takeaways:
- Decreasing the power budget on Jetson Orin generally implies reducing the ratio of DLA TOPs to DRAM bandwidth (the hardware's op per Byte ratio). Lowering the hardware's op per Byte ratio means that more layers get the chance to become math-bound, and hence we can also observe higher dense->sparse speedups with lower power modes.
DLA & GPU form your perfect team for Deep Learning inference on the SoC. While the GPU delivers the most TOPs in high-power profiles, DLA excels at power efficiency.
Below table shows the Perf/W ratio of DLA relative to the GPU (accelerator power only, perf metric: images per second) on a Jetson AGX Orin 64GB measured with JetPack 5.1.1, depending on the power mode:
Network | Weight mask mode | Power mode: MAXN | Power mode: 50W | Power mode: 30W | Power mode: 15W |
---|---|---|---|---|---|
RetinaNet ResNeXt-50 (800x800, bs=1) | Dense | 3.8x | 3.1x | 2.8x | 4.2x |
RetinaNet ResNeXt-50 (800x800, bs=1) | Sparse | 4.7x | 4.1x | 3.7x | 5.6x |
ResNet-34 backbone (1280x768, bs=1) | Dense | 4.3x | 3.7x | 3.4x | 4.9x |
ResNet-34 backbone (1280x768, bs=1) | Sparse | 3.4x | 3.0x | 2.7x | 4.1x |
RetinaNet ResNet-34 (1280x768, bs=1) | Dense | 4.2x | 3.5x | 3.1x | 4.6x |
RetinaNet ResNet-34 (1280x768, bs=1) | Sparse | 3.5x | 2.9x | 2.6x | 3.9x |
SSD-ResNet-34 (1200x1200, bs=1) | Dense | 3.7x | 3.0x | 2.7x | 4.1x |
SSD-ResNet-34 (1200x1200, bs=1) | Sparse | 3.5x | 3.3x | 2.9x | 4.3x |
ResNet-50 (224x224, DLA bs=2, GPU bs=256) | Dense | 3.5x | 2.9x | 2.6x | 3.9x |
SSD-MobileNetV1 (300x300, DLA bs=2, GPU bs=128) | Dense | 4.2x | 3.6x | 3.2x | 5.0x |
Key takeaways:
- DLA is about 3x to 5x more power efficient than GPU for these benchmarks
- Enabling Structured Sparsity generally improves DLA's power efficiency
- At the lowest power mode of 15W, DLA's power efficiency is the highest (where 74% total Orin peak DL INT8 performance comes from the DLAs)
See operators/README.md for details on ONNX operators already supported on DLA and planned to be supported in future releases.
Install the Python dependencies with (only supported on x86 hosts):
python3 -m pip install requirements.txt