Paper title:
TensorDash Exploiting Sparsity to Accelerate Deep Neural Network Training
Publication:
MICRO‘2020
Problem to solve:
-
The sparsity pattern during training is always dynamic.
-
During training, each tensor participates in two convolutions or operations.
-
Activations can be discarded after each layer during inference which is not the case during training where they are saved to be used by the backward pass.
-
Inference accelerators use narrow fixed-point arithmetic whereas training today is done predominantly using floating-point.
-
Training starts with randomly initialized values that keep evolving throughout the training process.
Major contribution:
-
TensorDash exploits naturally occurring sparsity during training which appears predominantly in the activations and the gradients. Sparsity is exploited dynamically and completely in hardware using a low-overhead hardware scheduler to advance MAC operations in time (earlier cycle) and space (another MAC unit) so that overall computation finishes earlier.
-
When incorporated into an accelerator based on Tensorcore processing units, TensorDash improves performance by 1.95× and energy efficiency by 1.5× (1.8× for compute units) on average over a set of deep learning models covering a wide range of applications.