abstract.tex

\section{abstract}
Deep Neural Network (DNN) is being studied and utilized in various fields such as image classification, captioning, object detection, and generation. In particular, Convolutional Neural Network (CNN), and Multilayer perceptron (MLP) which works in a variety of frameworks such as Caffe and, TensorFlow is one of the most popular neural networks used in image classification. Image classification is the most widely used, and until recently, more complex and accurate network structures have been proposed. 
Accelerators such as Tensor Processing Unit have been proposed to accelerate the Neural Network, but GPU is a typical hardware accelerating this Neural Network. Using single instruction multiple thread (SIMT) instruction, GPU computes multiple elements in parallel when computes these deep learning workloads.
The power of the DRAM system in the power budget of the GPU server is reported to be a significant part, the demand of energy efficient DRAM system is very high accordingly. Row buffer activation and precharge energy account for 20 to 40\% in total DRAM system energy. Partial row activation studies have been conducted to reduce the activation and precharge energy problems. 
Although Partial Row Activation is a technique that has been studied extensively, there are several problems from previous proposals that the DRAM controller located at the host side also needs to be modified what DRAM vender can’t, the die size of a cost sensitive DRAM is increased, the DRAM core structure should be changed significantly, or a new DRAM command is added.
So, most of these proposals are not compliant with the JEDEC standard DRAM protocol and interface.
Pointing out these problems, this thesis proposes a practical partial row activation scheme for a deep running workload which is drop-in replacement for existing HBM2 device with minimal changes on JEDEC compliant DRAM interface. Based on the HBM2 device, Half-DRAM and subarray partitioning schemes are applied to divide the existing row buffer into 8 sectors and partial activate corresponding sector for each column access by proposed delayed-activation timing.
The performance degradation due to latency from the delayed-activation can be minimized because of the nature of the GPU hiding DRAM latency and interleaving of many banks (32) of HBM2 with pseudo channels. CNN workload, Rodinia benchmark suite, STREAM benchmark, and on average about 7\% DRAM system energy gain.