TY - GEN
T1 - An Energy-Efficient Accelerator with Relative-Indexing Memory for Sparse Compressed Convolutional Neural Network
AU - Wu, I. Chen
AU - Huang, Po Tsang
AU - Lo, Chin Yang
AU - Hwang, Wei
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/3
Y1 - 2019/3
N2 - Deep convolutional neural networks (CNNs) are widely used in image recognition and feature classification. However, deep CNNs are hard to be fully deployed for edge devices due to both computation-intensive and memory-intensive workloads. The energy efficiency of CNNs is dominated by off-chip memory accesses and convolution computation. In this paper, an energy-efficient accelerator is proposed for sparse compressed CNNs by reducing DRAM accesses and eliminating zero-operand computation. Weight compression is utilized for sparse compressed CNNs to reduce the required memory capacity/bandwidth and a large portion of connections. Thus, ReLU function produces zero-valued activations. Additionally, the workloads are distributed based on channels to increase the degree of task parallelism, and all-row-to-all-row non-zero element multiplication is adopted for skipping redundant computation. The simulation results over the dense accelerator show that the proposed accelerator achieves 1.79x speedup and reduces 23.51%, 69.53%, 88.67% on-chip memory size, energy, and DRAM accesses of VGG-16.
AB - Deep convolutional neural networks (CNNs) are widely used in image recognition and feature classification. However, deep CNNs are hard to be fully deployed for edge devices due to both computation-intensive and memory-intensive workloads. The energy efficiency of CNNs is dominated by off-chip memory accesses and convolution computation. In this paper, an energy-efficient accelerator is proposed for sparse compressed CNNs by reducing DRAM accesses and eliminating zero-operand computation. Weight compression is utilized for sparse compressed CNNs to reduce the required memory capacity/bandwidth and a large portion of connections. Thus, ReLU function produces zero-valued activations. Additionally, the workloads are distributed based on channels to increase the degree of task parallelism, and all-row-to-all-row non-zero element multiplication is adopted for skipping redundant computation. The simulation results over the dense accelerator show that the proposed accelerator achieves 1.79x speedup and reduces 23.51%, 69.53%, 88.67% on-chip memory size, energy, and DRAM accesses of VGG-16.
UR - http://www.scopus.com/inward/record.url?scp=85070453873&partnerID=8YFLogxK
U2 - 10.1109/AICAS.2019.8771600
DO - 10.1109/AICAS.2019.8771600
M3 - Conference contribution
AN - SCOPUS:85070453873
T3 - Proceedings 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2019
SP - 42
EP - 45
BT - Proceedings 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 1st IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2019
Y2 - 18 March 2019 through 20 March 2019
ER -