TY - JOUR
T1 - Data and Hardware Efficient Design for Convolutional Neural Network
AU - Lin, Yue Jin
AU - Chang, Tian-Sheuan
PY - 2018/5/1
Y1 - 2018/5/1
N2 - Hardware design of deep convolutional neural networks (CNNs) faces challenges of high computational complexity and data bandwidth as well as huge divergence in different CNN network layers, in which the throughput of the convolutional layer would be bounded by available hardware resource, and throughput of the fully connected layer would be bounded by available data bandwidth. Thus, a highly flexible and efficient design is desired to meet these needs. This paper presents an end-to-end CNN accelerator that maximizes hardware utilization with run-time configurations of different kernel sizes. It also minimizes data bandwidth with the output first strategy to improve the data reuse of the convolutional layers by up to 300× ∼ 600× compared with the non-reused case. The whole CNN implementation of the target network is generated optimally for both hardware and data efficiency under design resource constraints, which can be run-time reconfigured by the layer optimized parameters to achieve real-time and end-to-end CNN acceleration. An implementation example for AlexNet consumes a 1.783 M gate count for 216 MACs and a 142.64 kb internal buffer with TSMC 40 nm process, and achieves 99.7 and 61.6 f/s under 454 MHz clock frequency for the convolutional layers and the whole AlexNet, respectively.
AB - Hardware design of deep convolutional neural networks (CNNs) faces challenges of high computational complexity and data bandwidth as well as huge divergence in different CNN network layers, in which the throughput of the convolutional layer would be bounded by available hardware resource, and throughput of the fully connected layer would be bounded by available data bandwidth. Thus, a highly flexible and efficient design is desired to meet these needs. This paper presents an end-to-end CNN accelerator that maximizes hardware utilization with run-time configurations of different kernel sizes. It also minimizes data bandwidth with the output first strategy to improve the data reuse of the convolutional layers by up to 300× ∼ 600× compared with the non-reused case. The whole CNN implementation of the target network is generated optimally for both hardware and data efficiency under design resource constraints, which can be run-time reconfigured by the layer optimized parameters to achieve real-time and end-to-end CNN acceleration. An implementation example for AlexNet consumes a 1.783 M gate count for 216 MACs and a 142.64 kb internal buffer with TSMC 40 nm process, and achieves 99.7 and 61.6 f/s under 454 MHz clock frequency for the convolutional layers and the whole AlexNet, respectively.
KW - Convolutional neural networks (CNNs)
KW - hardware design
UR - http://www.scopus.com/inward/record.url?scp=85033707033&partnerID=8YFLogxK
U2 - 10.1109/TCSI.2017.2759803
DO - 10.1109/TCSI.2017.2759803
M3 - Article
AN - SCOPUS:85033707033
SN - 1549-8328
VL - 65
SP - 1642
EP - 1651
JO - IEEE Transactions on Circuits and Systems I: Regular Papers
JF - IEEE Transactions on Circuits and Systems I: Regular Papers
IS - 5
ER -