TY - JOUR
T1 - VWA
T2 - Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network
AU - Chang, Kuo Wei
AU - Chang, Tian-Sheuan
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2020/1
Y1 - 2020/1
N2 - Hardware accelerators for convolution neural networks (CNNs) enable real-time applications of artificial intelligence technology. However, most of the existing designs suffer from low hardware utilization or high area cost due to complex data flow. This paper proposes a hardware efficient vectorwise CNN accelerator that adopts a 3 × 3 filter optimized systolic array using 1-D broadcast data flow to generate partial sum. This enables easy reconfiguration for different kinds of kernels with interleaved input or elementwise input data flow. This simple and regular data flow results in low area cost while attains high hardware utilization. The presented design achieves 99%, 97%, 93.7%, and 94% hardware utilization for VGG-16, ResNet-34, GoogLeNet, and Mobilenet, respectively. Hardware implementation with TSMC 40nm technology takes 266.9K NAND gate count and 191KB SRAM to support 168GOPS throughput while consumes only 154.98mW when running at 500MHz operating frequency, which has superior area and power efficiency than other designs.
AB - Hardware accelerators for convolution neural networks (CNNs) enable real-time applications of artificial intelligence technology. However, most of the existing designs suffer from low hardware utilization or high area cost due to complex data flow. This paper proposes a hardware efficient vectorwise CNN accelerator that adopts a 3 × 3 filter optimized systolic array using 1-D broadcast data flow to generate partial sum. This enables easy reconfiguration for different kinds of kernels with interleaved input or elementwise input data flow. This simple and regular data flow results in low area cost while attains high hardware utilization. The presented design achieves 99%, 97%, 93.7%, and 94% hardware utilization for VGG-16, ResNet-34, GoogLeNet, and Mobilenet, respectively. Hardware implementation with TSMC 40nm technology takes 266.9K NAND gate count and 191KB SRAM to support 168GOPS throughput while consumes only 154.98mW when running at 500MHz operating frequency, which has superior area and power efficiency than other designs.
KW - Convolution neural networks (CNNs)
KW - accelerators
KW - hardware design
UR - http://www.scopus.com/inward/record.url?scp=85078352992&partnerID=8YFLogxK
U2 - 10.1109/TCSI.2019.2942529
DO - 10.1109/TCSI.2019.2942529
M3 - Article
AN - SCOPUS:85078352992
SN - 1549-8328
VL - 67
SP - 145
EP - 154
JO - IEEE Transactions on Circuits and Systems I: Regular Papers
JF - IEEE Transactions on Circuits and Systems I: Regular Papers
IS - 1
M1 - 8854849
ER -