TY - GEN
T1 - A Coarse-Grained Dual-Convolver Based CNN Accelerator with High Computing Resource Utilization
AU - Lu, Yi
AU - Wu, Yi Lin
AU - Huang, Juinn Dar
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/8
Y1 - 2020/8
N2 - Deep learning technologies have been developed rapidly in recent years and have played an important role in our lives. Among them, convolutional neural network (CNN) performs well in many applications. The quality of result is generally getting better as the number of convolutional layers increases, which also increases the computational complexity. Hence, a highly resource-efficient accelerator is demanded. In this paper, we propose a new CNN accelerator that features a delay-chain-free input data aligner as well as a dual-convolver processing element (DCPE). Our architecture does not require delay chains with a large number of registers for input data alignment, which not only reduces the area and power but improves the overall resource utilization. In addition, a set of DCPEs shares the same input aligner to produce multiple output feature maps concurrently, which offers the desirable computing power and reduces the external memory traffic. An accelerator instance with 8 DCPEs (144 MACs) has been implemented using TSMC 40nm process. The internal logic only consumes 285K gates and the total internal memory size is merely 44KB. As running VGG-16, the average performance is 190GOPS (@750MHz), the resource (MAC) utilization reaches 8S.3%, and the energy efficiency is 481GOPS/W.
AB - Deep learning technologies have been developed rapidly in recent years and have played an important role in our lives. Among them, convolutional neural network (CNN) performs well in many applications. The quality of result is generally getting better as the number of convolutional layers increases, which also increases the computational complexity. Hence, a highly resource-efficient accelerator is demanded. In this paper, we propose a new CNN accelerator that features a delay-chain-free input data aligner as well as a dual-convolver processing element (DCPE). Our architecture does not require delay chains with a large number of registers for input data alignment, which not only reduces the area and power but improves the overall resource utilization. In addition, a set of DCPEs shares the same input aligner to produce multiple output feature maps concurrently, which offers the desirable computing power and reduces the external memory traffic. An accelerator instance with 8 DCPEs (144 MACs) has been implemented using TSMC 40nm process. The internal logic only consumes 285K gates and the total internal memory size is merely 44KB. As running VGG-16, the average performance is 190GOPS (@750MHz), the resource (MAC) utilization reaches 8S.3%, and the energy efficiency is 481GOPS/W.
KW - convolutional neural network CNN
KW - hardware accelerator
KW - high resource utilization
KW - low data bandwidth
UR - http://www.scopus.com/inward/record.url?scp=85084975345&partnerID=8YFLogxK
U2 - 10.1109/AICAS48895.2020.9073835
DO - 10.1109/AICAS48895.2020.9073835
M3 - Conference contribution
AN - SCOPUS:85084975345
T3 - Proceedings - 2020 IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2020
SP - 198
EP - 202
BT - Proceedings - 2020 IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2020
Y2 - 31 August 2020 through 2 September 2020
ER -