TY - GEN
T1 - High-speed power-efficient coarse-grained convolver architecture using depth-first compression scheme
AU - Wu, Yi Lin
AU - Lu, Yi
AU - Huang, Juinn Dar
N1 - Publisher Copyright:
© 2020 IEEE
PY - 2020/10
Y1 - 2020/10
N2 - Convolutional neural networks (CNNs) have been playing an important role in various applications, e.g., computer vision. Since CNN computations require numerous multiply-accumulate (MAC) operations, how to get them done efficiently is a crucial issue for CNN hardware accelerators. In this paper, we propose a high-speed power-efficient convolver architecture for CNN acceleration. A 3×3 convolver is asked to produce an output every cycle and is commonly accomplished by summing up the results of nine parallel multiplications, which requires ten carry-propagation adders (CPAs) in total. However, the proposed coarse-grained convolver can break the boundary between multipliers and reduce all partial products in a more global way. Consequently, it requires only one CPA to generate the final outcome. It also features a globally delay-optimized partial product reduction tree and a depth-first compression scheme for both area and power minimization. The proposed convolver has been implemented using TSMC 40nm technology. Compared to a conventional 3×3 convolver baseline design, our design can reduce area and power by 15.8% and 26.5% respectively at the clock rate of 1GHz.
AB - Convolutional neural networks (CNNs) have been playing an important role in various applications, e.g., computer vision. Since CNN computations require numerous multiply-accumulate (MAC) operations, how to get them done efficiently is a crucial issue for CNN hardware accelerators. In this paper, we propose a high-speed power-efficient convolver architecture for CNN acceleration. A 3×3 convolver is asked to produce an output every cycle and is commonly accomplished by summing up the results of nine parallel multiplications, which requires ten carry-propagation adders (CPAs) in total. However, the proposed coarse-grained convolver can break the boundary between multipliers and reduce all partial products in a more global way. Consequently, it requires only one CPA to generate the final outcome. It also features a globally delay-optimized partial product reduction tree and a depth-first compression scheme for both area and power minimization. The proposed convolver has been implemented using TSMC 40nm technology. Compared to a conventional 3×3 convolver baseline design, our design can reduce area and power by 15.8% and 26.5% respectively at the clock rate of 1GHz.
KW - Compensation vector
KW - Convolutional neural network (CNN)
KW - Convolver design
KW - Depth-first compression
KW - Hardware accelerator
KW - Multiply-accumulate operation
UR - http://www.scopus.com/inward/record.url?scp=85109256563&partnerID=8YFLogxK
U2 - 10.1109/ISCAS45731.2020.9180406
DO - 10.1109/ISCAS45731.2020.9180406
M3 - Conference contribution
AN - SCOPUS:85109256563
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
BT - 2020 IEEE International Symposium on Circuits and Systems, ISCAS 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 52nd IEEE International Symposium on Circuits and Systems, ISCAS 2020
Y2 - 10 October 2020 through 21 October 2020
ER -