TY - GEN
T1 - A 2.25 TOPS/W fully-integrated deep CNN learning processor with on-chip training
AU - Lu, Cheng Hsun
AU - Wu, Yi Chung
AU - Yang, Chia Hsiang
N1 - Publisher Copyright:
© 2019 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
PY - 2019/11
Y1 - 2019/11
N2 - This paper presents a deep learning processor that supports both inference and training for the entire convolutional neural network (CNN) with any size. The proposed design enables on-chip training for applications that ask for high security and privacy. Techniques across design abstraction are applied to improve the energy efficiency. Rearrangement of the weights in filters is leveraged to reduce the processing latency by 88%. Integration of fixed-point and floating-point arithmetics reduces the area of the multiplier by 56.8%, resulting in an unified processing element (PE) with 33% less area. In the low-precision mode, clock gating and data gating are employed to reduce the power of the PE cluster by 62%. Maxpooling and ReLU modules are co-designed to reduce the memory usage by 75%. A modified softmax function is utilized to reduce the area by 78%. Fabricated in 40nm CMOS, the chip consumes 18.7 mW and 64.5 mW for inference and training, respectively, at 82 MHz from a 0.6V supply. It achieves an energy efficiency of 2.25 TOPS/W, which is 2.67 times higher than the state-of-the-art learning processors. The chip also achieves a 2×10 5 times higher energy efficiency in training than a high-end CPU.
AB - This paper presents a deep learning processor that supports both inference and training for the entire convolutional neural network (CNN) with any size. The proposed design enables on-chip training for applications that ask for high security and privacy. Techniques across design abstraction are applied to improve the energy efficiency. Rearrangement of the weights in filters is leveraged to reduce the processing latency by 88%. Integration of fixed-point and floating-point arithmetics reduces the area of the multiplier by 56.8%, resulting in an unified processing element (PE) with 33% less area. In the low-precision mode, clock gating and data gating are employed to reduce the power of the PE cluster by 62%. Maxpooling and ReLU modules are co-designed to reduce the memory usage by 75%. A modified softmax function is utilized to reduce the area by 78%. Fabricated in 40nm CMOS, the chip consumes 18.7 mW and 64.5 mW for inference and training, respectively, at 82 MHz from a 0.6V supply. It achieves an energy efficiency of 2.25 TOPS/W, which is 2.67 times higher than the state-of-the-art learning processors. The chip also achieves a 2×10 5 times higher energy efficiency in training than a high-end CPU.
KW - CMOS digital integrated circuits
KW - Convolutional neural network
KW - Deep learning
KW - Specialized processor
UR - http://www.scopus.com/inward/record.url?scp=85090188108&partnerID=8YFLogxK
U2 - 10.1109/A-SSCC47793.2019.9056967
DO - 10.1109/A-SSCC47793.2019.9056967
M3 - Conference contribution
AN - SCOPUS:85090188108
T3 - Proceedings - 2019 IEEE Asian Solid-State Circuits Conference, A-SSCC 2019
SP - 65
EP - 68
BT - Proceedings - 2019 IEEE Asian Solid-State Circuits Conference, A-SSCC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th IEEE Asian Solid-State Circuits Conference, A-SSCC 2019
Y2 - 4 November 2019 through 6 November 2019
ER -