TY - GEN
T1 - A Novel Number Representation and Its Hardware Support for Accurate Low-Bit Quantization on Large Recommender Systems
AU - Chu, Yu Da
AU - Kuo, Pei Hsuan
AU - Ho, Lyu Ming
AU - Huang, Juinn Dar
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Deep learning based recommender systems with large embedding tables have become pivotal for web content recommendation. However, the growing size of those tables, reaching tens of gigabytes or even terabytes, presents a tough challenge for conducting inferences on resource-constrained hardware. In this paper, we present a novel 6-bit fixed-point number representation format for more precise quantization on recommender models. The proposed format is specifically designed to accommodate the nonuniform weight distribution inside those huge embedding tables. To further alleviate the model size, the well-known K-means quantization technique is utilized for 4-bit quantization and beyond. Moreover, we also propose dedicated hardware decoder architectures for both 6-bit and 4-bit quantization to ensure efficient runtime inference. Experimental results show that the proposed low-bit (8~3-bit) quantization techniques on embedding tables yield 4~10.7x model size reduction with minor accuracy loss as compared to the original FP32 model. Therefore, the proposed number representation format and low-bit quantization techniques can effectively and drastically reduce the model size of large recommender systems with a very low area cost while still keeping the accuracy loss minimized.
AB - Deep learning based recommender systems with large embedding tables have become pivotal for web content recommendation. However, the growing size of those tables, reaching tens of gigabytes or even terabytes, presents a tough challenge for conducting inferences on resource-constrained hardware. In this paper, we present a novel 6-bit fixed-point number representation format for more precise quantization on recommender models. The proposed format is specifically designed to accommodate the nonuniform weight distribution inside those huge embedding tables. To further alleviate the model size, the well-known K-means quantization technique is utilized for 4-bit quantization and beyond. Moreover, we also propose dedicated hardware decoder architectures for both 6-bit and 4-bit quantization to ensure efficient runtime inference. Experimental results show that the proposed low-bit (8~3-bit) quantization techniques on embedding tables yield 4~10.7x model size reduction with minor accuracy loss as compared to the original FP32 model. Therefore, the proposed number representation format and low-bit quantization techniques can effectively and drastically reduce the model size of large recommender systems with a very low area cost while still keeping the accuracy loss minimized.
KW - number representation
KW - quantization
KW - recommendation system
UR - http://www.scopus.com/inward/record.url?scp=85199874903&partnerID=8YFLogxK
U2 - 10.1109/AICAS59952.2024.10595902
DO - 10.1109/AICAS59952.2024.10595902
M3 - Conference contribution
AN - SCOPUS:85199874903
T3 - 2024 IEEE 6th International Conference on AI Circuits and Systems, AICAS 2024 - Proceedings
SP - 437
EP - 441
BT - 2024 IEEE 6th International Conference on AI Circuits and Systems, AICAS 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th IEEE International Conference on AI Circuits and Systems, AICAS 2024
Y2 - 22 April 2024 through 25 April 2024
ER -