TY - GEN
T1 - Near-DRAM Accelerated Matrix Multiplications
AU - Sinha, Aman
AU - Lai, Bo Cheng
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - General matrix multiplication (GeMM) is a fundamental computing operation underpinning various applications in machine learning, data science, computer graphics, and scientific simulations. However, its performance on von Neumann computers, such as CPUs and GPUs, is bottlenecked due to challenges such as insufficient memory access bandwidth, low cache efficiency, hardware under-utilization, and massive power consumption. The performance of GPUs suffers drastically for complex sequential analysis often occurring along with GeMMs in various big data pipelines. Furthermore, GeMM-optimized Tensor cores available in modern Nvidia GPUs offer no extensibility to non-GeMM tasks. Various Near-Memory Computing (NMC) architectures have recently been explored to alleviate the data-intensive nature of analyses such as GeMM. This work evaluates the performance potential of GeMM using NMC through clusters of simple interconnected processing cores on a stacked DRAM platform. The organized design shows efficiency comparable to high-end Nvidia GPUs while consuming lower power and being highly extensible to various non-GeMM logically complex workloads.
AB - General matrix multiplication (GeMM) is a fundamental computing operation underpinning various applications in machine learning, data science, computer graphics, and scientific simulations. However, its performance on von Neumann computers, such as CPUs and GPUs, is bottlenecked due to challenges such as insufficient memory access bandwidth, low cache efficiency, hardware under-utilization, and massive power consumption. The performance of GPUs suffers drastically for complex sequential analysis often occurring along with GeMMs in various big data pipelines. Furthermore, GeMM-optimized Tensor cores available in modern Nvidia GPUs offer no extensibility to non-GeMM tasks. Various Near-Memory Computing (NMC) architectures have recently been explored to alleviate the data-intensive nature of analyses such as GeMM. This work evaluates the performance potential of GeMM using NMC through clusters of simple interconnected processing cores on a stacked DRAM platform. The organized design shows efficiency comparable to high-end Nvidia GPUs while consuming lower power and being highly extensible to various non-GeMM logically complex workloads.
KW - DNNs
KW - GeMM
KW - Matrix Multiplication
KW - Near-Memory Computing
KW - RISC-V
KW - Stacked Memory
UR - http://www.scopus.com/inward/record.url?scp=85217076404&partnerID=8YFLogxK
U2 - 10.1109/MCSoC64144.2024.00048
DO - 10.1109/MCSoC64144.2024.00048
M3 - Conference contribution
AN - SCOPUS:85217076404
T3 - Proceedings - 2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2024
SP - 245
EP - 248
BT - Proceedings - 2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2024
Y2 - 16 December 2024 through 19 December 2024
ER -