TY - GEN
T1 - Substitution of kernel functions based on pattern matching on schedule trees
AU - Chen, Zi Xuan
AU - Yang, Wuu
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/8/12
Y1 - 2024/8/12
N2 - With the rise of AI, computing hardware with varying architectures has emerged. For some frequently used AI kernels, these hardwares provide special accelerators and related instructions. For example, since the Volta architecture, Nvidia GPUs have provided tensor cores to optimize operations related to matrix multiplication. The vector extension of the RISC-V architecture provides instruction-level parallelism for kernels. We design and implement a language for pattern matching with which a user can define patterns for kernels. We identify segments of the schedule trees that match the defined patterns and replace the segments with calls to kernel functions (in libraries) or intrinsics that are optimized for the specific accelerators. In the experiments, the Polybench benchmarks are optimized for (and hence linked with) the following libraries: CBLAS on the x64 platform, CuBLAS with tensor-core instructions on GPU, OpenBLAS containing vector instructions on the RISC-V platform (software emulation, using the vector-instruction emulation ability provided by the Ara vector unit). The average (geomean) performance improvements on selected BLAS benchmarks are (1) run-time speedup is 1.38x for CBLAS on the x64 platform; (2) run-time improvement is 5.27x for CuBLAS with tensor-core instructions on GPU; (3) cycle-count speedup is 5.78x for OpenBLAS containing vector instructions on the RISC-V platform.
AB - With the rise of AI, computing hardware with varying architectures has emerged. For some frequently used AI kernels, these hardwares provide special accelerators and related instructions. For example, since the Volta architecture, Nvidia GPUs have provided tensor cores to optimize operations related to matrix multiplication. The vector extension of the RISC-V architecture provides instruction-level parallelism for kernels. We design and implement a language for pattern matching with which a user can define patterns for kernels. We identify segments of the schedule trees that match the defined patterns and replace the segments with calls to kernel functions (in libraries) or intrinsics that are optimized for the specific accelerators. In the experiments, the Polybench benchmarks are optimized for (and hence linked with) the following libraries: CBLAS on the x64 platform, CuBLAS with tensor-core instructions on GPU, OpenBLAS containing vector instructions on the RISC-V platform (software emulation, using the vector-instruction emulation ability provided by the Ara vector unit). The average (geomean) performance improvements on selected BLAS benchmarks are (1) run-time speedup is 1.38x for CBLAS on the x64 platform; (2) run-time improvement is 5.27x for CuBLAS with tensor-core instructions on GPU; (3) cycle-count speedup is 5.78x for OpenBLAS containing vector instructions on the RISC-V platform.
KW - GPU
KW - RISC-V
KW - pattern matching
KW - polyhedral compilation
KW - x86-64
UR - http://www.scopus.com/inward/record.url?scp=85202865175&partnerID=8YFLogxK
U2 - 10.1145/3677333.3678152
DO - 10.1145/3677333.3678152
M3 - Conference contribution
AN - SCOPUS:85202865175
T3 - ACM International Conference Proceeding Series
SP - 48
EP - 57
BT - 53rd International Conference on Parallel Processing, ICPP 2024 - Workshops Proceedings
PB - Association for Computing Machinery
T2 - 53rd International Conference on Parallel Processing, ICPP 2024
Y2 - 12 August 2024 through 15 August 2024
ER -