TY - GEN
T1 - Multi-scale Motion-Aware Module for Video Action Recognition
AU - Peng, Huai Wei
AU - Tseng, Yu Chee
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Due to the lengthy computing time for optical flow, recent works have proposed to use the correlation operation as an alternative approach to extracting motion features. Although using correlation operations shows significant improvement with negligible FLOPs, it introduces much more latency per FLOP than convolution operations and increases noticeable latency as a larger searching patch is applied. Nonetheless, shrinking the searching patch in correlation operation is doomed to degrade its performance owing to the inability to capture larger displacements. In this paper, we propose an effective and low-latency Multi-Scale Motion-Aware (MSMA) module. It uses smaller searching patches at different scales for efficiently extracting motion features from large displacements. It can be installed into and generalizes well on different CNN backbones. When installed into TSM ResNet-50, the MSMA module introduces ≈ 17.6% more latency on NVIDIA Tesla V100 GPU, yet, it achieves state-of-the-art performance on Something-Something V1 & V2 and Diving-48.
AB - Due to the lengthy computing time for optical flow, recent works have proposed to use the correlation operation as an alternative approach to extracting motion features. Although using correlation operations shows significant improvement with negligible FLOPs, it introduces much more latency per FLOP than convolution operations and increases noticeable latency as a larger searching patch is applied. Nonetheless, shrinking the searching patch in correlation operation is doomed to degrade its performance owing to the inability to capture larger displacements. In this paper, we propose an effective and low-latency Multi-Scale Motion-Aware (MSMA) module. It uses smaller searching patches at different scales for efficiently extracting motion features from large displacements. It can be installed into and generalizes well on different CNN backbones. When installed into TSM ResNet-50, the MSMA module introduces ≈ 17.6% more latency on NVIDIA Tesla V100 GPU, yet, it achieves state-of-the-art performance on Something-Something V1 & V2 and Diving-48.
KW - Correlation operations
KW - Latency-performance trade-off
KW - Motion features extracting
KW - Video classification
UR - http://www.scopus.com/inward/record.url?scp=85150964996&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-25075-0_40
DO - 10.1007/978-3-031-25075-0_40
M3 - Conference contribution
AN - SCOPUS:85150964996
SN - 9783031250743
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 589
EP - 606
BT - Computer Vision – ECCV 2022 Workshops, Proceedings
A2 - Karlinsky, Leonid
A2 - Michaeli, Tomer
A2 - Nishino, Ko
PB - Springer Science and Business Media Deutschland GmbH
T2 - 17th European Conference on Computer Vision, ECCV 2022
Y2 - 23 October 2022 through 27 October 2022
ER -