A Unified SpatioTemporal Network with Structural Pruning for Video Action Recognition

  • Yang Jie Chen*
  • , Rashid Ali
  • , Hsu Feng Hsiao
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video action recognition poses significant challenges in capturing and integrating the complex spatiotemporal patterns and motion dynamics necessary for robust understanding. Despite recent advancements, existing deep learning approaches often struggle to efficiently model these interactions over extended temporal ranges. To address this, we propose the Unified SpatioTemporal Network (USTN), a novel framework that fuses segment-level spatiotemporal features with long-range temporal difference information. By strategically employing sparse frame sampling, USTN constructs a rich, coarse-grained representation encapsulating both spatial structure and temporal evolution. Furthermore, we introduce a structural pruning technique to identify and remove redundant parameters, mitigating overfitting and enhancing computational efficiency without compromising performance. Extensive evaluations on the challenging UCF101 and HMDB51 benchmarks, using USTN instantiated with ResNet backbones, demonstrate the superiority of our approach.

Original languageEnglish
Title of host publicationISCAS 2025 - IEEE International Symposium on Circuits and Systems, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350356830
DOIs
StatePublished - 2025
Event2025 IEEE International Symposium on Circuits and Systems, ISCAS 2025 - London, United Kingdom
Duration: 25 May 202528 May 2025

Publication series

NameProceedings - IEEE International Symposium on Circuits and Systems
ISSN (Print)0271-4310

Conference

Conference2025 IEEE International Symposium on Circuits and Systems, ISCAS 2025
Country/TerritoryUnited Kingdom
CityLondon
Period25/05/2528/05/25

Fingerprint

Dive into the research topics of 'A Unified SpatioTemporal Network with Structural Pruning for Video Action Recognition'. Together they form a unique fingerprint.

Cite this