Video summarization with frame index vision transformer

Tzu Chun Hsu, Yi Sheng Liao, Chun Rong Huang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

In this paper, we propose a novel frame index vision transformer for video summarization. Given training frames, we linearly project the content of the frames to obtain frame embedding. By incorporating the frame embedding with the index embedding and class embedding, the proposed frame index vision transformer can be efficiently and effectively applied to learn the importance of the input frames. As shown in the experimental results, the proposed method outperforms the state-of-the-art deep learning methods including recurrent neural network (RNN) and convolutional neural network (CNN) based methods in both of the SumMe and TVSum datasets. In addition, our method can achieve real-time computational efficiency during testing.

Original languageEnglish
Title of host publicationProceedings of MVA 2021 - 17th International Conference on Machine Vision Applications
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9784901122207
DOIs
StatePublished - 25 Jul 2021
Event17th International Conference on Machine Vision Applications, MVA 2021 - Aichi, Japan
Duration: 25 Jul 202127 Jul 2021

Publication series

NameProceedings of MVA 2021 - 17th International Conference on Machine Vision Applications

Conference

Conference17th International Conference on Machine Vision Applications, MVA 2021
Country/TerritoryJapan
CityAichi
Period25/07/2127/07/21

Fingerprint

Dive into the research topics of 'Video summarization with frame index vision transformer'. Together they form a unique fingerprint.

Cite this