Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Yan Bo Lin, Hung Yu Tseng, Hsin Ying Lee, Yen Yu Lin, Ming Hsuan Yang

研究成果: Conference contribution同行評審

45 引文 斯高帕斯(Scopus)

摘要

The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories. However, it is labor-intensive to temporally annotate audio and visual events and thus hampers the learning of a parsing model. To this end, we propose to explore additional cross-video and cross-modality supervisory signals to facilitate weakly-supervised audio-visual video parsing. The proposed method exploits both the common and diverse event semantics across videos to identify audio or visual events. In addition, our method explores event co-occurrence across audio, visual, and audio-visual streams. We leverage the explored cross-modality co-occurrence to localize segments of target events while excluding irrelevant ones. The discovered supervisory signals across different videos and modalities can greatly facilitate the training with only video-level annotations. Quantitative and qualitative results demonstrate that the proposed method performs favorably against existing methods on weakly-supervised audio-visual video parsing.

原文English
主出版物標題Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
編輯Marc'Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy S. Liang, Jenn Wortman Vaughan
發行者Neural information processing systems foundation
頁面11449-11461
頁數13
ISBN(電子)9781713845393
出版狀態Published - 2021
事件35th Conference on Neural Information Processing Systems, NeurIPS 2021 - Virtual, Online
持續時間: 6 12月 202114 12月 2021

出版系列

名字Advances in Neural Information Processing Systems
14
ISSN(列印)1049-5258

Conference

Conference35th Conference on Neural Information Processing Systems, NeurIPS 2021
城市Virtual, Online
期間6/12/2114/12/21

指紋

深入研究「Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing」主題。共同形成了獨特的指紋。

引用此