A multi-embedding neural model for incident video retrieval

Ting Hui Chiang*, Yi Chun Tseng, Yu Chee Tseng

*此作品的通信作者

研究成果: Article同行評審

摘要

Many internet search engines have been developed, however, the retrieval of video clips remains a challenge. This paper considers the retrieval of incident videos, which may contain more spatial and temporal semantics. We propose an encoder-decoder ConvLSTM model that explores multiple embeddings of a video to facilitate comparison of similarity between a pair of videos. The model is able to encode a video into an embedding that integrates both its spatial information and temporal semantics. Multiple video embeddings are then generated from coarse- and fine-grained features of a video to capture high- and low-level meanings. Subsequently, a learning-based comparative model is proposed to compare the similarity of two videos based on their embeddings. Extensive evaluations are presented and show that our model outperforms state-of-the-art methods for several video retrieval tasks on the FIVR-200K, CC_WEB_VIDEO, and EVVE datasets.

原文English
文章編號108807
期刊Pattern Recognition
130
DOIs
出版狀態Published - 10月 2022

指紋

深入研究「A multi-embedding neural model for incident video retrieval」主題。共同形成了獨特的指紋。

引用此