A multi-embedding neural model for incident video retrieval

Ting Hui Chiang*, Yi Chun Tseng, Yu Chee Tseng

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


Many internet search engines have been developed, however, the retrieval of video clips remains a challenge. This paper considers the retrieval of incident videos, which may contain more spatial and temporal semantics. We propose an encoder-decoder ConvLSTM model that explores multiple embeddings of a video to facilitate comparison of similarity between a pair of videos. The model is able to encode a video into an embedding that integrates both its spatial information and temporal semantics. Multiple video embeddings are then generated from coarse- and fine-grained features of a video to capture high- and low-level meanings. Subsequently, a learning-based comparative model is proposed to compare the similarity of two videos based on their embeddings. Extensive evaluations are presented and show that our model outperforms state-of-the-art methods for several video retrieval tasks on the FIVR-200K, CC_WEB_VIDEO, and EVVE datasets.

Original languageEnglish
Article number108807
JournalPattern Recognition
StatePublished - Oct 2022


  • Artificial intelligence
  • Computer vision
  • Deep metric learning
  • Incident video retrieval


Dive into the research topics of 'A multi-embedding neural model for incident video retrieval'. Together they form a unique fingerprint.

Cite this