Many internet search engines have been developed, however, the retrieval of video clips remains a challenge. This paper considers the retrieval of incident videos, which may contain more spatial and temporal semantics. We propose an encoder-decoder ConvLSTM model that explores multiple embeddings of a video to facilitate comparison of similarity between a pair of videos. The model is able to encode a video into an embedding that integrates both its spatial information and temporal semantics. Multiple video embeddings are then generated from coarse- and fine-grained features of a video to capture high- and low-level meanings. Subsequently, a learning-based comparative model is proposed to compare the similarity of two videos based on their embeddings. Extensive evaluations are presented and show that our model outperforms state-of-the-art methods for several video retrieval tasks on the FIVR-200K, CC_WEB_VIDEO, and EVVE datasets.