TY - GEN
T1 - Weakly-supervised video re-localization with multiscale attention model
AU - Huang, Yung Han
AU - Hsu, Kuang Jui
AU - Jeng, Shyh Kang
AU - Lin, Yen Yu
N1 - Publisher Copyright:
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2020/4/3
Y1 - 2020/4/3
N2 - Video re-localization aims to localize a sub-sequence, called target segment, in an untrimmed reference video that is similar to a given query video. In this work, we propose an attention-based model to accomplish this task in a weakly supervised setting. Namely, we derive our CNN-based model without using the annotated locations of the target segments in reference videos. Our model contains three modules. First, it employs a pre-trained C3D network for feature extraction. Second, we design an attention mechanism to extract multiscale temporal features, which are then used to estimate the similarity between the query video and a reference video.Third, a localization layer detects where the target segment is in the reference video by determining whether each frame in the reference video is consistent with the query video. The resultant CNN model is derived based on the proposed coattention loss which discriminatively separates the target segment from the reference video. This loss maximizes the similarity between the query video and the target segment while minimizing the similarity between the target segment and the rest of the reference video. Our model can be modified to fully supervised re-localization. Our method is evaluated on a public dataset and achieves the state-of-the-art performance under both weakly supervised and fully supervised settings.
AB - Video re-localization aims to localize a sub-sequence, called target segment, in an untrimmed reference video that is similar to a given query video. In this work, we propose an attention-based model to accomplish this task in a weakly supervised setting. Namely, we derive our CNN-based model without using the annotated locations of the target segments in reference videos. Our model contains three modules. First, it employs a pre-trained C3D network for feature extraction. Second, we design an attention mechanism to extract multiscale temporal features, which are then used to estimate the similarity between the query video and a reference video.Third, a localization layer detects where the target segment is in the reference video by determining whether each frame in the reference video is consistent with the query video. The resultant CNN model is derived based on the proposed coattention loss which discriminatively separates the target segment from the reference video. This loss maximizes the similarity between the query video and the target segment while minimizing the similarity between the target segment and the rest of the reference video. Our model can be modified to fully supervised re-localization. Our method is evaluated on a public dataset and achieves the state-of-the-art performance under both weakly supervised and fully supervised settings.
UR - http://www.scopus.com/inward/record.url?scp=85106420780&partnerID=8YFLogxK
U2 - 10.1609/aaai.v34i07.6763
DO - 10.1609/aaai.v34i07.6763
M3 - Conference contribution
AN - SCOPUS:85106420780
T3 - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
SP - 11077
EP - 11084
BT - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
PB - AAAI press
T2 - 34th AAAI Conference on Artificial Intelligence, AAAI 2020
Y2 - 7 February 2020 through 12 February 2020
ER -