TY - GEN
T1 - Attention-based Video Virtual Try-On
AU - Tsai, Wen Jiin
AU - Tien, Yi Cheng
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/6/12
Y1 - 2023/6/12
N2 - This paper presents a video virtual try-on model which is based on appearance flow warping and is parsing-free. In this model, we utilized attention methods from Transformer [15] and proposed three attention-based modules: a Person-Cloth Transformer, a Self-Attention Generator, and a Cloth Refinement Transformer. The Person-Cloth Transformer enables clothing features to refer to person information, which is beneficial for style vector calculation and also improves the style warping process to estimate better appearance flows. The Self-Attention Generator utilizes a self-attention mechanism at the deepest feature layer, which enables the feature map to learn global context from all the other pixels, helping it synthesize more realistic results. The Cloth Refinement Transformer utilizes two cross-attention modules: one enables the current warped clothes to refer to previously warped clothes to ensure it is temporally consistent, and the other enables the current warped clothes to refer to person information to ensure it is spatially aligned. Our ablation study shows that each proposed module contributes to the improvement of the results. Experiment results show that our model can generate realistic try-on videos with high quality and perform better than existing methods.
AB - This paper presents a video virtual try-on model which is based on appearance flow warping and is parsing-free. In this model, we utilized attention methods from Transformer [15] and proposed three attention-based modules: a Person-Cloth Transformer, a Self-Attention Generator, and a Cloth Refinement Transformer. The Person-Cloth Transformer enables clothing features to refer to person information, which is beneficial for style vector calculation and also improves the style warping process to estimate better appearance flows. The Self-Attention Generator utilizes a self-attention mechanism at the deepest feature layer, which enables the feature map to learn global context from all the other pixels, helping it synthesize more realistic results. The Cloth Refinement Transformer utilizes two cross-attention modules: one enables the current warped clothes to refer to previously warped clothes to ensure it is temporally consistent, and the other enables the current warped clothes to refer to person information to ensure it is spatially aligned. Our ablation study shows that each proposed module contributes to the improvement of the results. Experiment results show that our model can generate realistic try-on videos with high quality and perform better than existing methods.
KW - attention
KW - parsing free
KW - Virtual try-on
UR - http://www.scopus.com/inward/record.url?scp=85163681200&partnerID=8YFLogxK
U2 - 10.1145/3591106.3592252
DO - 10.1145/3591106.3592252
M3 - Conference contribution
AN - SCOPUS:85163681200
T3 - ICMR 2023 - Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
SP - 209
EP - 216
BT - ICMR 2023 - Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
PB - Association for Computing Machinery, Inc
T2 - 2023 ACM International Conference on Multimedia Retrieval, ICMR 2023
Y2 - 12 June 2023 through 15 June 2023
ER -