TY - GEN
T1 - Temporal-aware self-supervised learning for 3D hand pose and mesh estimation in videos
AU - Chen, Liangjian
AU - Lin, Shih Yao
AU - Xie, Yusheng
AU - Lin, Yen-Yu
AU - Xie, Xiaohui
N1 - Funding Information:
Acknowledgement. This work was supported in part by the Ministry of Science and Technology (MOST) under grants MOST 107-2628-E-009-007-MY3, MOST 109-2634-F-007-013, and MOST 109-2221-E-009-113-MY3, and by Qualcomm through a Taiwan University Research Collaboration Project.
Publisher Copyright:
© 2021 IEEE.
PY - 2021/1
Y1 - 2021/1
N2 - Estimating 3D hand pose directly from RGB images is challenging but has gained steady progress recently by training deep models with annotated 3D poses. However annotating 3D poses is difficult and as such only a few 3D hand pose datasets are available, all with limited sample sizes. In this study, we propose a new framework of training 3D pose estimation models from RGB images without using explicit 3D annotations, i.e., trained with only 2D information. Our framework is motivated by two observations: 1) Videos provide richer information for estimating 3D poses as opposed to static images; 2) Estimated 3D poses ought to be consistent whether the videos are viewed in the forward order or reverse order. We leverage these two observations to develop a self-supervised learning model called temporal-aware self-supervised network (TASSN). By enforcing temporal consistency constraints, TASSN learns 3D hand poses and meshes from videos with only 2D keypoint position annotations. Experiments show that our model achieves surprisingly good results, with 3D estimation accuracy on par with the state-of-the-art models trained with 3D annotations, highlighting the benefit of the temporal consistency in constraining 3D prediction models.
AB - Estimating 3D hand pose directly from RGB images is challenging but has gained steady progress recently by training deep models with annotated 3D poses. However annotating 3D poses is difficult and as such only a few 3D hand pose datasets are available, all with limited sample sizes. In this study, we propose a new framework of training 3D pose estimation models from RGB images without using explicit 3D annotations, i.e., trained with only 2D information. Our framework is motivated by two observations: 1) Videos provide richer information for estimating 3D poses as opposed to static images; 2) Estimated 3D poses ought to be consistent whether the videos are viewed in the forward order or reverse order. We leverage these two observations to develop a self-supervised learning model called temporal-aware self-supervised network (TASSN). By enforcing temporal consistency constraints, TASSN learns 3D hand poses and meshes from videos with only 2D keypoint position annotations. Experiments show that our model achieves surprisingly good results, with 3D estimation accuracy on par with the state-of-the-art models trained with 3D annotations, highlighting the benefit of the temporal consistency in constraining 3D prediction models.
UR - http://www.scopus.com/inward/record.url?scp=85116168016&partnerID=8YFLogxK
U2 - 10.1109/WACV48630.2021.00109
DO - 10.1109/WACV48630.2021.00109
M3 - Conference contribution
AN - SCOPUS:85116168016
T3 - Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021
SP - 1049
EP - 1058
BT - Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021
Y2 - 5 January 2021 through 9 January 2021
ER -