Transformer-based spatial-Temporal feature lifting for 3D hand mesh reconstruction

Meng Xue Lin, Wen Jiin Tsai

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper presents a novel model for reconstructing hand meshes in video sequences. The model extends the MobRecon [1] pipeline and incorporates a variant of the Transformer architecture which effectively models both spatial and temporal relationships using distinct positional encodings. The Transformer encoder enhances the feature representation by modeling joint relationships and learning hidden depth information. Leveraging temporal information from consecutive frames, the Transformer decoder further enhances the feature representation for the mesh decoder's final prediction. Additionally, we incorporate techniques such as Twice-LN, confidence-based attention, scaling in place of Softmax, and learnable encodings to improve the feature representation. Experimental results demonstrate the superiority of the proposed method over existing approaches.

Original languageEnglish
Title of host publication2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350359855
DOIs
StatePublished - 2023
Event2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023 - Jeju, Korea, Republic of
Duration: 4 Dec 20237 Dec 2023

Publication series

Name2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023

Conference

Conference2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
Country/TerritoryKorea, Republic of
CityJeju
Period4/12/237/12/23

Keywords

  • attention mechanism
  • deep learning
  • hand mesh
  • hand pose
  • machine learning
  • transformer

Fingerprint

Dive into the research topics of 'Transformer-based spatial-Temporal feature lifting for 3D hand mesh reconstruction'. Together they form a unique fingerprint.

Cite this