TY - GEN
T1 - Enhanced Pedestrian Trajectory Prediction via the Cross-Modal Feature Fusion Transformer
AU - Ali, Rashid
AU - Hsiao, Hsu Feng
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - We address the challenge of predicting pedestrian trajectories in videos, a task inherently complex due to the diverse and intricate nature of human motion and interactions within their environment. The accurate anticipation of trajectories necessitates a holistic comprehension of the temporal evolution of past events in videos. Regrettably, existing methods often neglect the fusion of critical features, such as human behavior, motion, and interaction, thereby limiting their efficacy in tackling these challenges. To overcome these limitations, we propose the Cross-modal Feature Fusion Transformer, a novel approach for pedestrian trajectory prediction. Our model seamlessly integrates multimodal features, including human behavior, position, speed, and interaction with surroundings, to effectively encapsulate the temporal progression of observed frames. It consists of transformer-based cross-modal fusion encoder and decoder modules, adeptly melding the interactions between the multimodal features through a multi-head co-Attentional mechanism. This enables the precise prediction of future trajectories. Additionally, we incorporate auxiliary self-supervised future prediction losses to learn the temporal evolution of past and future multimodal features. We evaluate our approach on ETH/UCY and ActEV/VIRAT datasets and demonstrate its superior performance compared to state-of-The-Art methods.
AB - We address the challenge of predicting pedestrian trajectories in videos, a task inherently complex due to the diverse and intricate nature of human motion and interactions within their environment. The accurate anticipation of trajectories necessitates a holistic comprehension of the temporal evolution of past events in videos. Regrettably, existing methods often neglect the fusion of critical features, such as human behavior, motion, and interaction, thereby limiting their efficacy in tackling these challenges. To overcome these limitations, we propose the Cross-modal Feature Fusion Transformer, a novel approach for pedestrian trajectory prediction. Our model seamlessly integrates multimodal features, including human behavior, position, speed, and interaction with surroundings, to effectively encapsulate the temporal progression of observed frames. It consists of transformer-based cross-modal fusion encoder and decoder modules, adeptly melding the interactions between the multimodal features through a multi-head co-Attentional mechanism. This enables the precise prediction of future trajectories. Additionally, we incorporate auxiliary self-supervised future prediction losses to learn the temporal evolution of past and future multimodal features. We evaluate our approach on ETH/UCY and ActEV/VIRAT datasets and demonstrate its superior performance compared to state-of-The-Art methods.
KW - Cross-Attention
KW - Self-supervised learning
KW - Trajectory prediction
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85184855791&partnerID=8YFLogxK
U2 - 10.1109/VCIP59821.2023.10402669
DO - 10.1109/VCIP59821.2023.10402669
M3 - Conference contribution
AN - SCOPUS:85184855791
T3 - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
BT - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
Y2 - 4 December 2023 through 7 December 2023
ER -