TY - JOUR
T1 - Decoupling-Cooperative Framework for Referring Expression Comprehension
AU - Song, Yun Zhu
AU - Chen, Yi Syuan
AU - Shuai, Hong Han
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2023
Y1 - 2023
N2 - Referring Expression Comprehension (REC) aims to locate a specific object within an image by interpreting a referring expression articulated in natural language. This task comprises two essential branches: understanding and localizing. The former entails processing cognitive information from multimodal data, while the latter involves realizing the predictions in the perceptive visual space. Although various advanced approaches have been developed for each of these branches separately, existing REC approaches are unable to effectively leverage them due to the specific designs of architectures or objectives for REC, which bind understanding and localizing inseparably. To overcome this challenge, we propose the Decoupling-Cooperative Framework (DCF). The decoupling scheme in DCF enables us to utilize up-to-date methods for understanding and localizing with minimal constraints. Meanwhile, the proposed cooperative modules enable better integration of the strengths from both branches to achieve further enhancements. Extensive experiments demonstrate that DCF achieves state-of-the-art performance across four benchmarks, thus highlighting the generalizability of DCF.
AB - Referring Expression Comprehension (REC) aims to locate a specific object within an image by interpreting a referring expression articulated in natural language. This task comprises two essential branches: understanding and localizing. The former entails processing cognitive information from multimodal data, while the latter involves realizing the predictions in the perceptive visual space. Although various advanced approaches have been developed for each of these branches separately, existing REC approaches are unable to effectively leverage them due to the specific designs of architectures or objectives for REC, which bind understanding and localizing inseparably. To overcome this challenge, we propose the Decoupling-Cooperative Framework (DCF). The decoupling scheme in DCF enables us to utilize up-to-date methods for understanding and localizing with minimal constraints. Meanwhile, the proposed cooperative modules enable better integration of the strengths from both branches to achieve further enhancements. Extensive experiments demonstrate that DCF achieves state-of-the-art performance across four benchmarks, thus highlighting the generalizability of DCF.
KW - Multimodal understanding
KW - object detection and localization
KW - referring expression comprehension (REC)
UR - http://www.scopus.com/inward/record.url?scp=85176295363&partnerID=8YFLogxK
U2 - 10.1109/LSP.2023.3327651
DO - 10.1109/LSP.2023.3327651
M3 - Article
AN - SCOPUS:85176295363
SN - 1070-9908
VL - 30
SP - 1542
EP - 1546
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -