TY - GEN
T1 - A Hybrid Convolutional and Transformer Network for Salient Object Detection
AU - Li, Bei Sin
AU - Hsiao, Hsu Feng
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - We present a novel hybrid architecture that seamlessly merges transformers and convolutional neural networks to enhance the performance of RGB-D salient object detection. Transformer-based models have recently demonstrated their potential in this field, owing to their unique ability to encode long-range information via the self-Attention mechanism. This mechanism adeptly mirrors human visual perception by capturing long-distance dependencies and selectively focusing on the most relevant sections of the input image. In contrast, convolutional neural networks, with their robust generalization and trainability, have proven to be invaluable for a wide array of image processing tasks. By fusing the strengths of these two models, our proposed hybrid architecture outperforms the effectiveness of using either transformers or convolutional neural networks in isolation. Our architecture employs an encoder-decoder framework. Within this structure, the hybrid model functions as the feature encoder, while the decoder integrates a convolutional neural network with deep layer aggregation to adeptly merge features of varying resolutions derived from the transformer-based encoder. This strategic design choice exploits the computational modeling prowess of convolutional neural networks in tasks such as saliency prediction, while also benefiting from the long-range dependency modeling offered by the hybrid model. We also use a Siamese architecture with shared parameters in the encoder to concurrently learn salient features from RGB and depth data. By harnessing the complementary strengths of both models, the proposed hybrid architecture has demonstrated superior performance.
AB - We present a novel hybrid architecture that seamlessly merges transformers and convolutional neural networks to enhance the performance of RGB-D salient object detection. Transformer-based models have recently demonstrated their potential in this field, owing to their unique ability to encode long-range information via the self-Attention mechanism. This mechanism adeptly mirrors human visual perception by capturing long-distance dependencies and selectively focusing on the most relevant sections of the input image. In contrast, convolutional neural networks, with their robust generalization and trainability, have proven to be invaluable for a wide array of image processing tasks. By fusing the strengths of these two models, our proposed hybrid architecture outperforms the effectiveness of using either transformers or convolutional neural networks in isolation. Our architecture employs an encoder-decoder framework. Within this structure, the hybrid model functions as the feature encoder, while the decoder integrates a convolutional neural network with deep layer aggregation to adeptly merge features of varying resolutions derived from the transformer-based encoder. This strategic design choice exploits the computational modeling prowess of convolutional neural networks in tasks such as saliency prediction, while also benefiting from the long-range dependency modeling offered by the hybrid model. We also use a Siamese architecture with shared parameters in the encoder to concurrently learn salient features from RGB and depth data. By harnessing the complementary strengths of both models, the proposed hybrid architecture has demonstrated superior performance.
KW - RGB-D salient object detection
KW - Siamese network
KW - transformers
UR - http://www.scopus.com/inward/record.url?scp=85184850814&partnerID=8YFLogxK
U2 - 10.1109/VCIP59821.2023.10402625
DO - 10.1109/VCIP59821.2023.10402625
M3 - Conference contribution
AN - SCOPUS:85184850814
T3 - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
BT - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
Y2 - 4 December 2023 through 7 December 2023
ER -