Bayesian Transformer Using Disentangled Mask Attention

Jen Tzung Chien, Yu Han Huang

研究成果: Conference article同行評審

6 引文 斯高帕斯(Scopus)


Transformer conducts self attention which has achieved state-of-the-art performance in many applications. Multi-head attention in transformer basically gathers the features from individual tokens in input sequence to form the mapping to output sequence. There are twofold weaknesses in transformer. First, due to the natural property that attention mechanism would mix up the features of different tokens in input and output sequences, it is likely that the representation of input tokens contains redundant information. Second, the patterns of attention weights between different heads tend to be similar, the model capacity is bounded. To strengthen the sequential learning, this paper presents a variational disentangled mask attention in transformer where the redundant features are enhanced with semantic information. Latent disentanglement in multi-head attention is learned. The attention weights are filtered by a mask which is optimized by semantic clustering. The proposed attention mechanism is then implemented according to a Bayesian learning for clustered disentanglement. The experiments on machine translation show the merit of the disentangled mask attention.

頁(從 - 到)1761-1765
期刊Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
出版狀態Published - 2022
事件23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
持續時間: 18 9月 202222 9月 2022


深入研究「Bayesian Transformer Using Disentangled Mask Attention」主題。共同形成了獨特的指紋。