TY - GEN
T1 - Incorporating Domain Knowledge into Language Transformers for Multi-Label Classification of Chinese Medical Questions
AU - Chen, Po Han
AU - Zeng, Yu Xiang
AU - Lee, Lung Hao
N1 - Publisher Copyright:
© 2021 ROCLING 2021 - Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing. All rights reserved.
PY - 2021
Y1 - 2021
N2 - In this paper, we propose a knowledge infusion mechanism to incorporate domain knowledge into language transformers. Weakly supervised data is regarded as the main source for knowledge acquisition. We pre-train the language models to capture masked knowledge of focuses and aspects and then fine-tune them to obtain better performance on the downstream tasks. Due to the lack of publicly available datasets for multi-label classification of Chinese medical questions, we crawled questions from medical question/answer forums and manually annotated them using eight predefined classes: persons and organizations, symptom, cause, examination, disease, information, ingredient, and treatment. Finally, a total of 1,814 questions with 2,340 labels. Each question contains an average of 1.29 labels. We used Baidu Medical Encyclopedia as the knowledge resource. Two transformers BERT and RoBERTa were implemented to compare performance on our constructed datasets. Experimental results showed that our proposed model with knowledge infusion mechanism can achieve better performance, no matter which evaluation metric including Macro F1, Micro F1, Weighted F1 or Subset Accuracy were considered.
AB - In this paper, we propose a knowledge infusion mechanism to incorporate domain knowledge into language transformers. Weakly supervised data is regarded as the main source for knowledge acquisition. We pre-train the language models to capture masked knowledge of focuses and aspects and then fine-tune them to obtain better performance on the downstream tasks. Due to the lack of publicly available datasets for multi-label classification of Chinese medical questions, we crawled questions from medical question/answer forums and manually annotated them using eight predefined classes: persons and organizations, symptom, cause, examination, disease, information, ingredient, and treatment. Finally, a total of 1,814 questions with 2,340 labels. Each question contains an average of 1.29 labels. We used Baidu Medical Encyclopedia as the knowledge resource. Two transformers BERT and RoBERTa were implemented to compare performance on our constructed datasets. Experimental results showed that our proposed model with knowledge infusion mechanism can achieve better performance, no matter which evaluation metric including Macro F1, Micro F1, Weighted F1 or Subset Accuracy were considered.
KW - Biomedical informatics
KW - Domain knowledge extraction
KW - Pretrained language models
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=85127417457&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85127417457
T3 - ROCLING 2021 - Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing
SP - 265
EP - 270
BT - ROCLING 2021 - Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing
A2 - Lee, Lung-Hao
A2 - Chang, Chia-Hui
A2 - Chen, Kuan-Yu
PB - The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
T2 - 33rd Conference on Computational Linguistics and Speech Processing, ROCLING 2021
Y2 - 15 October 2021 through 16 October 2021
ER -