TY - GEN
T1 - An Attention-based Neural Network on Multiple Speaker Diarization
AU - Cheng, Shao Wen
AU - Hung, Kai Jyun
AU - Chang, Hsie Chia
AU - Liao, Yen Chin
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity for each point in time, which can be used in a multi-speaker conversation environment, such as a meeting or interview. Moreover, speaker diarization can be used to improve the performance of auto speech recognition. This paper presents an end-to-end diarization model based on an attention mechanism with data augmentation, several data pre-processing, and post-processing. In the CALLHOME data set, the case of two speakers reached a 9.12% diarization error rate. We combine the speaker diarization model, and auto speech recognition model and implement the transcript conversion system on an edge device. By using proposed speaker diarization as preprocessing to segment recording according to different speakers, then get the transcript of each segmented utterance by ASR model to fulfill the transcript conversion on the edge device. Experiment shows that our model also performs well in the scenario with two people on edge devices with both accuracy and inference time.
AB - Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity for each point in time, which can be used in a multi-speaker conversation environment, such as a meeting or interview. Moreover, speaker diarization can be used to improve the performance of auto speech recognition. This paper presents an end-to-end diarization model based on an attention mechanism with data augmentation, several data pre-processing, and post-processing. In the CALLHOME data set, the case of two speakers reached a 9.12% diarization error rate. We combine the speaker diarization model, and auto speech recognition model and implement the transcript conversion system on an edge device. By using proposed speaker diarization as preprocessing to segment recording according to different speakers, then get the transcript of each segmented utterance by ASR model to fulfill the transcript conversion on the edge device. Experiment shows that our model also performs well in the scenario with two people on edge devices with both accuracy and inference time.
KW - Attention Mechanism
KW - End-to-end Diarization Model
KW - Speaker Diarization
KW - Transcript Conversion
UR - http://www.scopus.com/inward/record.url?scp=85138996309&partnerID=8YFLogxK
U2 - 10.1109/AICAS54282.2022.9870007
DO - 10.1109/AICAS54282.2022.9870007
M3 - Conference contribution
AN - SCOPUS:85138996309
T3 - Proceeding - IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
SP - 431
EP - 434
BT - Proceeding - IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
Y2 - 13 June 2022 through 15 June 2022
ER -