TY - GEN
T1 - Self-Supervised Learning for Online Speaker Diarization
AU - Chien, Jen Tzung
AU - Luo, Sixun
N1 - Publisher Copyright:
© 2021 APSIPA.
PY - 2021
Y1 - 2021
N2 - Speaker diarization deals with the issue of 'who spoke when' which is tackled through splitting an utterance into homogeneous segments with individual speakers. Traditional methods were implemented in an offline supervised strategy which constrained the usefulness of a practical system. Real-time processing and self-supervised learning are required. This paper deals with speaker diarization by relaxing the needs of reading the whole utterance and collecting the speaker label. The online pipeline components including feature extraction, voice activity detection, speech segmentation and speaker clustering is implemented. Importantly, an efficient end-to-end speech feature extraction is implemented by an unsupervised or self-supervised method, and then combined with online clustering to carry out online speaker diarization. This feature extractor is implemented by merging a bidirectional long short-term memory and a time-delayed neural network to capture the global and local features, respectively. The contrastive learning is introduced to improve initial speaker clusters. The augmentation invariance is imposed to assure model robustness. The online clustering based on autoregressive and fast-match clustering is investigated. The experiments on speaker diarization over NIST Speaker Recognition Evaluation show the merits of the proposed methods.
AB - Speaker diarization deals with the issue of 'who spoke when' which is tackled through splitting an utterance into homogeneous segments with individual speakers. Traditional methods were implemented in an offline supervised strategy which constrained the usefulness of a practical system. Real-time processing and self-supervised learning are required. This paper deals with speaker diarization by relaxing the needs of reading the whole utterance and collecting the speaker label. The online pipeline components including feature extraction, voice activity detection, speech segmentation and speaker clustering is implemented. Importantly, an efficient end-to-end speech feature extraction is implemented by an unsupervised or self-supervised method, and then combined with online clustering to carry out online speaker diarization. This feature extractor is implemented by merging a bidirectional long short-term memory and a time-delayed neural network to capture the global and local features, respectively. The contrastive learning is introduced to improve initial speaker clusters. The augmentation invariance is imposed to assure model robustness. The online clustering based on autoregressive and fast-match clustering is investigated. The experiments on speaker diarization over NIST Speaker Recognition Evaluation show the merits of the proposed methods.
UR - http://www.scopus.com/inward/record.url?scp=85126671630&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85126671630
T3 - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
SP - 2036
EP - 2042
BT - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021
Y2 - 14 December 2021 through 17 December 2021
ER -