TY - JOUR
T1 - Strategies for end-to-end text-independent speaker verification
AU - Lin, Weiwei
AU - Mak, Man Wai
AU - Chien, Jen Tzung
N1 - Publisher Copyright:
Copyright © 2020 ISCA
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020
Y1 - 2020
N2 - State-of-the-art speaker verification (SV) systems typically consist of two distinct components: a deep neural network (DNN) for creating speaker embeddings and a backend for improving the embeddings' discriminative ability. The question which arises is: Can we train an SV system without a backend? We believe that the backend is to compensate for the fact that the network is trained entirely on short speech segments. This paper shows that with several modifications to the x-vector system, DNN embeddings can be directly used for verification. The proposed modifications include: (1) a mask-pooling layer that augments the training samples by randomly masking the frame-level activations and then computing temporal statistics, (2) a sampling scheme that produces diverse training samples by randomly splicing several speech segments from each utterance, and (3) additional convolutional layers designed to reduce the temporal resolution to save computational cost. Experiments on NIST SRE 2016 and 2018 show that our method can achieve state-of-the-art performance with simple cosine similarity and requires only half of the computational cost of the x-vector network.
AB - State-of-the-art speaker verification (SV) systems typically consist of two distinct components: a deep neural network (DNN) for creating speaker embeddings and a backend for improving the embeddings' discriminative ability. The question which arises is: Can we train an SV system without a backend? We believe that the backend is to compensate for the fact that the network is trained entirely on short speech segments. This paper shows that with several modifications to the x-vector system, DNN embeddings can be directly used for verification. The proposed modifications include: (1) a mask-pooling layer that augments the training samples by randomly masking the frame-level activations and then computing temporal statistics, (2) a sampling scheme that produces diverse training samples by randomly splicing several speech segments from each utterance, and (3) additional convolutional layers designed to reduce the temporal resolution to save computational cost. Experiments on NIST SRE 2016 and 2018 show that our method can achieve state-of-the-art performance with simple cosine similarity and requires only half of the computational cost of the x-vector network.
KW - Deep neural network
KW - End-to-end speaker embedding
KW - Speaker verification
KW - X-vector
UR - http://www.scopus.com/inward/record.url?scp=85098160006&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2092
DO - 10.21437/Interspeech.2020-2092
M3 - Conference article
AN - SCOPUS:85098160006
SN - 2308-457X
VL - 2020-October
SP - 4308
EP - 4312
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -