TY - JOUR
T1 - SVSNet+
T2 - 25th Interspeech Conferece 2024
AU - Yin, Chun
AU - Chi, Tai Shih
AU - Tsao, Yu
AU - Wang, Hsin Min
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks.However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated.In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity.Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models.In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance.Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.
AB - Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks.However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated.In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity.Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models.In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance.Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.
KW - pre-trained speech foundation model
KW - speaker voice similarity assessment
UR - http://www.scopus.com/inward/record.url?scp=85214789355&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-1540
DO - 10.21437/Interspeech.2024-1540
M3 - Conference article
AN - SCOPUS:85214789355
SN - 2308-457X
SP - 1195
EP - 1199
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Y2 - 1 September 2024 through 5 September 2024
ER -