SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

Chun Yin, Tai Shih Chi, Yu Tsao, Hsin Min Wang

Research output: Contribution to journalConference articlepeer-review

Abstract

Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks.However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated.In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity.Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models.In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance.Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.

Original languageEnglish
Pages (from-to)1195-1199
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sep 20245 Sep 2024

Keywords

  • pre-trained speech foundation model
  • speaker voice similarity assessment

Fingerprint

Dive into the research topics of 'SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models'. Together they form a unique fingerprint.

Cite this