TY - GEN
T1 - Group-and-Conquer for Multi-Speaker Single-Channel Speech Separation
AU - Yen, Ya Fan
AU - Shuai, Hong Han
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - We propose to reduce the difficulty level by separating the separation process in a group-and-conquer way. Specifically, in the first stage, we propose a prediction model to estimate the optimal number of groups based on the input mixed signal. In addition, we train a group separation model to separate a mixed signal into multiple groups according to the number of groups. By training a vocal network with the triplet cosine loss and a group separation network simultaneously, the proposed group separation model better learns the latent feature of each group. As such, given the predicted number of groups, the group separation model can automatically separate the input audio signal into several groups. In the second stage, for the group with more than one speaker, the separation model focuses on fine-grained information to better separate the speech among the group. Experimental results show that our approach outperforms the state-of-the-art models by at least 8.68% in SI-SNRi.
AB - We propose to reduce the difficulty level by separating the separation process in a group-and-conquer way. Specifically, in the first stage, we propose a prediction model to estimate the optimal number of groups based on the input mixed signal. In addition, we train a group separation model to separate a mixed signal into multiple groups according to the number of groups. By training a vocal network with the triplet cosine loss and a group separation network simultaneously, the proposed group separation model better learns the latent feature of each group. As such, given the predicted number of groups, the group separation model can automatically separate the input audio signal into several groups. In the second stage, for the group with more than one speaker, the separation model focuses on fine-grained information to better separate the speech among the group. Experimental results show that our approach outperforms the state-of-the-art models by at least 8.68% in SI-SNRi.
KW - single channel
KW - Speech separation
KW - voice embedding
UR - http://www.scopus.com/inward/record.url?scp=85215682780&partnerID=8YFLogxK
U2 - 10.1109/WOCC61718.2024.10786031
DO - 10.1109/WOCC61718.2024.10786031
M3 - Conference contribution
AN - SCOPUS:85215682780
T3 - 2024 33rd Wireless and Optical Communications Conference, WOCC 2024
SP - 165
EP - 169
BT - 2024 33rd Wireless and Optical Communications Conference, WOCC 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 33rd Wireless and Optical Communications Conference, WOCC 2024
Y2 - 25 October 2024 through 26 October 2024
ER -