Singing voice separation using spectro-temporal modulation features

Frederick Yen, Yin Jyun Luo, Tai-Shih Chi

Research output: Contribution to conferencePaperpeer-review

8 Scopus citations

Abstract

An auditory-perception inspired singing voice separation algorithm for monaural music recordings is proposed in this paper. Under the framework of computational auditory scene analysis (CASA), the music recordings are first transformed into auditory spectrograms. After extracting the spectral-temporal modulation contents of the time-frequency (T-F) units through a two-stage auditory model, we define modulation features pertaining to three categories in music audio signals: vocal, harmonic, and percussive. The T-F units are then clustered into three categories and the singing voice is synthesized from T-F units in the vocal category via time-frequency masking. The algorithm was tested using the MIR-1K dataset and demonstrated comparable results to other unsupervised masking approaches. Meanwhile, the set of novel features gives a possible explanation on how the auditory cortex analyzes and identifies singing voice in music audio mixtures.

Original languageEnglish
Pages617-622
Number of pages6
StatePublished - 1 Jan 2014
Event15th International Society for Music Information Retrieval Conference, ISMIR 2014 - Taipei, Taiwan
Duration: 27 Oct 201431 Oct 2014

Conference

Conference15th International Society for Music Information Retrieval Conference, ISMIR 2014
Country/TerritoryTaiwan
CityTaipei
Period27/10/1431/10/14

Fingerprint

Dive into the research topics of 'Singing voice separation using spectro-temporal modulation features'. Together they form a unique fingerprint.

Cite this