TY - JOUR
T1 - MOSA
T2 - Music Motion With Semantic Annotation Dataset for Cross-Modal Music Processing
AU - Huang, Yu Fen
AU - Moran, Nikki
AU - Coleman, Simon
AU - Kelly, Jon
AU - Wei, Shun Hwa
AU - Chen, Po Yin
AU - Huang, Yun Hsin
AU - Chen, Tsung Ping
AU - Kuo, Yu Chia
AU - Wei, Yu Chi
AU - Li, Chih Hsuan
AU - Huang, Da Yu
AU - Kao, Hsuan Kai
AU - Lin, Ting Wei
AU - Su, Li
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2024
Y1 - 2024
N2 - In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrases, and expressive contents from audio, video and motion data, and the generation of musicians’ body motion from given music audio.
AB - In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrases, and expressive contents from audio, video and motion data, and the generation of musicians’ body motion from given music audio.
KW - Music information retrieval
KW - artificial intelligence
KW - cross-modal
KW - motion capture
KW - music semantics
UR - http://www.scopus.com/inward/record.url?scp=85194846512&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2024.3407529
DO - 10.1109/TASLP.2024.3407529
M3 - Article
AN - SCOPUS:85194846512
SN - 2329-9290
VL - 32
SP - 4157
EP - 4170
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -