Multimedia content creation and manipulation have attracted great attention in recent years due to the popularity of mobile image and video capturing devices. In our daily life, the most common subject appearing in the captured content are people. Hence, to create images and videos by manipulating appearances or motions of the human character inside these media becomes an important research issue. In this paper, we propose a novel idea that requires the fusion of video and audio intelligence. We develop a system whose inputs are a monocular video captured by a stationary RGBD sensor and a music clip. In the context of a human character performing a music conducting motion, the output is a re-targeted video where the motion of performer is manipulated according to the emotional cues extracted from the music clip. To achieve this goal, our system is decomposed into three major stages. First, one needs to access to the geometric and appearance information pertaining to meaningful and representative targets in the video sequence. Second, a systematic way is needed to reliably classify and identify important emotions from the music. Third, to complete the emotion transfer, one has to manipulate the video targets based on the extracted music emotions.