Both cameras and IoT devices have their particular capabilities in tracking human behaviors and statuses. Their correlations are, however, unclear. In this work, we propose a framework for integrating video and wearable sensing data for smart surveillance, such as person identification and tracking. Using biometric features such as fingerprint, iris, gait, and face may lead to good recognition results. However, these approaches all have their limitations in distance and privacy concerns. In this work, we present a data fusion framework based on deep learning for fusing the aforementioned data. Here, using deep learning is to help adaptively learn the hidden bindings of these data. We demonstrate how to retrieve data of interest from IoT devices, which are attached on human objects, and correctly tag them on the human objects captured by a camera, thus correlating video and IoT data. Potential applications of this framework include smart surveillance and friendly visualization. We then show a case study, including integrating video data with body movement and physiological data.