Object detection is an important application for modern smart embedded devices. It enables the device to recognize the surrounding environment and perform intelligent applications. The intensive computation requirements make the object detection an expensive application running on the resource-constrained embedded device. Parallel processing on multi-core systems provides a platform to boost the performance. However, the memory bottleneck limits the performance scalability. Improving data locality of the on-chip cache has therefore become a critical design concern. This paper analyzed the memory behavior of a parallel Viola-Jones algorithm, and proposed a scheme to enhance the data locality of on-chip cache. By running a multi-threaded object detection algorithm on a cycle-accurate multi-core simulator, the proposed approach can achieve up to 58% better performance when compared with the original parallel program.