Multi-camera systems have been widely used in many video surveillance applications. When an event happens and is monitored across multiple cameras, it is easy for an expert to generate the corresponding spatial representation to comprehend the series of event. However, it is not trivial for users new to the environment. With support from psychological evidences, we propose an approach to mimic generating pictorial-based representation of mental images when a target is moving across the views of cameras. First we conduct a ball-rolling experiment to compare this approach with others. The empirical results demonstrate that the performance of users with this approach is significantly better than others. We suggest that it is because this approach is better for users to preserve spatial representation of the environment while transiting views between cameras. Then we propose a framework to realize this approach. The demonstrations in different situations indicate the validity of such framework.