This paper proposes a semantic video object segmentation system which combines spatio-temporal video segmentation and region tracking together to extract important semantic objects from videos. At beginning, the paper uses multiple cues to segment video frames to different regions. The cues include color, edges, motions, and kernel-based models. Since these features are complementary to each other, all desired regions can be well segmented from input frames even though they are captured from a non-stationary camera. Then, according to temporal information of each segmented region, we can construct a region adjacency graph (RAG) which can well record the relative relations between each region. Based on the RAG, we propose a Bayesian classier which can group regions by properly checking their spatial and temporal similarities such that different regions will be merged and associated together to form a meaningful object. Since a kernel-based analysis is included into the designed classier, all desired semantic objects can be well extracted even though they are static in videos. Experimental results have proved the superiority of the proposed method in object segmentation.