Surface layout recovering helps computers understand the intricate information in an image by assigning local segments to different geometric classes. It greatly reduces the complexity of the following-up image processing and is widely used in various computer vision applications. However, the algorithm walks through every image pixel and imposes intensive computation requirement. Through comprehensive analysis on the execution behavior, this paper identifies significant parallelism inherent in the algorithm. With careful concerns on both multi-threaded software and parallel hardware, the optimized parallel design on a modern GPGPU has reached an average of 10.7X performance enhancement.