Modern GPGPUs implement on-chip shared cache to better exploit the data reuse of various general purpose applications. Given the massive amount of concurrent threads in a GPGPU, striking the balance between Data Locality and Load Balance has become a critical design concern. To achieve the best performance, the trade-off between these two factors needs to be performed concurrently. This paper proposes a dynamic thread scheduler which co-optimizes both the data locality and load balance on a GPGPU. The proposed approach is evaluated using three applications with various input datasets. The results show that the proposed approach reduces the overall execution cycles by up to 16% when compared with other approaches concerning only one objective.