In the H.264 video compression standard, the deblocking filtering contributes about one-third of all computation in the decoder. With many-core architectures becoming the future trend of system design, computation time can be reduced if the deblocking appropriately apportions its operations to multiple processing elements. In this study, we used a four-pixel-long boundary as the basis for analyzing and exploiting possible parallelism. Compared with the two-dimensional (2D) wavefront method order for deblocking both 1920×1080- and 1080×1920-pixel frames, the proposed design exhibits speedups of 1.92 and 2.44 times, respectively, given an unlimited number of processing elements. Compared with our previous design, it gains speedups of 1.25 and 1.13 times, respectively. In addition, as the frame size grows, this approach requires only extra time that is proportional to the square root of the frame size increase (keeping the same width to height ratio), pushing the boundary of practical real-time deblocking of increasingly larger video sizes.