This paper presents an efficient hardware-software implementation with a marcoblock based pipelining and a bus interlaced architecture for all binary motion estimation (ABME), which has been proven to be simple and low cost for hardware design. The bus interleaved preprocessing module of the ABME architecture can generate downsampling and binarized data in the same flow without additional dedicated hardware. With the 3layer binary bitplane of ABME, we use a two-dimensional (2-D) mapping unit and a binary adder tree instead of a systolic array to compute the block matching metric, which is sum of difference (SoD), in one cycle. In addition, a new bus bandwidth reduction scheme is proposed by reusing the binarized image, which can achieve up to 67% bus bandwidth saving. The experiment shows that for each macroblock, our design can finish ABME within 283 cycles with 65k gate counts synthesized by UMC 0.18um cell library.