This paper presents an implementation of an energy efficient bit-plane payload design for machine learning processor. The proposed architecture facilitates high parallelism and high data bandwidth and thus improves the model learning/training time of machine learning algorithms. By assembling multiple bits as a bit-plane and enlarging query parallelism with a central compare-flag updater, data processing parallelism can be increased. Binary sequential partition (BSP), a fast density estimation algorithm capable of dealing with high dimensional data sets, is realized. Fabricated in 90nm 1P9M CMOS process, the processing rate can achieve 16.9 Gb/sec with 8 queries for data dimension D=210. The test chip integrates 64 counting cells and provides 5 modes with power consumptions of 1.86mJ/Gb per Query.