This paper introduces a turbo decoder that utilizes multiple soft-in/soft-out (SISO) decoders to decode one codeword. In addition, each SISO decoder is modified to allow simultaneous execution over multiple successive trellis stages. The design issues related to the architecture with parallel high-radix SISO decoders are discussed. First, a contention-free interleaver for the hybrid parallelism is presented to overcome the complicated collision problem as well as reduce interconnection network complexity. Second, two techniques for the high-speed add-compare-select (ACS) circuits are given to lessen area overhead of the SISO decoder. Third, a modification of the processing schedule is made for higher operating efficiency. Two designs with parallel architecture have been implemented. The first design with 32 SISO decoders, each of which processes 2 symbols per cycle, has 160 Mb/s and 0.22 nJ/b/iter after measurement. The second design uses 16 SISO decoders to deal with 4 symbols per cycle and achieves 100% efficiency, leading to 1000 Mb/s and 0.15 nJ/b/iter in post-layout simulation.