Exploring the VLSI scalability of stream processors
Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area-and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALU in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALU per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALU per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3× of kernel speedup and 8.0× of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation.