INCA: Input-stationary Dataflow at Outside-the-box Thinking about Deep Learning Accelerators
This paper first presents an input-stationary (IS) implemented crossbar accelerator (INCA), supporting inference and training for deep neural networks (DNNs). Processing-in-memory (PIM) accelerators for DNNs have been actively researched, specifically, with resistive random-access memory (RRAM), due to RRAM's computing and memorizing capabilities and device merits. To the best of our knowledge, all previous PIM accelerators have saved weights into RRAMs and inputs (activations) into conventional memories - it naturally forms weight-stationary (WS) dataflow. WS has generally been considered the most optimized choice for high parallelism and data reuse. How-ever, WS-based PIM accelerators show fundamental limitations: first, remaining high dependency on DRAM and buffers for fetching and saving inputs (activations); second, a remarkable number of extra RRAMs for transposed weights and additional computational intermediates in training; third, coarse-grained arrays demanding high-bit analog-to-digital converters (ADCs) and introducing poor utilization in depthwise and pointwise convolution; last, degraded accuracy due to its sensitivity to weights which are affected by RRAM's nonideality. On the other hand, we observe that IS dataflow, where RRAMs retain inputs (activations), can effectively address the limitations of WS, because of low dependency by only loading weights, no need for extra RRAMs, feasibility of fine-grained accelerator design, and less impact of input (activation) variance on accuracy. But IS dataflow is hardly achievable by the existing crossbar structure because it is difficult to implement kernel sliding and preserve the high parallelism. To support kernel movement, we constitute a cell structure with two-transistor-one-RRAM (2T1R). Based on the 2T1R cell, we design a novel three-dimensional (3D) architecture for high parallelism in batch training. Our experiment results prove the potential of INCA. Compared to the WS accelerator, INCA achieves up to 20.6× and 260× energy efficiency improvement in inference and training, respectively; 4.8× (inference) and 18.6× (training) speedup as well. While accuracy in WS drops to 15% in our high-noise simulation, INCA presents an even more robust result as 86% accuracy.