Robin: RWKV Accelerator using Block Circulant Matrices based on FPGA
Recent advancements in linear-attention models, such as RWKV, have opened up new possibilities for efficient sequence processing by reducing the computational overhead of traditional Transformer architectures. Field-programmable gate arrays (FPGAs) offer a compelling solution for deep learning applications by providing customizable hardware architectures that enhance computational efficiency and flexibility. However, deploying these models on FPGAs introduces several challenges. Previous FPGA deployment workflows tend to focus on general machine learning tasks, lacking sufficient integration between software and hardware optimizations. Besides, FPGAs are constrained by limited on-chip and off-chip memory, posing significant challenges for weight storage. Moreover, the predominance of linear operations on the GPU runtime leads to significant computational bottlenecks. These obstacles necessitate innovative solutions to bridge the performance gap between FPGAs and GPUs while preserving model accuracyTo overcome these challenges, we introduce Robin, a fine-grained FPGA accelerator workflow that integrates both algorithm-level and hardware-level optimization. Robin leverages a weight compression technique based on Partial Block Circulant Matrices (PBCM), which effectively reduces storage demands while maintaining accuracy. Based on PBCM, our design employs a configurable circulant computing core that fully exploits the bit-width efficiency of DSP48E resources through two DSP packaging strategies to support both circulant and standard matrix operations. The combined end-to-end software-hardware co-design enables Robin to achieve up to a 3.09× increase in throughput and a 7.31× boost in energy efficiency compared to high-end Tesla A100 GPU implementations, making it a compelling solution for deploying RWKV models on FPGAs.