Scholars@Duke publication: FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model

FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model

Publication , Conference

Shao, Z; Wang, Y; Wang, Q; Jiang, T; Du, Z; Ye, H; Zhuo, D; Chen, Y; Li, H

Published in: Proceedings of the Aaai Conference on Artificial Intelligence

January 1, 2026

Singular Value Decomposition (SVD) has recently gained traction as an effective compression technique for large language models (LLMs), with many studies reporting 2080% parameter reduction at minimal accuracy cost. However, despite reducing weight memory, existing SVD-based approaches still rely on standard dense CUDA kernels during inference, which incur substantial-and ultimately unnecessary-activation memory overhead. Our analysis reveals that this kernel-induced cost, which grows with sequence length and hidden size, in worst case prevents any real reduction in peak inference memory, limiting the practical impact of SVD compression for on-device deployment. To address this bottleneck, we propose FlashSVD, an endto-end, rank-aware streaming inference framework for SVDcompressed LLMs. FlashSVD integrates seamlessly with any SVD-based model and directly fuses low-rank projection kernels into self-attention and feed-forward pipelines. This design avoids materializing large activation buffers by streaming small tiles of truncated factors through on-chip SRAM, performing on-the-fly multiplication and reduction, and immediately evicting results–thus preserving high GPU occupancy without introducing latency. On standard benchmarks (e.g., BERT-Base), FlashSVD reduces peak activation memory by up to 70.2% and transient memory by 75%, with zero accuracy loss against low-rank baselines, enabling truly memory-efficient deployment of low-rank LLMs.

Duke Scholars

Author Yiran Chen Pierre R. Lamond Department of Electrical and Computer Engin ...

Published In

Proceedings of the Aaai Conference on Artificial Intelligence

DOI

10.1609/aaai.v40i30.39720

EISSN

2374-3468

ISSN

2159-5399

Publication Date

January 1, 2026

Volume

Issue

Start / End Page

25278 / 25285

Citation

APA

Chicago

ICMJE

MLA

NLM

Shao, Z., Wang, Y., Wang, Q., Jiang, T., Du, Z., Ye, H., … Li, H. (2026). FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model. In Proceedings of the Aaai Conference on Artificial Intelligence (Vol. 40, pp. 25278–25285). https://doi.org/10.1609/aaai.v40i30.39720

Shao, Z., Y. Wang, Q. Wang, T. Jiang, Z. Du, H. Ye, D. Zhuo, Y. Chen, and H. Li. “FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model.” In Proceedings of the Aaai Conference on Artificial Intelligence, 40:25278–85, 2026. https://doi.org/10.1609/aaai.v40i30.39720.

Shao Z, Wang Y, Wang Q, Jiang T, Du Z, Ye H, et al. FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model. In: Proceedings of the Aaai Conference on Artificial Intelligence. 2026. p. 25278–85.

Shao, Z., et al. “FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model.” Proceedings of the Aaai Conference on Artificial Intelligence, vol. 40, no. 30, 2026, pp. 25278–85. Scopus, doi:10.1609/aaai.v40i30.39720.

Shao Z, Wang Y, Wang Q, Jiang T, Du Z, Ye H, Zhuo D, Chen Y, Li H. FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model. Proceedings of the Aaai Conference on Artificial Intelligence. 2026. p. 25278–25285.

Published In

Proceedings of the Aaai Conference on Artificial Intelligence

DOI

10.1609/aaai.v40i30.39720

EISSN

2374-3468

ISSN

2159-5399

Publication Date

January 1, 2026

Volume

Issue

Start / End Page

25278 / 25285