Skip to main content

FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model

Publication ,  Conference
Shao, Z; Wang, Y; Wang, Q; Jiang, T; Du, Z; Ye, H; Zhuo, D; Chen, Y; Li, H
Published in: Proceedings of the Aaai Conference on Artificial Intelligence
January 1, 2026

Singular Value Decomposition (SVD) has recently gained traction as an effective compression technique for large language models (LLMs), with many studies reporting 2080% parameter reduction at minimal accuracy cost. However, despite reducing weight memory, existing SVD-based approaches still rely on standard dense CUDA kernels during inference, which incur substantial-and ultimately unnecessary-activation memory overhead. Our analysis reveals that this kernel-induced cost, which grows with sequence length and hidden size, in worst case prevents any real reduction in peak inference memory, limiting the practical impact of SVD compression for on-device deployment. To address this bottleneck, we propose FlashSVD, an endto-end, rank-aware streaming inference framework for SVDcompressed LLMs. FlashSVD integrates seamlessly with any SVD-based model and directly fuses low-rank projection kernels into self-attention and feed-forward pipelines. This design avoids materializing large activation buffers by streaming small tiles of truncated factors through on-chip SRAM, performing on-the-fly multiplication and reduction, and immediately evicting results–thus preserving high GPU occupancy without introducing latency. On standard benchmarks (e.g., BERT-Base), FlashSVD reduces peak activation memory by up to 70.2% and transient memory by 75%, with zero accuracy loss against low-rank baselines, enabling truly memory-efficient deployment of low-rank LLMs.

Duke Scholars

Published In

Proceedings of the Aaai Conference on Artificial Intelligence

DOI

EISSN

2374-3468

ISSN

2159-5399

Publication Date

January 1, 2026

Volume

40

Issue

30

Start / End Page

25278 / 25285
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Shao, Z., Wang, Y., Wang, Q., Jiang, T., Du, Z., Ye, H., … Li, H. (2026). FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model. In Proceedings of the Aaai Conference on Artificial Intelligence (Vol. 40, pp. 25278–25285). https://doi.org/10.1609/aaai.v40i30.39720
Shao, Z., Y. Wang, Q. Wang, T. Jiang, Z. Du, H. Ye, D. Zhuo, Y. Chen, and H. Li. “FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model.” In Proceedings of the Aaai Conference on Artificial Intelligence, 40:25278–85, 2026. https://doi.org/10.1609/aaai.v40i30.39720.
Shao Z, Wang Y, Wang Q, Jiang T, Du Z, Ye H, et al. FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model. In: Proceedings of the Aaai Conference on Artificial Intelligence. 2026. p. 25278–85.
Shao, Z., et al. “FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model.” Proceedings of the Aaai Conference on Artificial Intelligence, vol. 40, no. 30, 2026, pp. 25278–85. Scopus, doi:10.1609/aaai.v40i30.39720.
Shao Z, Wang Y, Wang Q, Jiang T, Du Z, Ye H, Zhuo D, Chen Y, Li H. FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Model. Proceedings of the Aaai Conference on Artificial Intelligence. 2026. p. 25278–25285.

Published In

Proceedings of the Aaai Conference on Artificial Intelligence

DOI

EISSN

2374-3468

ISSN

2159-5399

Publication Date

January 1, 2026

Volume

40

Issue

30

Start / End Page

25278 / 25285