Scholars@Duke publication: Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression

Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression

Publication , Conference

Cheng, F; Guo, C; Wei, C; Zhang, J; Zhou, C; Hanson, E; Liu, X; Li, H; Chen, Y

Published in: Proceedings International Symposium on Computer Architecture

June 21, 2025

Large language models (LLMs) have demonstrated transformative capabilities across diverse artificial intelligence applications, yet their deployment is hindered by substantial memory and computational demands, especially in resource-constrained environments. Quantization techniques have emerged as a critical solution, reducing data precision to enhance memory and computational efficiency. However, existing methods often suffer from high runtime overheads and potential accuracy degradation. To address these challenges, we propose Ecco, an entropy-based cache compression technique tailored for LLMs. Ecco combines group-wise and nonuniform quantization with pre-defined shared k-means patterns and Huffman coding to exploit the inherent entropy characteristics of LLM cache data. Recognizing the inefficiencies of traditional Huffman coding in terms of parallelism and latency, we introduce a novel parallel Huffman-based decoding process with a multi-stage pipeline design, reducing latency by two orders of magnitude and achieving throughput comparable to GPU L2 caches. Comprehensive evaluations demonstrate that Ecco achieves an up to 2.9× and 1.9× speedup over the state-of-the-art AWQ and SmoothQuant framework, 2.4× over the Olive accelerator, all while increasing memory capacity by nearly 4× and maintaining state-of-the-art LLM accuracy. These results underscore the effectiveness of our entropy-based cache compression in enhancing LLM performance and efficiency, paving the way for more deployable large-scale AI models.

Duke Scholars

Author Hai "Helen" Li Electrical and Computer Engineering

Author Yiran Chen Electrical and Computer Engineering

Published In

Proceedings International Symposium on Computer Architecture

DOI

10.1145/3695053.3731024

EISSN

2575-713X

ISSN

1063-6897

Publication Date

June 21, 2025

Start / End Page

793 / 807

Citation

APA

Chicago

ICMJE

MLA

NLM

Cheng, F., Guo, C., Wei, C., Zhang, J., Zhou, C., Hanson, E., … Chen, Y. (2025). Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression. In Proceedings International Symposium on Computer Architecture (pp. 793–807). https://doi.org/10.1145/3695053.3731024

Cheng, F., C. Guo, C. Wei, J. Zhang, C. Zhou, E. Hanson, X. Liu, H. Li, and Y. Chen. “Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression.” In Proceedings International Symposium on Computer Architecture, 793–807, 2025. https://doi.org/10.1145/3695053.3731024.

Cheng F, Guo C, Wei C, Zhang J, Zhou C, Hanson E, et al. Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression. In: Proceedings International Symposium on Computer Architecture. 2025. p. 793–807.

Cheng, F., et al. “Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression.” Proceedings International Symposium on Computer Architecture, 2025, pp. 793–807. Scopus, doi:10.1145/3695053.3731024.

Cheng F, Guo C, Wei C, Zhang J, Zhou C, Hanson E, Liu X, Li H, Chen Y. Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression. Proceedings International Symposium on Computer Architecture. 2025. p. 793–807.

Published In

Proceedings International Symposium on Computer Architecture

DOI

10.1145/3695053.3731024

EISSN

2575-713X

ISSN

1063-6897

Publication Date

June 21, 2025

Start / End Page

793 / 807