Athena: A Plug-and-Play Advisor for Retrieval-Augmented Generation using VectorDB
Retrieval-Augmented Generation (RAG) has emerged as a popular technique for addressing several challenges of Large Language Model (LLM) systems, including static model knowledge, hallucination, and limited input sequence lengths. Although RAG mitigates common pitfalls of current LLM systems, its inherent heterogeneity and configurability introduce new challenges. The performance of RAG is crucial for meeting the high-throughput and low-latency demands of LLM services. Different components of RAG operate on different hardware platforms, and their complexity scales with the configurability and complexity of the rest of the system. For example, larger embeddings may enhance retrieval accuracy, but also increase the latency of embedding creation and indexing, thereby compromising the RAG system's performance and energy consumption.Thus, a comprehensive characterization of an end-to-end RAG system becomes necessary. In this work, we build an end-to-end RAG benchmarking framework, Athena, that supports various embedding models, vector databases, index/search algorithms, and LLMs. By characterizing the system under various RAG settings built using Athena, we demystify RAG by identifying performance bottlenecks and quantifying the impact of each sub-component on overall system performance. In addition, the plug-and-play, open-sourced Athena framework is designed to assist future RAG research.