Beyond RAG: How cache-augmented generation reduces latency, complexity for smaller workloads

by News Feed Editor | Jan 17, 2025 | Technology

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Retrieval-augmented generation (RAG) has become the de-facto way of customizing large language models (LLMs) for bespoke information. However, RAG comes with upfront technical costs and can be slow. Now, thanks to advances in long-context LLMs, enterprises can bypass RAG by inserting all the proprietary information in the prompt.

A new study by the National Chengchi University in Taiwan shows that by using long-context LLMs and caching techniques, you can create customized applications that outperform RAG pipelines. Called cache-augmented generation (CAG), this approach can be a simple and efficient replacement for RAG in enterprise settings where the knowledge corpus can fit in the model’s context window.

Limitations of RAG

RAG is an effective method for handling open-domain questions and specialized tasks. It uses retrieval algorithms to gather documents that are relevant to the request and adds context to enable the LLM to craft more accurate responses.

However, RAG introduces several limitations to LLM applications. The added retrieval step introduces latency that can degrade the user experience. The result also depends on the quality of the document selection and ranking step. In many cases, the limitations of the models used for retrieval require documents to be broken down into smaller chunks, which can harm the retrieval process.

And in general, RAG adds complexity to the LLM application, requiring the development, integration and maintenance of additional components. The added overhead slows the development process.

Cache-augmented retrieval

RAG (top) vs CAG (bottom) (source: arXiv)

The alternative to developing a RAG pipeline is to insert the entire document corpus into the prompt and have the model choose which bits are relevant to the request. This approach removes the complexity of the RAG pipeline and the problems caused by retrieval errors.

However, there are three key challenges with front-loading all documents into the prompt. First, long prompts will slow down the model and increase the costs of inference. Second, the length of the LLM’s context window sets limits to the number of documents that fit in the prompt. And finally, adding irrelevant information to the prompt can confuse the model and reduce the quality of its answers. So, just stuffing all your documents into the prompt instead of choosing the most relevant ones can end up hur …

Article Attribution | Read More at Article Source

Beyond RAG: How cache-augmented generation reduces latency, complexity for smaller workloads

About RN

Website Awards

More Info