CacheBlend (Best Paper @ ACM EuroSys'25): Enabling 100% KV Cache Hit Rate in RAG

By LMCache Team


Break News: “CacheBlend” Receives BEST PAPER AWARD at ACM EuroSys 2025

This week, at ACM EuroSys 2025 (Top Academic Conference in Computer Systems), Jiayi Yao, the first author of the groundbreaking paper on CacheBlend, will present our innovative work that redefines the landscape of LLM efficiency, particularly in retrieval-augmented generation (RAG) applications. This paper has already made waves in the research community, receiving a unanimous “accept” decision from reviewers and alreday having multiple follow-up works.

Icon

CacheBlend is now released as part of the LMCache (https://github.com/LMCache/LMCache) and vLLM Production Stack (https://github.com/vllm-project/production-stack) packages, which allow effortless and performant K8s-native vLLM deployment.

image

vLLM Production Stack [which includes CacheBlend] showcases how solid research can drive real-world impact through open sourcing within the vLLM ecosystem. By providing an optimized reference system for scalable vLLM deployment, it plays a crucial role in bridging the gap between cutting-edge innovation and enterprise-grade LLM serving.”

- Prof. Ion Stoica, Director of Sky Computing Lab at the University of California, Berkeley

Challenge: Low KV Cache Hit Rate in RAG Systems

In the world of large language models (LLMs), cache reuse is a crucial factor in ensuring optimal performance. The standard approach in most LLM systems is to reuse only the key-value (KV) cache of a common prefix in the input sequence, but this results in a low KV cache hit rate. As a result, LLMs in RAG applications, where information is retrieved dynamically, often face inefficiencies due to underutilization of the cache, leading to redundant computations and slower processing speeds.

Despite the importance of maximizing cache hits to improve the efficiency of LLMs, traditional caching methods fall short in capturing the true potential of the KV cache in RAG scenarios, where a wide range of non-prefix texts is generated alongside the common prefix. This is where CacheBlend comes in.

CacheBlend’s Breakthrough Approach

CacheBlend tackles this challenge head-on by innovating on cache reuse techniques, allowing for almost 100% KV cache hit rate in RAG applications. Here’s how it works:

  • Near-100% KV Cache Hit Rate: Unlike standard methods that only reuse KV caches for a common prefix, CacheBlend enables the LLM to reuse the KV cache of non-prefix texts. This drastically improves the efficiency of LLM processing in RAG scenarios by reducing the need for recomputation of key-value pairs.

  • Efficient Positional Encoding Updates: CacheBlend doesn’t simply reuse cached data as-is; it also ensures that the positional encoding is updated efficiently to align with the new context of the generated text, ensuring accurate and high-quality outputs.

  • Selective Cross-Attention Recalculation: CacheBlend selectively recomputes cross-attention layers to maintain generation quality while preventing redundant computations. This intelligent approach to cache management ensures that the model maintains high output quality without sacrificing efficiency.

CacheBlend represents a significant shift in how cache reuse is handled in large-scale LLM systems. At its core, the technique involves three key components:

  1. Positional Encoding Optimization: One of the primary challenges in reusing KV cache across non-prefix texts lies in ensuring that the positional encoding remains accurate and aligned with the input sequence. CacheBlend efficiently updates the positional encoding during reuse, preserving the integrity of the model’s attention mechanism.

  2. Selective Cross-Attention Recalculation: Cross-attention is a vital part of the transformer architecture, allowing the model to focus on relevant information from different parts of the sequence. CacheBlend’s selective recalculation of cross-attention layers ensures that the model does not reprocess information unnecessarily, thus optimizing resource usage while still maintaining the high-quality generation for each token.

  3. Cache-Aware Model Architecture: CacheBlend’s design is integrated into the model architecture itself, where the cache management system works seamlessly with the transformer model’s computation pipeline. This tight integration enables the model to reuse KV cache across a broader range of inputs, significantly improving its performance in real-world RAG applications.

Evaluation Highlights: Achieving Unprecedented Efficiency

CacheBlend has undergone rigorous evaluation to validate its claims.

Here, we highlight two particular settings.

On the popular RAG dataset of “2WikiMQA” with a chunk size of 512 tokens, the following figure plots the average Time to First Token (TTFT) and average quality (in F1 score) of CacheBlend and standard KV cache reuse (widely used in vLLM), across 1,500 randomly sampled queries. The used model is a Llama 70B model, run on two A40 GPUs. The KV cache of each chunk is independently pre-computed and stored on SSD. By enabling reuse of non-prefix KV caches CacheBlend vastly increases the KV cache hit rate, reducing TTFT by 3x. In the meantime, the quality is ensured by selectively recomputing KV cache for only a small fraction of tokens.

Icon

Next, we stress test the system by feeding multiple queries per second. This increases the response delay (TTFT) of queries, but we can see CacheBlend, due to significantly reduced prefill computation, achieves lower delay at a higher number of queries per second, i.e., higher throughput. By only selectively recomputing KV cache, CacheBlend also increases throughput by 3x.

Icon

Key results from our research paper include:

  • Near-100% KV Cache Hit Rate: In RAG applications, CacheBlend demonstrated an exceptional increase in the KV cache hit rate, nearing 100%. This means that the model can reuse almost all of its previous computations, vastly improving throughput and reducing latency.

  • Minimal Loss in Generation Quality: Even with the drastic increase in cache reuse, the model’s output quality remains high. The selective recalculation of cross-attention layers ensures that CacheBlend does not compromise the integrity of the generated text, even when non-prefix text is reused from the cache.

  • Significant Speedup in Inference: By minimizing the amount of computation required for each generation step, CacheBlend achieves significant speedup in inference times, making it particularly suitable for real-time applications where speed is crucial.

Impact on Open-Source AI & RAG Applications

CacheBlend’s integration with the open-source LMCache library marks a milestone for the LLM community, especially in the context of RAG applications. The research has already led to tangible results in industry, with LMCache being widely adopted within the vLLM ecosystem.

CacheBlend’s efficiency improvements have the potential to drastically reduce the infrastructure cost of serving LLMs, making advanced AI accessible to a broader range of applications and industries. Its integration into vLLM, a popular library for efficient LLM serving, ensures that these advancements can be quickly adopted and scaled across the AI ecosystem.

The figure below shows how CacheBlend is deployed with vLLM in practice, via two of our open source projects LMCache (https://github.com/LMCache/LMCache) and vLLM Production Stack (https://github.com/vllm-project/production-stack).

Icon

How to Try It Out?

Ready to test CacheBlend for yourself? It’s now available as part of the LMCache library. You can follow this link to the demo from a previous post on how to run CacheBlend in LMCache. This detailed guide will walk you through the setup process, giving you hands-on experience with this game-changing technology.

Share: X (Twitter) Facebook LinkedIn