About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

Accelerating OpenClaw Agents with CacheBlend

By

Jiayue Chen

and

LMCache Team

The standard approach to reducing LLM inference costs is prefix caching, which reuses previously computed token states to avoid redundant computation. In practice, however, this approach misses significant caching opportunities in real-world agentic workloads!

Caching in Agentic Workflows

In agentic workloads, shared content (e.g., retrieved contexts and documents) frequently appears across requests at varied positions, rendering prefix caching ineffective. This is because prefix caching requires an exact match from the beginning of the token sequence and any change to the prefix breaks cache reuse from that point forward. All subsequent tokens must be recomputed from scratch, even when the vast majority of the content is identical to the previous request. Let’s look at OpenClaw as an example.

OpenClaw and the Limits of Prefix Caching

OpenClaw is an open-source agentic framework that orchestrates multi-step AI workflows, coordinating tool calls, memory retrieval, and reasoning across extended interactions. 

At each turn, the agent assembles a full context window from several sources, such as system instructions, conversation history, retrieved documents, tool outputs, and passes the context to the model to determine the next action.

Although a large number of the retrieved documents/contexts remain constant across turns, their position in the sequence shifts as new content is prepended or interleaved before them. This displacement breaks prefix cache reuse, forcing the model to recompute attention over the full context on every step. For long-running workflows with large retrieved contexts, this redundant computation dominates inference cost and latency. In practice, these documents and contexts can easily grow to tens of thousands of tokens. 

Diagram illustrating OPENCLAW Inputs, showing two sections: 'System prompt + Conv. history' labeled as 'Cacheable' and 'Retrieved docs + User question' labeled as 'Not Prefix Cacheable'.

Looking more closely at how OpenClaw structures its prompts, as shown in the Figure above, the conversation history is handled well by prefix caching — each new turn appends to a stable base, extending the cached prefix rather than invalidating it. The breakdown happens specifically with retrieved contexts and documents. They can shift position or be dropped and reintroduced across turns, and any such change invalidates the cache for all tokens that follow — even if 99% of the content is identical to the previous turn.

To isolate and demonstrate this problem clearly, our demo uses a simplified prompt structure, placing the user question between the system prompt and the supporting document. This represents the worst-case prefix-breaking scenario and makes the cache hit difference between prefix caching and CacheBlend straightforward to observe directly.

CacheBlend: Non-Prefix KV Cache Reuse

CacheBlend is a technique introduced by the Tensormesh team and contributed to LMCache, enabling reuse of precomputed KV caches beyond prefix boundaries. Rather than requiring an exact prefix match, CacheBlend identifies and reuses any contiguous block of previously computed tokens regardless of position, then selectively recomputes the KV values of a small subset of tokens to partially update each reused cache block.

Consider the following example:

Turn 1 retrieved context: [Chunk A] + [Chunk B] + [Chunk C]
Turn 2 retrieved context: [Chunk A] + [Chunk C] + [Chunk B]

Under prefix caching, only the shared prefix ([Chunk A]) can be reused in the second turn, resulting in a cache hit rate of roughly 1/3, assuming Chunk A, B, C have the same token counts. The reordering of chunks breaks the prefix structure, preventing reuse of [Chunk B] and [Chunk C]. With CacheBlend, all three chunks are recognized as previously computed and reused in full, regardless of their new ordering.

A natural question, then, is why we cannot simply reuse all previously computed KV caches by concatenating them in the new order. The challenge lies in the nature of the Attention Mechanism: each token’s representation depends on its interactions with all preceding tokens. When a chunk is computed in isolation, its KV cache encodes attention patterns based only on its original context. Reordering chunks changes these dependencies and tokens in later chunks should now attend to different preceding tokens. As a result, simply concatenating pre-computed KV caches produces inconsistent attention states and leads to incorrect outputs.

CacheBlend bridges the gap between full KV reuse (fast but inaccurate) and full KV recomputation (accurate but slow) through a technique called selective KV recomputation. The key insight is that not all tokens need to be recomputed, only those whose KV representations differ most between naive reuse (e.g., concatenated KV caches) and full recomputation under the correct context. CacheBlend calls these High-KV-Deviation (HKVD) tokens. By recomputing only the 10-15% HKVD tokens, CacheBlend is able to closely approximate full recomputation quality while retaining much of the efficiency of KV reuse, achieving near full-recompute accuracy at a fraction of the inference cost.

Demo and Performance Results

We evaluate CacheBlend on OpenClaw questions from the MTRAG benchmark Cloud corpus with two turns. Each prompt is assembled as: system prompt + user query + retrieved supporting document, with user query deliberately placed between the system prompt and document to create a realistic prefix-breaking scenario. Orange blocks indicate cached content and green blocks indicate recomputed content.

The impact shows up directly in two key metrics: Time to First Token, how long a user waits before receiving the very first character, and Cache Hit Rate, the percentage of prompt tokens reused from cache. In agentic workloads, both metrics degrade progressively as context grows and prompt structure diverges even when the majority of content is semantically unchanged across turns.

Prefix Caching (Baseline)

On the second turn (Q2: “Can you tell me more about WebSocket API?”), only the system prompt forms a reusable prefix. The supporting document, though identical across both turns, must be fully recomputed because the user query breaks prefix alignment ahead of it. This yields a modest TTFT improvement from 5.553s to 4.055s, with only 48% cache hit rate, meaning more than half of the prompt tokens are still being recomputed unnecessarily.

Non-Prefix Caching with CacheBlend

Diagram illustrating CacheBlend system with two questions about parameters and API, highlighting time to first token and cache hit rate improvements.

Both the system prompt and supporting document are served from cache, regardless of the query positioned between them. This results in a TTFT of 2.325s and a cache hit rate of 98%, yielding a 42% reduction in latency and 2x improvement in cache utilization compared to the baseline.

Demo

Resources:


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading