About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

Stop Calling It KV Cache: It’s Something Much Bigger

By

Junchen Jiang

For years, we have referred to one of the most critical components of modern LLM inference as a “KV cache.” That name made sense once. Today, it is increasingly misleading.

What began as a small, ephemeral optimization inside a single inference pass has quietly evolved into something far more important: a first-class data object with its own lifecycle, storage stack, and economic value.

Here is how we got here, and why the name needs to change.

The Origin: A True Cache

The KV cache originates from the Transformer architecture. During inference, each token generates Key (K) and Value (V) matrices that are reused to compute attention for subsequent tokens. This reuse avoids redundant computation and dramatically speeds up decoding. That is why we called it a cache.

In its original form, it was temporary, confined to a single request, stored entirely in GPU memory, and discarded the moment a query completed. It behaved exactly like a traditional cache.

For a long time, that description held. Even though KV cache was theoretically reusable across requests, in practice GPU memory was too limited and too expensive to justify persistence. Cross-request reuse was rare, and the system treated KV cache as disposable.

2025: The Turning Point

Everything changed when KV cache began leaving GPU memory.

The trigger was a shift in how LLMs are used. Multi-turn conversations, long-context document analysis, and agent workflows all require reusing prefixes across sessions. Suddenly, ephemeral storage was not enough.

Systems began treating KV cache less like a temporary artifact and more like reusable data that needed to be stored persistently and retrieved globally. The storage stack expanded rapidly: from GPU to CPU offloading, from CPU memory pools to SSD, and from SSD to remote storage services like S3. Commercial storage vendors including Weka, VAST, and DDN began offering dedicated storage systems for KV cache.

This was the moment KV cache stopped being “just a cache.”

2026: A First-Class Data Object

Fast forward to today. KV cache is no longer an optimization at the margins of inference. It is the core infrastructure.

NVIDIA introduced ICMS (now renamed CMX) to formalize inference context storage. KV cache now represents the majority of the inference storage footprint. DRAM and SSD costs are becoming a significant line item in AI system budgets.

We are witnessing the emergence of a new category: inference-time data as a persistent, valuable asset.

At NVIDIA GTC last month, I joined NVIDIA and Tencent in the first-ever industry tutorial dedicated to KV cache technology(slides). What was striking was the framing: KV cache was not discussed as a feature of an inference engine. It was discussed as a standalone data object, one that is beginning to form its own ecosystem and generate value in its own right.

The Deeper Shift: KV Cache Has Semantics

Here is the part that changes everything.

KV cache is not just reusable. It is semantically meaningful. It represents the model’s internal understanding of context: a compressed encoding of everything the model has processed, directly usable by the model, and opaque to humans.

Put another way, KV cache is the native memory format of a Transformer. And once you see it that way, the old framing falls apart entirely.

From Immutable Cache to Mutable Memory

Traditional caches are opaque, immutable, and replaceable. KV cache no longer fits any of those descriptions.

Recent research shows it can be compressed without meaningful accuracy loss, approximated while preserving performance, and even modified to improve outputs. Projects like Catridges (Stanford), LLMSteer (UChicago), and PASTA (Georgia Tech) have demonstrated that editing KV cache directly steers model behavior.

That is not caching. That is memory manipulation.

A New Mental Model

KV cache already departs from traditional caching in three fundamental ways.

First, it is persistent and multi-tier. It is stored across GPU, CPU, SSD, and cloud, survives beyond a single request, and is managed more like a data pipeline than a local buffer.

Second, it is not a black box. It can be compressed, quantized, and transformed. Its internal structure is exposed and exploitable by the systems that manage it.

Third, it is semantically mutable. It can be edited to improve model outcomes. It encodes meaning, not just computation.

Given all of this, the term “KV cache” is actively misleading. It implies ephemerality, simplicity, and a lack of intrinsic value. None of those things are true anymore.

Comparison between Classic Cache and KV Cache highlighting key features such as persistence, sharing, and semantic richness.

What should we call it instead? If this is the native memory of an LLM, it deserves a name that reflects that. Some candidates: AI-Native Memory, Model-Native Memory, Inference State Object, Semantic Context Memory. Or perhaps something entirely new. Let me know in the comments.

A Final Thought

We are watching the birth of a new abstraction layer in AI systems: a form of machine-native memory that is persistent, valuable, and manipulable.

Calling it a cache undersells its importance. And as the ecosystem around it grows, naming will shape how we design, optimize, and reason about these systems.

It is time to retire the old name.


Resources:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading