About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

LMCache: A Journey

By

Junchen Jiang

GTC wrapped up a month ago. 

Our open-source KV cache management library, LMCache, was shown in Jensen Huang’s keynote, was spotlighted by NVIDIA SVP Kevin Deierling, I was invited to speak at the first-ever industry KV cache tutorial, and mentioned in talk after talk by industry partners. At one point, someone even came to our booth just to take a photo with Yihua Cheng, one of LMCache’s core maintainers. It was surreal.

It all felt, frankly, surreal. 

The whole team is energized and eager to keep pushing. Recognition is a hell of a motivator. But I can’t help looking back at the journey that brought us here: a journey mixed with joyful surprises and painful lessons.

To keep it readable, I’ll pick three moments: three years ago, two years ago, and one year ago.

A timeline graphic illustrating the LMCache journey from 2023 to 2026. Highlights include research on KV cache optimization at the University of Chicago, the introduction of open-source LMCache, broader industry adoption in 2025, and going mainstream in 2026. The graphic features a line chart showing GitHub star growth over time.

Three years ago (early 2023): Pivot

In January 2023, barely one month after the “ChatGPT moment,” I was one of the (un)fortunate faculty members confronted with a question from students: “Since the whole world is talking about ChatGPT, can I switch my research to it?” First came Yuhan Liu, then a junior PhD student still wrapping up her first project. Then others joined, all with little background in LLMs but every bit of passion to jump on the bandwagon.

My instinct was skepticism. Another hype cycle, I thought. Why chase a single app? LLM inference wasn’t even considered interesting in systems research, only training was. And I had a personal reason: I knew squat about Transformers! The limited ML I understood was CNNs, and even that I’d learned mostly from my PhD student Kuntai Du. Was I really going to start over? What if LLMs faded fast? And if we abandoned our current work, how long before we could publish again at top venues? 

But those concerns didn’t last. 

I was too weak to resist the nudging. I gathered the “GPT enthusiasts” planning to talk them down and ended up being convinced myself. As we placed GPT in a historical context, something clicked. This wasn’t hype. The scale was real and the public fascination unprecedented—none of it resembled past AI applications. If LLMs ran everywhere, inference would matter more than training. And if models were generating and consuming data autonomously, it would herald a new kind of internet for AI with ChatGPT as its first users. (We were talking about agents before we had the word for it.)

The excitement became undeniable. 

In the end, my students persuaded me. While others tried to steer their students away from the GPT wave, we embraced it as a paradigm shift, believing that everything in today’s internet would need to be reinvented for this new one, even if that new internet only existed in our imaginations.

Where to start? 

Credit goes to Yuhan: she proposed focusing on KV cache. We all knew video encoding and streaming, and KV cache is also a 3D tensor. Maybe it could be compressed and streamed similarly? That intuition led to Yuhan’s CacheGen paper and later Jiayi Yao’s CacheBlend. We were among the first to think about KV cache from a systems perspective, and likely the first from a networking one.

Looking back, it’s almost embarrassing how little we knew. Yuhan had to explain to me and her co-advisor Shan Lu what KV cache even was. At one point, we thought caching and reusing KV cache was a breakthrough idea, only to discover it was obvious to anyone who actually understood Transformers, and already standard practice in HuggingFace.

Two years ago (early 2024): Open source

Researchers always overestimate how much others care about their work. 

A year into KV cache work, we had successfully managed to publish zero papers on the subject. The bigger challenge wasn’t the learning curve; it was convincing others that the problem was real. Our work on CacheGen and CacheBlend were supposed to run in an infrastructure that managed and optimized KV cache distribution, not just in GPU memory but across the whole system. To me, such an infrastructure should obviously exist: reusing KV cache saves GPU cycles, but first you have to store it somewhere outside the GPU and load it back, and that loading latency is the bottleneck we are addressing.

But my reasoning collapsed when tested in the real world. After giving talks across the industry, I realized nothing was moving KV cache outside GPUs in the first place, so there was no loading latency to optimize. Back then, everyone was scaling models up, so GPU memory was expected to scale with them, and it will be enough to store all reusable KV caches. If anyone had done the math, they’d have seen GPU memory would never be enough but it wasn’t a priority for anyone I spoke to.

Faced with lukewarm reception, most researchers either move on or keep publishing and hope the industry catches up. I didn’t want to go either way.

If the system didn’t exist, why not build it ourselves and then convince the industry to use it?

That decision led to LMCache, and eventually, Tensormesh.

In April 2024, I made my first move, starting with Yihua. “The industry isn’t ready for what we do,” I told him. “We arrived too early, so let’s start the party ourselves.” Although his research hadn’t been about KV cache, and he’d been bouncing between ill-fated projects, mostly due to my missteps. I knew he should lead the open-source effort; no one else had both the engineering depth and the communication skills to pull it off. 

As it turned out, not only was he on board, but the role turned out to be a perfect fit. Yihua thrives with full visibility across a codebase, and he has a rare ability to communicate engineering ideas with clarity and enthusiasm. He became the architect and the natural leader of the developer community, Oh, he also coined the name LMCache. 

That decision to build LMCache wasn’t just about open source or enabling research. It exposed us to a new world: not what the industry might want tomorrow, but what it wants today, and how to make them actually use the code. 

We were new to open source, but thankfully, Ion Stoica introduced us to the vLLM project, and Kuntai spent a whole summer there, learning not just how inference engines work, but how open-source communities do. It’s hard to overstate how far ahead Berkeley’s SkyLab is in churning out successful open-source projects.

LMCache had a humble start. Unlike most well-known open-source projects, we had no backing from a big company. We had to figure everything out: managing GitHub issues and PRs, promoting on social media, running community meetings. I still remember texting friends the day before our first weekly community meeting, begging them to show up because I was afraid no one would come. There were many embarrassing ideas like that.

One year ago: First GTC

Today, NVIDIA GTC is arguably the most attended event in AI in Silicon Valley. So it’s embarrassing to admit, but we first heard about GTC only a few weeks before the 2025 edition. When collaborators from Pliops offered us a couple of free tickets to “this event called GTC,” our reaction was: GTC? What’s that?

By early 2025, we had incorporated Tensormesh, the first company focused on KV caching, LMCache had barely over 500 stars, and we had just open-sourced the vLLM Production Stack, an attempt to fill another gap: no Kubernetes framework existed that could deploy both inference engines and KV cache storage together.

We were frustrated that the world wasn’t paying more attention. What we didn’t realize was how much was about to change.

The first surprise at GTC 2025 was Dynamo. When Jensen Huang announced it, the whole world paid attention. It included nearly every feature our projects had, including a KV cache manager called KVBM. In a few hours, it attracted more attention than LMCache had in almost a year.

Everyone who knew us asked if our projects were about to be replaced.

We didn’t panic. Instead, we immediately tested its performance and found we still had an edge but for how long? We needed to differentiate, partner, or both. That question occupied most of the next year.

The second surprise was Kevin Deierling. Through Vikram Sharma Mailthody, Dynamo’s project lead, I learned that a group Kevin led at NVIDIA had been quietly following our research and open-source work. So I attended his talk, but was unsure what to say. I had never spoken to an SVP before—and barely understood what the title meant. 
When he walked by, I stopped him with the most awkward opener: “Kevin? I’m Junchen from UChicago.” 

He lit up: “Oh yes, from LMCache. We should talk about CacheBlend.” 

I was stunned! CacheBlend was still an obscure arXiv paper at the time. Kevin had seen the importance of KV cache storage and optimization before almost anyone else in the industry. A year later, it’s well-known, and NVIDIA was the first to demonstrate it publicly. 

The third surprise was Samuel Shen, then a UChicago senior and later our first full-time engineer at Tensormesh. Sam had been trying to join my PhD group, not yet knowing I was on leave. We met in person at GTC while he was in the Bay Area for spring break. He later told me it was the most surprising conversation of his life. I had already decided he’d be a great fit for the startup, but I had to pitch him on it. I don’t remember exactly what I said, but it probably involved arguing that a PhD was overrated and that joining us would be the most exciting thing he could do. Well, words I’d be embarrassed to say, wearing in my faculty hat. Sam became one of the most valued members of the team and a highly regarded expert in the industry. 

Today

Time moves fast.

  • Three years ago, we pivoted to KV cache as a research direction.
  • Two years ago, we were just learning what open source meant. 
  • One year ago, we showed up to GTC having never heard of it, and left with a clearer sense of how hard the road ahead would be.
  • This year, our project appeared in Jensen’s keynote.

I’ve learned more in the last two years than the previous twenty.

Friends often ask how the industry has changed my life. Honestly, I’m not sure it has. Every true researcher and entrepreneur is driven by the same instinct: the feeling of being the first on a new frontier. It’s what drove NASA engineers to put humans on the moon, PARC scientists to invent the PC and Ethernet, and Steve Jobs to reimagine the phone.

My PhD advisor, Hui Zhang, used to call it being the “lead dog”: whether you fall into a trap or find a bone, you’re always the first to experience it.

That feeling hasn’t changed. Everything else has.

(There are many people I owe thanks to along this journey. That moment will come.)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading