lmcache Archives | LMCache Blog

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

April 3, 2026

Benchmark, lmcache, New features, Performance, Tutorial

lmcache, vLLM

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent…

Read more: LMCache’s New Architecture Boosts MoE Inference Performance by 10×
Accelerating OpenClaw Agents with CacheBlend

April 1, 2026

Agent, lmcache, Performance

cacheblend, Demo, KVCache, lmcache, OpenClaw, RAG

The standard approach to reducing LLM inference costs is prefix caching, which reuses previously computed token states to avoid redundant computation. In practice, however, this approach misses significant caching opportunities in real-world agentic workloads! Caching in Agentic Workflows In agentic workloads, shared content (e.g., retrieved contexts and documents) frequently appears across requests at varied positions,…

Read more: Accelerating OpenClaw Agents with CacheBlend
LMCache + NVIDIA Dynamo 1.0: A Match Made in Inference Heaven 🚀

March 16, 2026

News, NVIDIA

dynamo, lmcache

We have some exciting news to share: NVIDIA Dynamo has officially hit v1.0, and we couldn’t be more thrilled. This is a huge milestone for the LLM inference ecosystem and for us at LMCache, it’s a moment worth celebrating. What Is NVIDIA Dynamo, and Why Does It Matter? If you haven’t been following Dynamo’s journey,…

Read more: LMCache + NVIDIA Dynamo 1.0: A Match Made in Inference Heaven 🚀
Context Engineering & Reuse Pattern Under the Hood of Claude Code

December 23, 2025

Benchmark, lmcache

cacheblend, claude-code, lmcache

Over the last few months, Claude Code has quietly become one of the most interesting & widely-adopted real-world agentic systems available to normal developers. Unlike cloud-only agents whose internals remain hidden behind API gateways like Perplexity, Devin, or Manus, nor as fully open source agents like Mini SWE Agent or Terminus 2 where you can…

Read more: Context Engineering & Reuse Pattern Under the Hood of Claude Code
Tensormesh unveiled and LMCache joins the PyTorch Foundation

October 31, 2025

News

lmcache, pytorch, tensormesh

Announcing Tensormesh First I wanted to repeat here what I posted on the LMCache #general Slack channel last week: I am delighted to announce that the team that founded the LMCache project has decided to form a company, Tensormesh, a few months ago. As we are announcing the beta of our first product, we have…

Read more: Tensormesh unveiled and LMCache joins the PyTorch Foundation
Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy

September 23, 2025

New features

lmcache, vLLM

A flexible plugin system for enhanced observability and management Abstract In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production…

Read more: Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy
NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference

September 18, 2025

News

dynamo, lmcache, nvidia, vLLM

We’re thrilled to announce that Nvidia Dynamo has integrated LMCache as a KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a data center-scale inference platform used by many developers worldwide to deploy AI at scale. For comprehensive details about Dynamo’s KV cache optimization…

Read more: NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference
Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development

September 11, 2025

Tutorial

backend, customization, extension, lmcache, storage

In large language model inference scenarios, the performance and flexibility of KVCache caching systems directly impact overall service efficiency. LMCache, as a high-performance large model caching framework, provides developers with rich extension capabilities through its modular backend design. This article will start with LMCache backend’s extension mechanism, using the officially provided lmc_external_log_backend as an example,…

Read more: Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development
🎉 LMCache Hits 5,000+ GitHub Stars — Thank You, Community!

August 28, 2025

News

community, github, lmcache, milestone, stars

We’re thrilled to share that LMCache has officially crossed 5,000 GitHub stars! 🚀 This milestone is not just a number — it’s a strong signal that KV cache technology has become a first-class citizen in the LLM inference stack, and that our community is leading the way. What is LMCache? LMCache is the first open-source…

Read more: 🎉 LMCache Hits 5,000+ GitHub Stars — Thank You, Community!
LMIgnite: Fastest LLM Inference for Conversational and Long-Document AI, Only One Click Away

July 22, 2025

New features, News

LLM, lmcache, LMIginte, one click, production stack

TL;DR: LLMs are transforming every product and service—from chatbots and copilots to intelligent document search and enterprise workflows. But running LLMs in production is still painfully slow, prohibitively expensive, and complex to manage. That changes today. We’re excited to announce the launch of LMIgnite — the first one-click deployable high-performance LLM inference backend for Conversational…

Read more: LMIgnite: Fastest LLM Inference for Conversational and Long-Document AI, Only One Click Away