LMCache Blog

About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

Author: LMCache Team

When Open Source Meets Open Source: A Joint Effort Between LMCache and Mooncake

May 26, 2026

Behind the Build, lmcache

A collaboration story about LMCache multiprocess mode + MooncakeStore — From 0 to 1, from functional to optimized. 1. Before We Begin Recently, the LMCache community and the Mooncake community carried out a series of valuable open-source collaborations around the Mooncake Store L2 adapter for LMCache MP (multiprocess) mode. The main contributors include: This was…

Read more: When Open Source Meets Open Source: A Joint Effort Between LMCache and Mooncake
Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

May 12, 2026

AMD, Benchmark, lmcache, Performance

A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation traces from kv-cache-tester against MiniMax-M2.5 (230 GB FP8 MoE) on 2× AMD MI300X with vLLM 0.19.0 + LMCache (built from source for…

Read more: Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X
LMCache’s New Architecture Boosts MoE Inference Performance by 10×

April 3, 2026

Benchmark, lmcache, New features, Performance, Tutorial

lmcache, vLLM

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent…

Read more: LMCache’s New Architecture Boosts MoE Inference Performance by 10×
Accelerating OpenClaw Agents with CacheBlend

April 1, 2026

Agent, lmcache, Performance

cacheblend, Demo, KVCache, lmcache, OpenClaw, RAG

The standard approach to reducing LLM inference costs is prefix caching, which reuses previously computed token states to avoid redundant computation. In practice, however, this approach misses significant caching opportunities in real-world agentic workloads! Caching in Agentic Workflows In agentic workloads, shared content (e.g., retrieved contexts and documents) frequently appears across requests at varied positions,…

Read more: Accelerating OpenClaw Agents with CacheBlend
LMCache + NVIDIA Dynamo 1.0: A Match Made in Inference Heaven 🚀

March 16, 2026

News, NVIDIA

dynamo, lmcache

We have some exciting news to share: NVIDIA Dynamo has officially hit v1.0, and we couldn’t be more thrilled. This is a huge milestone for the LLM inference ecosystem and for us at LMCache, it’s a moment worth celebrating. What Is NVIDIA Dynamo, and Why Does It Matter? If you haven’t been following Dynamo’s journey,…

Read more: LMCache + NVIDIA Dynamo 1.0: A Match Made in Inference Heaven 🚀
NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference

September 18, 2025

News

dynamo, lmcache, nvidia, vLLM

We’re thrilled to announce that Nvidia Dynamo has integrated LMCache as a KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a data center-scale inference platform used by many developers worldwide to deploy AI at scale. For comprehensive details about Dynamo’s KV cache optimization…

Read more: NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference
Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference

September 7, 2025

Best practices, Performance

collaboration, distributed-inference, dynamo, nvidia, performance

We’re thrilled to announce that the Nvidia Dynamo project has integrated LMCache as its KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a production-scale ecosystem used by many developers worldwide. Why KV Caching Matters KV caching is a foundational optimization for modern LLM…

Read more: Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference
🎉 LMCache Hits 5,000+ GitHub Stars — Thank You, Community!

August 28, 2025

News

community, github, lmcache, milestone, stars

We’re thrilled to share that LMCache has officially crossed 5,000 GitHub stars! 🚀 This milestone is not just a number — it’s a strong signal that KV cache technology has become a first-class citizen in the LLM inference stack, and that our community is leading the way. What is LMCache? LMCache is the first open-source…

Read more: 🎉 LMCache Hits 5,000+ GitHub Stars — Thank You, Community!
LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1

July 3, 2025

New features, News

kv cache, lmcache, mm_hash, multimodal

TL;DR: The latest LMCache release plugs seamlessly into vLLM’s new multimodal stack. By hashing image-side tokens (mm_hashes) and caching their key-value (KV) pairs, LMCache reuses vision embeddings across requests—slashing time-to-first-token and GPU memory for visual-LLMs. Summary — Why This Matters Multimodal large language models (MLLMs) multiply KV-cache traffic: every image can add thousands of “vision…

Read more: LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1
LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming

June 20, 2025

New features, News

AMD, Arm, Ascend, CUDA, kernel, lmcache, production stack, pytorch, TPU

TL;DR: Our LLM Production Stack project just hit another milestone. We’re integrating with more hardware accelerators — including Ascend, Arm, and AMD — signaling growing maturity and broader applicability across enterprise and research settings. 🚀 LMCache Is Gaining Traction LMCache has quietly become the unsung hero in the LLM inference world. As a core component…

Read more: LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming