lmcache Archives | LMCache Blog

A New Chapter for LMCache and the KV Cache Community

June 2, 2026

Behind the Build, lmcache, News

TL;DR: A key contributor to the LMCache community just secured a major investment. This will greatly accelerate our mission of building the best KV cache library for every developer. Come join us in building the future AI-native data layer! An Independent Layer for the KV Cache KV cache is no longer just a byproduct of…

Read more: A New Chapter for LMCache and the KV Cache Community
When Open Source Meets Open Source: A Joint Effort Between LMCache and Mooncake

May 26, 2026

Behind the Build, lmcache

A collaboration story about LMCache multiprocess mode + MooncakeStore — From 0 to 1, from functional to optimized. 1. Before We Begin Recently, the LMCache community and the Mooncake community carried out a series of valuable open-source collaborations around the Mooncake Store L2 adapter for LMCache MP (multiprocess) mode. The main contributors include: This was…

Read more: When Open Source Meets Open Source: A Joint Effort Between LMCache and Mooncake
OpenAI API Is the New IPv4

May 20, 2026

lmcache, Tech Explained

A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the internet or web infrastructure. These systems were never cleanly designed from first principles; they…

Read more: OpenAI API Is the New IPv4
Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

May 12, 2026

AMD, Benchmark, lmcache, Performance

A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation traces from kv-cache-tester against MiniMax-M2.5 (230 GB FP8 MoE) on 2× AMD MI300X with vLLM 0.19.0 + LMCache (built from source for…

Read more: Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X
Stop Calling It KV Cache: It’s Something Much Bigger

April 28, 2026

lmcache, NVIDIA

For years, we have referred to one of the most critical components of modern LLM inference as a “KV cache.” That name made sense once. Today, it is increasingly misleading. What began as a small, ephemeral optimization inside a single inference pass has quietly evolved into something far more important: a first-class data object with…

Read more: Stop Calling It KV Cache: It’s Something Much Bigger
LMCache’s New Architecture Boosts MoE Inference Performance by 10×

April 3, 2026

Benchmark, lmcache, New features, Performance, Tutorial

lmcache, vLLM

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent…

Read more: LMCache’s New Architecture Boosts MoE Inference Performance by 10×
Accelerating OpenClaw Agents with CacheBlend

April 1, 2026

Agent, lmcache, Performance

cacheblend, Demo, KVCache, lmcache, OpenClaw, RAG

The standard approach to reducing LLM inference costs is prefix caching, which reuses previously computed token states to avoid redundant computation. In practice, however, this approach misses significant caching opportunities in real-world agentic workloads! Caching in Agentic Workflows In agentic workloads, shared content (e.g., retrieved contexts and documents) frequently appears across requests at varied positions,…

Read more: Accelerating OpenClaw Agents with CacheBlend
LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production

January 21, 2026

lmcache, New features, Performance

Baolong Mao (Tencent), Chunxiao Zheng (Tencent), Weishu Deng (Tensormesh), Darren Peng (Tensormesh), Samuel Shen (Tensormesh) What is P2P and what does it promise? In this blog post, we will go over: Most production vLLM deployments run multiple identical instances behind a load balancer. Each instance builds its own KV cache only from the traffic it…

Read more: LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production
AMD × LMcache: AMD GPU Acceleration with LMcache

January 9, 2026

AMD, Benchmark, lmcache

Introduction LLM inference becomes increasingly challenging as context length grows and workloads scale. Traditional serving engines rely on prefix-based KV cache reuse, which limits opportunities for optimization, especially when processing long, repeated, or overlapping text across different requests. LMCache addresses this challenge. It is an extension to LLM serving engines that dramatically reduces time-to-first-token (TTFT)…

Read more: AMD × LMcache: AMD GPU Acceleration with LMcache
Context Engineering & Reuse Pattern Under the Hood of Claude Code

December 23, 2025

Benchmark, lmcache

cacheblend, claude-code, lmcache

Over the last few months, Claude Code has quietly become one of the most interesting & widely-adopted real-world agentic systems available to normal developers. Unlike cloud-only agents whose internals remain hidden behind API gateways like Perplexity, Devin, or Manus, nor as fully open source agents like Mini SWE Agent or Terminus 2 where you can…

Read more: Context Engineering & Reuse Pattern Under the Hood of Claude Code