LMCache Blog | This is the blog of the LMCache community. It provides caching knowledge for your LLM , Accelerating and optimizing your GPU KVcache

LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production

By
slshen

Jan 21, 2026

lmcache, New features, Performance

Baolong Mao (Tencent), Chunxiao Zheng (Tencent), Weishu Deng (Tensormesh), Darren Peng (Tensormesh), Samuel Shen (Tensormesh) What is P2P and what does it promise? In this blog post, we will go over: Most production vLLM deployments run multiple identical instances behind a load balancer. Each instance builds its own KV cache only from the traffic it…
Read more: LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production
AMD × LMcache: AMD GPU Acceleration with LMcache

By
Andy Luo
,
Haichen Zhang
,
AMD AIG
,
Yihua
,
nijaba
and
LMCache Lab

Jan 9, 2026

AMD, Benchmark, lmcache

Introduction LLM inference becomes increasingly challenging as context length grows and workloads scale. Traditional serving engines rely on prefix-based KV cache reuse, which limits opportunities for optimization, especially when processing long, repeated, or overlapping text across different requests. LMCache addresses this challenge. It is an extension to LLM serving engines that dramatically reduces time-to-first-token (TTFT)…
Read more: AMD × LMcache: AMD GPU Acceleration with LMcache
Context Engineering & Reuse Pattern Under the Hood of Claude Code

By
Kobe
and
Mengbing

Dec 23, 2025

Benchmark, lmcache

cacheblend, claude-code, lmcache

Over the last few months, Claude Code has quietly become one of the most interesting & widely-adopted real-world agentic systems available to normal developers. Unlike cloud-only agents whose internals remain hidden behind API gateways like Perplexity, Devin, or Manus, nor as fully open source agents like Mini SWE Agent or Terminus 2 where you can…
Read more: Context Engineering & Reuse Pattern Under the Hood of Claude Code
LMCache x Ascend: Accelerating LLM inference on Ascend NPUs

By
LMCache Lab

Nov 4, 2025

ascend, lmcache, News

Supporting Ascend NPUs We’re delighted to announce that LMCache now officially supports Ascend NPUs with the release of the LMCache-Ascend plugin. LMCache-Ascend supports a broad range of Ascend compute platforms from the cloud to the edge. This major platform expansion underscores LMCache’s commitment to delivering leading performance across a diverse hardware ecosystem, enabling developers to…
Read more: LMCache x Ascend: Accelerating LLM inference on Ascend NPUs
Tensormesh unveiled and LMCache joins the PyTorch Foundation

By
Junchen Jiang

Oct 31, 2025

News

lmcache, pytorch, tensormesh

Announcing Tensormesh First I wanted to repeat here what I posted on the LMCache #general Slack channel last week: I am delighted to announce that the team that founded the LMCache project has decided to form a company, Tensormesh, a few months ago. As we are announcing the beta of our first product, we have…
Read more: Tensormesh unveiled and LMCache joins the PyTorch Foundation
Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere

By
Walter Beller-Morales (Cohere)
,
Kishor Aher (CoreWeave)
and
Samuel Shen (Tensormesh)

Oct 29, 2025

Benchmark, Performance

benchmark, CAIOS, cohere, coreweave, RAG, storage, tensormesh

The challenge: Scaling enterprise AI Enterprises today are racing to integrate large language models (LLMs) into their products and workflows, but doing it at scale brings challenges in performance, cost, and accuracy. Organizations need models to be based on their specific data, while making sure that this information remains private. Cohere, one of the leading…
Read more: Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere
LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage

By
Danna Wang (Google)

Oct 7, 2025

Benchmark

benchmark, gke, Google, storage, vLLM

Overview of the Collaboration The KV Cache is a memory optimization that makes Large Language Models(LLMs) run the forward pass faster by storing Key (K) and Value (V) matrices to prevent the model from recalculating them for the entire text sequence with every new generated token. Maximizing the KV Cache hit rate with storage is…
Read more: LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage
Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy

By
Kobe
and
Baolong

Sep 23, 2025

New features

lmcache, vLLM

A flexible plugin system for enhanced observability and management Abstract In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production…
Read more: Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy
NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference

By
LMCache Team

Sep 18, 2025

News

dynamo, lmcache, nvidia, vLLM

We’re thrilled to announce that Nvidia Dynamo has integrated LMCache as a KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a data center-scale inference platform used by many developers worldwide to deploy AI at scale. For comprehensive details about Dynamo’s KV cache optimization…
Read more: NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference
Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development

By
Baolong
and
Kobe

Sep 11, 2025

Tutorial

backend, customization, extension, lmcache, storage

In large language model inference scenarios, the performance and flexibility of KVCache caching systems directly impact overall service efficiency. LMCache, as a high-performance large model caching framework, provides developers with rich extension capabilities through its modular backend design. This article will start with LMCache backend’s extension mechanism, using the officially provided lmc_external_log_backend as an example,…
Read more: Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development