LMCache Blog | This is the blog of the LMCache community. It provides caching knowledge for your LLM , Accelerating and optimizing your GPU KVcache

Speeding Up LLM Inference: Beyond the Inference Engine

By
Junchen Jiang
,
Hanchen Li
and
Jake Sonsini

Jul 21, 2025

Best practices, New features

AIBrix, dynamo, inference engines, kserve, kubernetes, llm-d, lmignite, Modula, orchestration, orchestrator, production stack, scale, SGLang OME

TL;DR: LLMs are rapidly becoming the dominant workload in enterprise AI. As more applications rely on real-time generation, inference performance — measured in speed, cost, and reliability — becomes the key bottleneck. Today, the industry focuses primarily on speeding up inference engines like vLLM, SGLang, and TensorRT. But in doing so, we’re overlooking a much…
Read more: Speeding Up LLM Inference: Beyond the Inference Engine
LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1

By
LMCache Team

Jul 3, 2025

New features, News

kv cache, lmcache, mm_hash, multimodal

TL;DR: The latest LMCache release plugs seamlessly into vLLM’s new multimodal stack. By hashing image-side tokens (mm_hashes) and caching their key-value (KV) pairs, LMCache reuses vision embeddings across requests—slashing time-to-first-token and GPU memory for visual-LLMs. Summary — Why This Matters Multimodal large language models (MLLMs) multiply KV-cache traffic: every image can add thousands of “vision…
Read more: LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1
LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming

By
LMCache Team

Jun 20, 2025

New features, News

AMD, Arm, Ascend, CUDA, kernel, lmcache, production stack, pytorch, TPU

TL;DR: Our LLM Production Stack project just hit another milestone. We’re integrating with more hardware accelerators — including Ascend, Arm, and AMD — signaling growing maturity and broader applicability across enterprise and research settings. 🚀 LMCache Is Gaining Traction LMCache has quietly become the unsung hero in the LLM inference world. As a core component…
Read more: LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming
LMCache Announces Exciting Collaboration with Red Hat, with LMCache Serving as a Founding Supporter of the llm-d project

By
LMCache Team

May 22, 2025

News

llm-d, lmcache, Red Hat

We’re delighted to announce that LMCache is joining forces with Red Hat and other industry leaders on some exciting open source project collaborations. LMCache has been selected to be a core component of llm-d, a new open source project led by Red Hat to drive more scalable, efficient distributed inferencing across clusters of vLLM servers…
Read more: LMCache Announces Exciting Collaboration with Red Hat, with LMCache Serving as a Founding Supporter of the llm-d project
How LMCache Turbocharges Enterprise LLM Inference Frameworks

By
LMCache Team

May 16, 2025

Performance

benchmark, ITL, lmcache, PD disagregation, performance, RAG, TTFT

TL;DR LMCache, the state-of-the-art KV cache layer library developed by TensorMesh and the project’s open-source community, delivers breakthrough performance improvements to modern enterprise LLM inference frameworks, including the vLLM Production Stack, KServe, and NVIDIA’s Dynamo. With fast and scalable caching of long-context KV cache, LMCache helps reduce inference costs and ensures SLOs for both latency…
Read more: How LMCache Turbocharges Enterprise LLM Inference Frameworks
LMCache x Mooncake: Unite to Pioneer KVCache-Centric LLM Serving System

By
Mooncake
and
LMCache Team

May 8, 2025

News

collaboration, lmcache, mooncake, storage

Overview of the Collaboration LMCache and Mooncake have announced a strategic collaboration aimed at pioneering a KVCache-centric Large Language Model (LLM) serving system. This partnership seeks to significantly enhance the efficiency, scalability, and responsiveness of LLM applications. By combining LMCache’s advanced KVCache management techniques with Mooncake’s powerful and optimized backend infrastructure, the collaboration aims to…
Read more: LMCache x Mooncake: Unite to Pioneer KVCache-Centric LLM Serving System
Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache

By
LMCache Team

Apr 29, 2025

Benchmark

dynamo, lmcache, NIXL, PD disagregation

TL;DR:In our previous blog, we introduced **LMCache**’s integration with vLLM v1 and NVIDIA’s NIXL used in Dynamo, enabling Prefill-Decode Disaggregation (PD) for LLM inference. Today, we’re excited to share benchmark results that confirm this system achieves state-of-the-art PD performance, balancing time-to-first-token (TTFT) and inter-token latency (ITL) with unprecedented consistency. Here’s an example result (scroll down…
Read more: Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache
Extending LMCache Remote Connectors: MooncakeStore as an Example

By
LMCache Team
and
Tencent

Apr 22, 2025

Tutorial

connector, lmcache, mooncake, tencent

Highlights: This article refers to LMCache based on commit-01277a1 LMCache V1(experimental), and introduces it in the context of the inference engine vLLM’s V0 version. LMCache Architecture and Position in the Ecosystem LMCache is an intelligent caching middleware specifically designed for Large Language Model (LLM) inference. Here’s a breakdown of its architecture and position: In the…
Read more: Extending LMCache Remote Connectors: MooncakeStore as an Example
Shaping NIXL-based PD Disaggregation in vLLM V1

By
LMCache Team

Apr 11, 2025

Tutorial

kv cache, NIXL, PD disagregation, prefill, vLLM

Highlights: Today, LMCache shares two key designs in LLM infrastructure for disaggregated prefill and more: Together, these updates mark a pivotal leap forward in PD disaggregation for vLLM, towards better system flexibility and multi-node scale-out capabilities. A high-level architecture diagram of “vLLM V1 + NIXL + LMCache” integration: vLLM V1 Gets a Major Upgrade with…
Read more: Shaping NIXL-based PD Disaggregation in vLLM V1
CacheBlend (Best Paper @ ACM EuroSys’25): Enabling 100% KV Cache Hit Rate in RAG

By
LMCache Team

Mar 31, 2025

News

award, cacheblend, paper, RAG

Break News: “CacheBlend” Receives BEST PAPER AWARD at ACM EuroSys 2025 This week, at ACM EuroSys 2025 (Top Academic Conference in Computer Systems), Jiayi Yao, the first author of the groundbreaking paper on CacheBlend, will present our innovative work that redefines the landscape of LLM efficiency, particularly in retrieval-augmented generation (RAG) applications. This paper has…
Read more: CacheBlend (Best Paper @ ACM EuroSys’25): Enabling 100% KV Cache Hit Rate in RAG