LMCache Blog | This is the blog of the LMCache community. It provides caching knowledge for your LLM , Accelerating and optimizing your GPU KVcache

LMCache Announces Exciting Collaboration with Red Hat, with LMCache Serving as a Founding Supporter of the llm-d project

By
LMCache Team

May 22, 2025

News

llm-d, lmcache, Red Hat

We’re delighted to announce that LMCache is joining forces with Red Hat and other industry leaders on some exciting open source project collaborations. LMCache has been selected to be a core component of llm-d, a new open source project led by Red Hat to drive more scalable, efficient distributed inferencing across clusters of vLLM servers…
Read more: LMCache Announces Exciting Collaboration with Red Hat, with LMCache Serving as a Founding Supporter of the llm-d project
How LMCache Turbocharges Enterprise LLM Inference Frameworks

By
LMCache Team

May 16, 2025

Performance

benchmark, ITL, lmcache, PD disagregation, performance, RAG, TTFT

TL;DR LMCache, the state-of-the-art KV cache layer library developed by TensorMesh and the project’s open-source community, delivers breakthrough performance improvements to modern enterprise LLM inference frameworks, including the vLLM Production Stack, KServe, and NVIDIA’s Dynamo. With fast and scalable caching of long-context KV cache, LMCache helps reduce inference costs and ensures SLOs for both latency…
Read more: How LMCache Turbocharges Enterprise LLM Inference Frameworks
LMCache x Mooncake: Unite to Pioneer KVCache-Centric LLM Serving System

By
Mooncake
and
LMCache Team

May 8, 2025

News

collaboration, lmcache, mooncake, storage

Overview of the Collaboration LMCache and Mooncake have announced a strategic collaboration aimed at pioneering a KVCache-centric Large Language Model (LLM) serving system. This partnership seeks to significantly enhance the efficiency, scalability, and responsiveness of LLM applications. By combining LMCache’s advanced KVCache management techniques with Mooncake’s powerful and optimized backend infrastructure, the collaboration aims to…
Read more: LMCache x Mooncake: Unite to Pioneer KVCache-Centric LLM Serving System
Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache

By
LMCache Team

Apr 29, 2025

Benchmark

dynamo, lmcache, NIXL, PD disagregation

TL;DR:In our previous blog, we introduced **LMCache**’s integration with vLLM v1 and NVIDIA’s NIXL used in Dynamo, enabling Prefill-Decode Disaggregation (PD) for LLM inference. Today, we’re excited to share benchmark results that confirm this system achieves state-of-the-art PD performance, balancing time-to-first-token (TTFT) and inter-token latency (ITL) with unprecedented consistency. Here’s an example result (scroll down…
Read more: Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache
Extending LMCache Remote Connectors: MooncakeStore as an Example

By
LMCache Team
and
Tencent

Apr 22, 2025

Tutorial

connector, lmcache, mooncake, tencent

Highlights: This article refers to LMCache based on commit-01277a1 LMCache V1(experimental), and introduces it in the context of the inference engine vLLM’s V0 version. LMCache Architecture and Position in the Ecosystem LMCache is an intelligent caching middleware specifically designed for Large Language Model (LLM) inference. Here’s a breakdown of its architecture and position: In the…
Read more: Extending LMCache Remote Connectors: MooncakeStore as an Example
Shaping NIXL-based PD Disaggregation in vLLM V1

By
LMCache Team

Apr 11, 2025

Tutorial

kv cache, NIXL, PD disagregation, prefill, vLLM

Highlights: Today, LMCache shares two key designs in LLM infrastructure for disaggregated prefill and more: Together, these updates mark a pivotal leap forward in PD disaggregation for vLLM, towards better system flexibility and multi-node scale-out capabilities. A high-level architecture diagram of “vLLM V1 + NIXL + LMCache” integration: vLLM V1 Gets a Major Upgrade with…
Read more: Shaping NIXL-based PD Disaggregation in vLLM V1
CacheBlend (Best Paper @ ACM EuroSys’25): Enabling 100% KV Cache Hit Rate in RAG

By
LMCache Team

Mar 31, 2025

News

award, cacheblend, paper, RAG

Break News: “CacheBlend” Receives BEST PAPER AWARD at ACM EuroSys 2025 This week, at ACM EuroSys 2025 (Top Academic Conference in Computer Systems), Jiayi Yao, the first author of the groundbreaking paper on CacheBlend, will present our innovative work that redefines the landscape of LLM efficiency, particularly in retrieval-augmented generation (RAG) applications. This paper has…
Read more: CacheBlend (Best Paper @ ACM EuroSys’25): Enabling 100% KV Cache Hit Rate in RAG
Open-Source LLM Inference Cluster Performing 10x FASTER than SOTA OSS Solution

By
Production Stack Team

Mar 6, 2025

Benchmark

k8s, kubernetes, production stack, QPS, router, TTFT, vLLM

A picture is worth a thousand words: Executive Summary: [vLLM Production Stack Github] | [Get In Touch] | [Slack] | [Linkedin] | [Twitter] Benchmark setups Methods: Workload: Inspired by our production deployments, we create workloads that emulate a typical chat-bot document analysis workload. By default, each LLM query input has 9K tokens, including a document…
Read more: Open-Source LLM Inference Cluster Performing 10x FASTER than SOTA OSS Solution
AGI Infra for All: vLLM Production Stack as the Standard for Scalable vLLM Serving

By
LMCache Lab

Mar 2, 2025

New features

k8s, kubernetes, production stack, vLLM

TL;DR Why vLLM Production Stack? AGI isn’t just about better models–it is also about better systems to serve the models to the wide public so that everyone will have access to the new capabilities! In order to fully harness the power of Generative AI, every organization that take this AI revolution seriously needs to have…
Read more: AGI Infra for All: vLLM Production Stack as the Standard for Scalable vLLM Serving
Deploying LLMs in Clusters #2: running “vLLM production-stack” on AWS EKS and GCP GKE

By
LMCache Team

Feb 20, 2025

Tutorial

aws, eks, gcp, gke, lambda, lambda lab, production stack, vLLM

TL;DR [Github Link] | [More Tutorials] | [Get In Touch] AWS Tutorial (click here) GKE Tutorial (click here) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an…
Read more: Deploying LLMs in Clusters #2: running “vLLM production-stack” on AWS EKS and GCP GKE