vLLM Archives | LMCache Blog

LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage

October 7, 2025

Benchmark

benchmark, gke, Google, storage, vLLM

Overview of the Collaboration The KV Cache is a memory optimization that makes Large Language Models(LLMs) run the forward pass faster by storing Key (K) and Value (V) matrices to prevent the model from recalculating them for the entire text sequence with every new generated token. Maximizing the KV Cache hit rate with storage is…

Read more: LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage
Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy

September 23, 2025

New features

lmcache, vLLM

A flexible plugin system for enhanced observability and management Abstract In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production…

Read more: Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy
NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference

September 18, 2025

News

dynamo, lmcache, nvidia, vLLM

We’re thrilled to announce that Nvidia Dynamo has integrated LMCache as a KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a data center-scale inference platform used by many developers worldwide to deploy AI at scale. For comprehensive details about Dynamo’s KV cache optimization…

Read more: NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference
LMCache supports gpt-oss (20B/120B) on Day 1

August 5, 2025

Benchmark, Best practices, News

benchmark, gpt-oss, OpenAI, vLLM

LMCache now supports OpenAI’s newly released GPT-OSS models (20B and 120B parameters) from day one! This post provides a complete guide to setting up vLLM with LMCache for GPT-OSS models and demonstrates significant performance improvements through our CPU offloading capabilities. Step 1: Installing vLLM GPT OSS Version Installation Test the Installation Step 2: Install LMCache…

Read more: LMCache supports gpt-oss (20B/120B) on Day 1
Shaping NIXL-based PD Disaggregation in vLLM V1

April 11, 2025

Tutorial

kv cache, NIXL, PD disagregation, prefill, vLLM

Highlights: Today, LMCache shares two key designs in LLM infrastructure for disaggregated prefill and more: Together, these updates mark a pivotal leap forward in PD disaggregation for vLLM, towards better system flexibility and multi-node scale-out capabilities. A high-level architecture diagram of “vLLM V1 + NIXL + LMCache” integration: vLLM V1 Gets a Major Upgrade with…

Read more: Shaping NIXL-based PD Disaggregation in vLLM V1
Open-Source LLM Inference Cluster Performing 10x FASTER than SOTA OSS Solution

March 6, 2025

Benchmark

k8s, kubernetes, production stack, QPS, router, TTFT, vLLM

A picture is worth a thousand words: Executive Summary: [vLLM Production Stack Github] | [Get In Touch] | [Slack] | [Linkedin] | [Twitter] Benchmark setups Methods: Workload: Inspired by our production deployments, we create workloads that emulate a typical chat-bot document analysis workload. By default, each LLM query input has 9K tokens, including a document…

Read more: Open-Source LLM Inference Cluster Performing 10x FASTER than SOTA OSS Solution
AGI Infra for All: vLLM Production Stack as the Standard for Scalable vLLM Serving

March 2, 2025

New features

k8s, kubernetes, production stack, vLLM

TL;DR Why vLLM Production Stack? AGI isn’t just about better models–it is also about better systems to serve the models to the wide public so that everyone will have access to the new capabilities! In order to fully harness the power of Generative AI, every organization that take this AI revolution seriously needs to have…

Read more: AGI Infra for All: vLLM Production Stack as the Standard for Scalable vLLM Serving
Deploying LLMs in Clusters #2: running “vLLM production-stack” on AWS EKS and GCP GKE

February 20, 2025

Tutorial

aws, eks, gcp, gke, lambda, lambda lab, production stack, vLLM

TL;DR [Github Link] | [More Tutorials] | [Get In Touch] AWS Tutorial (click here) GKE Tutorial (click here) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an…

Read more: Deploying LLMs in Clusters #2: running “vLLM production-stack” on AWS EKS and GCP GKE
Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM

February 13, 2025

Tutorial

deployment, k8s, kubernetes, production stack, vLLM

TL;DR [Github Link] | [More Tutorials] | [Interest Form] Tutorial Video (click below) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an open-source reference implementation of an…

Read more: Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM
High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”

January 21, 2025

News

deployment, k8s, kubernetes, performance, production stack, vLLM

TL;DR The Context In the AI arms race, it’s no longer just about who has the best model—it’s about who has the best LLM serving system. vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on…

Read more: High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”