Tutorial Archives | LMCache Blog

vLLM + LMCache: A Starter Guide, No GPU Required

June 23, 2026

lmcache, Tech Explained, Tutorial

Get started easily: a single MacBook is all you need to develop vLLM + LMCacheFor New Contributors · Covering Frontend / L1 Eviction / L2 Storage / Observability If you ever skipped LMCache because you didn’t have a GPU on hand, this guide was written for you. LMCache’s multi-platform framework has already decoupled the GPU…

Read more: vLLM + LMCache: A Starter Guide, No GPU Required
Understanding LMCache MP Mode Transfer Paths: A Beginner’s Guide

June 15, 2026

lmcache, Tutorial

In the traditional setup, KV cache is usually managed inside the inference engine process. This means the cache is closely tied to the lifetime of that engine. If the inference engine restarts or crashes, the KV cache may be lost as well.To address this, LMCache introduces multiprocess (MP) mode. In MP mode, LMCache runs as…

Read more: Understanding LMCache MP Mode Transfer Paths: A Beginner’s Guide
LMCache’s New Architecture Boosts MoE Inference Performance by 10×

April 3, 2026

Benchmark, lmcache, New features, Performance, Tutorial

lmcache, vLLM

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent…

Read more: LMCache’s New Architecture Boosts MoE Inference Performance by 10×
Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development

September 11, 2025

Tutorial

backend, customization, extension, lmcache, storage

In large language model inference scenarios, the performance and flexibility of KVCache caching systems directly impact overall service efficiency. LMCache, as a high-performance large model caching framework, provides developers with rich extension capabilities through its modular backend design. This article will start with LMCache backend’s extension mechanism, using the officially provided lmc_external_log_backend as an example,…

Read more: Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development
CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!

July 31, 2025

Tutorial

cachegen, kv cache, quantization, s3, storage

TL;DR: 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them way faster than recomputing! It compresses your KV cache up to 3× smaller than quantization so that you can load your KV cache blazingly fast while keeping response quality high. Stop wasting compute — use CacheGen to fully utilize…

Read more: CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!
Extending LMCache Remote Connectors: MooncakeStore as an Example

April 22, 2025

Tutorial

connector, lmcache, mooncake, tencent

Highlights: This article refers to LMCache based on commit-01277a1 LMCache V1(experimental), and introduces it in the context of the inference engine vLLM’s V0 version. LMCache Architecture and Position in the Ecosystem LMCache is an intelligent caching middleware specifically designed for Large Language Model (LLM) inference. Here’s a breakdown of its architecture and position: In the…

Read more: Extending LMCache Remote Connectors: MooncakeStore as an Example
Shaping NIXL-based PD Disaggregation in vLLM V1

April 11, 2025

Tutorial

kv cache, NIXL, PD disagregation, prefill, vLLM

Highlights: Today, LMCache shares two key designs in LLM infrastructure for disaggregated prefill and more: Together, these updates mark a pivotal leap forward in PD disaggregation for vLLM, towards better system flexibility and multi-node scale-out capabilities. A high-level architecture diagram of “vLLM V1 + NIXL + LMCache” integration: vLLM V1 Gets a Major Upgrade with…

Read more: Shaping NIXL-based PD Disaggregation in vLLM V1
Deploying LLMs in Clusters #2: running “vLLM production-stack” on AWS EKS and GCP GKE

February 20, 2025

Tutorial

aws, eks, gcp, gke, lambda, lambda lab, production stack, vLLM

TL;DR [Github Link] | [More Tutorials] | [Get In Touch] AWS Tutorial (click here) GKE Tutorial (click here) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an…

Read more: Deploying LLMs in Clusters #2: running “vLLM production-stack” on AWS EKS and GCP GKE
Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM

February 13, 2025

Tutorial

deployment, k8s, kubernetes, production stack, vLLM

TL;DR [Github Link] | [More Tutorials] | [Interest Form] Tutorial Video (click below) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an open-source reference implementation of an…

Read more: Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM
Beyond Prefix Caching! How LMCache Speeds Up RAG by 4.5x By One Line of Change

October 9, 2024

Tutorial

cacheblend, paper, RAG

TL;DR: Your RAG can run up to 4.5× faster by pairing vLLM with LMCache . [💻 Source code] [📚 Paper] will appear in the 10th ACM EuroSys (European Conference on Computer Systems) 2025 [🎬 3-minute introduction video] The Problem: RAG is WAY TOO SLOW Retrieval-Augmented Generation (RAG) has become a key technique in…

Read more: Beyond Prefix Caching! How LMCache Speeds Up RAG by 4.5x By One Line of Change