Tutorial Archives | LMCache Blog

Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development

September 11, 2025

Tutorial

backend, customization, extension, lmcache, storage

In large language model inference scenarios, the performance and flexibility of KVCache caching systems directly impact overall service efficiency. LMCache, as a high-performance large model caching framework, provides developers with rich extension capabilities through its modular backend design. This article will start with LMCache backend’s extension mechanism, using the officially provided lmc_external_log_backend as an example,…

Read more: Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development
CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!

July 31, 2025

Tutorial

cachegen, kv cache, quantization, s3, storage

TL;DR: 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them way faster than recomputing! It compresses your KV cache up to 3× smaller than quantization so that you can load your KV cache blazingly fast while keeping response quality high. Stop wasting compute — use CacheGen to fully utilize…

Read more: CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!
Extending LMCache Remote Connectors: MooncakeStore as an Example

April 22, 2025

Tutorial

connector, lmcache, mooncake, tencent

Highlights: This article refers to LMCache based on commit-01277a1 LMCache V1(experimental), and introduces it in the context of the inference engine vLLM’s V0 version. LMCache Architecture and Position in the Ecosystem LMCache is an intelligent caching middleware specifically designed for Large Language Model (LLM) inference. Here’s a breakdown of its architecture and position: In the…

Read more: Extending LMCache Remote Connectors: MooncakeStore as an Example
Shaping NIXL-based PD Disaggregation in vLLM V1

April 11, 2025

Tutorial

kv cache, NIXL, PD disagregation, prefill, vLLM

Highlights: Today, LMCache shares two key designs in LLM infrastructure for disaggregated prefill and more: Together, these updates mark a pivotal leap forward in PD disaggregation for vLLM, towards better system flexibility and multi-node scale-out capabilities. A high-level architecture diagram of “vLLM V1 + NIXL + LMCache” integration: vLLM V1 Gets a Major Upgrade with…

Read more: Shaping NIXL-based PD Disaggregation in vLLM V1
Deploying LLMs in Clusters #2: running “vLLM production-stack” on AWS EKS and GCP GKE

February 20, 2025

Tutorial

aws, eks, gcp, gke, lambda, lambda lab, production stack, vLLM

TL;DR [Github Link] | [More Tutorials] | [Get In Touch] AWS Tutorial (click here) GKE Tutorial (click here) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an…

Read more: Deploying LLMs in Clusters #2: running “vLLM production-stack” on AWS EKS and GCP GKE
Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM

February 13, 2025

Tutorial

deployment, k8s, kubernetes, production stack, vLLM

TL;DR [Github Link] | [More Tutorials] | [Interest Form] Tutorial Video (click below) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an open-source reference implementation of an…

Read more: Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM
Beyond Prefix Caching! How LMCache Speeds Up RAG by 4.5x By One Line of Change

October 9, 2024

Tutorial

cacheblend, paper, RAG

TL;DR: Your RAG can run up to 4.5× faster by pairing vLLM with LMCache . [💻 Source code] [📚 Paper] will appear in the 10th ACM EuroSys (European Conference on Computer Systems) 2025 [🎬 3-minute introduction video] The Problem: RAG is WAY TOO SLOW Retrieval-Augmented Generation (RAG) has become a key technique in…

Read more: Beyond Prefix Caching! How LMCache Speeds Up RAG by 4.5x By One Line of Change
Are you a vLLM user? Change just ONE line of code to unlock 100x more KV cache storage power!

September 23, 2024

Tutorial

lmcache, vLLM

Are you a vLLM user? Unlock 100x more KV cache storage space for your multi-round conversation and document QA applications using LMCache! Just ONE line change to your code! Offline inference For offline inference, you can use LMCache within two steps: First run And then change to and now you are good to go! Like…

Read more: Are you a vLLM user? Change just ONE line of code to unlock 100x more KV cache storage power!
LMCache: Turboboosting vLLM with 7x faster access to 100x more KV caches

September 17, 2024

News, Tutorial

LLM, lmcache, RAG, vLLM

TL;DR: LMCache turboboosts vLLM with 7× faster access to 100x more KV caches, for both multi-turn conversation and RAG . [💻 Source code] [📚 Paper1] [📚 Paper2] [🎬 3-minute introduction video] LLMs are ubiquitous across industries, but when using them with long documents, it takes forever for the model even to spit…

Read more: LMCache: Turboboosting vLLM with 7x faster access to 100x more KV caches