LMCache Blog | This is the blog of the LMCache community. It provides caching knowledge for your LLM , Accelerating and optimizing your GPU KVcache

Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM

By
LMCache Team

Feb 13, 2025

Tutorial

deployment, k8s, kubernetes, production stack, vLLM

TL;DR [Github Link] | [More Tutorials] | [Interest Form] Tutorial Video (click below) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an open-source reference implementation of an…
Read more: Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM
High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”

By
LMCache Team

Jan 21, 2025

News

deployment, k8s, kubernetes, performance, production stack, vLLM

TL;DR The Context In the AI arms race, it’s no longer just about who has the best model—it’s about who has the best LLM serving system. vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on…
Read more: High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”
You focus on KV cache research, we make it compatible with vLLM

By
LMCache Team

Oct 29, 2024

News

research, vLLM

🚀 Building your system on KV cache? Try building it on LMCache! By using LMCache for your research, you can focus on KV cache management, while we handle all the vLLM integration and compatibility for you. Here’s why LMCache works as your research testbed: Check our codebase and documentations for more information! Update: we are…
Read more: You focus on KV cache research, we make it compatible with vLLM
📖 Explore LMCache Documentation

By
LMCache Team

Oct 17, 2024

News

documentation, lmcache

We’re excited to announce that our LMCache documentation is now live! 🎉 This documentation website to help you get started quickly and understand all the key features. Here’s what you’ll find: Our documentation is designed for both beginners and experienced developers who want to optimize LLM inference and explore cutting-edge techniques. Check out the documentation…
Read more: 📖 Explore LMCache Documentation
Beyond Prefix Caching! How LMCache Speeds Up RAG by 4.5x By One Line of Change

By
LMCache Team

Oct 9, 2024

Tutorial

cacheblend, paper, RAG

TL;DR: Your RAG can run up to 4.5× faster by pairing vLLM with LMCache . [💻 Source code] [📚 Paper] will appear in the 10th ACM EuroSys (European Conference on Computer Systems) 2025 [🎬 3-minute introduction video] The Problem: RAG is WAY TOO SLOW Retrieval-Augmented Generation (RAG) has become a key technique in…
Read more: Beyond Prefix Caching! How LMCache Speeds Up RAG by 4.5x By One Line of Change
Are you a vLLM user? Change just ONE line of code to unlock 100x more KV cache storage power!

By
LMCache Team

Sep 23, 2024

Tutorial

lmcache, vLLM

Are you a vLLM user? Unlock 100x more KV cache storage space for your multi-round conversation and document QA applications using LMCache! Just ONE line change to your code! Offline inference For offline inference, you can use LMCache within two steps: First run And then change to and now you are good to go! Like…
Read more: Are you a vLLM user? Change just ONE line of code to unlock 100x more KV cache storage power!
Introducing LMCache: Watch Our 3-Minute Demo on YouTube!

By
LMCache Team

Sep 20, 2024

News

lmcache, video

Read more: Introducing LMCache: Watch Our 3-Minute Demo on YouTube!
LMCache: Turboboosting vLLM with 7x faster access to 100x more KV caches

By
LMCache Team

Sep 17, 2024

News, Tutorial

LLM, lmcache, RAG, vLLM

TL;DR: LMCache turboboosts vLLM with 7× faster access to 100x more KV caches, for both multi-turn conversation and RAG . [💻 Source code] [📚 Paper1] [📚 Paper2] [🎬 3-minute introduction video] LLMs are ubiquitous across industries, but when using them with long documents, it takes forever for the model even to spit…
Read more: LMCache: Turboboosting vLLM with 7x faster access to 100x more KV caches