Follow us on: X, LinkedIn
Initiated and Officially Supported by Tensormesh
TL;DR [Github Link] | [More Tutorials] | [Interest Form] Tutorial Video (click below) The Context vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments. vLLM Production-stack is an open-source reference implementation of an…
TL;DR The Context In the AI arms race, it’s no longer just about who has the best model—it’s about who has the best LLM serving system. vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on…
🚀 Building your system on KV cache? Try building it on LMCache! By using LMCache for your research, you can focus on KV cache management, while we handle all the vLLM integration and compatibility for you. Here’s why LMCache works as your research testbed: Check our codebase and documentations for more information! Update: we are…
We’re excited to announce that our LMCache documentation is now live! 🎉 This documentation website to help you get started quickly and understand all the key features. Here’s what you’ll find: Our documentation is designed for both beginners and experienced developers who want to optimize LLM inference and explore cutting-edge techniques. Check out the documentation…
TL;DR: Your RAG can run up to 4.5× faster by pairing vLLM with LMCache . [💻 Source code] [📚 Paper] will appear in the 10th ACM EuroSys (European Conference on Computer Systems) 2025 [🎬 3-minute introduction video] The Problem: RAG is WAY TOO SLOW Retrieval-Augmented Generation (RAG) has become a key technique in…
Are you a vLLM user? Unlock 100x more KV cache storage space for your multi-round conversation and document QA applications using LMCache! Just ONE line change to your code! Offline inference For offline inference, you can use LMCache within two steps: First run And then change to and now you are good to go! Like…
TL;DR: LMCache turboboosts vLLM with 7× faster access to 100x more KV caches, for both multi-turn conversation and RAG . [💻 Source code] [📚 Paper1] [📚 Paper2] [🎬 3-minute introduction video] LLMs are ubiquitous across industries, but when using them with long documents, it takes forever for the model even to spit…