Follow us on: X, LinkedIn
Initiated and Officially Supported by Tensormesh
TL;DR: The latest LMCache release plugs seamlessly into vLLM’s new multimodal stack. By hashing image-side tokens (mm_hashes) and caching their key-value (KV) pairs, LMCache reuses vision embeddings across requests—slashing time-to-first-token and GPU memory for visual-LLMs. Summary — Why This Matters Multimodal large language models (MLLMs) multiply KV-cache traffic: every image can add thousands of “vision…

TL;DR: Our LLM Production Stack project just hit another milestone. We’re integrating with more hardware accelerators — including Ascend, Arm, and AMD — signaling growing maturity and broader applicability across enterprise and research settings. 🚀 LMCache Is Gaining Traction LMCache has quietly become the unsung hero in the LLM inference world. As a core component…
We’re delighted to announce that LMCache is joining forces with Red Hat and other industry leaders on some exciting open source project collaborations. LMCache has been selected to be a core component of llm-d, a new open source project led by Red Hat to drive more scalable, efficient distributed inferencing across clusters of vLLM servers…

TL;DR LMCache, the state-of-the-art KV cache layer library developed by TensorMesh and the project’s open-source community, delivers breakthrough performance improvements to modern enterprise LLM inference frameworks, including the vLLM Production Stack, KServe, and NVIDIA’s Dynamo. With fast and scalable caching of long-context KV cache, LMCache helps reduce inference costs and ensures SLOs for both latency…

Overview of the Collaboration LMCache and Mooncake have announced a strategic collaboration aimed at pioneering a KVCache-centric Large Language Model (LLM) serving system. This partnership seeks to significantly enhance the efficiency, scalability, and responsiveness of LLM applications. By combining LMCache’s advanced KVCache management techniques with Mooncake’s powerful and optimized backend infrastructure, the collaboration aims to…

TL;DR:In our previous blog, we introduced **LMCache**’s integration with vLLM v1 and NVIDIA’s NIXL used in Dynamo, enabling Prefill-Decode Disaggregation (PD) for LLM inference. Today, we’re excited to share benchmark results that confirm this system achieves state-of-the-art PD performance, balancing time-to-first-token (TTFT) and inter-token latency (ITL) with unprecedented consistency. Here’s an example result (scroll down…

Highlights: This article refers to LMCache based on commit-01277a1 LMCache V1(experimental), and introduces it in the context of the inference engine vLLM’s V0 version. LMCache Architecture and Position in the Ecosystem LMCache is an intelligent caching middleware specifically designed for Large Language Model (LLM) inference. Here’s a breakdown of its architecture and position: In the…

We’re excited to announce that our LMCache documentation is now live! 🎉 This documentation website to help you get started quickly and understand all the key features. Here’s what you’ll find: Our documentation is designed for both beginners and experienced developers who want to optimize LLM inference and explore cutting-edge techniques. Check out the documentation…

Are you a vLLM user? Unlock 100x more KV cache storage space for your multi-round conversation and document QA applications using LMCache! Just ONE line change to your code! Offline inference For offline inference, you can use LMCache within two steps: First run And then change to and now you are good to go! Like…
