lmcache Archives | Page 2 of 3

LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1

July 3, 2025

New features, News

kv cache, lmcache, mm_hash, multimodal

TL;DR: The latest LMCache release plugs seamlessly into vLLM’s new multimodal stack. By hashing image-side tokens (mm_hashes) and caching their key-value (KV) pairs, LMCache reuses vision embeddings across requests—slashing time-to-first-token and GPU memory for visual-LLMs. Summary — Why This Matters Multimodal large language models (MLLMs) multiply KV-cache traffic: every image can add thousands of “vision…

Read more: LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1
LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming

June 20, 2025

New features, News

AMD, Arm, Ascend, CUDA, kernel, lmcache, production stack, pytorch, TPU

TL;DR: Our LLM Production Stack project just hit another milestone. We’re integrating with more hardware accelerators — including Ascend, Arm, and AMD — signaling growing maturity and broader applicability across enterprise and research settings. 🚀 LMCache Is Gaining Traction LMCache has quietly become the unsung hero in the LLM inference world. As a core component…

Read more: LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming
LMCache Announces Exciting Collaboration with Red Hat, with LMCache Serving as a Founding Supporter of the llm-d project

May 22, 2025

News

llm-d, lmcache, Red Hat

We’re delighted to announce that LMCache is joining forces with Red Hat and other industry leaders on some exciting open source project collaborations. LMCache has been selected to be a core component of llm-d, a new open source project led by Red Hat to drive more scalable, efficient distributed inferencing across clusters of vLLM servers…

Read more: LMCache Announces Exciting Collaboration with Red Hat, with LMCache Serving as a Founding Supporter of the llm-d project
How LMCache Turbocharges Enterprise LLM Inference Frameworks

May 16, 2025

Performance

benchmark, ITL, lmcache, PD disagregation, performance, RAG, TTFT

TL;DR LMCache, the state-of-the-art KV cache layer library developed by TensorMesh and the project’s open-source community, delivers breakthrough performance improvements to modern enterprise LLM inference frameworks, including the vLLM Production Stack, KServe, and NVIDIA’s Dynamo. With fast and scalable caching of long-context KV cache, LMCache helps reduce inference costs and ensures SLOs for both latency…

Read more: How LMCache Turbocharges Enterprise LLM Inference Frameworks
LMCache x Mooncake: Unite to Pioneer KVCache-Centric LLM Serving System

May 8, 2025

News

collaboration, lmcache, mooncake, storage

Overview of the Collaboration LMCache and Mooncake have announced a strategic collaboration aimed at pioneering a KVCache-centric Large Language Model (LLM) serving system. This partnership seeks to significantly enhance the efficiency, scalability, and responsiveness of LLM applications. By combining LMCache’s advanced KVCache management techniques with Mooncake’s powerful and optimized backend infrastructure, the collaboration aims to…

Read more: LMCache x Mooncake: Unite to Pioneer KVCache-Centric LLM Serving System
Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache

April 29, 2025

Benchmark

dynamo, lmcache, NIXL, PD disagregation

TL;DR:In our previous blog, we introduced **LMCache**’s integration with vLLM v1 and NVIDIA’s NIXL used in Dynamo, enabling Prefill-Decode Disaggregation (PD) for LLM inference. Today, we’re excited to share benchmark results that confirm this system achieves state-of-the-art PD performance, balancing time-to-first-token (TTFT) and inter-token latency (ITL) with unprecedented consistency. Here’s an example result (scroll down…

Read more: Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache
Extending LMCache Remote Connectors: MooncakeStore as an Example

April 22, 2025

Tutorial

connector, lmcache, mooncake, tencent

Highlights: This article refers to LMCache based on commit-01277a1 LMCache V1(experimental), and introduces it in the context of the inference engine vLLM’s V0 version. LMCache Architecture and Position in the Ecosystem LMCache is an intelligent caching middleware specifically designed for Large Language Model (LLM) inference. Here’s a breakdown of its architecture and position: In the…

Read more: Extending LMCache Remote Connectors: MooncakeStore as an Example
📖 Explore LMCache Documentation

October 17, 2024

News

documentation, lmcache

We’re excited to announce that our LMCache documentation is now live! 🎉 This documentation website to help you get started quickly and understand all the key features. Here’s what you’ll find: Our documentation is designed for both beginners and experienced developers who want to optimize LLM inference and explore cutting-edge techniques. Check out the documentation…

Read more: 📖 Explore LMCache Documentation
Are you a vLLM user? Change just ONE line of code to unlock 100x more KV cache storage power!

September 23, 2024

Tutorial

lmcache, vLLM

Are you a vLLM user? Unlock 100x more KV cache storage space for your multi-round conversation and document QA applications using LMCache! Just ONE line change to your code! Offline inference For offline inference, you can use LMCache within two steps: First run And then change to and now you are good to go! Like…

Read more: Are you a vLLM user? Change just ONE line of code to unlock 100x more KV cache storage power!
Introducing LMCache: Watch Our 3-Minute Demo on YouTube!

September 20, 2024

News

lmcache, video

Read more: Introducing LMCache: Watch Our 3-Minute Demo on YouTube!