LMCache Blog | This is the blog of the LMCache community. It provides caching knowledge for your LLM , Accelerating and optimizing your GPU KVcache

Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference

By
LMCache Team

Sep 7, 2025

Best practices, Performance

collaboration, distributed-inference, dynamo, nvidia, performance

We’re thrilled to announce that the Nvidia Dynamo project has integrated LMCache as its KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a production-scale ecosystem used by many developers worldwide. Why KV Caching Matters KV caching is a foundational optimization for modern LLM…
Read more: Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference
🎉 LMCache Hits 5,000+ GitHub Stars — Thank You, Community!

By
LMCache Team

Aug 28, 2025

News

community, github, lmcache, milestone, stars

We’re thrilled to share that LMCache has officially crossed 5,000 GitHub stars! 🚀 This milestone is not just a number — it’s a strong signal that KV cache technology has become a first-class citizen in the LLM inference stack, and that our community is leading the way. What is LMCache? LMCache is the first open-source…
Read more: 🎉 LMCache Hits 5,000+ GitHub Stars — Thank You, Community!
LMCache supports gpt-oss (20B/120B) on Day 1

By
Yihua
and
Kobe

Aug 5, 2025

Benchmark, Best practices, News

benchmark, gpt-oss, OpenAI, vLLM

LMCache now supports OpenAI’s newly released GPT-OSS models (20B and 120B parameters) from day one! This post provides a complete guide to setting up vLLM with LMCache for GPT-OSS models and demonstrates significant performance improvements through our CPU offloading capabilities. Step 1: Installing vLLM GPT OSS Version Installation Test the Installation Step 2: Install LMCache…
Read more: LMCache supports gpt-oss (20B/120B) on Day 1
CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!

By
Kuntai Du
and
Kobe

Jul 31, 2025

Tutorial

cachegen, kv cache, quantization, s3, storage

TL;DR: 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them way faster than recomputing! It compresses your KV cache up to 3× smaller than quantization so that you can load your KV cache blazingly fast while keeping response quality high. Stop wasting compute — use CacheGen to fully utilize…
Read more: CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!
Shortest Prefill First—Smarter Scheduling for Faster Prefill!

By
Kuntai Du

Jul 29, 2025

Benchmark, New features

prefill, queueing, scheduling, spf

TL;DR: ⚡ Shortest Prefill First (SPF) scheduling cuts LLM time-to-first-token by up to 18% in prefill-decode disaggregation—unlocking even greater gains when combined with LMCache! At LMCache Lab, we’re obsessed with LLM performance. As prefill-decode disaggregation becomes the norm, we spotted a major, untapped scheduling opportunity for prefill nodes.That’s why we developed SPF (Shortest Prefill First,…
Read more: Shortest Prefill First—Smarter Scheduling for Faster Prefill!
LMCache Lab: Only prefilling? We reduce decoding latency by 60%!

By
Kuntai Du

Jul 23, 2025

Benchmark

decoding, spec decode, speculative

TL;DR: 🚀 LMCache Lab cuts decoding latency for code/text editing by 60% with speculative decoding! ⚡ You might know LMCache Lab for our KV cache optimizations that make LLM prefilling a breeze. But that’s not all! We’re now focused on speeding up decoding too—so your LLM agents can generate new content even faster. In other…
Read more: LMCache Lab: Only prefilling? We reduce decoding latency by 60%!
LMIgnite: Fastest LLM Inference for Conversational and Long-Document AI, Only One Click Away

By
Junchen Jiang
and
Hanchen Li

Jul 22, 2025

New features, News

LLM, lmcache, LMIginte, one click, production stack

TL;DR: LLMs are transforming every product and service—from chatbots and copilots to intelligent document search and enterprise workflows. But running LLMs in production is still painfully slow, prohibitively expensive, and complex to manage. That changes today. We’re excited to announce the launch of LMIgnite — the first one-click deployable high-performance LLM inference backend for Conversational…
Read more: LMIgnite: Fastest LLM Inference for Conversational and Long-Document AI, Only One Click Away
Speeding Up LLM Inference: Beyond the Inference Engine

By
Junchen Jiang
,
Hanchen Li
and
Jake Sonsini

Jul 21, 2025

Best practices, New features

AIBrix, dynamo, inference engines, kserve, kubernetes, llm-d, lmignite, Modula, orchestration, orchestrator, production stack, scale, SGLang OME

TL;DR: LLMs are rapidly becoming the dominant workload in enterprise AI. As more applications rely on real-time generation, inference performance — measured in speed, cost, and reliability — becomes the key bottleneck. Today, the industry focuses primarily on speeding up inference engines like vLLM, SGLang, and TensorRT. But in doing so, we’re overlooking a much…
Read more: Speeding Up LLM Inference: Beyond the Inference Engine
LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1

By
LMCache Team

Jul 3, 2025

New features, News

kv cache, lmcache, mm_hash, multimodal

TL;DR: The latest LMCache release plugs seamlessly into vLLM’s new multimodal stack. By hashing image-side tokens (mm_hashes) and caching their key-value (KV) pairs, LMCache reuses vision embeddings across requests—slashing time-to-first-token and GPU memory for visual-LLMs. Summary — Why This Matters Multimodal large language models (MLLMs) multiply KV-cache traffic: every image can add thousands of “vision…
Read more: LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1
LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming

By
LMCache Team

Jun 20, 2025

New features, News

AMD, Arm, Ascend, CUDA, kernel, lmcache, production stack, pytorch, TPU

TL;DR: Our LLM Production Stack project just hit another milestone. We’re integrating with more hardware accelerators — including Ascend, Arm, and AMD — signaling growing maturity and broader applicability across enterprise and research settings. 🚀 LMCache Is Gaining Traction LMCache has quietly become the unsung hero in the LLM inference world. As a core component…
Read more: LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming