New features Archives | LMCache Blog

Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy

September 23, 2025

New features

lmcache, vLLM

A flexible plugin system for enhanced observability and management Abstract In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production…

Read more: Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy
Shortest Prefill First—Smarter Scheduling for Faster Prefill!

July 29, 2025

Benchmark, New features

prefill, queueing, scheduling, spf

TL;DR: ⚡ Shortest Prefill First (SPF) scheduling cuts LLM time-to-first-token by up to 18% in prefill-decode disaggregation—unlocking even greater gains when combined with LMCache! At LMCache Lab, we’re obsessed with LLM performance. As prefill-decode disaggregation becomes the norm, we spotted a major, untapped scheduling opportunity for prefill nodes.That’s why we developed SPF (Shortest Prefill First,…

Read more: Shortest Prefill First—Smarter Scheduling for Faster Prefill!
LMIgnite: Fastest LLM Inference for Conversational and Long-Document AI, Only One Click Away

July 22, 2025

New features, News

LLM, lmcache, LMIginte, one click, production stack

TL;DR: LLMs are transforming every product and service—from chatbots and copilots to intelligent document search and enterprise workflows. But running LLMs in production is still painfully slow, prohibitively expensive, and complex to manage. That changes today. We’re excited to announce the launch of LMIgnite — the first one-click deployable high-performance LLM inference backend for Conversational…

Read more: LMIgnite: Fastest LLM Inference for Conversational and Long-Document AI, Only One Click Away
Speeding Up LLM Inference: Beyond the Inference Engine

July 21, 2025

Best practices, New features

AIBrix, dynamo, inference engines, kserve, kubernetes, llm-d, lmignite, Modula, orchestration, orchestrator, production stack, scale, SGLang OME

TL;DR: LLMs are rapidly becoming the dominant workload in enterprise AI. As more applications rely on real-time generation, inference performance — measured in speed, cost, and reliability — becomes the key bottleneck. Today, the industry focuses primarily on speeding up inference engines like vLLM, SGLang, and TensorRT. But in doing so, we’re overlooking a much…

Read more: Speeding Up LLM Inference: Beyond the Inference Engine
LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1

July 3, 2025

New features, News

kv cache, lmcache, mm_hash, multimodal

TL;DR: The latest LMCache release plugs seamlessly into vLLM’s new multimodal stack. By hashing image-side tokens (mm_hashes) and caching their key-value (KV) pairs, LMCache reuses vision embeddings across requests—slashing time-to-first-token and GPU memory for visual-LLMs. Summary — Why This Matters Multimodal large language models (MLLMs) multiply KV-cache traffic: every image can add thousands of “vision…

Read more: LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1
LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming

June 20, 2025

New features, News

AMD, Arm, Ascend, CUDA, kernel, lmcache, production stack, pytorch, TPU

TL;DR: Our LLM Production Stack project just hit another milestone. We’re integrating with more hardware accelerators — including Ascend, Arm, and AMD — signaling growing maturity and broader applicability across enterprise and research settings. 🚀 LMCache Is Gaining Traction LMCache has quietly become the unsung hero in the LLM inference world. As a core component…

Read more: LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming
AGI Infra for All: vLLM Production Stack as the Standard for Scalable vLLM Serving

March 2, 2025

New features

k8s, kubernetes, production stack, vLLM

TL;DR Why vLLM Production Stack? AGI isn’t just about better models–it is also about better systems to serve the models to the wide public so that everyone will have access to the new capabilities! In order to fully harness the power of Generative AI, every organization that take this AI revolution seriously needs to have…

Read more: AGI Infra for All: vLLM Production Stack as the Standard for Scalable vLLM Serving

About us

Categories

Tags

Category: New features

Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy

Shortest Prefill First—Smarter Scheduling for Faster Prefill!

LMIgnite: Fastest LLM Inference for Conversational and Long-Document AI, Only One Click Away

Speeding Up LLM Inference: Beyond the Inference Engine

LMCache Extends Its Turbo-Boost to Multimodal Models in vLLM V1

LLM Production Stack Goes Cross-Hardware: Ascend, Arm, and AMD Support Incoming

AGI Infra for All: vLLM Production Stack as the Standard for Scalable vLLM Serving