Performance Archives | LMCache Blog

Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

May 12, 2026

AMD, Benchmark, lmcache, Performance

A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation traces from kv-cache-tester against MiniMax-M2.5 (230 GB FP8 MoE) on 2× AMD MI300X with vLLM 0.19.0 + LMCache (built from source for…

Read more: Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X
LMCache’s New Architecture Boosts MoE Inference Performance by 10×

April 3, 2026

Benchmark, lmcache, New features, Performance, Tutorial

lmcache, vLLM

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent…

Read more: LMCache’s New Architecture Boosts MoE Inference Performance by 10×
Accelerating OpenClaw Agents with CacheBlend

April 1, 2026

Agent, lmcache, Performance

cacheblend, Demo, KVCache, lmcache, OpenClaw, RAG

The standard approach to reducing LLM inference costs is prefix caching, which reuses previously computed token states to avoid redundant computation. In practice, however, this approach misses significant caching opportunities in real-world agentic workloads! Caching in Agentic Workflows In agentic workloads, shared content (e.g., retrieved contexts and documents) frequently appears across requests at varied positions,…

Read more: Accelerating OpenClaw Agents with CacheBlend
LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production

January 21, 2026

lmcache, New features, Performance

Baolong Mao (Tencent), Chunxiao Zheng (Tencent), Weishu Deng (Tensormesh), Darren Peng (Tensormesh), Samuel Shen (Tensormesh) What is P2P and what does it promise? In this blog post, we will go over: Most production vLLM deployments run multiple identical instances behind a load balancer. Each instance builds its own KV cache only from the traffic it…

Read more: LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production
Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere

October 29, 2025

Benchmark, Performance

benchmark, CAIOS, cohere, coreweave, RAG, storage, tensormesh

The challenge: Scaling enterprise AI Enterprises today are racing to integrate large language models (LLMs) into their products and workflows, but doing it at scale brings challenges in performance, cost, and accuracy. Organizations need models to be based on their specific data, while making sure that this information remains private. Cohere, one of the leading…

Read more: Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere
Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference

September 7, 2025

Best practices, Performance

collaboration, distributed-inference, dynamo, nvidia, performance

We’re thrilled to announce that the Nvidia Dynamo project has integrated LMCache as its KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a production-scale ecosystem used by many developers worldwide. Why KV Caching Matters KV caching is a foundational optimization for modern LLM…

Read more: Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference
How LMCache Turbocharges Enterprise LLM Inference Frameworks

May 16, 2025

Performance

benchmark, ITL, lmcache, PD disagregation, performance, RAG, TTFT

TL;DR LMCache, the state-of-the-art KV cache layer library developed by TensorMesh and the project’s open-source community, delivers breakthrough performance improvements to modern enterprise LLM inference frameworks, including the vLLM Production Stack, KServe, and NVIDIA’s Dynamo. With fast and scalable caching of long-context KV cache, LMCache helps reduce inference costs and ensures SLOs for both latency…

Read more: How LMCache Turbocharges Enterprise LLM Inference Frameworks

About us

Tags

Category: Performance

Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Accelerating OpenClaw Agents with CacheBlend

LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production

Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere

Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference

How LMCache Turbocharges Enterprise LLM Inference Frameworks