Benchmark Archives | LMCache Blog

AMD × LMcache: AMD GPU Acceleration with LMcache

January 9, 2026

AMD, Benchmark, lmcache

Introduction LLM inference becomes increasingly challenging as context length grows and workloads scale. Traditional serving engines rely on prefix-based KV cache reuse, which limits opportunities for optimization, especially when processing long, repeated, or overlapping text across different requests. LMCache addresses this challenge. It is an extension to LLM serving engines that dramatically reduces time-to-first-token (TTFT)…

Read more: AMD × LMcache: AMD GPU Acceleration with LMcache
Context Engineering & Reuse Pattern Under the Hood of Claude Code

December 23, 2025

Benchmark, lmcache

cacheblend, claude-code, lmcache

Over the last few months, Claude Code has quietly become one of the most interesting & widely-adopted real-world agentic systems available to normal developers. Unlike cloud-only agents whose internals remain hidden behind API gateways like Perplexity, Devin, or Manus, nor as fully open source agents like Mini SWE Agent or Terminus 2 where you can…

Read more: Context Engineering & Reuse Pattern Under the Hood of Claude Code
Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere

October 29, 2025

Benchmark, Performance

benchmark, CAIOS, cohere, coreweave, RAG, storage, tensormesh

The challenge: Scaling enterprise AI Enterprises today are racing to integrate large language models (LLMs) into their products and workflows, but doing it at scale brings challenges in performance, cost, and accuracy. Organizations need models to be based on their specific data, while making sure that this information remains private. Cohere, one of the leading…

Read more: Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere
LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage

October 7, 2025

Benchmark

benchmark, gke, Google, storage, vLLM

Overview of the Collaboration The KV Cache is a memory optimization that makes Large Language Models(LLMs) run the forward pass faster by storing Key (K) and Value (V) matrices to prevent the model from recalculating them for the entire text sequence with every new generated token. Maximizing the KV Cache hit rate with storage is…

Read more: LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage
LMCache supports gpt-oss (20B/120B) on Day 1

August 5, 2025

Benchmark, Best practices, News

benchmark, gpt-oss, OpenAI, vLLM

LMCache now supports OpenAI’s newly released GPT-OSS models (20B and 120B parameters) from day one! This post provides a complete guide to setting up vLLM with LMCache for GPT-OSS models and demonstrates significant performance improvements through our CPU offloading capabilities. Step 1: Installing vLLM GPT OSS Version Installation Test the Installation Step 2: Install LMCache…

Read more: LMCache supports gpt-oss (20B/120B) on Day 1
Shortest Prefill First—Smarter Scheduling for Faster Prefill!

July 29, 2025

Benchmark, New features

prefill, queueing, scheduling, spf

TL;DR: ⚡ Shortest Prefill First (SPF) scheduling cuts LLM time-to-first-token by up to 18% in prefill-decode disaggregation—unlocking even greater gains when combined with LMCache! At LMCache Lab, we’re obsessed with LLM performance. As prefill-decode disaggregation becomes the norm, we spotted a major, untapped scheduling opportunity for prefill nodes.That’s why we developed SPF (Shortest Prefill First,…

Read more: Shortest Prefill First—Smarter Scheduling for Faster Prefill!
LMCache Lab: Only prefilling? We reduce decoding latency by 60%!

July 23, 2025

Benchmark

decoding, spec decode, speculative

TL;DR: 🚀 LMCache Lab cuts decoding latency for code/text editing by 60% with speculative decoding! ⚡ You might know LMCache Lab for our KV cache optimizations that make LLM prefilling a breeze. But that’s not all! We’re now focused on speeding up decoding too—so your LLM agents can generate new content even faster. In other…

Read more: LMCache Lab: Only prefilling? We reduce decoding latency by 60%!
Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache

April 29, 2025

Benchmark

dynamo, lmcache, NIXL, PD disagregation

TL;DR:In our previous blog, we introduced **LMCache**’s integration with vLLM v1 and NVIDIA’s NIXL used in Dynamo, enabling Prefill-Decode Disaggregation (PD) for LLM inference. Today, we’re excited to share benchmark results that confirm this system achieves state-of-the-art PD performance, balancing time-to-first-token (TTFT) and inter-token latency (ITL) with unprecedented consistency. Here’s an example result (scroll down…

Read more: Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache
Open-Source LLM Inference Cluster Performing 10x FASTER than SOTA OSS Solution

March 6, 2025

Benchmark

k8s, kubernetes, production stack, QPS, router, TTFT, vLLM

A picture is worth a thousand words: Executive Summary: [vLLM Production Stack Github] | [Get In Touch] | [Slack] | [Linkedin] | [Twitter] Benchmark setups Methods: Workload: Inspired by our production deployments, we create workloads that emulate a typical chat-bot document analysis workload. By default, each LLM query input has 9K tokens, including a document…

Read more: Open-Source LLM Inference Cluster Performing 10x FASTER than SOTA OSS Solution