About us

Categories

Tags

Follow us on: X, LinkedIn

Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache

By

LMCache Team

TL;DR:
In our previous blog, we introduced **LMCache**’s integration with vLLM v1 and NVIDIA’s NIXL used in Dynamo, enabling Prefill-Decode Disaggregation (PD) for LLM inference. Today, we’re excited to share benchmark results that confirm this system achieves state-of-the-art PD performance, balancing time-to-first-token (TTFT) and inter-token latency (ITL) with unprecedented consistency.

Here’s an example result (scroll down for more details and results):

image

This integration results from an active collaboration with NVIDIA Dynamo, with the goal of enhancing performance based on insights gained from LMCache.

Enter Prefill-Decode (PD) Disaggregation

A promising solution to this problem, Prefill-Decode (PD) Disaggregation, has become the default architecture in many cutting-edge inference stacks. The core idea is to separate the “prefill” and “decode” stages of inference into different processes, allowing decoders to stream tokens without being blocked by ongoing prefill jobs.

Why does PD matter?

  • Without PD: New requests preempt ongoing decoding, leading to high tail latencies for long-form completions.
  • With PD: Decoders are dedicated to generating tokens, leading to smooth, uninterrupted streaming.

PD is now adopted in several production-grade inference engines and internal deployments at top-tier AI platforms.

LMCache + vLLM v1 + NIXL = PD at State-of-the-Art Speed

Today, we present the benchmark results that validate its performance, using vLLM v1 and Dynamo‘s latest release (as of 04/28) as reference points. We used the benchmark script from vllm/benchmarks.

Launch parameters

Set up #1: 1P1D

  • Topology: 1 prefillers + 1 decoder (1p1d)
  • Model: meta-llama/Llama-3.1-8B-Instruct
  • Hardware: a 8x H100 node, with NVLink
  • Workload: 3.6 queries per second, each with 8000 input tokens (prefill) + 200 output tokens (decode)
  • Benchmark command
python3 benchmark_serving.py --port 8080 --seed $(date +%s) \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --dataset-name random --random-input-len 8000 --random-output-len 200 \
      --num-prompts 200 --burstiness 100 --request-rate 3.6 --metric-percentiles 95 \
      --backend openai-chat --endpoint /v1/chat/completions --ignore-eos | tee benchmark.log

Results:

image

Set up #2: 2P1D

  • Topology: 2 prefillers + 1 decoder (1p1d) on 3-GPU node with NVLink
  • Model: meta-llama/Llama-3.1-8B-Instruct
  • Hardware: a 8x H100 node, with NVLink
  • Workload: 5.5 queries per second, each with 10000 input tokens (prefill) + 100 output tokens (decode)
  • Benchmark command:
python3 benchmark_serving.py --port 8080 --seed $(date +%s) \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --dataset-name random --random-input-len 10000 --random-output-len 100 \
      --num-prompts 250 --burstiness 100 --request-rate 5.5 --metric-percentiles 95 \
      --backend openai-chat --endpoint /v1/chat/completions --ignore-eos | tee benchmark.log

Results:

image

Key takeaway:

vLLM v1 (w/ PD) outperforms vLLM (w/o PD) and achieves similar performance as Dynamo across TTFT and ITL

Design Deep Dive: How We Achieve It

The core challenge of any PD system lies in efficiently transferring the KV cache between prefillers and decoders.

Here’s how we do it:

  1. KV cache is extracted from vLLM’s paged memory
  2. LMCache collects and assembles the KV into a GPU-resident contiguous buffer
  3. The buffer is sent over the network via NIXL to decoders
image

Why Use a Buffer?

At first glance, adding a buffer sounds like extra overhead. But let’s consider the alternative: sending KV cache block by block directly from vLLM’s paged memory. With a default block size of 16 tokens, this results in many tiny transfers, each underutilizing bandwidth, even with NIXL.

In contrast, LMCache batches the KV blocks into a single large buffer, achieving:

  • High GPU-to-NIC throughput
  • Minimal fragmentation
  • Near-zero memory copy overhead (since it’s all in GPU memory)

It’s analogous to how OSes use I/O buffers to reduce syscall overhead and improve disk/network performance.

Why Page Size Matters

KV cache transfer speed varies significantly with vLLM’s page size. A smaller page means more tiny KV transfers; a larger page introduces memory fragmentation and hurts prefix caching.

LMCache solves this by buffering at send-time, decoupling network transfer efficiency from page size.

The delay to transfer the KV cache of a 7500-token input through NVLink

  • 20ms, if page size is 16 tokens
  • 8ms, if page size is 128 tokens

Why Not Just Increase Page Size in vLLM?

Great question.

vLLM sets a default block size of 16 tokens for a reason—prefix caching. Larger blocks reduce prefix hit rates unless complex partial-matching mechanisms are added. Moreover, large blocks fragment GPU memory, clashing with the paged attention philosophy.

Again, this is similar to OS memory paging—smaller pages allow fine-grained caching and less fragmentation.

By leveraging LMCache’s decoupled buffering, we get the best of both worlds:

  • Retain small pages for better memory usage and prefix hits
  • Still achieve efficient KV transfer for PD

Final Notes

Disaggregated Prefill is not just a nice-to-have—it’s becoming foundational to high-performance LLM inference.

With LMCache’s PD support for vLLM v1 + NIXL, we bring production-grade PD to open-source stacks, with state-of-the-art performance and a robust, future-proof architecture.

In all fairness, we don’t think LMCache’s design is optimal. The LMCache Lab works closely with vLLM team to explore better designs of PD disaggregation.

Stay tuned—we’re just getting started.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading