Bringing State-Of-The-Art PD Speed to vLLM v1 with LMCache

By LMCache Team


TL;DR:
In our previous blog, we introduced LMCache’s integration with vLLM v1 and NVIDIA’s NIXL used in Dynamo, enabling Prefill-Decode Disaggregation (PD) for LLM inference. Today, we’re excited to share benchmark results that confirm this system achieves state-of-the-art PD performance, balancing time-to-first-token (TTFT) and inter-token latency (ITL) with unprecedented consistency.

Here’s an example result (scroll down for more details and results):

image

This integration results from an active collaboration with NVIDIA Dynamo, with the goal of enhancing performance based on insights gained from LMCache.


Why This Matters: LLM Inference on the Critical Path

Large language models are no longer confined to offline batch tasks—they are increasingly on the critical path of real-time, latency-sensitive applications like:

  • Real-time copilots in IDEs (e.g., GitHub Copilot, Cursor)
  • Voice assistants and smart reply systems
  • Conversational agents for customer support (e.g., Intercom, Ada)
  • Realtime search and retrieval augmentation (RAG) systems

In these settings, it’s not enough for a model to eventually return results—the generation process needs to be fast and smooth. The two most important latency metrics are:

  • Time-to-first-token (TTFT): How fast the model starts responding
  • Inter-token latency (ITL): How consistently fast the tokens are streamed afterward

In short, TTFT helps improve perceived responsiveness. ITL ensures that long-form completions don’t lag or stall midway. And for enterprise-grade deployments, both must be tightly bounded to meet service-level objectives (SLOs).


Enter Prefill-Decode (PD) Disaggregation

A promising solution to this problem, Prefill-Decode (PD) Disaggregation, has become the default architecture in many cutting-edge inference stacks. The core idea is to separate the “prefill” and “decode” stages of inference into different processes, allowing decoders to stream tokens without being blocked by ongoing prefill jobs.

Why does PD matter?

  • Without PD: New requests preempt ongoing decoding, leading to high tail latencies for long-form completions.
  • With PD: Decoders are dedicated to generating tokens, leading to smooth, uninterrupted streaming.

PD is now adopted in several production-grade inference engines and internal deployments at top-tier AI platforms.


The Gap in vLLM v1 — Until Now

vLLM is arguably the most popular open-source inference engine for LLMs, known for its paged attention and efficient scheduling. With the release of vLLM v1, major improvements were introduced, such as chunked prefill and smarter memory management.

Yet, PD was missing from the official vLLM stack—especially the XpYd setup (X prefillers, Y decoders), which is widely used in production to maximize multi-node throughput.

This is where LMCache enters.

In our previous blog, we announced LMCache’s integration with vLLM v1 and NVIDIA’s NIXL to support Disaggregated Prefill.

LMCache’s upstream support in vLLM


LMCache + vLLM v1 + NIXL = PD at State-of-the-Art Speed

Today, we present the benchmark results that validate its performance, using vLLM v1 and Dynamo’s latest release (as of 04/28) as reference points. We used the benchmark script from vllm/benchmarks.

Launch parameters

  • Dynamo: Following rainj-me’s PR in vLLM to reduce unnecessary gateway overhead. And then build prefill-decode connections using the add_remote_prefill_eps interface.
  • LMCache: Launch parameter the same as in https://github.com/vllm-project/vllm/blob/main/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh, except that the ports are different in 2P1D scenario.

Set up #1: 1P1D

  • Topology: 1 prefillers + 1 decoder (1p1d)
  • Model: meta-llama/Llama-3.1-8B-Instruct
  • Hardware: a 8x H100 node, with NVLink
  • Workload: 3.6 queries per second, each with 8000 input tokens (prefill) + 200 output tokens (decode)
  • Benchmark command:
    python3 benchmark_serving.py --port 8080 --seed $(date +%s) \
          --model meta-llama/Llama-3.1-8B-Instruct \
          --dataset-name random --random-input-len 8000 --random-output-len 200 \
          --num-prompts 200 --burstiness 100 --request-rate 3.6 --metric-percentiles 95 \
          --backend openai-chat --endpoint /v1/chat/completions --ignore-eos | tee benchmark.log
    

Results:

image

Set up #2: 2P1D

  • Topology: 2 prefillers + 1 decoder (1p1d) on 3-GPU node with NVLink
  • Model: meta-llama/Llama-3.1-8B-Instruct
  • Hardware: a 8x H100 node, with NVLink
  • Workload: 5.5 queries per second, each with 10000 input tokens (prefill) + 100 output tokens (decode)
  • Benchmark command:
    python3 benchmark_serving.py --port 8080 --seed $(date +%s) \
          --model meta-llama/Llama-3.1-8B-Instruct \
          --dataset-name random --random-input-len 10000 --random-output-len 100 \
          --num-prompts 250 --burstiness 100 --request-rate 5.5 --metric-percentiles 95 \
          --backend openai-chat --endpoint /v1/chat/completions --ignore-eos | tee benchmark.log
    

Results:

image

Key takeaway:

vLLM v1 (w/ PD) outperforms vLLM (w/o PD) and achieves similar performance as Dynamo across TTFT and ITL


Design Deep Dive: How We Achieve It

The core challenge of any PD system lies in efficiently transferring the KV cache between prefillers and decoders.

Here’s how we do it:

  1. KV cache is extracted from vLLM’s paged memory
  2. LMCache collects and assembles the KV into a GPU-resident contiguous buffer
  3. The buffer is sent over the network via NIXL to decoders

image

Why Use a Buffer?

At first glance, adding a buffer sounds like extra overhead. But let’s consider the alternative: sending KV cache block by block directly from vLLM’s paged memory. With a default block size of 16 tokens, this results in many tiny transfers, each underutilizing bandwidth, even with NIXL.

In contrast, LMCache batches the KV blocks into a single large buffer, achieving:

  • High GPU-to-NIC throughput
  • Minimal fragmentation
  • Near-zero memory copy overhead (since it’s all in GPU memory)

It’s analogous to how OSes use I/O buffers to reduce syscall overhead and improve disk/network performance.

Why Page Size Matters

KV cache transfer speed varies significantly with vLLM’s page size. A smaller page means more tiny KV transfers; a larger page introduces memory fragmentation and hurts prefix caching.

LMCache solves this by buffering at send-time, decoupling network transfer efficiency from page size.

The delay to transfer the KV cache of a 7500-token input through NVLink

  • 20ms, if page size is 16 tokens
  • 8ms, if page size is 128 tokens

Why Not Just Increase Page Size in vLLM?

Great question.

vLLM sets a default block size of 16 tokens for a reason—prefix caching. Larger blocks reduce prefix hit rates unless complex partial-matching mechanisms are added. Moreover, large blocks fragment GPU memory, clashing with the paged attention philosophy.

Again, this is similar to OS memory paging—smaller pages allow fine-grained caching and less fragmentation.

By leveraging LMCache’s decoupled buffering, we get the best of both worlds:

  • Retain small pages for better memory usage and prefix hits
  • Still achieve efficient KV transfer for PD

Final Notes

Disaggregated Prefill is not just a nice-to-have—it’s becoming foundational to high-performance LLM inference.

With LMCache’s PD support for vLLM v1 + NIXL, we bring production-grade PD to open-source stacks, with state-of-the-art performance and a robust, future-proof architecture.

In all fairness, we don’t think LMCache’s design is optimal. The LMCache Lab works closely with vLLM team to explore better designs of PD disaggregation.

Stay tuned—we’re just getting started.

Share: X (Twitter) Facebook LinkedIn