TL;DR:
In our previous blog, we introduced **LMCache**’s integration with vLLM v1 and NVIDIA’s NIXL used in Dynamo, enabling Prefill-Decode Disaggregation (PD) for LLM inference. Today, we’re excited to share benchmark results that confirm this system achieves state-of-the-art PD performance, balancing time-to-first-token (TTFT) and inter-token latency (ITL) with unprecedented consistency.
Here’s an example result (scroll down for more details and results):
This integration results from an active collaboration with NVIDIA Dynamo, with the goal of enhancing performance based on insights gained from LMCache.
Enter Prefill-Decode (PD) Disaggregation
A promising solution to this problem, Prefill-Decode (PD) Disaggregation, has become the default architecture in many cutting-edge inference stacks. The core idea is to separate the “prefill” and “decode” stages of inference into different processes, allowing decoders to stream tokens without being blocked by ongoing prefill jobs.
Why does PD matter?
- Without PD: New requests preempt ongoing decoding, leading to high tail latencies for long-form completions.
- With PD: Decoders are dedicated to generating tokens, leading to smooth, uninterrupted streaming.
PD is now adopted in several production-grade inference engines and internal deployments at top-tier AI platforms.
LMCache + vLLM v1 + NIXL = PD at State-of-the-Art Speed
Today, we present the benchmark results that validate its performance, using vLLM v1 and Dynamo‘s latest release (as of 04/28) as reference points. We used the benchmark script from vllm/benchmarks.
Launch parameters
- Dynamo: Following rainj-me’s PR in vLLM to reduce unnecessary gateway overhead.
And then build prefill-decode connections using the add_remote_prefill_eps interface. - LMCache: Launch parameter the same as in https://github.com/vllm-project/vllm/blob/main/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh, except that the ports are different in 2P1D scenario.
Set up #1: 1P1D
- Topology: 1 prefillers + 1 decoder (1p1d)
- Model: meta-llama/Llama-3.1-8B-Instruct
- Hardware: a 8x H100 node, with NVLink
- Workload: 3.6 queries per second, each with 8000 input tokens (prefill) + 200 output tokens (decode)
- Benchmark command
python3 benchmark_serving.py --port 8080 --seed $(date +%s) \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random --random-input-len 8000 --random-output-len 200 \
--num-prompts 200 --burstiness 100 --request-rate 3.6 --metric-percentiles 95 \
--backend openai-chat --endpoint /v1/chat/completions --ignore-eos | tee benchmark.log
Results:
Set up #2: 2P1D
- Topology: 2 prefillers + 1 decoder (1p1d) on 3-GPU node with NVLink
- Model: meta-llama/Llama-3.1-8B-Instruct
- Hardware: a 8x H100 node, with NVLink
- Workload: 5.5 queries per second, each with 10000 input tokens (prefill) + 100 output tokens (decode)
- Benchmark command:
python3 benchmark_serving.py --port 8080 --seed $(date +%s) \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random --random-input-len 10000 --random-output-len 100 \
--num-prompts 250 --burstiness 100 --request-rate 5.5 --metric-percentiles 95 \
--backend openai-chat --endpoint /v1/chat/completions --ignore-eos | tee benchmark.log
Results:
Key takeaway:
vLLM v1 (w/ PD) outperforms vLLM (w/o PD) and achieves similar performance as Dynamo across TTFT and ITL
Design Deep Dive: How We Achieve It
The core challenge of any PD system lies in efficiently transferring the KV cache between prefillers and decoders.
Here’s how we do it:
- KV cache is extracted from vLLM’s paged memory
- LMCache collects and assembles the KV into a GPU-resident contiguous buffer
- The buffer is sent over the network via NIXL to decoders
Why Use a Buffer?
At first glance, adding a buffer sounds like extra overhead. But let’s consider the alternative: sending KV cache block by block directly from vLLM’s paged memory. With a default block size of 16 tokens, this results in many tiny transfers, each underutilizing bandwidth, even with NIXL.
In contrast, LMCache batches the KV blocks into a single large buffer, achieving:
- High GPU-to-NIC throughput
- Minimal fragmentation
- Near-zero memory copy overhead (since it’s all in GPU memory)
It’s analogous to how OSes use I/O buffers to reduce syscall overhead and improve disk/network performance.
Why Page Size Matters
KV cache transfer speed varies significantly with vLLM’s page size. A smaller page means more tiny KV transfers; a larger page introduces memory fragmentation and hurts prefix caching.
LMCache solves this by buffering at send-time, decoupling network transfer efficiency from page size.
The delay to transfer the KV cache of a 7500-token input through NVLink
- 20ms, if page size is 16 tokens
- 8ms, if page size is 128 tokens
Why Not Just Increase Page Size in vLLM?
Great question.
vLLM sets a default block size of 16 tokens for a reason—prefix caching. Larger blocks reduce prefix hit rates unless complex partial-matching mechanisms are added. Moreover, large blocks fragment GPU memory, clashing with the paged attention philosophy.
Again, this is similar to OS memory paging—smaller pages allow fine-grained caching and less fragmentation.
By leveraging LMCache’s decoupled buffering, we get the best of both worlds:
- Retain small pages for better memory usage and prefix hits
- Still achieve efficient KV transfer for PD
Final Notes
Disaggregated Prefill is not just a nice-to-have—it’s becoming foundational to high-performance LLM inference.
With LMCache’s PD support for vLLM v1 + NIXL, we bring production-grade PD to open-source stacks, with state-of-the-art performance and a robust, future-proof architecture.
In all fairness, we don’t think LMCache’s design is optimal. The LMCache Lab works closely with vLLM team to explore better designs of PD disaggregation.
Stay tuned—we’re just getting started.

Leave a Reply