About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

By

Weishu Deng

and

LMCache Team

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent load.

Diagram illustrating a distributed processing system with multiple data paths (DP 0, DP 1, DP N) featuring attention modules and load balancer.

To meet these demands, vLLM commonly combines Data Parallelism (DP) with automatic Expert Parallelism (EP). Compared to pure tensor parallelism (TP) at a similar scale, this configuration consistently delivers better throughput (TPS) and lower TTFT in real-world deployments. In this design, attention layers are replicated across GPUs while expert layers are sharded, ensuring that latency-critical attention computation and KV-cache access remain local. This reduces per-GPU memory pressure and enables more efficient utilization of compute resources.

Each request is routed to a specific DP rank, where attention is executed locally. During MoE layers, tokens are dynamically dispatched to experts across GPUs and then aggregated. This architecture provides strong scalability, efficient memory usage, and minimal communication overhead—making it well suited for serving large MoE models.

However, a key limitation remains: each DP rank maintains its own isolated KV-cache buffer within its local process. Even with CPU offloading enabled, these caches are not shared across workers. As a result, identical or overlapping contexts processed by different ranks cannot be reused, leading to redundant prefill computation and inefficient memory utilization.

LMCache’s Multi-Process (MP) mode fundamentally addresses this limitation by introducing a unified KV-cache layer. Instead of maintaining fragmented, per-process KV buffers, MP mode centralizes KV-cache management into a shared memory layer that is accessible across all serving processes. This unified design enables true cross-process cache reuse, eliminating redundant computation and significantly improving memory efficiency. The benefits are especially pronounced in multi-turn workloads, where context grows incrementally and naturally creates high cache reuse opportunities across requests and processes.

Diagram illustrating the In-Process offload and Unified KV modes in a memory architecture, showing data pathways between DRAM and HBMs.

Deployment and Benchmarking

Below, we present the deployment configurations for both in-process offloading and LMCache MP mode. We evaluate their effectiveness using a conversation-based benchmark that reflects realistic multi-turn workloads on the Qwen3-235B-A22B-Instruct-2507-FP8 model. All experiments are conducted on an 8× NVIDIA H100 80GB GPU server using vLLM 0.18.1 and LMCache 0.4.3-dev (with the official 0.4.3 release forthcoming).

In-Process offload

Step 1: Launch vLLM Serving

We first launch a baseline vLLM instance with 8-way data parallelism and in-process KV-cache offloading:

export VLLM_USE_FLASHINFER_MOE_FP8=0 export VLLM_USE_DEEP_GEMM=1 
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \ 
--data-parallel-size 8  --enable-expert-parallel --gpu-memory-utilization 0.8 \ 
--max-num-batched-tokens 1024  --kv-offloading-size 50  \
--disable-hybrid-kv-cache-manager  \
--max-model-len auto

In this setup, each data-parallel rank is allocated 50 GB of host memory for KV-cache offloading, resulting in an aggregate 400 GB CPU memory pool. While this allows the system to support long-context workloads and maintain high GPU utilization, the KV cache remains process-local and fragmented, preventing reuse across workers.
(Note: –disable-hybrid-kv-cache-manager is required for HMA models with the native OffloadingConnector.)


LMCache MP offload

Unlike the previous in-process offload, LMCache MP mode runs as an independent service. It dynamically detects and registers serving engines at runtime, enabling seamless integration across multiple serving processes.

Step 1: Start LMCache MP Server

We first launch the LMCache multi-process server with 400GB of host memory:

lmcache server --l1-size-gb 400 --eviction-policy LRU

Step 2: Launch vLLM with LMCache MP

export VLLM_USE_FLASHINFER_MOE_FP8=0 export VLLM_USE_DEEP_GEMM=1  
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
--data-parallel-size 8  --enable-expert-parallel   --gpu-memory-utilization 0.8 \  
--max-num-batched-tokens 1024  --max-model-len auto \ 
--kv-transfer-config '{"kv_connector":"LMCacheMPConnector","kv_role":"kv_both"}'

In this configuration, all GPU workers dynamically register with the LMCache server and access a shared KV-cache pool backed by host memory. This transforms KV-cache management from isolated per-process buffers into a unified, system-wide caching layer, enabling efficient reuse across DP ranks and even across independent serving instances.

Multi-Turn Benchmark

To capture realistic behavior, we evaluate using a multi-round conversation benchmark:

lmcache bench engine --engine-url http://localhost:8000  --workload multi-round-chat  \
--mrc-qps 2.0 --mrc-duration 120

This workload naturally exposes repeated prefixes and growing context, making KV-cache reuse critical for performance.


Key Difference

  • In-process offload: KV cache is fragmented across DP ranks → no cross-process reuse
  • LMCache MP mode: KV cache is unified at the host layer → shared across all processes

This architectural shift—from isolated buffers to a unified KV-cache layer—is the core reason behind the performance gains.

Bar chart comparing average TTFT in seconds for In-Process Offload and LMCache MP, highlighting a 13.6x reduction for LMCache MP.
MetricStatisticLMCache MP ModeIn-Process Offload
TTFT (s)Mean0.293.98
p991.3013.55
Decoding SpeedMean37.47 tok/s9.81 tok/s
p9945.14 tok/s34.27 tok/s


LMCache MP mode delivers substantial improvements in both latency and system efficiency. It reduces TTFT by approximately 13× on average (0.29s vs. 3.98s) and improves tail latency by over 10× at p99 (1.30s vs. 13.55s). Decoding throughput is also increased by nearly (37.5 vs. 9.8 tokens/s).

These gains stem directly from the unified host-side KV-cache layer, which dramatically increases cache hit rates and eliminates redundant prefill computation. By avoiding repeated work and reducing memory fragmentation, MP mode frees up GPU resources for decoding and delivers more stable, predictable performance under high concurrency.


LMCache MP Mode Roadmap

Currently, LMCache MP mode operates at the node level, where multiple serving processes—across DP ranks and independent instances—share a unified KV-cache layer backed by host memory (L1). This design centralizes cache management, reduces per-process overhead, and integrates seamlessly with L2 storage backends, including local storage and remote connectors.

Looking ahead, MP mode is being extended beyond a single node. Upcoming features such as peer-to-peer (P2P) cache sharing and prefill–decode (PD) disaggregation will enable cross-node KV reuse and distributed cache orchestration. These advancements will evolve the unified KV-cache layer from a node-local optimization into a cluster-wide caching system, unlocking even greater scalability and efficiency. Enhanced observability is also planned to provide deeper visibility into cache behavior and system performance.

For more details, refer to the documentation: https://docs.lmcache.ai/mp/index.html

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading