About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

Understanding LMCache MP Mode Transfer Paths: A Beginner’s Guide

By

Jiayue Chen

,

Tony Lin (Intel)

and

LMCache Team

In the traditional setup, KV cache is usually managed inside the inference engine process. This means the cache is closely tied to the lifetime of that engine. If the inference engine restarts or crashes, the KV cache may be lost as well.
To address this, LMCache introduces multiprocess (MP) mode. In MP mode, LMCache runs as a standalone daemon process that is decoupled from the inference engine. The inference engine focuses on running model inference, while the LMCache MP server independently manages KV cache storage and reuse. This separation makes KV cache management more robust, especially when inference engine workers restart or fail.

Diagram illustrating the overall architecture and KV transfer paths in LMCache MP mode, depicting interactions between vLLM workers, LMCache MP server, and various storage backends.

In this blog, we use vLLM as an example to walk through that transfer path in a beginner-friendly way. The goal is to help new contributors understand the basic MP mode architecture, build enough context to pick up their next beginner-friendly issue, and contribute meaningfully to LMCache.

vLLM + LMCache

To see this in practice, during a vLLM deployment, the vLLM worker acts as a client and the LMCache MP server acts as an independent cache manager. The worker sends storage and retrieval requests, while LMCache handles the backend logistics.

Because the worker and the server are separate processes, they do not share memory. That raises the central systems question of this post: how do massive KV cache tensors physically move between them? The mechanism responsible for this movement is called the transfer path.
Every transfer path supports two operations: store (worker -> server) and retrieve (server -> worker).

Gathering and Scattering KV Blocks

Before looking at the paths themselves, it helps to understand what the data being moved looks like.

Inference engines like vLLM do not store a request’s KV cache as one contiguous memory region. With PagedAttention, the KV cache is split into small fixed-size blocks that are allocated on demand as the request grows, so one request’s KV cache ends up scattered across many GPU memory blocks. A block table records which block IDs belong to each request.

To move a request’s KV cache, one has to walk the block table and copy the scattered blocks into one contiguous buffer:  this step is called gathering. The reverse – copying a contiguous buffer back into the paged blocks during retrieve – is called scattering. (You will see these names directly in the LMCache code as gather_paged_kv_to_cpu and scatter_cpu_to_paged_kv.)

The Anatomy of a Transfer Path

A transfer path therefore has two jobs:

  1. Gather or scatter KV blocks: gather scattered KV blocks into a contiguous buffer before transfer, or scatter the contiguous buffer back into paged KV blocks during retrieval.
  2. Move the buffer between processes: transfer those bytes between the inference engine worker and the LMCache MP server.

Depending on the hardware environment, LMCache handles this process-boundary transfer differently. This brings us to the two core transfer paths LMCache supports: the CUDA path and the non-CUDA path.

The CUDA Path

Normally, a GPU memory pointer is only valid inside the process that allocated it. Another process cannot use that pointer directly. CUDA Inter-Process Communication (IPC) solves this by letting the owning process export a GPU memory allocation as a small IPC handle. Another process can import that handle and get its own valid pointer to the same underlying GPU memory.

LMCache uses this mechanism in the CUDA path. Instead of copying the whole KV cache out of the worker and sending it to the LMCache server, the worker sends a CUDA IPC handle. The handle is not the KV cache data itself; it is a lightweight reference that lets the LMCache server access the worker’s GPU memory from a separate process.

In LMCache, CudaIPCWrapper wraps a GPU tensor into a sendable object. It bundles the CUDA IPC handle with tensor metadata, such as shape, dtype, and layout information, so the server knows how to interpret the memory. A small CUDA event also tells the other process when the GPU operation is finished and when it is safe to read from or write to the shared memory, avoiding race conditions.

The actual KV cache still has to be copied into LMCache-managed storage. For a store operation, LMCache first uses the CUDA IPC handle to access the worker’s GPU memory. It then performs the gather step described earlier: the scattered paged KV blocks are copied into a contiguous temporary GPU staging buffer. After that, the contiguous buffer is copied from GPU memory into LMCache-managed CPU memory, as shown in LMCache’s lmcache_driven_transfer.py implementation. So the store path can be summarized as: worker paged KV blocks on GPU → GPU staging buffer → LMCache CPU memory. A retrieve then runs the same path in reverse.

The Non-CUDA Path

The IPC handle is a CUDA-only feature. This means CPUs and non-CUDA accelerators, such as Intel XPU or Habana HPU, cannot use the same mechanism to share device memory across processes. This is where LMCache community maintainer hlin99 made an important contribution through two PRs, #3259 and #3359, which built the alternative transfer path and expanded LMCache MP mode support across multiple platforms.

To move data between processes, you could either copy the bytes through a channel (e.g. a socket) or share a region of memory that both processes can read and write. hlin99’s two PRs implement exactly these two options for the non-CUDA path:

  • PR #3259 – the pickle path (copy through a channel)
  • PR #3359 – the shared-memory(SHM) path (share memory)

Shared Memory (SHM) — 1 Copy

When the LMCache server starts with --l1-size-gb, it sets up an L1 pool. This L1 pool is LMCache’s primary high-speed storage space in host memory. By default, LMCache allocates this pool inside /dev/shm, a Linux shared-memory area backed by RAM that allows separate processes to access the same memory region.

In non-CUDA setups, the worker uses this shared memory pool to transfer KV cache data to the LMCache server. Because the L1 pool lives in shared memory, both the vLLM worker and the LMCache server can map the same physical memory region. The worker can then write its KV cache directly into the server’s L1 buffer, instead of copying the data through a separate communication channel. As a result, a store operation only needs one data copy: gathered worker KV cache -> Shared L1 buffer.

Note: “1 copy” assumes the device-specific C ops backend. The default Python fallback adds an extra staging buffer (2 copies) to coalesce small chunks into one large transfer, which maximizes transfer throughput at the cost of one extra copy.

Pickle — 4 Copies

The SHM path only works if the L1 pool actually fits inside /dev/shm. If the requested L1 pool is larger than the /dev/shm capacity (or shared memory is disabled), it can’t live there, so LMCache allocates the L1 pool in the server’s private RAM instead, which the worker has no way to map into.

When that happens, KV cache cannot be shared directly between the worker and the server, so the data has to be moved through a byte stream. In this path, the worker first gathers KV blocks by their block IDs into contiguous CPU chunks. It then serializes those chunks with pickle.dumps and sends the bytes to the server over the ZMQ socket as a COMMIT_STORE request. On the server side, LMCache deserializes the bytes back into tensors and writes them into its private L1 pool.

Diagram illustrating the data transfer paths of KV tensors from the vLLM worker to the LMCACHE MP server in MP mode, detailing CUDA and non-CUDA transport processes.

This gives the pickle path four copies in total: gather -> serialize ->  deserialize -> write.
Unlike the CUDA path, which shares GPU memory access through an IPC handle, or the SHM path, which lets both processes map the same shared buffer, the pickle path actually moves the KV cache bytes through the ZMQ channel. This makes it a platform-independent and universal fallback path for data transfer.

PathData flowCopiesWhat crosses the process boundary
CUDA IPCGather GPU blocks -> GPU staging buffer -> CPU L12A small IPC handle that lets the server read the worker’s GPU directly
SHMGather KV -> Straight into the shared /dev/shm L1 buffer1*(1 copy with the device-specific C ops backend; the default Python fallback uses 2 , see note above)Nothing – both map the same memory region
PickleGather KV -> CPU Chunk -> serialize => ZMQ => deserialize -> write to L14The serialized KV bytes
KV Transfer Paths (store operation)

Deployment Tutorial and Explanation

https://github.com/glbyktjys/lmcache-mp-transfer-paths-tutorial

Final Thoughts

If you have read this far, you should now have a clearer mental model of how LMCache moves KV cache in MP mode. The CUDA path shares GPU memory through an IPC handle, while the non-CUDA transfer paths either share a /dev/shm buffer through SHM or move bytes over a socket through pickle.

Hopefully, this gives you enough context to start exploring the codebase with more confidence. One possible area to look at is DeepSeek V4 support. DeepSeek V4 is already supported in MP mode through the CUDA path via PR #3171, but extending that support to the non-CUDA transfer paths remains an open opportunity. In particular, accelerators such as Intel XPU and Habana HPU, as well as the CPU/SHM/pickle paths used for GPU-free testing, still need support for V4’s hybrid KV cache groups. This could be a well-scoped contribution for someone interested in MP mode, multi-platform support, and hybrid KV layouts.

Another direction we’re exploring is making the SHM store path asynchronous. Today, SHM transfer is synchronous, so the worker blocks until the copy finishes. An async design would let the vLLM worker use a “fire-and-forget” store flow, similar to how the CUDA path uses events to decouple the two processes. This could meaningfully improve throughput. This is still an early-stage idea, but if these design tradeoffs interest you, we’d love for you to join the discussion and help build it with us.

LMCache’s multi-platform support is also continuing to evolve. A recent follow-up, PR #3352, added CPU support to the SHM-based EngineDrivenContext path, enabling store/retrieve flows to run entirely over shared memory without GPU or CUDA dependencies. These updates simplify testing of the non-CUDA MP path and make LMCache’s MP mode easier to extend across hardware platforms and use without accelerators.

Together, these improvements make MP mode easier to explore, debug, and contribute to, especially for developers working in non-CUDA or CPU-only environments. We hope this guide makes the MP code path feel more approachable, and we welcome contributors who are interested in helping LMCache run well across more platforms.

References:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading