About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

LMCache Multi-node P2P CPU Memory Sharing & Control: From Experimental Feature to Production

By

slshen

Baolong Mao (Tencent), Chunxiao Zheng (Tencent), Weishu Deng (Tensormesh), Darren Peng (Tensormesh), Samuel Shen (Tensormesh)

What is P2P and what does it promise?

In this blog post, we will go over:

  1. a short motivation of the P2PBackend in LMCache and how it differs from existing KV Caching solutions
  2. how to run and benchmark performance on the P2PBackend
  3. design decisions and pain points in making P2P lightweight and production-ready
Diagram illustrating KV caching solutions and trade-offs, featuring three sections: Local CPU Backend, Remote KV Pool, and P2P CPU Sharing. Each section includes a visual representation of components and capacity ratings, transfer latency, and fault tolerance levels.

Most production vLLM deployments run multiple identical instances behind a load balancer. Each instance builds its own KV cache only from the traffic it sees, creating cache silos.

Concrete example (cache silo problem):

  • 4 vLLM instances behind a load balancer
  • Each instance has 10 GB CPU KV cache (Total capacity: 40 GB)
  • Each request can only benefit from the cache of the instance it lands on → effectively ~10 GB usable per request
  • Result: duplicated compute, lower KV Cache reuse, wasted RAM and RAM Contention for instances on the same host
The Art of Managing Grain Quality with Silos - AGRIVI

A solution LMCache currently provides is to act as a socket for serving engines (e.g. vLLM, SGLang) to connect to a remote shared cache (Redis/S3/custom KV store), in theory providing persistent and infinitely scalable KV storage. The current downsides of this solution are:

  • deployment/resource overhead of introducing a new stateful service
  • non-trivial transfer latency (typically TCP-bound)

Tencent and the LMCache team, over the course of two months, implemented a production-grade solution within the LMCache open-source project for multi-node CPU P2P sharing.

P2P is a read-sharing supplement layered on top of the LocalCPU backend, not a replacement. Unlike an isolated shared cache service, the total storage capacity will be limited by available RAM in your cluster. However, the underlying assumption regarding scalability is that the memory demands of your inference workload will directly scale with the number of active instances.

LMCache implements this using a controller based architecture and a NIXL based transfer path for high performance KV movement.

Quickstart Benchmark: Proof of Concept + Performance

A quickstart for running a two instance P2P setup can be found in the LMCache Documentation at:

https://docs.lmcache.ai/kv_cache/p2p_sharing.html

The example assumes that the host has an RDMA-enabled NIC (if not, performance may be slightly worse).

After launching the controller:

PYTHONHASHSEED=123 lmcache_controller --host localhost --port 9000 --monitor-ports '{"pull": 8300, "reply": 8400, "heartbeat": 8082}'

and the two vllm instances with LMCache workers attached:

PYTHONHASHSEED=123  CUDA_VISIBLE_DEVICES=0 LMCACHE_CONFIG_FILE=/path/to/example1.yaml \
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --gpu-memory-utilization 0.8 \
    --port 8010 \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
PYTHONHASHSEED=123  CUDA_VISIBLE_DEVICES=1 LMCACHE_CONFIG_FILE=/path/to/example2.yaml \
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --gpu-memory-utilization 0.8 \
    --port 8011 \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

We use our Long Doc QA workload generator to first query instance 1 with 50 contexts of context length 10k tokens for a total of 500k tokens on Llama 3.1 8B (~62 GB of unique KV Cache) followed by the same workload to instance 2, both of whom have registered with the Cache Controller.

Because instance 2 can retrieve KV Cache from its peer (instance 1), it performs with a 4x improvement in TTFT and a 5x improvement in total round time.

# instance 1 (cold peer: prefill i.e. no reuse of context)
Query round mean TTFT: 2.028s
Query round time: 38.323s
# instance 2 (start after instance 1 i.e. consuming KV Cache from instance 1)
Query round mean TTFT: 0.490s
Query round time: 7.964s

Granted, this is a contrived benchmark with full workload reuse but suffices for a proof of concept. After a period of battle testing and experimentation, we plan to revisit this architecture and discuss best practices for deployment at scale and what kind of performance improvements might be expected on various real production workloads.

We’ll now go over some architectural decisions behind the P2PBackend.

Coming up with a Controller Metadata Architecture

A diagram illustrating a Registry Tree structure with three instances labeled 'rw workers'. Each instance consists of WorkerNodes, which are connected to fast locations leading to kvpools.

To make P2P efficient, the controller needs to remember:

  • which instances exist
  • which workers each instance has
  • what KV chunks live in which storage locations for each worker

Metadata Designs We Considered (S0, S1, S2)

Before committing to a production ready controller metadata layout, we evaluated three approaches for tracking KV cache ownership across the cluster. In all experiments, we simulated 100 instances, each containing 1 million KV chunks, and measured four dimensions that matter in production: lookup latency, memory footprint, full-state reporting time (for controller recovery), and worker de-registration time (for elastic scaling).

S0 – Flat Indexing:

Idea: keep a single global mapping:

chunk_hash_to_worker: dict[int, (instance_id, worker_id, location)]

This kept lookups fast but came at a massive cost for writes — every admit/evict had to update flat tables, IP groupings, and nested structures. Deregistering a worker meant rebuilding huge portions of these flat indexes.

S1 – RegistryTree: 

Idea: store metadata hierarchically

A diagram illustrating a Registry Tree structure with three instances labeled 'rw workers'. Each instance consists of WorkerNodes, which are connected to fast locations leading to kvpools.

RegistryTree → InstanceNode → WorkerNode → Location → Chunks

While S1’s lookup time (8 microseconds) is higher than S0, S1 achieves:

  • memory reduction: From 19.7 GB to 5.9 GB — crucial for cost-effective scaling
  • 4,000× faster deregistration: From 32 seconds to 8 milliseconds — enabling rapid elastic scaling
  • 23× faster full reporting: From 1.3 seconds to 60 milliseconds — essential for controller recovery

Our evaluation was that these improvements are better suited for production due to:

  • Elastic scaling: Workers can join and leave the cluster in milliseconds, not minutes
  • Fast recovery: Controller restarts trigger full sync that completes in seconds
  • Cost efficiency: Lower memory footprint means more instances per machine

S2 – Reverse Chunk to Worker Index

Idea: keep S1’s hierarchical structure, but add a global reverse map for near O(1) lookup:

key_to_worker_index: dict[int, (instance_id, worker_id, location)]

However, the new issues:

  • 2× memory: storing every chunk-to-worker mapping twice (once in tree, once in index)
  • 285× slower deregistration (8ms → 2,425ms): Removing a worker requires cleaning up thousands/millions of index entries
  • 87× slower full reports (60ms → 5,259ms): Reporting state requires walking both structures

After these experiments, the RegistryTree was chosen as the most scalable and robust architecture for centralized metadata storage. 

Decision: Why we chose RegistryTree (S1)

SolutionLookup (ms)Memory (GB)FullReport (ms)Deregister (ms)
S0 (Flat Indexing)0.00008919.7201.37332.597
S1 (RegistryTree)0.0081975.870608
S2 (S1 + reverse key-to-worker index)0.0000619.7205.2592.425

Even though S0 and S2 achieve very fast lookups, S1 is the most robust and scalable in the scenarios that dominate real operations: worker churn, controller recovery, and keeping controller memory bounded. For that reason, RegistryTree (S1) was chosen as the base metadata architecture.

Fine-Grained Locking for High Performance

Rather than using a single global lock that would create bottlenecks, we implemented a layered locking strategy with read-write locks at each level:

LevelLock TypePurpose
RegistryTreeRead-Write LockProtects the instances dictionary
InstanceNodeRead-Write LockProtects the workers‘ dictionary
WorkerNodeFastLock (non-blocking)Protects individual KV store operations

This solution provides: 

  • Concurrent reads: Multiple threads can query different instances simultaneously
  • Parallel operations: Different instances can be modified in parallel without blocking each other
  • Fine-grained Isolation: Adding or removing workers in one instance doesn’t affect operations on other instances
  • The non-blocking fast lock on the admit/evict operations for KV chunks avoids expensive syscalls and context switches entirely.

Fault Tolerance: Heartbeat-Driven Recovery

A graphical representation of an electrocardiogram (ECG) showing various heart rhythms on grid paper.

The Controller maintains centralized metadata in memory only for performance reasons. So how do we implement fault tolerance? The number of operations excludes the possibility of using a WAL. 

Two Critical Failure Scenarios:

  • Controller crashes: All metadata lost, RegistryTree becomes empty
  • Worker dies: Controller holds stale metadata pointing to a dead worker

We use a two-layer approach:

  • Heartbeat: Detects failures and maintains liveness
  • Full Sync: Recovers state after Controller restart

Layer 1: Heartbeat – The Health Check

Every 10 seconds, the workers and the controller perform a REQ-REP cycle as a health check. The heartbeat mechanism doubles as a command channel — the Controller can send commands through heartbeat responses.

A flowchart illustrating the communication between Worker, Controller, and Registry components in a system, detailing heartbeat messages sent every 10 seconds, including instance ID, worker ID, IP address, port, and peer initialization URL. The Controller updates heartbeat time and checks synchronization needs before allowing the Worker to continue normal operations.
ConditionDetectionAction
Worker aliveHeartbeat arrives on timeUpdate timestamp → Normal
Worker deadNo heartbeat for 30s+Deregister worker → Remove from cluster
Controller restartedSuccess = false (worker unknown)Auto re-register → Trigger Full Sync

Layer 2: Full Sync – State Recovery

When the Controller restarts, it has no memory of which workers exist or what chunks they hold. The heartbeat mechanism triggers Full Sync to rebuild this state.

Controller Restart Flow:

Flowchart illustrating the communication process between Worker, Controller, and Registry in a heartbeat and state recovery system.

What is Freeze Mode?

During Full Sync, workers enter Freeze Mode to prevent data inconsistencies:

  • All store operations will be SKIPPED (no new data stored)
  • Only Local CPU will be used for retrieval (no peers or remote storages)
  • No admit/evict messages will be generated

Cluster Monitoring

To better observe the state of the Controller, we also provide a Dashboard that offers improved visibility into Controller health and behavior. Please see: https://docs.lmcache.ai/controller/index.html

Screenshot of LMCache Controller Dashboard showing system status, total keys as 4684898, and recent activities.

The Instances page displays information such as instance IPs, the number of workers per instance, and key counts.

Screenshot of an Instance Management interface displaying a table with details of various instances, including Instance ID, IP Address, Status, Worker Count, Key Count, Last Heartbeat, and action buttons for viewing or removing instances.

The Workers view allows you to inspect a specific worker’s key count, IP address, port number, and related information.

Screenshot of the LMCache Controller Dashboard showing the Worker Management section, including instance IDs, worker IDs, IP addresses, ports, statuses, key counts, last heartbeats, and action buttons.

In addition, it supports general capabilities such as retrieving metrics, thread information, and environment variables.

Screenshot of the LMCache Controller Dashboard displaying performance metrics, including garbage collection statistics and worker counts.

If you encounter any issues, please leave an issue inside of LMCache! We love to hear about production use-cases and always welcome contributions to our open source community.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading