Extending LMCache Remote Connectors: MooncakeStore as an Example

By LMCache Team and Tencent


Highlights:

  • Tencent engineers join forces with LMCache, creating the innovative MooncakeStoreConnector to enable Mooncake Store use in LMCache!
  • This post delves into the design principles and implementation details of Remote Connectors in the LMCache system, illustrated through a case study of the MooncakeStoreConnector.

  • Specifically, we will explore:
    1. The overall architecture of LMCache and the core concepts behind Remote Connectors.
    2. A detailed breakdown of the interface implementations within MooncakeStoreConnector.
    3. Methods for extending the system and potential future optimizations.
  • This work represents an ongoing collaboration between engineers from Tencent and the LMCache Team, aimed at enhancing LMCache’s capabilities further more under real production scenarios.

This article refers to LMCache based on commit-01277a1 LMCache V1(experimental), and introduces it in the context of the inference engine vLLM’s V0 version.

Icon

Fig 1: LMCache x Tencent Collaboration

LMCache Architecture and Position in the Ecosystem

LMCache is an intelligent caching middleware specifically designed for Large Language Model (LLM) inference. Here’s a breakdown of its architecture and position:

  • Positioning: It sits between LLM inference engines (like vLLM, SGLang) and remote Key-Value (KV) storage systems.
  • Adaptability: Utilizes a unified RemoteConnector interface to seamlessly connect with various backend storage systems (e.g., Redis, Infinistore, MooncakeStore).
  • Core Functionality: Leverages a deep understanding of the inference engine’s KVCache management to efficiently extract and inject the attention key-value cache.
  • Key Value Proposition:
    • Significantly reduces VRAM usage and computational overhead during LLM inference.
    • Achieves this through multi-level caching strategies (memory -> local disk -> remote storage) and intelligent prefetching mechanisms.
  • Design Philosophy: Its modular design allows flexible integration and support for storage backends with diverse features.
  • Overall Goal: To provide an efficient and adaptable caching solution tailored for distributed LLM inference scenarios.

In the remote backend diagram above, MooncakeStore, Valkey, and DFS are Tencent’s contributions to LMCache.

Performance benchmarks comparing these different remote backends are currently being conducted by Tencent and will be shared in a future update; this post focuses on the architectural integration.

How LMCache Remote Connectors Work

Icon

Fig 2: LMCache Architecture Overview

LMCache employs a layered storage architecture where the StorageManager acts as the core management layer, coordinating between local disk storage (LocalDiskBackend) and remote storage (RemoteBackend). The RemoteBackend connects to various storage backends using different RemoteConnector implementations. The MooncakeStoreConnector, for example, is specifically designed for the Mooncake distributed storage system.

Here’s the initialization flow:

  1. When LMCache’s LMCacheConnector is initialized, it calls the init_lmcache_engine method.
  2. This method constructs an LMCacheEngine.
  3. During the LMCacheEngine construction, a StorageManager is created for it.
  4. While constructing the StorageManager, different StorageBackend instances are created based on the configuration:
    • If local_disk is set to True and max_local_disk_size (in GiB) has a valid value, a LocalDiskBackend is constructed.
    • If remote_url is not empty, a RemoteBackend is constructed.
    • If both local_disk and remote_url are configured, both backends are constructed. If neither is configured, neither is constructed.

During the StorageManager construction, it also checks if local_cpu is True. If so, it sets use_hot to true. When use_hot is enabled, KV cache entries being put are stored in the hot_cache, and get operations will prioritize retrieving from the hot_cache.

Due to scope limitations, this article will not delve into LocalDiskBackend, hot_cache, or MemoryAllocator. Our focus is solely on RemoteBackend.

When constructing a RemoteBackend, the configured remote_url is parsed. The scheme extracted from the URL determines which specific connector implementation to build.

The RemoteConnector serves as the abstract interface for LMCache’s interaction with remote storage systems. Its primary responsibilities include:

  • Connection Management
  • Data Storage and Retrieval
  • Resource Management
  • Error Handling
Icon

Fig 3: Remote Connectors Overview

You can specify the remote connector for the LMCache KV connector using the following methods:

  1. Via vLLM Startup Argument: Specify LMCacheConnector as the kv_connector for vLLM:
    --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_both","kv_parallel_size":2}'
    
  2. Via Environment Variable: Set the LMCACHE_REMOTE_URL environment variable:
    LMCACHE_REMOTE_URL="<REMOTE_URL>"
    
  3. Via Configuration File: Specify the path to a configuration file using the LMCACHE_CONFIG_FILE environment variable:
    LMCACHE_CONFIG_FILE=/tmp/lmcache_example.yaml
    

    Inside the lmcache_example.yaml file (YAML format), specify the remote connector using the remote_url key:

    remote_url: "<REMOTE_URL>"
    

In the methods above, <REMOTE_URL> follows this format: <REMOTE_SCHEME>://<HOST0>:<PORT0>,<HOST1>:<PORT2>.../<PATH>/?device=<DEVICE_NAME>

Here are some examples:

Example Description
lm://localhost:65432 Remote is an LMCache server, service address is localhost:65432.
redis://localhost:6379 Remote is Redis, service address is localhost:6379.
redis-sentinel://localhost:26379,localhost:26380,localhost:26381 Remote is Redis in Sentinel mode, specifying 3 Sentinel service addresses: localhost:26379, localhost:26380, localhost:26381.
mooncakestore://localhost:50051/ Remote is Mooncakestore, Mooncake master address is localhost:50051. Mooncake store has other configurations that need to be specified via mooncake environment variables, MOONCAKE_CONFIG_PATH=/tmp/mooncake.json.
blackhole://host:0/ Remote is blackhole, as the name suggests, write operations are discarded, read operations return None, exist check returns false, used for benchmarking remote performance ceiling. PR#505

How to Extend with a New Remote Backend Connector

If your scenario requires adapting a new remote backend Connector, how can you implement it?

Icon Icon

Fig 4: LMCache Examples

You can refer to the existing examples in LMCache, write a new connector implementation, and construct the corresponding remote backend connector in __init__.py based on the scheme specified in the user’s remote URL.

How to Extend the MooncakeStore Remote Connector

Introduction to MooncakeStore

Icon

Fig 5: Mooncake Store Architecture

Mooncake Store is a high-performance distributed key-value KV Cache storage engine designed specifically for LLM inference scenarios.

Mooncake Store is managed by a global Master Service responsible for allocating storage space pools. As shown in the figure above, Mooncake Store has two key components: the Master Service and the Client (Mooncake Store resides within the vLLM instance process). For a detailed introduction, please refer to the official Mooncake Store documentation: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store-preview.md

Detailed Process for Extending the MooncakeStore Remote Connector

In vLLM, Mooncake Store can be used as a remote KV store to implement Prefill/Decode (PD) separation, or for KV cache offload & reuse. This helps improve the hit rate in multi-turn conversation scenarios.

Considering that LMCache has already done extensive work on vLLM KV cache adaptation and has integrated several excellent third-party KV stores like Redis and Infinistore, Mooncake Store can also be integrated into the LMCache backend. This allows focusing more on the Store itself without needing to worry too much about the details of adapting to vLLM.

Since LMCache already provides a well-designed front-end/back-end architecture and interface abstractions, it’s very conducive for open-source contributors to extend and implement connectors for new KVStores. This section uses MooncakeStoreConnector as an example to detail how to implement a new remote connector.

Creating the MooncakestoreConnector Class

MooncakestoreConnector needs to extend the RemoteConnector class as a subclass and implement all its abstract methods.

Icon

Fig 6: Mooncake Connector Demonstration

Initialization and Configuration

The initialization process for MooncakeStoreConnector involves three key steps:

  1. Dependency Check: Verify if the Mooncake library is installed.
  2. Configuration Loading: Load configuration from the file specified by an environment variable.
  3. Storage Initialization: Set up the MooncakeDistributedStore instance.
Icon

Fig 7: Mooncake Connector Initialization Workflow

Import and construct a MooncakeDistributedStore instance using from mooncake.store import MooncakeDistributedStore. First, parse the Mooncake configuration information, then initialize MooncakeDistributedStore via MooncakeDistributedStore.setup. This completes all initialization tasks. Subsequent interactions with the Mooncake store are handled by the provided MooncakeDistributedStore.

Implementation Details

exists

The simplest implementation just needs to call the is_exists method of MooncakeDistributedStore.

put

The put operation first constructs Metadata, then sends the metadata and the kvbytes data content to the backend store. It’s worth noting that after the standard logic within put is completed, the reference count of memory_obj must be decremented to prevent leaks.

get

The get method retrieves data from the Mooncake store and constructs an LMCache-managed MemoryObj.

Icon

Fig 8: Mooncake Connector Online Workflow

Testing Process

Start vLLM + LMCacheConnector + Mooncake:

  1. Start etcd:
    # Start etcd
    docker run  -p 2379:2379 -p 2380:2380 --rm -e ALLOW_NONE_AUTHENTICATION=yes  --name etcd  bitnami/etcd
    

    … (other dependencies if needed)

  2. Start mooncake_master:
    # Start mooncake_master
    /mooncake/mooncake-store/src/mooncake_master -v=1 -port=50051
    
  3. Start vLLM with LMCacheConnector and Mooncake Store as the remote backend:
    # Start vllm with lmcache connector and specific mooncake store as remote backend of lmcache
    VLLM_USE_V1=0 \
    MOONCAKE_CONFIG_PATH=./mooncake.json \
    LMCACHE_USE_EXPERIMENTAL=True LMCACHE_TRACK_USAGE=false \
    LMCACHE_CHUNK_SIZE=16 LMCACHE_LOCAL_CPU=False LMCACHE_MAX_LOCAL_CPU_SIZE=5.0 \
    LMCACHE_REMOTE_URL=mooncakestore://localhost:50051/ \
    LMCACHE_REMOTE_SERDE="cachegen" \
    vllm serve /disc/f/models/opt-125m/ \
              --served-model-name "facebook/opt-125m" \
              --enforce-eager  \
              --port 8000 \
              --gpu-memory-utilization 0.8 \
              --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_both","kv_parallel_size":2}' \
              --trust-remote-code
    
  4. Example mooncak.json configuration:
    {
        "local_hostname": "localhost",
        "metadata_server": "etcd://localhost:2379",
        "protocol": "tcp",
        "device_name": "",
        "master_server_address": "localhost:50051"
    }
    

Testing

We now use two curl to verify hit and non-hit scenarios.

root@docker-desktop:/vllm-workspace: curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "facebook/opt-125m",
        "prompt": "San Francisco is a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15",
        "max_tokens": 10,
        "temperature": 0
    }'

{
  "id": "cmpl-e9a422dd7e1b41afa58c2c666edd80a5",
  "object": "text_completion",
  "created": 1744729807,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "index": 0,
      "text": " 16 17 18 19 20 21 22 23 24 25",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 30,
    "completion_tokens": 10,
    "prompt_tokens_details": null
  }
}

root@docker-desktop:/vllm-workspace: curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "facebook/opt-125m",
        "prompt": "San Francisco is a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15",
        "max_tokens": 10,
        "temperature": 0
    }'

{
  "id": "cmpl-132df463ebca49949906792ad629a9b3",
  "object": "text_completion",
  "created": 1744729824,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "index": 0,
      "text": " 16 17 18 19 20 21 22 23 24 25",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 30,
    "completion_tokens": 10,
    "prompt_tokens_details": null
  }
}

Log Analysis

The Mooncake master logs are divided into four sections:

  1. Startup Logs: These indicate that the service has started successfully.
  2. Buffer Allocator Initialization: As vLLM starts, register_buffer requests arrive. These requests originate from the Mooncake store embedded within the vLLM inference engine process.
  3. Cold Run Test Request Logs: This represents the first run where the cache is entirely missed. The logs show that the get_replica_list operation returns object_not_found, followed by logs for the put operation.
  4. Hot Run Test Request Logs: This represents the second run where the cache is fully hit. The logs show only successful get_replica_list operations.
root@docker-desktop:/vllm-workspace# /opt/venv/lib/python3.12/site-packages/mooncake/mooncake_master -v=1
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0419 14:41:06.556012 12693 master.cpp:27] Master service started on port 50051, enable_gc=0, max_threads=4
I0419 14:41:06.559537 12693 master_service.cpp:77] action=gc_disabled



I0419 14:42:04.932854 12703 scoped_vlog_timer.h:42] MountSegment request: buffer=140097034911744, size=3355443200, segment_name=localhost:13503
I0419 14:42:04.933364 12703 allocator.cpp:24] initializing_buffer_allocator segment_name=localhost:13503 base_address=0x7f6ae2000000 size=3355443200
I0419 14:42:04.933933 12703 allocator.cpp:50] buffer_allocator_initialized pool_id=0
I0419 14:42:04.934000 12703 scoped_vlog_timer.h:77] MountSegment response: {"error_code":0}, latency=1163us


I0419 14:42:47.904599 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:42:47.906082 12703 master_service.cpp:116] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, info=object_not_found
I0419 14:42:47.906311 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[],"error_code":-704}, latency=2480us
I0419 14:42:48.147836 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:42:48.147917 12703 master_service.cpp:116] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, info=object_not_found
I0419 14:42:48.147954 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[],"error_code":-704}, latency=125us
I0419 14:42:48.149864 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:42:48.149912 12703 master_service.cpp:116] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, info=object_not_found
I0419 14:42:48.149919 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[],"error_code":-704}, latency=66us
I0419 14:42:49.235935 12703 scoped_vlog_timer.h:42] PutStart request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, value_length=1384015, slice_lengths=1
I0419 14:42:49.248529 12703 master_service.cpp:175] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, value_length=1384015, slice_count=1, config=ReplicateConfig: { replica_num: 1 }, action=put_start_begin
I0419 14:42:49.261435 12703 allocator.cpp:75] allocation_succeeded size=1384015 segment=localhost:13503 address=0x7f6ae2000000
I0419 14:42:49.261528 12703 master_service.cpp:221] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, replica_id=0, slice_index=0, handle=AllocatedBuffer: { segment_name: localhost:13503, size: 1384015, status: INIT, buffer_ptr: 0x7f6ae2000000 }, action=slice_allocated
I0419 14:42:49.261662 12703 scoped_vlog_timer.h:77] PutStart response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1384015,"buffer_address_":140097034911744,"status_":0}],"status":2}],"error_code":0}, latency=26121us
I0419 14:42:49.512735 12703 scoped_vlog_timer.h:42] PutEnd request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:42:49.512790 12703 scoped_vlog_timer.h:77] PutEnd response: {"error_code":0}, latency=65us
I0419 14:42:49.514432 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60
I0419 14:42:49.515123 12703 master_service.cpp:116] key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60, info=object_not_found
I0419 14:42:49.515190 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[],"error_code":-704}, latency=768us
I0419 14:42:49.541345 12703 scoped_vlog_timer.h:42] PutStart request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60, value_length=1315431, slice_lengths=1
I0419 14:42:49.541409 12703 master_service.cpp:175] key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60, value_length=1315431, slice_count=1, config=ReplicateConfig: { replica_num: 1 }, action=put_start_begin
I0419 14:42:49.541421 12703 allocator.cpp:75] allocation_succeeded size=1315431 segment=localhost:13503 address=0x7f6ae216fbb8
I0419 14:42:49.541425 12703 master_service.cpp:221] key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60, replica_id=0, slice_index=0, handle=AllocatedBuffer: { segment_name: localhost:13503, size: 1315431, status: INIT, buffer_ptr: 0x7f6ae216fbb8 }, action=slice_allocated
I0419 14:42:49.541462 12703 scoped_vlog_timer.h:77] PutStart response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1315431,"buffer_address_":140097036417976,"status_":0}],"status":2}],"error_code":0}, latency=126us
I0419 14:42:49.559959 12703 scoped_vlog_timer.h:42] PutEnd request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60
I0419 14:42:49.560063 12703 scoped_vlog_timer.h:77] PutEnd response: {"error_code":0}, latency=212us


I0419 14:43:04.882916 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:43:04.883111 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1384015,"buffer_address_":140097034911744,"status_":1}],"status":3}],"error_code":0}, latency=238us
I0419 14:43:05.049643 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60
I0419 14:43:05.049695 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1315431,"buffer_address_":140097036417976,"status_":1}],"status":3}],"error_code":0}, latency=59us
I0419 14:43:05.105633 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:43:05.105705 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1384015,"buffer_address_":140097034911744,"status_":1}],"status":3}],"error_code":0}, latency=80us
I0419 14:43:05.106576 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60
I0419 14:43:05.106652 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1315431,"buffer_address_":140097036417976,"status_":1}],"status":3}],"error_code":0}, latency=86us

Tencent’s Other Contributions to different Remote Backends

  • Added device parsing to the remote URL parser; Infinistore now supports device configuration via URL.
  • Abstracted and refactored the Redis connector to reduce code conflicts between sentinel and standalone modes.
  • Redis, Valkey, MooncakeStore, etc., now support both naive and CacheGen serialization/deserialization (serde) methods.

Future Optimization Directions

  • Adopt a more flexible framework for managing connector extensions to avoid modifying the CreateConnector method for each new connector.
  • During testing, LMCache checks for a key’s existence multiple times. To reduce interaction with the remote store, cache the existence status of keys for a period.
  • Enhance observability by adding metrics to track memory allocator allocate and free operations for performance analysis.
  • Enable CacheGen Serde support for MLA.
  • Adapt LMCache for other inference engines like SGLang.
  • More ideas… (Feel free to put an RFC!)
Share: X (Twitter) Facebook LinkedIn