Highlights:
- Tencent engineers join forces with LMCache, creating the innovative MooncakeStoreConnector to enable Mooncake Store use in LMCache!
-
This post delves into the design principles and implementation details of Remote Connectors in the LMCache system, illustrated through a case study of the MooncakeStoreConnector.
- Specifically, we will explore:
- The overall architecture of LMCache and the core concepts behind Remote Connectors.
- A detailed breakdown of the interface implementations within MooncakeStoreConnector.
- Methods for extending the system and potential future optimizations.
- This work represents an ongoing collaboration between engineers from Tencent and the LMCache Team, aimed at enhancing LMCache’s capabilities further more under real production scenarios.
This article refers to LMCache based on commit-01277a1 LMCache V1(experimental), and introduces it in the context of the inference engine vLLM’s V0 version.

Fig 1: LMCache x Tencent Collaboration
LMCache Architecture and Position in the Ecosystem
LMCache is an intelligent caching middleware specifically designed for Large Language Model (LLM) inference. Here’s a breakdown of its architecture and position:
- Positioning: It sits between LLM inference engines (like vLLM, SGLang) and remote Key-Value (KV) storage systems.
- Adaptability: Utilizes a unified
RemoteConnector
interface to seamlessly connect with various backend storage systems (e.g., Redis, Infinistore, MooncakeStore). - Core Functionality: Leverages a deep understanding of the inference engine’s KVCache management to efficiently extract and inject the attention key-value cache.
- Key Value Proposition:
- Significantly reduces VRAM usage and computational overhead during LLM inference.
- Achieves this through multi-level caching strategies (memory -> local disk -> remote storage) and intelligent prefetching mechanisms.
- Design Philosophy: Its modular design allows flexible integration and support for storage backends with diverse features.
- Overall Goal: To provide an efficient and adaptable caching solution tailored for distributed LLM inference scenarios.
In the remote backend diagram above, MooncakeStore, Valkey, and DFS are Tencent’s contributions to LMCache.
Performance benchmarks comparing these different remote backends are currently being conducted by Tencent and will be shared in a future update; this post focuses on the architectural integration.
How LMCache Remote Connectors Work

Fig 2: LMCache Architecture Overview
LMCache employs a layered storage architecture where the StorageManager
acts as the core management layer, coordinating between local disk storage (LocalDiskBackend
) and remote storage (RemoteBackend
). The RemoteBackend
connects to various storage backends using different RemoteConnector
implementations. The MooncakeStoreConnector
, for example, is specifically designed for the Mooncake distributed storage system.
Here’s the initialization flow:
- When LMCache’s
LMCacheConnector
is initialized, it calls theinit_lmcache_engine
method. - This method constructs an
LMCacheEngine
. - During the
LMCacheEngine
construction, aStorageManager
is created for it. - While constructing the
StorageManager
, differentStorageBackend
instances are created based on the configuration:- If
local_disk
is set toTrue
andmax_local_disk_size
(in GiB) has a valid value, aLocalDiskBackend
is constructed. - If
remote_url
is not empty, aRemoteBackend
is constructed. - If both
local_disk
andremote_url
are configured, both backends are constructed. If neither is configured, neither is constructed.
- If
During the StorageManager
construction, it also checks if local_cpu
is True
. If so, it sets use_hot
to true. When use_hot
is enabled, KV cache entries being put
are stored in the hot_cache
, and get
operations will prioritize retrieving from the hot_cache
.
Due to scope limitations, this article will not delve into LocalDiskBackend
, hot_cache
, or MemoryAllocator
. Our focus is solely on RemoteBackend
.
When constructing a RemoteBackend
, the configured remote_url
is parsed. The scheme extracted from the URL determines which specific connector implementation to build.
The RemoteConnector
serves as the abstract interface for LMCache’s interaction with remote storage systems. Its primary responsibilities include:
- Connection Management
- Data Storage and Retrieval
- Resource Management
- Error Handling

Fig 3: Remote Connectors Overview
You can specify the remote connector for the LMCache KV connector using the following methods:
- Via vLLM Startup Argument:
Specify
LMCacheConnector
as thekv_connector
for vLLM:--kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_both","kv_parallel_size":2}'
- Via Environment Variable:
Set the
LMCACHE_REMOTE_URL
environment variable:LMCACHE_REMOTE_URL="<REMOTE_URL>"
- Via Configuration File:
Specify the path to a configuration file using the
LMCACHE_CONFIG_FILE
environment variable:LMCACHE_CONFIG_FILE=/tmp/lmcache_example.yaml
Inside the
lmcache_example.yaml
file (YAML format), specify the remote connector using theremote_url
key:remote_url: "<REMOTE_URL>"
In the methods above, <REMOTE_URL>
follows this format:
<REMOTE_SCHEME>://<HOST0>:<PORT0>,<HOST1>:<PORT2>.../<PATH>/?device=<DEVICE_NAME>
Here are some examples:
Example | Description |
---|---|
lm://localhost:65432 |
Remote is an LMCache server, service address is localhost:65432 . |
redis://localhost:6379 |
Remote is Redis, service address is localhost:6379 . |
redis-sentinel://localhost:26379,localhost:26380,localhost:26381 |
Remote is Redis in Sentinel mode, specifying 3 Sentinel service addresses: localhost:26379 , localhost:26380 , localhost:26381 . |
mooncakestore://localhost:50051/ |
Remote is Mooncakestore, Mooncake master address is localhost:50051 . Mooncake store has other configurations that need to be specified via mooncake environment variables, MOONCAKE_CONFIG_PATH=/tmp/mooncake.json . |
blackhole://host:0/ |
Remote is blackhole, as the name suggests, write operations are discarded, read operations return None, exist check returns false, used for benchmarking remote performance ceiling. PR#505 |
How to Extend with a New Remote Backend Connector
If your scenario requires adapting a new remote backend Connector, how can you implement it?


Fig 4: LMCache Examples
You can refer to the existing examples in LMCache, write a new connector implementation, and construct the corresponding remote backend connector in __init__.py
based on the scheme specified in the user’s remote URL.
How to Extend the MooncakeStore Remote Connector
Introduction to MooncakeStore

Fig 5: Mooncake Store Architecture
Mooncake Store is a high-performance distributed key-value KV Cache storage engine designed specifically for LLM inference scenarios.
Mooncake Store is managed by a global Master Service responsible for allocating storage space pools. As shown in the figure above, Mooncake Store has two key components: the Master Service and the Client (Mooncake Store resides within the vLLM instance process). For a detailed introduction, please refer to the official Mooncake Store documentation: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store-preview.md
Detailed Process for Extending the MooncakeStore Remote Connector
In vLLM, Mooncake Store can be used as a remote KV store to implement Prefill/Decode (PD) separation, or for KV cache offload & reuse. This helps improve the hit rate in multi-turn conversation scenarios.
Considering that LMCache has already done extensive work on vLLM KV cache adaptation and has integrated several excellent third-party KV stores like Redis and Infinistore, Mooncake Store can also be integrated into the LMCache backend. This allows focusing more on the Store itself without needing to worry too much about the details of adapting to vLLM.
Since LMCache already provides a well-designed front-end/back-end architecture and interface abstractions, it’s very conducive for open-source contributors to extend and implement connectors for new KVStores. This section uses MooncakeStoreConnector
as an example to detail how to implement a new remote connector.
Related PRs
- https://github.com/LMCache/LMCache/pull/430
- https://github.com/LMCache/LMCache/pull/489
- https://github.com/LMCache/LMCache/pull/498
Creating the MooncakestoreConnector
Class
MooncakestoreConnector
needs to extend the RemoteConnector
class as a subclass and implement all its abstract methods.

Fig 6: Mooncake Connector Demonstration
Initialization and Configuration
The initialization process for MooncakeStoreConnector
involves three key steps:
- Dependency Check: Verify if the Mooncake library is installed.
- Configuration Loading: Load configuration from the file specified by an environment variable.
- Storage Initialization: Set up the
MooncakeDistributedStore
instance.

Fig 7: Mooncake Connector Initialization Workflow
Import and construct a MooncakeDistributedStore
instance using from mooncake.store import MooncakeDistributedStore
. First, parse the Mooncake configuration information, then initialize MooncakeDistributedStore
via MooncakeDistributedStore.setup
. This completes all initialization tasks. Subsequent interactions with the Mooncake store are handled by the provided MooncakeDistributedStore
.
Implementation Details
exists
The simplest implementation just needs to call the is_exists
method of MooncakeDistributedStore
.
put
The put
operation first constructs Metadata
, then sends the metadata and the kvbytes
data content to the backend store. It’s worth noting that after the standard logic within put
is completed, the reference count of memory_obj
must be decremented to prevent leaks.
get
The get
method retrieves data from the Mooncake store and constructs an LMCache-managed MemoryObj
.

Fig 8: Mooncake Connector Online Workflow
Testing Process
Start vLLM + LMCacheConnector + Mooncake:
- Start etcd:
# Start etcd docker run -p 2379:2379 -p 2380:2380 --rm -e ALLOW_NONE_AUTHENTICATION=yes --name etcd bitnami/etcd
… (other dependencies if needed)
- Start mooncake_master:
# Start mooncake_master /mooncake/mooncake-store/src/mooncake_master -v=1 -port=50051
- Start vLLM with LMCacheConnector and Mooncake Store as the remote backend:
# Start vllm with lmcache connector and specific mooncake store as remote backend of lmcache VLLM_USE_V1=0 \ MOONCAKE_CONFIG_PATH=./mooncake.json \ LMCACHE_USE_EXPERIMENTAL=True LMCACHE_TRACK_USAGE=false \ LMCACHE_CHUNK_SIZE=16 LMCACHE_LOCAL_CPU=False LMCACHE_MAX_LOCAL_CPU_SIZE=5.0 \ LMCACHE_REMOTE_URL=mooncakestore://localhost:50051/ \ LMCACHE_REMOTE_SERDE="cachegen" \ vllm serve /disc/f/models/opt-125m/ \ --served-model-name "facebook/opt-125m" \ --enforce-eager \ --port 8000 \ --gpu-memory-utilization 0.8 \ --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_both","kv_parallel_size":2}' \ --trust-remote-code
- Example
mooncak.json
configuration:{ "local_hostname": "localhost", "metadata_server": "etcd://localhost:2379", "protocol": "tcp", "device_name": "", "master_server_address": "localhost:50051" }
Testing
We now use two curl to verify hit and non-hit scenarios.
root@docker-desktop:/vllm-workspace: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15",
"max_tokens": 10,
"temperature": 0
}'
{
"id": "cmpl-e9a422dd7e1b41afa58c2c666edd80a5",
"object": "text_completion",
"created": 1744729807,
"model": "facebook/opt-125m",
"choices": [
{
"index": 0,
"text": " 16 17 18 19 20 21 22 23 24 25",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 20,
"total_tokens": 30,
"completion_tokens": 10,
"prompt_tokens_details": null
}
}
root@docker-desktop:/vllm-workspace: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15",
"max_tokens": 10,
"temperature": 0
}'
{
"id": "cmpl-132df463ebca49949906792ad629a9b3",
"object": "text_completion",
"created": 1744729824,
"model": "facebook/opt-125m",
"choices": [
{
"index": 0,
"text": " 16 17 18 19 20 21 22 23 24 25",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 20,
"total_tokens": 30,
"completion_tokens": 10,
"prompt_tokens_details": null
}
}
Log Analysis
The Mooncake master logs are divided into four sections:
- Startup Logs: These indicate that the service has started successfully.
- Buffer Allocator Initialization: As vLLM starts,
register_buffer
requests arrive. These requests originate from the Mooncake store embedded within the vLLM inference engine process. - Cold Run Test Request Logs: This represents the first run where the cache is entirely missed. The logs show that the
get_replica_list
operation returnsobject_not_found
, followed by logs for theput
operation. - Hot Run Test Request Logs: This represents the second run where the cache is fully hit. The logs show only successful
get_replica_list
operations.
root@docker-desktop:/vllm-workspace# /opt/venv/lib/python3.12/site-packages/mooncake/mooncake_master -v=1
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0419 14:41:06.556012 12693 master.cpp:27] Master service started on port 50051, enable_gc=0, max_threads=4
I0419 14:41:06.559537 12693 master_service.cpp:77] action=gc_disabled
I0419 14:42:04.932854 12703 scoped_vlog_timer.h:42] MountSegment request: buffer=140097034911744, size=3355443200, segment_name=localhost:13503
I0419 14:42:04.933364 12703 allocator.cpp:24] initializing_buffer_allocator segment_name=localhost:13503 base_address=0x7f6ae2000000 size=3355443200
I0419 14:42:04.933933 12703 allocator.cpp:50] buffer_allocator_initialized pool_id=0
I0419 14:42:04.934000 12703 scoped_vlog_timer.h:77] MountSegment response: {"error_code":0}, latency=1163us
I0419 14:42:47.904599 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:42:47.906082 12703 master_service.cpp:116] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, info=object_not_found
I0419 14:42:47.906311 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[],"error_code":-704}, latency=2480us
I0419 14:42:48.147836 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:42:48.147917 12703 master_service.cpp:116] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, info=object_not_found
I0419 14:42:48.147954 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[],"error_code":-704}, latency=125us
I0419 14:42:48.149864 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:42:48.149912 12703 master_service.cpp:116] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, info=object_not_found
I0419 14:42:48.149919 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[],"error_code":-704}, latency=66us
I0419 14:42:49.235935 12703 scoped_vlog_timer.h:42] PutStart request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, value_length=1384015, slice_lengths=1
I0419 14:42:49.248529 12703 master_service.cpp:175] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, value_length=1384015, slice_count=1, config=ReplicateConfig: { replica_num: 1 }, action=put_start_begin
I0419 14:42:49.261435 12703 allocator.cpp:75] allocation_succeeded size=1384015 segment=localhost:13503 address=0x7f6ae2000000
I0419 14:42:49.261528 12703 master_service.cpp:221] key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754, replica_id=0, slice_index=0, handle=AllocatedBuffer: { segment_name: localhost:13503, size: 1384015, status: INIT, buffer_ptr: 0x7f6ae2000000 }, action=slice_allocated
I0419 14:42:49.261662 12703 scoped_vlog_timer.h:77] PutStart response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1384015,"buffer_address_":140097034911744,"status_":0}],"status":2}],"error_code":0}, latency=26121us
I0419 14:42:49.512735 12703 scoped_vlog_timer.h:42] PutEnd request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:42:49.512790 12703 scoped_vlog_timer.h:77] PutEnd response: {"error_code":0}, latency=65us
I0419 14:42:49.514432 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60
I0419 14:42:49.515123 12703 master_service.cpp:116] key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60, info=object_not_found
I0419 14:42:49.515190 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[],"error_code":-704}, latency=768us
I0419 14:42:49.541345 12703 scoped_vlog_timer.h:42] PutStart request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60, value_length=1315431, slice_lengths=1
I0419 14:42:49.541409 12703 master_service.cpp:175] key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60, value_length=1315431, slice_count=1, config=ReplicateConfig: { replica_num: 1 }, action=put_start_begin
I0419 14:42:49.541421 12703 allocator.cpp:75] allocation_succeeded size=1315431 segment=localhost:13503 address=0x7f6ae216fbb8
I0419 14:42:49.541425 12703 master_service.cpp:221] key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60, replica_id=0, slice_index=0, handle=AllocatedBuffer: { segment_name: localhost:13503, size: 1315431, status: INIT, buffer_ptr: 0x7f6ae216fbb8 }, action=slice_allocated
I0419 14:42:49.541462 12703 scoped_vlog_timer.h:77] PutStart response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1315431,"buffer_address_":140097036417976,"status_":0}],"status":2}],"error_code":0}, latency=126us
I0419 14:42:49.559959 12703 scoped_vlog_timer.h:42] PutEnd request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60
I0419 14:42:49.560063 12703 scoped_vlog_timer.h:77] PutEnd response: {"error_code":0}, latency=212us
I0419 14:43:04.882916 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:43:04.883111 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1384015,"buffer_address_":140097034911744,"status_":1}],"status":3}],"error_code":0}, latency=238us
I0419 14:43:05.049643 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60
I0419 14:43:05.049695 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1315431,"buffer_address_":140097036417976,"status_":1}],"status":3}],"error_code":0}, latency=59us
I0419 14:43:05.105633 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@3d29b85e7dc2d3eaa9b1a4f3ecbbb70e2b239d6b8d00ea82b88b1758b00ad754
I0419 14:43:05.105705 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1384015,"buffer_address_":140097034911744,"status_":1}],"status":3}],"error_code":0}, latency=80us
I0419 14:43:05.106576 12703 scoped_vlog_timer.h:42] GetReplicaList request: key=vllm@/disc/f/models/opt-125m/@1@0@a7301c544c86fe97d973cb0e2a154e1d54aba116874e07c18df1efc6ffdddd60
I0419 14:43:05.106652 12703 scoped_vlog_timer.h:77] GetReplicaList response: {"replica_list":[{"buffer_descriptors":[{"segment_name_":"localhost:13503","size_":1315431,"buffer_address_":140097036417976,"status_":1}],"status":3}],"error_code":0}, latency=86us
Tencent’s Other Contributions to different Remote Backends
- Added device parsing to the remote URL parser; Infinistore now supports device configuration via URL.
- Abstracted and refactored the Redis connector to reduce code conflicts between sentinel and standalone modes.
- Redis, Valkey, MooncakeStore, etc., now support both
naive
andCacheGen
serialization/deserialization (serde) methods.
Future Optimization Directions
- Adopt a more flexible framework for managing connector extensions to avoid modifying the
CreateConnector
method for each new connector. - During testing, LMCache checks for a key’s existence multiple times. To reduce interaction with the remote store, cache the existence status of keys for a period.
- Enhance observability by adding metrics to track memory allocator
allocate
andfree
operations for performance analysis. - Enable CacheGen Serde support for MLA.
- Adapt LMCache for other inference engines like SGLang.
- More ideas… (Feel free to put an RFC!)
Links
- LMCache Github: https://github.com/LMCache/LMCache
- Chat with the Developers Interest Form
- LMCache slack
- vLLM Production-Stack channel