About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

When Open Source Meets Open Source: A Joint Effort Between LMCache and Mooncake

By

Baolong

and

LMCache Team

A collaboration story about LMCache multiprocess mode + MooncakeStore — From 0 to 1, from functional to optimized.


1. Before We Begin

Recently, the LMCache community and the Mooncake community carried out a series of valuable open-source collaborations around the Mooncake Store L2 adapter for LMCache MP (multiprocess) mode.

The main contributors include:

  • maobaolong (LMCache community maintainer): Built the initial integration from scratch, connecting LMCache MP mode -> MooncakeStore for the first time.
  • fangchizheng (fcczzz) (from the Mooncake open-source community): Contributed a series of key optimizations, including MooncakeStore RDMA pre-registration, batch operation support, and per-op worker pools.
  • chunxiaozheng (LMCache community maintainer): Provided critical review feedback in each PR and helped ensure code quality.

This was not a one-sided product delivery from a single company. It was a true joint development effort between two open-source communities: one side brought the framework and real-world usage scenarios, while the other brought the underlying engine and performance optimizations.

In this blog, we will look back on this collaboration and walk through the technical results that eventually landed.


2. Meet LMCache and Mooncake Store

2.1 What is LMCache?

LMCache is a middleware layer in the vLLM ecosystem focused on KV cache reuse and persistence. Its core goal is to “save” the KV cache generated during LLM inference, so prompts with the same prefix do not need to recompute the prefill stage. Instead, they can directly reuse historical KV cache, significantly reducing TTFT (Time-To-First-Token) and compute cost.

LMCache provides two storage layers:

  • L1: A KV cache pool in CPU shared memory — fast but volatile.
  • L2: A persistent secondary storage layer, such as NVMe, file systems, distributed KV stores, or object storage, connected through different L2 adapters.

2.2 What is Mooncake Store?

Mooncake is an open-source infrastructure project designed for distributed large-model inference. Within Mooncake, Mooncake Store serves as a distributed KV cache storage engine. It natively supports RDMA, zero-copy transfer, and multi-replica metadata management, and is positioned as a high-performance storage foundation for PD disaggregation/KV-centric inference architectures.

A simple way to summarize their roles:

ProjectRole
LMCacheThe manager and scheduler of KV cache: deciding what to store, what to retrieve, and which layer to place it in.
Mooncake StoreThe high-performance remote storage engine for KV cache: deciding how to store and retrieve it quickly and reliably.

3. LMCache MP Mode: Overall Architecture

Before diving into the details of this collaboration, let’s first take a brief look at the overall architecture of LMCache MP mode.

In MP (multiprocess) mode, LMCache is no longer loaded by vLLM in-process as a library. Instead, it runs as an independent process and communicates with vLLM workers through RPC(ZMQ). This fully separates KV cache management from the vLLM main process. This design brings three major benefits:

  1. Process isolation: Failures in LMCache do not affect the main vLLM process.
  2. Resource decoupling: CPU, memory, and I/O resources can be allocated to LMCache independently.
  3. Cross-instance sharing: Multiple vLLM instances can share the same LMCache service, forming a machine-level or even cluster-level KV cache pool.

3.1 Architecture Diagram

Diagram illustrating the overall architecture of LMCache in MP mode, detailing vLLM worker processes, the LMCache MP server, ZMQ RPC layer, store/lookup/retrieve functions, L1 and L2 management, and external storage backends.

A few key points:
1. The L1 Manager maintains a CPU shared-memory pool. All vLLM workers read from and write to the same L1 through RPC.
2. The StoreController asynchronously copies newly generated KV chunks from L1 to L2 in the background.
3. When LOOKUP hits in L2, the PrefetchController asynchronously pulls the data from L2 back into L1. vLLM can then fetch it through a RETRIEVE RPC.
4. The L2 Adapter layer is fully pluggable. Each storage backend is implemented as an independent adapter and integrated through the unified L2AdapterInterface. The mooncake_store adapter discussed in this blog is a new member of this layer.


4. Before the Collaboration: The Foundation LMCache Had Already Built

The smooth collaboration between LMCache and Mooncake would not have been possible without the infrastructure the LMCache community had already put in place. In simple terms, there were three essential pieces.

4.1 Piece One: The Native Connector Framework (#2642)

[Southbound]: Create a Native Protocol for MP and non-MP
— by Samuel Shen from TensorMesh

In the early days, LMCache only had one C++ connector: Redis. Its code lived under csrc/redis/ and was tightly coupled with the RESP protocol. This PR introduced a much cleaner abstraction and refactored the connector layer from the ground up:

  • Promoted csrc/redis into a more general csrc/storage_backends/ structure.
  • Introduced the IStorageConnector interface and the ConnectorBase<T> CRTP template.
  • Added a set of pybind macros in connector_pybind_utils.h, allowing new backends to complete Python bindings with only a few dozen lines of code.
  • Introduced NativeConnectorL2Adapter on the Python side as a general bridge between any native connector and the L2AdapterInterface.
  • Supported both MP and non-MP modes, allowing the same C++ code to be reused across both scenarios.

In short, LMCache created a standard slot for all future native storage backends.

4.2 Piece Two: The First Native FS Connector (#2779)

Introduce native fs connector
— by maobaolong

After the native connector framework was abstracted, the first L2 native adapter to plug into it was the local file system:

  • Implemented a C++ file system connector under csrc/storage_backends/fs/.
  • It served both as a usable feature and as a minimal working example. In environments without dependencies such as NIXL or Mooncake, the FS adapter is enough to run through the entire L2 path. For future contributors, it also provides a reference implementation for adding new backends. In addition, some distributed file systems can be integrated with LMCache through NFS or mounted file systems together with this FS adapter.

4.3 Piece Three: Dynamic Loading for Third-Party Native Connectors (#2851)

[MP] Refactor l2 plugin framework to support dynamic load third-party native l2 connector
— by maobaolong

The final piece, and also the most important one, was enabling third-party connectors to be loaded without being merged into the LMCache main repository.

This PR:

  • Refactored the L2 plugin framework into a registry-based and dynamically discovered model.
  • Provided a complete external plugin package template, lmc_external_native_connector/, so external users can package their own C++ backend as an independent pip package. Once installed, it can be automatically discovered and loaded by LMCache.
  • Added the corresponding design documentation in docs/design/l2_adapters/plugin.md.

Before this refactor, adding a new backend meant modifying the factory function’s if/elif chain, updating registries, and changing imports. After the refactor, extension became truly zero-modification — the way an open-source plugin system should work.

With the foundation in place, it was time for Mooncake Store to take the stage.


5. The Main Story: Bringing the MooncakeStore L2 Adapter From Scratch

5.1 Step One: Connecting Mooncake to LMCache (#2911)

[MP] Introduce l2 mooncake adapter
— by maobaolong, 2026-04-03

This was the first shot in the collaboration. Building on the native connector framework described in the previous section, this PR added csrc/storage_backends/mooncake/:

  • C++ side: connector.cpp, connector.h, and pybind.cpp wrap the Mooncake client. Altogether, the implementation is only around 140 lines.
  • Python side: mooncake_store_l2_adapter.py provides MooncakeStoreL2AdapterConfig and the corresponding factory function.

A key design choice here is configuration pass-through. LMCache only handles the few configuration fields it needs to manage itself, such as num_workers. All Mooncake-specific options, including local_hostname, master_server_addr, protocol, rdma_devices, and others, are passed through as a single dictionary to Mooncake SDK’s setup_internal(ConfigDict). As a result, if Mooncake adds, removes, or changes its own configuration options later, LMCache does not need to update its adapter code.

This design choice reflects an important principle: do not reinterpret third-party configuration; pass it through as-is.
In open-source collaboration, this kind of boundary is valuable. LMCache does not try to absorb Mooncake’s internal configuration model or make assumptions about its lower-level details. Instead, it gives Mooncake the freedom to evolve its own interface while keeping the integration stable.

The result: the full MP mode + Mooncake Store over TCP path was working end to end, including store, lookup, and load. The first version was not fast yet, but it worked.

5.2 Step Two: Mooncake Gives Back — RDMA Pre-Registration (#3018)

[Feat] Add RDMA L1 memory preregistration support for MooncakeStore L2 adapter
— by Chizheng Fang, 2026-04-23

After the first version landed, fcczzz from the Mooncake community stepped in. He identified a core issue: in Mooncake’s RDMA mode, LMCache’s L1 memory must be registered with the RDMA NIC in advance. Otherwise, every I/O operation would trigger page pinning, causing a major performance penalty.

This PR introduced:

  • A preregister_l1_memory path in the C++ MooncakeConnector.
  • One-time registration of LMCache’s L1 shared memory pointer with Mooncake’s transfer engine when the adapter is created.
  • A cleanup of the unnecessary owned_real_client_ complexity from the first version, simplifying the implementation to a single client instance.
  • A standalone preregister_l1_memory switch in MooncakeStoreL2AdapterConfig, along with corresponding tests.

What made this PR especially meaningful was that fcczzz was not just reporting a bug as an external user. He read and understood LMCache’s L1 Manager, wrote patches across both the C++ and Python layers of LMCache, and even improved the organization of the native connector code along the way.

This is one of the most valuable forms of open-source collaboration: the boundary between “your project” and “our project” starts to disappear, and everyone becomes an owner of the community.

5.3 Step Three: Batch Operations (#3172)

[MP]: Add batch operations to Mooncake L2 adapter
— by Chizheng Fang, 2026-05-07

Once RDMA was working, the next step was to squeeze out more performance. fcczzz followed up with another important PR:

  • Implemented batch store, lookup, and delete in MooncakeConnector, upgrading the previous loop of single operations into true batch calls.
  • Fixed an edge case: Mooncake’s “key exists” error should be treated as a cache miss rather than thrown as an exception.
  • Isolated the RDMA integration tests to avoid competing with the default TCP adapter for the same segment.
  • Added batch_delete tests covering mixed existing and missing keys.

Why does batch operation matter?

Because a single LMCache store operation often involves hundreds or even thousands of KV chunks, with each chunk corresponding to a token block. Sending them through RPC one by one introduces unacceptable latency. With batching, throughput moves to the next level.

5.4 Step Four: Per-Operation Worker Pools (#3227)

[MP]: Add per-operation dedicated worker pools to Mooncake L2 connector
— by Chizheng Fang

This improvement addresses a very concrete problem: Under a shared worker pool, a burst of store requests can drag down lookup performance.

Why? Because lookup, retrieve, store, and delete have very different latency profiles. lookup is extremely fast but highly latency-sensitive. store is slower but more tolerant of delay. If all operations are placed into the same worker pool, then during a burst of store requests, a lookup that should take only a few dozen microseconds may be forced to wait behind store operations that take tens of milliseconds. The result is an exploding p99.

The solution in #3227:

  • Introduced WorkerPoolConfig in ConnectorBase, allowing separate workers to be allocated for the lookup, retrieve, store, and delete lanes.
  • Routed requests to the corresponding lane based on operation type. Unconfigured lanes fall back to the shared pool.
  • Exposed per_op_workers: dict[str, int] in the Python configuration, with full validation.
  • Implemented the lane mechanism in ConnectorBase, so all native connectors can benefit from it. Other backends only need to wire WorkerPoolConfig through their pybind layer.

Measured results with Mooncake RDMA, 1024 keys, 1 MiB values, and 4 stores per round. All numbers are in milliseconds:

ConfigurationLookup avgLookup p99Load avgLoad p99Store avgStore p99
Shared 4 workers0.94716.8323.39941.1655.29251.390
Per-op workers (1/1/2)0.2660.4832.5903.5063.21667.434
  • Lookup p99 dropped by ~35×, from 16.8 ms to 0.48 ms.
  • Load p99 dropped by ~12×, from 41.2 ms to 3.5 ms.

This is not the kind of performance improvement that comes from simple parameter tuning. It is an architectural gain. Isolating workloads with different SLOs is one of the simplest and most effective principles in high-performance storage systems.


6. Looking Back at the Timeline

A timeline chart depicting the collaboration milestones between LMCache and Mooncake, featuring key events such as 'Native Connector framework', 'Native FS Connector', 'Dynamic third-party plugin loading', 'MooncakeStore L2 adapter', 'RDMA L1 preregistration', 'Batch operations', and 'Per-Op Worker Pools', with dates ranging from March to May.

In just over a month, the integration went from working, to fast, and then to stable.

None of these steps was completed by one side alone. LMCache provided the framework and code review, while the Mooncake community brought in lower-level expertise and optimization ideas. Along the way, reviewers like chunxiaozheng patiently helped maintain code quality across every PR.


7. What This Collaboration Taught Us

7.1 Clear Boundaries Are a Prerequisite for Good Open-Source Collaboration

LMCache did not translate every Mooncake configuration option into its own dataclass. Instead, it chose to pass through a Dict[str, str].

At first glance, this may look like “taking a shortcut.” But in reality, it gives the collaborating project enough room to evolve. In open-source collaboration, the clearer the boundaries are, the easier it is for both sides to move forward.

7.2 Pluginization Is the Foundation for Sustainable Community Collaboration

Without the earlier native connector refactor and dynamic loading mechanism, the cost of this Mooncake integration would have been much higher. It might have required adding more if/elif branches to the LMCache main repository, modifying multiple factory functions, and dealing with one merge conflict after another. Zero-modification extension is a friction reducer for open-source collaboration. The project that gets this right first becomes much easier to work with.

7.3 Win-Win Is the Simplest Logic of Open Source

  • For LMCache, this brought in a production-grade RDMA-based distributed L2 backend, along with a series of optimizations that can benefit all native connectors, including per-op worker pools, batch operations, and the idea of L1 preregistration.
  • For Mooncake, this opened a path to vLLM users, allowing its RDMA capabilities to be used directly by LLM inference workloads.
  • For users, this provides a simple, ready-to-use LMCache MP + MooncakeStore setup that is both easy to run and high-performance.

That is the meaning of open source: do what you are good at, share it freely, and often end up gaining more in return.


8. A Small Request for Everyone Who Made It This Far

If you are working on:

  • large-model inference serving,
  • building an inference cluster with shared KV cache across instances,
  • performance-sensitive RDMA or distributed KV storage,
  • or simply curious about what a well-designed open-source plugin framework looks like,

we highly recommend trying out the LMCache MP Mode + Mooncake Store combination:

# 1. Enable the Mooncake extension at build time
BUILD_MOONCAKE=1 pip install -e . --verbose
# 2. Start the LMCache server
lmcache server --l1-size-gb 100 --eviction-policy LRU \
--l2-adapter '{
"type": "mooncake_store",
"num_workers": 4,
"per_op_workers": {"lookup": 1, "retrieve": 1, "store": 2},
"preregister_l1_memory": true,
"local_hostname": "node01",
"metadata_server": "http://localhost:8080/metadata",
"master_server_addr": "localhost:50051",
"protocol": "rdma",
"rdma_devices": "mlx5_0",
"global_segment_size": "3221225472"
}'

The detailed benchmark results, configuration notes, and documentation links can be found in #3227 and
docs/source/mp/l2_storage.rst.

Come support our communities: give both projects a star, open an issue, or submit a PR. Open-source collaboration is open to everyone — every contributor helps pave a wider path.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading