Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy

Abstract

In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production environments.

However, instead of directly customizing the lmcache core, we introduced the LMCache Plugin Framework - a lightweight yet powerful plugin system that allows developers to run custom scripts within LMCache processes.

Based on this plugin framework, we implemented lmcache_frontend, a monitoring and proxy service that runs as a subprocess only on scheduler nodes. It provides a Web interface for cluster status visualization and implements request forwarding functionality through HTTP proxy services. This design not only facilitates deployment and management but also provides developers with an excellent plugin implementation example, demonstrating how to use the Plugin Framework to enhance system observability and control capabilities.

Through this article, developers can not only understand and master how to configure and use lmcache_frontend, but also grasp the working principles of the LMCache Plugin Framework and extend to implement their own plugins.

Overall Architecture

LMCache Plugin Framework Architecture

As shown in the diagram above, each LMCache runs within a vLLM process, and each LMCache starts an InternalApiServer that provides internal interaction through socket path.

Specifically, the LMCache plugin launcher discovers that we have defined a plugin named scheduler_lmc_frontend_plugin.py, so it starts a subprocess in the scheduler role’s LMCache. This subprocess is the LMCache frontend, which acts as an interaction proxy between users (external systems) and various LMCache services, hence we call it LMCache frontend.

LMCache Plugin Framework

LMCache Plugin Framework adopts an elegant and flexible design, implementing dynamic loading and execution of plugins through environment variable injection and naming conventions.

This means your plugin implementation, deployment, and activation require no modification to any LMCache code.

Plugin Loading Process

Core Process Description

Initialization Phase
- Start: LMCache is loaded and started by vLLM
- Init Framework: Initialize the plugin framework
- Scan Plugins: Scan plugin directory to discover executable files
Plugin Eligibility Check
- Check Eligibility: Comprehensive check of plugin activation conditions:
  - Role matching (scheduler/worker)
  - Worker ID matching (if specified)
  - File type support
- Execute Plugin: Eligible plugins are executed as subprocesses
- Skip: Ineligible plugins are skipped
Lifecycle Management
- Monitor: Continuously monitor LMCache status
- Terminate: Terminate plugins when LMCache stops

Plugin Naming Rules and Execution Targets

The plugin filename determines its execution target. The format is:

<ROLE>_<WORKER_ID>_<DESCRIPTION>.<EXTENSION>

Examples:

Example	Execution Target
`scheduler_foo_plugin.py`	Execute only on scheduler
`worker_0_test.sh`	Execute only on worker 0
`all_plugin.sh`	Run on all nodes

Application Scenarios

Plugin Framework supports various application scenarios:

Application Scenario	Implementation Method and Benefits
Monitoring Metrics Reporting	Periodically collect and report metrics to monitoring systems, achieving centralized monitoring
Log Collection	Real-time capture and forward logs to logging systems. Facilitates centralized log analysis and querying
Service Discovery	Send heartbeats to service registry, supporting dynamic service discovery and operational information collection
Health Checks	Implement custom health check logic, enhancing system reliability

lmcache_frontend: A Typical Implementation of Plugin Framework

lmcache_frontend is a monitoring and proxy service that can run independently or as a subprocess only on scheduler nodes. It provides a Web interface for cluster status visualization and implements request forwarding functionality through HTTP proxy services, forwarding requests to the internal_api_server of each lmcache scheduler and worker process, obtaining return information, and rendering it to the frontend for display.

Additionally, lmcache_frontend is a complete example developed based on the Plugin Framework, demonstrating how to use the plugin system to enhance LMCache functionality.

Architecture Design

lmcache_frontend adopts a lightweight design, running only as a subprocess of the scheduler node, implementing functionality through:

Web Monitoring Interface: Provides visual display of cluster status
HTTP Proxy Service: Supports request forwarding to any cluster node
Plugin Integration: Seamless integration through scheduler_lmc_frontend_plugin.py

Core Features

Currently, lmcache_frontend implements the following core features:

Function Category	Function Description
Cluster Monitoring	Web-based dashboard for visualizing cluster status
Request Proxy	HTTP proxy forwarding to cluster nodes
Node Management	Support for multiple node configuration methods
Metrics Monitoring	Support for getting aggregated prometheus metrics via `/metrics`, or getting specific sub-node metrics via `/proxy2/target_node_id/metrics`
Thread Information	Support for real-time thread information including thread status, call stacks, etc., helping understand node running status - a troubleshooting tool
Log Level	Support for dynamic log level setting, very convenient for setting node log levels anytime, simplifying troubleshooting processes. Debug logs can provide insight into internal state
Configuration Management	Display configuration, dynamically update configuration
Service Discovery	Register current node to service discovery center through heartbeat, providing node management functionality
Internal Information Display (In Development)	Display important LMCache internal information such as: current connecting request information (request token count, entry time, hit token count), current total MemoryObj count, evict count, total storage and usage, backends list, async load workload and completion information
Dynamic Script Execution (Best for Troubleshooting)	Support dynamically writing Python scripts for interpreter execution. You can be creative and do anything you want, but you are responsible for your actions

Frontend Interface Overview

Node Overview Page

This page displays an overview of current node information, including node name, host, port information, and lmcache version information.

Metrics Page

This page displays the current selected node’s Metrics information. You can understand cluster operation status based on Metrics information.

Threads Page

This page can display thread information of the currently selected node. Through thread information, you can fully understand what operations are currently being executed inside the current node, and can also help discover if there are time-consuming operations and whether critical threads are still waiting.

Log Level Page

This is the current node’s Log level viewing and adjustment page. The right frame shows all current Logger levels. The left frame can be filled with the logger name to be set and the new level. For example, setting it to DEBUG as shown above, after clicking the “set level” button, a prompt box will pop up indicating that the setting is complete.

Log Level Set Confirmation

Config Page

This page displays the current node’s configuration information. Through this page, you can understand whether the currently effective configuration is consistent with expectations. It can also support dynamic configuration updates, but this is still being implemented.

Meta Page

This page displays the Metadata information corresponding to the CacheEngine in the current LMCache. This information reflects the key information of the current LMCache.

Inference Page

The Inference page displays information related to the inference engine, such as the vllm version, vllm configuration, model configuration, and cache configuration shown in the figure. This allows us to master LMCache information while also understanding inference engine information.

Service Registration and Service Discovery Features

Service Discovery Architecture

Considering that LMCache may be deployed in containers, and operations teams usually don’t allow each inference engine node to expose web ports, although we start an LMCache frontend process on each inference engine node through the plugin framework, this process can only serve as a proxy for external access to other processes on the same node, and cannot provide web page display of internal information.

Therefore, we can configure each LMCache frontend to send heartbeats to the LMCache Discovery Service, registering service information. At the same time, we can start an LMCache United Frontend process (which is actually also LMCache frontend), configured to obtain all registered LMCache frontend node information through the LMCache Discovery Service, thereby being able to control the entire cluster.

Extensibility and Future Plans

lmcache_frontend demonstrates good extensibility. The current project is contributed to the GitHub LMCache organization as part of the LMCache ecosystem. Community contributors are welcome to build together. Future enhancements may include:

Authentication & Authorization: Add role-based access control
Advanced Monitoring: Integrate Prometheus metrics export
Alert Integration: Support integration with common alerting systems
Performance Analysis: Built-in performance profiling and bottleneck detection
Service Discovery: Report services to registry center

Summary and Reflections

Through the implementation of LMCache Plugin Framework and lmcache_frontend, we gained an important insight: when handling specific scenario requirements in open source projects, functional abstraction and universal design are crucial.

The success of Plugin Framework lies in that it doesn’t directly implement various customization requirements, but provides a flexible extension mechanism. This design philosophy enables:

Separation of Concerns: Core functionality is decoupled from extension functionality, keeping the core system clean
Ecosystem Building: Encourages community contribution of various plugins, forming a rich ecosystem
Easy Maintenance: Core system remains stable while plugins can develop and update independently

Conversely, if we simply implemented various companies’ customization requirements directly in the open source project, it would lead to the project rapidly expanding into a hard-to-maintain “hodgepodge”.

LMCache found a balance point through Plugin Framework, satisfying diverse requirements while maintaining project maintainability and extensibility. This design pattern is also reflected in the LMCache Remote External Connector framework and LMCache External backend framework, and is worth promoting in future development.

By defining clear extension interfaces and specifications, we enable the community to meet specific requirements without modifying core code, thus achieving long-term healthy development of the project.

We hope that as LMCache becomes increasingly powerful, it can continue to maintain healthy development.

Reference Links