Abstract
In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production environments.
However, instead of directly customizing the lmcache core, we introduced the LMCache Plugin Framework - a lightweight yet powerful plugin system that allows developers to run custom scripts within LMCache processes.
Based on this plugin framework, we implemented lmcache_frontend, a monitoring and proxy service that runs as a subprocess only on scheduler nodes. It provides a Web interface for cluster status visualization and implements request forwarding functionality through HTTP proxy services. This design not only facilitates deployment and management but also provides developers with an excellent plugin implementation example, demonstrating how to use the Plugin Framework to enhance system observability and control capabilities.
Through this article, developers can not only understand and master how to configure and use lmcache_frontend, but also grasp the working principles of the LMCache Plugin Framework and extend to implement their own plugins.
Overall Architecture
As shown in the diagram above, each LMCache runs within a vLLM process, and each LMCache starts an InternalApiServer that provides internal interaction through socket path.
Specifically, the LMCache plugin launcher discovers that we have defined a plugin named scheduler_lmc_frontend_plugin.py
, so it starts a subprocess in the scheduler role’s LMCache. This subprocess is the LMCache frontend, which acts as an interaction proxy between users (external systems) and various LMCache services, hence we call it LMCache frontend.
LMCache Plugin Framework
LMCache Plugin Framework adopts an elegant and flexible design, implementing dynamic loading and execution of plugins through environment variable injection and naming conventions.
This means your plugin implementation, deployment, and activation require no modification to any LMCache code.
Plugin Loading Process
Core Process Description
- Initialization Phase
- Start: LMCache is loaded and started by vLLM
- Init Framework: Initialize the plugin framework
- Scan Plugins: Scan plugin directory to discover executable files
- Plugin Eligibility Check
- Check Eligibility: Comprehensive check of plugin activation conditions:
- Role matching (scheduler/worker)
- Worker ID matching (if specified)
- File type support
- Execute Plugin: Eligible plugins are executed as subprocesses
- Skip: Ineligible plugins are skipped
- Check Eligibility: Comprehensive check of plugin activation conditions:
- Lifecycle Management
- Monitor: Continuously monitor LMCache status
- Terminate: Terminate plugins when LMCache stops
Plugin Naming Rules and Execution Targets
The plugin filename determines its execution target. The format is:
<ROLE>_<WORKER_ID>_<DESCRIPTION>.<EXTENSION>
Examples:
Example | Execution Target |
---|---|
scheduler_foo_plugin.py |
Execute only on scheduler |
worker_0_test.sh |
Execute only on worker 0 |
all_plugin.sh |
Run on all nodes |
Application Scenarios
Plugin Framework supports various application scenarios:
Application Scenario | Implementation Method and Benefits |
---|---|
Monitoring Metrics Reporting | Periodically collect and report metrics to monitoring systems, achieving centralized monitoring |
Log Collection | Real-time capture and forward logs to logging systems. Facilitates centralized log analysis and querying |
Service Discovery | Send heartbeats to service registry, supporting dynamic service discovery and operational information collection |
Health Checks | Implement custom health check logic, enhancing system reliability |
lmcache_frontend: A Typical Implementation of Plugin Framework
lmcache_frontend
is a monitoring and proxy service that can run independently or as a subprocess only on scheduler nodes. It provides a Web interface for cluster status visualization and implements request forwarding functionality through HTTP proxy services, forwarding requests to the internal_api_server of each lmcache scheduler and worker process, obtaining return information, and rendering it to the frontend for display.
Additionally, lmcache_frontend is a complete example developed based on the Plugin Framework, demonstrating how to use the plugin system to enhance LMCache functionality.
Architecture Design
lmcache_frontend adopts a lightweight design, running only as a subprocess of the scheduler node, implementing functionality through:
- Web Monitoring Interface: Provides visual display of cluster status
- HTTP Proxy Service: Supports request forwarding to any cluster node
- Plugin Integration: Seamless integration through
scheduler_lmc_frontend_plugin.py
Core Features
Currently, lmcache_frontend implements the following core features:
Function Category | Function Description |
---|---|
Cluster Monitoring | Web-based dashboard for visualizing cluster status |
Request Proxy | HTTP proxy forwarding to cluster nodes |
Node Management | Support for multiple node configuration methods |
Metrics Monitoring | Support for getting aggregated prometheus metrics via /metrics , or getting specific sub-node metrics via /proxy2/target_node_id/metrics |
Thread Information | Support for real-time thread information including thread status, call stacks, etc., helping understand node running status - a troubleshooting tool |
Log Level | Support for dynamic log level setting, very convenient for setting node log levels anytime, simplifying troubleshooting processes. Debug logs can provide insight into internal state |
Configuration Management | Display configuration, dynamically update configuration |
Service Discovery | Register current node to service discovery center through heartbeat, providing node management functionality |
Internal Information Display (In Development) | Display important LMCache internal information such as: current connecting request information (request token count, entry time, hit token count), current total MemoryObj count, evict count, total storage and usage, backends list, async load workload and completion information |
Dynamic Script Execution (Best for Troubleshooting) | Support dynamically writing Python scripts for interpreter execution. You can be creative and do anything you want, but you are responsible for your actions |
Frontend Interface Overview
Node Overview Page
This page displays an overview of current node information, including node name, host, port information, and lmcache version information.
Metrics Page
This page displays the current selected node’s Metrics information. You can understand cluster operation status based on Metrics information.
Threads Page
This page can display thread information of the currently selected node. Through thread information, you can fully understand what operations are currently being executed inside the current node, and can also help discover if there are time-consuming operations and whether critical threads are still waiting.
Log Level Page
This is the current node’s Log level viewing and adjustment page. The right frame shows all current Logger levels. The left frame can be filled with the logger name to be set and the new level. For example, setting it to DEBUG as shown above, after clicking the “set level” button, a prompt box will pop up indicating that the setting is complete.
Config Page
This page displays the current node’s configuration information. Through this page, you can understand whether the currently effective configuration is consistent with expectations. It can also support dynamic configuration updates, but this is still being implemented.
Meta Page
This page displays the Metadata information corresponding to the CacheEngine in the current LMCache. This information reflects the key information of the current LMCache.
Inference Page
The Inference page displays information related to the inference engine, such as the vllm version, vllm configuration, model configuration, and cache configuration shown in the figure. This allows us to master LMCache information while also understanding inference engine information.
Service Registration and Service Discovery Features
Considering that LMCache may be deployed in containers, and operations teams usually don’t allow each inference engine node to expose web ports, although we start an LMCache frontend process on each inference engine node through the plugin framework, this process can only serve as a proxy for external access to other processes on the same node, and cannot provide web page display of internal information.
Therefore, we can configure each LMCache frontend to send heartbeats to the LMCache Discovery Service, registering service information. At the same time, we can start an LMCache United Frontend process (which is actually also LMCache frontend), configured to obtain all registered LMCache frontend node information through the LMCache Discovery Service, thereby being able to control the entire cluster.
Extensibility and Future Plans
lmcache_frontend demonstrates good extensibility. The current project is contributed to the GitHub LMCache organization as part of the LMCache ecosystem. Community contributors are welcome to build together. Future enhancements may include:
- Authentication & Authorization: Add role-based access control
- Advanced Monitoring: Integrate Prometheus metrics export
- Alert Integration: Support integration with common alerting systems
- Performance Analysis: Built-in performance profiling and bottleneck detection
- Service Discovery: Report services to registry center
Summary and Reflections
Through the implementation of LMCache Plugin Framework and lmcache_frontend, we gained an important insight: when handling specific scenario requirements in open source projects, functional abstraction and universal design are crucial.
The success of Plugin Framework lies in that it doesn’t directly implement various customization requirements, but provides a flexible extension mechanism. This design philosophy enables:
- Separation of Concerns: Core functionality is decoupled from extension functionality, keeping the core system clean
- Ecosystem Building: Encourages community contribution of various plugins, forming a rich ecosystem
- Easy Maintenance: Core system remains stable while plugins can develop and update independently
Conversely, if we simply implemented various companies’ customization requirements directly in the open source project, it would lead to the project rapidly expanding into a hard-to-maintain “hodgepodge”.
LMCache found a balance point through Plugin Framework, satisfying diverse requirements while maintaining project maintainability and extensibility. This design pattern is also reflected in the LMCache Remote External Connector framework and LMCache External backend framework, and is worth promoting in future development.
By defining clear extension interfaces and specifications, we enable the community to meet specific requirements without modifying core code, thus achieving long-term healthy development of the project.
We hope that as LMCache becomes increasingly powerful, it can continue to maintain healthy development.