
Supporting Ascend NPUs
We’re delighted to announce that LMCache now officially supports Ascend NPUs with the release of the LMCache-Ascend plugin. LMCache-Ascend supports a broad range of Ascend compute platforms from the cloud to the edge. This major platform expansion underscores LMCache’s commitment to delivering leading performance across a diverse hardware ecosystem, enabling developers to deploy high-performance Large Language Model (LLM) services anywhere with minimal code changes.
Empowering Ascend Cloud and Edge Deployments
This integration delivers immediate benefits to Ascend Cloud users and to teams deploying Atlas servers or other edge devices:
- Ascend Cloud Users: Enterprises running LLM services on Ascend Cloud—such as Q&A, content generation, or code completion—can now easily deploy LMCache-Ascend. Their services will benefit not only from the native compute power of Ascend NPUs but also from the cache’s ability to reduce latency and improve requests per second per dollar.
- Edge & On-Premises Deployments: For Atlas 200 modules, Atlas 300 cards, and Atlas 800 servers, LMCache-Ascend enables larger models in resource-constrained edge environments. By reducing steady compute demand, it lowers power consumption and response times, enabling real-time inference for scenarios such as autonomous driving, industrial inspection, and robotics.
How the Plugin Works
Under the hood, LMCache-Ascend uses runtime monkey-patching to take over key subsystems seamlessly. This technique allows it to dynamically swap out core LMCache components with two critical updates: First, it replaces standard PyTorch/C++ operation bindings, which, as previously mentioned, then call into dedicated Ascend NPU-specific kernels via the torch_npu backend and Ascend Computing Language (ACL) APIs to unlock native hardware performance. Second, and just as crucially, it adopts a bespoke memory management system for the Ascend architecture, ensuring data is allocated and handled in the most efficient way possible for the NPU. This elegant architecture delivers deep hardware-specific optimizations while maintaining 100% API compatibility and requiring zero changes to existing LMCache code.
Availability
Support for Ascend NPUs in LMCache is now available as a public beta. Developers can access the Ascend-compatible version, example code, and deployment documentation on the official LMCache-Ascend GitHub repository.
- Learn more about LMCache-Ascend:
LMCache-Ascend Wiki - Get the beta release of LMCache-Ascend and read the integration guide:
LMCache-Ascend - Learn more about Ascend:
Ascend Official Website
Looking Ahead
The integration with Ascend NPUs marks a significant milestone on our path to a comprehensive and high-performance AI acceleration platform. Going forward, we will continue to deepen our collaboration with Ascend developers and the broader Ascend ecosystem. Our roadmap includes further optimizations for prefill-decode disaggregation, support for the latest Ascend hardware, and enhanced multi-node caching capabilities. We remain committed to making LMCache the premier caching solution for AI inference, regardless of hardware, so developers can focus on building the next generation of AI applications.
Acknowledgments
This achievement would not have been possible without the support of our partners and internal team. We extend our sincere gratitude to the Huawei Euler Team and other Ascend developers including the following individuals for their contributions:

Leave a Reply