Introduction
LLM inference becomes increasingly challenging as context length grows and workloads scale. Traditional serving engines rely on prefix-based KV cache reuse, which limits opportunities for optimization, especially when processing long, repeated, or overlapping text across different requests.
LMCache addresses this challenge. It is an extension to LLM serving engines that dramatically reduces time-to-first-token (TTFT) and increases throughput, particularly in long-context scenarios. Unlike traditional prefix-based reuse, LMCache enables fine-grained KV cache reuse for any repeated text, regardless of its position or the serving instance handling the request. By storing reusable KV caches across GPU memory, CPU, DRAM, and local disks, LMCache eliminates redundant computation and preserves valuable GPU cycles.
When integrated with vLLM, LMCache delivers 3-10x improvements on AMD Instinct™ MI300X GPUs for a wide range of community models, including Qwen3, Llama3, and Qwen-VL.
By combining LMCache with vLLM, developers achieve significant performance improvement and GPU cycle reduction in many LLM use cases, including long document QA and multi-round QA.
Long document benchmark
To verify, we selected the long document benchmark to compare the performance boost between enabling and disabling LMcache on AMD Instinct™ MI300X GPU by setting the Number of documents as 100, and the document length as 10000 to run the test.
The following results are generated by using the popular community models including Qwen3/Llama3/Qwen-VL, to demonstrate performance boost for different model architectures.
This benchmark evaluates the performance of the Llama3 (70B), Qwen2.5 Vison Language models, and Qwen3 series models (specifically the 8B and 30B parameter versions) on a long-document question-answering task, with a key focus on the impact of enabling LMcache (LM Cache).
The tests were conducted on an Instinct MI300X server. The core methodology involved serving the model using the vLLM framework and running a client-side benchmark script (long_doc_qa.py) that processes many documents, each with a substantial length of 10,000 tokens, while generating 300-token outputs. The primary variable was the number of documents processed, with tests run for 100, 200, and 500 documents.
The server configuration clearly differentiated between two scenarios: one with LMcache enabled and one without.
Enabling LMcache involved specific environment variables like PYTHONHASHSEED=0and LMCACHE_MAX_LOCAL_CPU_SIZE, which was tuned for different model sizes (e.g., 200 for Qwen3-8B, 180 and 150 for Qwen3-30B). The vLLM command included the –kv-transfer-configparameter set to use the LMCacheConnector V1. The results presented in the accompanying charts demonstrate a significant performance improvement when LMcache is activated.
The conclusion that can be drawn is that LMcache effectively optimizes the inference process for long-context scenarios. By caching key-value pairs from the transformer model’s attention mechanism, LMcache reduces redundant computation when processing long, similar documents. This leads to lower latency, particularly for the first token, and higher overall throughput, making the Qwen3 models more efficient and responsive for handling extensive textual data on the MI300X hardware. The tuning of the LMCACHE_MAX_LOCAL_CPU_SIZE parameter for different model sizes also highlights the importance of cache configuration for optimal resource utilization.
How to reproduce:
Server:
Enable LMcache:
PYTHONHASHSEED=0 \
LMCACHE_MAX_LOCAL_CPU_SIZE=200 \
vllm serve Qwen/Qwen3-8B \
--tensor-parallel-size 1 \
--kv-transfer-config \
'{"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"}' \
--gpu-memory-utilization 0.9 \
--load-format dummy \
--trust-remote-code
Disable LMcache:
PYTHONHASHSEED=0 vllm serve Qwen/Qwen3-8B \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --load-format dummy \ --trust-remote-code
Client:
Number of documents: 100
python3 benchmarks/long_doc_qa/long_doc_qa.py \ --model Qwen/Qwen3-8B \ --num-documents 100 \ --document-length 10000 \ --output-len 300 \ --repeat-count 1 \ --repeat-mode tile \ --max-inflight-requests 4
Multi-round QA benchmark
Multi-round QA workflow
To verify, we selected the Multi-round QA benchmark to compare the performance boost between enabling and disabling LMcache on MI300X GPU by setting the Number of users as 20 and number of rounds as 6 to run the test.
The workload simulated in these benchmarks is a multi-round QA (question answering) task with multiple users interacting with an LLM engine concurrently.
The following results are generated by using the popular community models including Qwen3/Llama3/Qwen-VL, to demonstrate performance boost for different model architectures.
How to reproduce:
Server:
Enable LMcache:
PYTHONHASHSEED=0 \ MIOPEN_USER_DB_PATH=/app/miopen \ MIOPEN_FIND_MODE=FAST \ VLLM_USE_V1=1 \ VLLM_ROCM_USE_AITER=1 \ SAFETENSORS_FAST_GPU=1 \ vllm serve Qwen/Qwen2.5-VL-72B-Instruct \ --tensor_parallel_size=8 \ --trust_remote_code \ --mm-encoder-tp-mode "data" \ --load-format dummy \ --gpu-memory-utilization 0.6
Disable LMcache**:**
PYTHONHASHSEED=0 \ MIOPEN_USER_DB_PATH=/app/miopen \ MIOPEN_FIND_MODE=FAST \ VLLM_USE_V1=1 \ VLLM_ROCM_USE_AITER=1 \ SAFETENSORS_FAST_GPU=1 \ vllm serve Qwen/Qwen2.5-VL-72B-Instruct \ --tensor_parallel_size=8 \ --trust_remote_code \ --mm-encoder-tp-mode "data" \ --load-format dummy \ --gpu-memory-utilization 0.6
Client**:**
python3 multi-round-qa.py \
--num-users 20 \
--num-rounds 6 \
--qps 1 \
--shared-system-prompt 1000 \
--user-history-prompt 2000 \
--answer-len 100 \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--base-url http://localhost:8000/v1
Summary
The LMCache Long Document QA benchmark is designed to evaluate the performance of large language models when processing and answering questions about lengthy documents. This benchmark demonstrates the AMD Instinct MI300X GPU system’s ability to handle extended context windows and maintain information retrieval accuracy across long documents. for assessing how well LLM serving systems can manage memory-intensive, long-context scenarios.
The LMBenchmark suite provides a comprehensive evaluation framework for LLM serving systems through multi-round question-answering scenarios. The benchmark simulates realistic concurrent user interactions with three distinct test configurations: ShareGPT (real-world conversations at QPS=1.34), number of users and number of rounds. While enabling systematic comparison of LLM serving systems under varying load conditions, from llama like models to the vison Language models, as well as MoE models, the LMcache provides essential performance improvement for large-scale language model deployments on AMD GPUs.
LMCache Roadmap
In 2026 Q1, the LMCache roadmap is driven by a primary objective to stabilize core features while pioneering advanced global KV cache sharing mechanisms for scalable LLM deployments.
We are expanding our ecosystem connectivity by integrating with new serving engines such as TRTLLM and Modular, and simultaneously optimizing our storage layer with io_uring and NVMe FDP support for superior I/O performance. Internally, we are refactoring our RPC infrastructure and memory allocators to enable large-scale peer-to-peer cache sharing and disaggregated pooling.
More importantly, we are going to strengthen the commitment to adapt heterogeneous hardware, including AMD. We are establishing a dedicated AMD testing platform within our CI/CD pipeline to ensure seamless compatibility and peak performance on AMD hardware, solidifying LMCache’s role in the next generation of efficient LLM serving infrastructure.
For more details, please visit our Q1 roadmap at: https://github.com/LMCache/LMCache/issues/2350
Acknowledgements
We would like to thank the many talented people who have contributed to this collaboration:
- AMD: Andy Luo, Haichen Zhang, and the AMD AIG Teams.
- LMcache: Junchen Jiang,Yihua Cheng, Nick Barcet
We’re excited to keep refining and expanding our optimizations to unlock even greater capabilities in the weeks and months ahead!
Join Us
Looking for Collaborations! Calling all passionate community developers and researchers: join us in training the next-generation router model on AMD GPUs and building the future of trustworthy AI infrastructure.
Interested? Reach out to us:
- Haichen Zhang:[email protected]
- Yihua Cheng: [email protected]

Leave a Reply