Overview of the Collaboration
The KV Cache is a memory optimization that makes Large Language Models(LLMs) run the forward pass faster by storing Key (K) and Value (V) matrices to prevent the model from recalculating them for the entire text sequence with every new generated token. Maximizing the KV Cache hit rate with storage is paramount for delivering high-performance LLM inference. The KV Cache’s immense GPU memory footprint, amplified by system prompt size, is the primary bottleneck in LLM serving, directly constraining context length, concurrency, and overall system throughput. Our primary goal is to maximize the KV Cache hit ratio by effectively expanding GPU HBM with a tiered node-local storage solution. To achieve this, our collaboration with the LMCache team (Kuntai Du, Jiayi Yao, Yihua Cheng who all work at Tensormesh, the company that they formed after collaborating on LMCache) leverages their intelligent software to create this solution on GKE.
Tiered Storage Approach
LMCache allows extending the KV Cache from the GPU’s fast HBM(Tier 1) to larger, cost-effective tiers like CPU RAM and local SSDs. This innovative approach dramatically increases the total cache size, which provides a better chance of maximizing the hit ratio and improves overall inference performance, since more data is stored and available locally on the accelerator node. For GKE users, this means accommodating models with massive context windows while maintaining excellent performance. Performance Benchmarking and Results
We designed our tests to measure the performance of a tiered KV Cache by configuring workloads with total cache sizes that would fill each storage layer (HBM, CPU RAM, Local SSD). These configurations were benchmarked using various context lengths (1k, 5k, 10k, 50k, and 100k tokens) representing multiple use cases , with TTFT, token input throughput and end-to-end latency as our primary performance indicators. Some examples of use cases for different system prompt length:
- 1k - 5k: High-Fidelity Personas and Complex Instructions
- 10k: Average user prompt in tokens (small RAG) or tokens in a webpage / article
- 50k: Prompt Stuffing
- 100k: Tokens in a long book
The results below capture the best-performing storage setup for each KV Cache size and the improvements it provides.
Experiment Setup
We deployed a vLLM server on a A3 mega machine, utilizing local SSD for ephemeral storage (through emptyDir). The detailed test setup is provided below:
- Hardware: 8 × nvidia-h100-mega-80gb GPUs
- Model: Llama-3.3-70B-Instruct
- LMCache version: v0.3.3
- Cache Configuration
- HBM only
- HBM + CPU RAM
- HBM + CPU RAM + Local SSD
- Storage resource: HBM: 640Gi, CPU RAM: 1Ti, Local SSD 5Ti
- Benchmark tool: SGLang bench_serving
- Requests: The test was conducted using several system prompt lengths (1k, 5k, 10k, 50k, and 100k tokens). Each system prompt provided a shared context for a batch of 20 inference requests. While the system prompt was shared, each individual request consisted of a unique 256-token input and generated a 512-token output.
Example command
python3 sglang/bench_serving.py --host=${IP} --port=${PORT} --dataset-name='generated-shared-prefix' --model=$MODEL --tokenizer=$MODEL --backend=vllm --gsp-num-groups=80 --gsp-prompts-per-group=20 --gsp-system-prompt-len=1000 --gsp-question-len=256 --gsp-output-len=512 --request-rate=800 --max-concurrency=200
Benchmark results
We create our test with different total KV Cache sizes. The below results capture the best storage setup for each KV Cache sizes and the improvement they make:
Test 1: The total cache of 1.1M - 1.3M tokens fit entirely within HBM.
Results: In this scenario, adding slower storage tiers offered no advantage, making an HBM-only configuration the optimal setup.
Test 2: The cache size was increased to 4.0M - 4.3M tokens, exceeding HBM’s capacity but smaller than HBM + CPU RAM capacity.
Results:
System Prompt Length | Best-performing Storage Setup | Mean TTFT (ms) Change (%) comparing with HBM only setup | Input Throughput Change (%) comparing with HBM only setup | Mean End-to-End Latency Change (%) comparing with HBM only setup |
---|---|---|---|---|
1000 | HBM | 0% | 0% | 0% |
5000 | HBM + CPU RAM | -18% | +16% | -14% |
10000 | HBM + CPU RAM | -44% | +50% | -33% |
50000 | HBM + CPU RAM + Local SSD | -68% | +179% | -64% |
100000 | HBM + CPU RAM + Local SSD | -79% | +264% | -73% |
Test 3: With a large cache of 12.6M - 13.7M tokens, the workload saturated both HBM and CPU RAM, spilling over to the Local SSD. The total size of cache is smaller than HBM + CPU RAM + Local SSD capacity.
Results:
System Prompt Length | Best-performing Storage Setup | Mean TTFT (ms) Change(%) comparing with HBM only setup | Input Throughput Change (%) comparing with HBM only setup | Mean End-to-End Latency Change (%) comparing with HBM only setup |
---|---|---|---|---|
1000 | HBM + CPU RAM | +5% | +1% | -1% |
5000 | HBM + CPU RAM | -6% | +27% | -21% |
10000 | HBM + CPU RAM + Local SSD | +121% | +23% | -19% |
50000 | HBM + CPU RAM + Local SSD | +48% | +69% | -41% |
100000 | HBM + CPU RAM + Local SSD | -3% | +130% | -57% |
Summary and Next Steps
As you can see from the results above the tiered storage solution improves inference performance by using node-local storages, particularly in scenarios with long system prompts that generate large KV Caches.
Improving LLM inference performance is a complex challenge where multiple infrastructure components (storage, compute, networking) must work together. Our efforts are part of a larger program to optimize the entire end-to-end inference stack—from intelligent load balancing at the Inference Gateway to advanced caching logic in the model server.
We are continuing to explore further enhancements by integrating additional remote storage solutions with LMCache.
Next Steps:
- To get started you can try the same setup mentioned above on GKE.
- Keep up to date on the LLM-D Inference Stack