About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

LMCache on Amazon SageMaker HyperPod: Accelerating LLM Inference with Managed Tiered KV Cache

By

Kunal Jha (AWS)

,

Vinay Arora (AWS)

and

Ziwen Ning (AWS)

Overview

Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context lengths grow and concurrent users increase, the KV cache can exceed GPU memory capacity, forcing expensive recomputation that degrades latency and throughput.

LMCache addresses this challenge with a tiered caching architecture that extends KV cache beyond GPU memory into CPU RAM (L1) and remote storage (L2). In this post, we demonstrate how LMCache integrates with Amazon SageMaker HyperPod’s managed tiered storage as an L2 cache backend, delivering up to 1.67× improvement in inter-token latency (ITL) and 1.27× higher throughput for multi-round chat workloads under high concurrency compared to L1-only configurations.

The Challenge: Why L2 Cache Matters

KV caching is a well-established technique for accelerating LLM inference — by storing attention states from previous computations, the model avoids redundant work when generating new tokens. LMCache extends this with an L1 CPU cache that holds evicted KV entries from GPU memory, allowing them to be reloaded instead of recomputed.

However, L1 CPU cache has a practical limit. In production deployments with many concurrent users, the total KV cache footprint across all active sessions can easily exceed available CPU memory. When L1 fills up, it evicts older sessions. If those users return, the server must recompute their entire conversation context from scratch — a process that takes 17+ seconds for a 70B model with 12K tokens of history.

This is where L2 cache becomes essential. By persisting evicted KV entries to a larger-capacity storage tier, the system can reload them when users return instead of recomputing. Amazon SageMaker HyperPod provides managed tiered storage through its ai-toolkit daemon, which runs on every cluster node and offers high-performance local access. LMCache’s SageMaker HyperPod connector integrates directly with this storage as an L2 tier — no additional infrastructure to deploy or manage.

The Solution: LMCache with SageMaker HyperPod Managed Tiered Storage

Amazon SageMaker HyperPod provides managed tiered storage through its ai-toolkit daemon, which runs on every node in the cluster. This storage layer offers high-performance local access with an HTTP control plane for cache management.

LMCache’s SageMaker HyperPod connector integrates directly with this managed storage as an L2 cache tier. The architecture works as follows:

  1. L1 (CPU RAM): Fast, node-local cache for frequently accessed KV entries. Limited by available CPU memory.
  2. L2 (HyperPod Managed Tiered Storage): Larger capacity cache backed by the ai-toolkit daemon. After new KV entries are written to L1, they are asynchronously persisted to L2 by a background store controller. When a returning session misses L1 (due to eviction), the system checks L2 before falling back to full recomputation — the data is already there from the earlier async store.
A flowchart illustrating an inference request process in a HyperPod cluster using AWS services, including an AWS Load Balancer and Intelligent Router directing requests to Instance A and Instance B, each with a model pod and L1 cache, connecting to an L2 cache.

An intelligent router (using strategies such as prefix-aware, KV-aware, or session-based routing) directs returning conversations to the server that previously handled them, maximizing cache hit rates across both L1 and L2 tiers.

Benchmark Setup

We deployed a two-node inference cluster on SageMaker HyperPod to measure the impact of adding L2 managed tiered storage to an L1-only configuration.

ComponentConfiguration
ModelLlama-3.1-70B-Instruct
Hardware2× ml.p5.48xlarge (8× NVIDIA H100 80GB per node)
Tensor parallelism8 per server
GPU memory utilization0.9
LMCache versionv0.4.3 (with vLLM v0.19.0)
RoutervLLM Production Stack router, session-based routing
L1 cache20 GB CPU RAM per server
L2 cacheSageMaker HyperPod managed tiered storage (100 GiB per node)
WorkloadMulti-round chat: 100 concurrent sessions, 12K tokens per session (2K shared system prompt + 10K chat history), 200 output tokens, QPS=8

We compared two configurations:

  • L1 Only: LMCache with CPU cache only. When GPU automatic prefix caching (APC) and L1 both evict a session, it must be recomputed from scratch.
  • L1 + L2: LMCache with CPU cache plus SageMaker HyperPod managed tiered storage. Evicted sessions can be loaded from L2 instead of recomputed.

Benchmarks were run using the LMCache bench engine CLI:

lmcache bench engine \
   --engine-url http://<router-endpoint>:9300 \
   --workload multi-round-chat \
   --tokens-per-gb-kvcache 3000 \
   --kv-cache-volume 400 \
   --mrc-shared-prompt-length 2000 \
   --mrc-chat-history-length 10000 \
   --mrc-user-input-length 50 \
   --mrc-output-length 200 \
   --mrc-qps 8.0 \
   --mrc-duration 180 \
   --json --no-interactive

Results

MetricL1 OnlyL1 + L2 (HyperPod)Improvement
TTFT P504,506 ms5,030 ms0.90×
TTFT P908,253 ms6,833 ms1.21×
ITL P5063.4 ms38.0 ms1.67×
ITL P90164.9 ms116.8 ms1.41×
Output Throughput844 tok/s1,071 tok/s1.27×
Total Requests Served1,0161,2001.18×
Success Rate100%100%

With 100 concurrent multi-round chat sessions at QPS=8, the L2 tier reduces ITL P50 by 1.67× and P90 by 1.41×, while increasing output throughput by 1.27×. The system serves 18% more requests in the same time window. Under this high-concurrency workload, GPU APC and L1 are under significant pressure — many sessions are evicted and must be either recomputed (L1-only) or loaded from L2 (L1+L2). The L2 tier turns expensive recomputation into fast shared-memory reads, which particularly benefits inter-token latency as the decode phase can start sooner when the KV cache is loaded from L2 rather than recomputed.

Cache Flow

When a multi-round chat request arrives:

  1. The session-based router identifies which server previously handled this conversation and routes the request there.
  2. The server checks GPU memory (automatic prefix caching) for the conversation’s KV cache.
  3. If not in GPU memory, LMCache checks the L1 CPU cache.
  4. If not in L1, LMCache queries the L2 HyperPod managed tiered storage via the SageMaker HyperPod connector.
  5. If found in L2, the KV cache is loaded from HyperPod managed tiered storage and the server resumes generation from the cached state.
  6. Only if the KV cache is not found in any tier does the server perform full recomputation.

When new KV cache entries are computed, they are written to L1 immediately and then asynchronously persisted to L2 by a background store controller. This means that by the time L1 evicts an entry (using LRU eviction), the data is already available in L2 for future retrieval.

When L2 Cache Helps Most

The L2 managed tiered storage tier provides the greatest benefit when:

  • The model is large: Larger models have higher recomputation cost per token, making cache retrieval from L2 more valuable than recomputing. Our benchmarks used Llama-3.1-70B-Instruct; smaller models (e.g., 8B parameters) showed less benefit because recomputation is fast relative to L2 retrieval overhead.
  • Context is long: Longer conversations accumulate more KV cache entries, making recomputation increasingly expensive. At 12K tokens, recomputation takes several seconds for a 70B model.
  • Concurrency is high: When many concurrent sessions compete for GPU and CPU cache, evictions are frequent and L2 serves as a safety net that prevents costly recomputation. Our benchmarks showed the strongest benefit at QPS=8 with 100 concurrent sessions.
  • Sessions return: Multi-round chat workloads where users return to previous conversations benefit most, as the L2 cache retains their context even after L1 eviction.

Summary

LMCache’s integration with SageMaker HyperPod managed tiered storage provides a practical path to improving LLM inference performance for multi-round chat and long-context workloads. By adding an L2 cache tier backed by HyperPod’s managed storage infrastructure, deployments can achieve up to 1.67× reduction in ITL P50 and 1.27× higher throughput under high-concurrency workloads — without additional operational complexity, as the ai-toolkit daemon is already running on every HyperPod node.

For setup instructions, see the LMCache SageMaker HyperPod documentation. For more about SageMaker HyperPod, visit the Amazon SageMaker HyperPod documentation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading