CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!

TL;DR: 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them way faster than recomputing! It compresses your KV cache up to 3× smaller than quantization so that you can load your KV cache blazingly fast while keeping response quality high. Stop wasting compute — use CacheGen to fully utilize your storage and get instant first-token speedup!

*CacheGen reduces KV cache loading time from disk.*

Key Results 📊

System	Mean TTFT (ms)	Mean TPOT (ms)
LMCache + CacheGen	737	47.7
Naive vLLM	4,355	247.6
Fireworks	2,353	664.7
DeepInfra	2,949	79.0
Baseten	113,239	174.9

Takeaway: CacheGen cuts Time-To-First-Token (TTFT) by up to 3× compared to other baselines, and reduces per-token latency, too.

Quick Start 🛠️

uv pip install vllm
uv pip install lmcache

# Start cache server
lmcache_server localhost 65434

# Start vLLM+LMCache server (using CacheGen)
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

example.yaml:

chunk_size: 2048
local_cpu: False
remote_url: "lm://localhost:65434"
remote_serde: "cachegen"

Citation

If you use CacheGen in your research, please cite our paper:

@misc{liu2024cachegenkvcachecompression,
      title={CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving}, 
      author={Yuhan Liu and Hanchen Li and Yihua Cheng and Siddhant Ray and Yuyang Huang and Qizheng Zhang and Kuntai Du and Jiayi Yao and Shan Lu and Ganesh Ananthanarayanan and Michael Maire and Henry Hoffmann and Ari Holtzman and Junchen Jiang},
      year={2024},
      eprint={2310.07240},
      archivePrefix={arXiv},
      primaryClass={cs.NI},
      url={https://arxiv.org/abs/2310.07240}, 
}

Paper: CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Contact

LMCache Github: https://github.com/LMCache/LMCache
Chat with the Developers Interest Form
LMCache slack
vLLM Production-Stack channel

CacheGen: persistent, streaming context for fast, scalable LLMs—the LMCache Lab way! 🚀

About us

Categories

Tags

CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!

Key Results 📊

Quick Start 🛠️

Citation

Contact

Like this:

Leave a ReplyCancel reply

About us

Categories

Tags

CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!

Key Results 📊

Quick Start 🛠️

Citation

Contact

Like this:

Leave a ReplyCancel reply

Discover more from LMCache Blog