LMCache 第一时间支持 GPT-OSS（20B/120B）

[2025年8月5日]() [Benchmark](https://blog.lmcache.ai/en/category/benchmark/), [Best practices](https://blog.lmcache.ai/en/category/best-practices/), [News](https://blog.lmcache.ai/en/category/news-en/), [benchmark](https://blog.lmcache.ai/en/tag/benchmark-en/), [gpt-oss](https://blog.lmcache.ai/en/tag/gpt-oss-en/), [OpenAI](https://blog.lmcache.ai/en/tag/openai-en/), [vLLM](https://blog.lmcache.ai/en/tag/vllm-en/)

作者：Yihua, Kobe

LMCache 现已第一时间支持 OpenAI 最新发布的 GPT-OSS 模型（200 亿与 1200 亿参数）！

本文提供完整指南，教你如何用 vLLM + LMCache 部署 GPT-OSS 模型，并通过 CPU offloading能力获得显著性能提升。

A bar chart comparing the performance of Vanilla vLLM and vLLM with LMCache in a long-document Q&A benchmark, showing average TTFT and total finish time in seconds with percentage reductions highlighted.

步骤 1：安装 vLLM GPT-OSS 版

安装

uv pip install –pre vllm==0.10.1+gptoss \

–extra-index-url https://wheels.vllm.ai/gpt-oss/ \

–extra-index-url https://download.pytorch.org/whl/nightly/cu128 \

–index-strategy unsafe-best-match

验证安装

vllm serve openai/gpt-oss-120b

–max-model-len 32768

–disable-hybrid-kv-cache-manager

url http://localhost:9000/v1/chat/completions

-H “Content-Type: application/json”

-d ‘{

“model”: “openai/gpt-oss-120b”,

“messages”: [

{

“role”: “user”,

“content”: “Hello how are you today”

}

“temperature”: 0.7

}’

步骤 2：从源码安装LMCache

为什么要从源码安装？

vLLM依赖PyTorch Nightly构建版本运行GPT模型，为了确保兼容性，我们强烈建议基于您当前虚拟环境中的Pytorch版本构建LMcache。

安装步骤

从源码安装 LMCache（因需编译 CUDA 内核，此过程可能持续几分钟）：

git clone https://github.com/LMCache/LMCache.git

cd LMCache

# In your virtual environment

ENABLE_CXX11_ABI=1 uv pip install -e . –no-build-isolation

验证安装

python3 -c “import torch; import lmcache; import lmcache.c_ops”

步骤 3：基于LMCache运行vLLM

LMCache配置

创建用于 CPU offloading的配置文件 backend_cpu.yaml：

# Create a CPU offloading buffer with 80G

chunk_size: 256

local_cpu: True

max_local_cpu_size: 80

LMCache部署vLLM

LMCACHE_CONFIG_FILE=”./backend_cpu.yaml”

LMCACHE_USE_EXPERIMENTAL=True

vllm serve

openai/gpt-oss-120b

–max-model-len 32768

–disable-log-requests

–disable-hybrid-kv-cache-manager

–kv-transfer-config

‘{“kv_connector”:”LMCacheConnectorV1″, “kv_role”:”kv_both”}’

步骤 4：Benchmark测试结果

应用场景：长文档问&答

– 输入： 20 篇不同文档，平均每篇约20k个词元

– 输出： 每次查询 50 个词元

测试流程

1. 阶段一： 将全部文档送入推理引擎以预热 KV cache；

2. 阶段二： 随机重排查询顺序再次发送，测量TTFT与总完成时间。

性能结果

第二阶段benchmark测试表明，系统性能显著提升：

| 配置方案 | 平均 TTFT (s) | 完成全部查询总时长 (s) |

|——————|————–|————————|

| 原生 vLLM | 1.20 | 15.70 |

| vLLM + LMCache | 0.39 | 7.73 |

为什么会有性能提升？

在单卡 A100/H100 上部署 GPT-120B 时，可用的KV cache GPU缓冲区通常不足 10 GB。LMCache 通过 CPU offloading缓冲区，使 vLLM 能够存储并复用更多前缀的 KV cache，所以：

– 将首词元生成时间降低 67 %；

– 将查询总完成时间缩短 51 %。

运行Benchmark

执行以下命令即可复现上述结果：

python long-doc-qa.py \

–num-documents 20 \

–document-length 20000 \

–output-len 50 \

–repeat-count 1 \

–repeat-mode random \

–shuffle-seed 0

完整的benchmark测试脚本可在以下地址获取：https://github.com/LMCache/LMCache/blob/dev/benchmarks/long-doc-qa/long-doc-qa.py.

About us

Categories

Tags

LMCache 第一时间支持 GPT-OSS（20B/120B）

赞过：

发表评论取消回复

About us

Categories

Tags

LMCache 第一时间支持 GPT-OSS（20B/120B）

赞过：

发表评论取消回复

了解 LMCache Blog 的更多信息