Get started easily: a single MacBook is all you need to develop vLLM + LMCache
For New Contributors · Covering Frontend / L1 Eviction / L2 Storage / Observability
If you ever skipped LMCache because you didn’t have a GPU on hand, this guide was written for you. LMCache’s multi-platform framework has already decoupled the GPU from most of the core data paths, meaning that on an ordinary MacBook (Apple Silicon) or on Ubuntu (which follows a mostly similar setup, so this article does not cover it in detail. For reference, see .github/workflows/cpu_device.yml), you can run the full vLLM + LMCache end-to-end pipeline, modify code, run unit tests, and do end-to-end verification with ease.
This article covers:
① What problem LMCache solves;
② How to set up your workspace on a MacBook;
③ How to run vLLM + LMCache end to end;
④ What to do when you run out of memory, or when the model won’t download;
⑤ What specifically you can work on across four directions.

◍ 1. What problem does LMCache solve (A 60-second Overview)
During LLM inference, every request computes a segment of KV cache that supports the generation of subsequent tokens. If contexts share a common prefix (a system prompt, document retrieval, multi-turn dialogue, etc.), the KV that would otherwise be recomputed can in fact be reused directly — from another request, another machine, or from disk.
LMCache is a standalone KV cache engine: It receives KV from the inference engine (vLLM, SGLang, …), aggregates it in memory (L1), persists it in external storage (L2 — disk/Redis /object storage /NIXL …), and efficiently returns matched KV to the inference engine when a new request arrives. To the layer above, it is simply a “cache layer + multi-machine coordination” capability; in scenarios with high reuse rates it can cut time-to-first-token by an order of magnitude.
There is really only one thing you need to remember: LMCache’s core data structure is the tensor, and the way tensors are moved around is abstracted into a “Transfer Context.” A Transfer Context can travel over GPU IPC or over CPU shared memory — which is exactly why a MacBook can run the full pipeline too. See https://blog.lmcache.ai/en/2026/06/15/understanding-lmcache-mp-mode-transfer-paths-a-beginners-guide/
◍ 2. Why a MacBook is enough for development
The LMCache codebase has a Platform/Device abstraction layer that centralizes hardware-specific checks, such as “Is CUDA available?” or “Does this op exist in PyTorch?”, behind a few well-defined entry points. For the vast majority of modules such as Frontend/Eviction/Storage/Observability, the actual tensor-processing steps happen on the Python side, and data movement is hidden behind the Transfer Context. So you can absolutely get hands-on directly on a laptop, even without access to a GPU.
Two kinds of Transfer Context
EngineDrivenTransferContext:The worker gathers the tensors and sends the full tensor payload to the LMCache server. The server then copies the received data into its cache pool withmemcpy. Sub-modes include SHM (zero-copy via POSIXshm_open/mmap) andpickle(serialized data transferred over a socket).LMCacheDrivenTransferContext: The LMCache server directly accesses tensor memory owned by the worker through an IPC handle. On GPU, this handle is a CUDA IPC handle. On CPU, it is the name of a POSIX shared-memory segment. In both cases, both processes map the same underlying physical pages viammap, enabling zero-copy sharing.
On the vLLM side, you switch between the two paths via lmcache.mp.mp_transfer_mode inside kv_connector_extra_config. For local development we recommend fixing it to engine_driven, so that your local pipeline is exactly equivalent — in terms of data path — to the GPU pipeline in production.
If you want to dig deeper into LMCache’s multi-platform architecture, look at
.github/workflows/cpu_device.ymland.github/scripts/run-cpu-e2e-validation.sh. They are the authoritative reference for CPU-only end-to-end validation in CI, and the best reproducible template for your local environment.
◍ 3. Set up your MacBook Workspace in ten minutes
3.1 Create an isolated Python environment
vLLM’s v1 KV connector API requires Python ≥ 3.11; this guide was tested and verified with Python 3.12. You also need to make sure cmake is installed so that vLLM ≥ 0.4.3 can be built from source. We recommend placing the venv alongside the source code for easy switching:
# We recommend putting everything under ~/projects-test/ mkdir -p ~/projects-test cd ~/projects-test # Create a shared venv (vLLM + LMCache will share it) python3 -m venv .venv-lmcache source .venv-lmcache/bin/activate pip install -U pip wheel setuptools
3.2 Clone the Source Repositories
cd ~/projects-test git clone https://github.com/vllm-project/vllm.git git clone https://github.com/LMCache/LMCache.git
3.3 Install the CPU build of vLLM (choose either method)
The PyPI wheel for vLLM is built with CUDA support, so importing it on a CPU-only laptop will fail. Below are two installation options: building from source, which is recommended and keeps you aligned with the latest code, or using the prebuilt vLLM CPU nightly wheel, which saves about 2–3 minutes of build time. For day-to-day development, we recommend Option 1. If you only want to quickly validate the end-to-end flow, Option 2 is the faster path.
Option 1: Build from source (recommended)
For exact commands, refer to the official Apple Silicon installation guide:
# Official doc:
# https://docs.vllm.ai/en/stable/getting_started/installation/cpu/#apple-silicon
cd ~/projects-test/vllm
source ~/projects-test/.venv-lmcache/bin/activate
# 1) Install CPU deps (torch CPU, transformers, etc.)
pip install uv
VIRTUAL_ENV=~/projects-test/.venv-lmcache \
uv pip install -r requirements/cpu.txt \
--index-strategy unsafe-best-match
# 2) Build vLLM from source (needs setuptools_scm; Xcode 16 is fine)
pip install setuptools_scm setuptools_rust
VIRTUAL_ENV=~/projects-test/.venv-lmcache VLLM_TARGET_DEVICE=cpu \
uv pip install -e . --no-build-isolation
# 3) Sanity check - v1 KV connector API must be available
python -c \
'import vllm; print(vllm.__version__);
from vllm.distributed.kv_transfer.kv_connector.v1.base \
import KVConnectorBase_V1; print("v1 OK")'
In our testing, on a MacBook M-series + Xcode 16 + Python 3.12, going from scratch to a successful import vllm takes about 2–3 minutes (with torch already downloaded, the vLLM source build is about 40 seconds).
Option 2: Prebuilt vLLM CPU wheel (using the CI script)
If you do not want to wait for a source build, you can use the same script that is validated in LMCache CI: .github/scripts/install_vllm_cpu.sh. The script automatically installs numpy<2 and vllm-cpu-nightly (which includes the torch 2.11 CPU build) and automatically creates the dist-info alias.
The script supports two environment variables to control its behavior:
- PIP_BIN: specifies the pip command path; defaults to pip. You can set it to
uv pipto use uv, or to a full path such as~/projects-test/.venv-lmcache/bin/pip. - PIP_INSTALL_EXTRA_ARGS: extra arguments passed to pip install; empty by default. For example, you can add
--index-urlto specify a mirror source.
source ~/projects-test/.venv-lmcache/bin/activate # Simplest usage (uses pip from the activated venv): bash ~/projects-test/LMCache/.github/scripts/install_vllm_cpu.sh # Or explicitly point to the venv's pip: # PIP_BIN="~/projects-test/.venv-lmcache/bin/pip" \ # bash ~/projects-test/LMCache/.github/scripts/install_vllm_cpu.sh # If you need to pass extra pip args (e.g. mirror): # PIP_INSTALL_EXTRA_ARGS="-i https://mirrors.tencent.com/pypi/simple" \ # bash ~/projects-test/LMCache/.github/scripts/install_vllm_cpu.sh
What does the script do internally ①
pip install numpy<2(a hard dependency of vLLM CPU); ②pip install vllm-cpu-nightly --extra-index-urlhttps://download.pytorch.org/whl/cpu (becausevllm-cpu-nightlydepends ontorch 2.11,which exists only in the PyTorch CPU index; without this argument,pipwon’t find a matching torch version); ③ automatically creates thevllm-<ver>+cpu.dist-info alias— because the wheel’s dist metadata is registered asvllm-cpu-nightlyrather than vllm, while vLLM’s CLI and internal calls look up the version via importlib.metadata.version(“vllm”); without the alias, this throws PackageNotFoundError and vllm serve cannot start. The +cpu local label is also the condition for activating the CPU platform plugin: it will grep dist metadata for the substring “cpu” to determine that the current environment is CPU. The script is idempotent; running it again only refreshes the alias.
3.4 Install LMCache (key point: NO_GPU_EXT=1)
On a laptop without an NVIDIA/AMD/Intel accelerator card, the default pip install -e . will try to pull GPU vendor dependencies such as cuda runtime, nixl, and cupy, and attempt to compile the CUDA extension, failing outright. LMCache now provides three installation switches in setup.py / pyproject.toml, designed specifically for MacBook development scenarios:
- NO_GPU_EXT=1 (first choice for MacBook): skips all GPU extensions plus GPU vendor dependencies like cupy / nixl, but keeps pure C++ extensions such as lmcache_fs / lmcache_redis.
- NO_NATIVE_EXT=1: skip even the native C++ extensions, giving you a pure-Python package. Suitable if you only want to run frontend / algorithm-related tests and can’t be bothered to install a C++ toolchain at all.
- NO_CUDA_EXT=1: a legacy alias, equivalent to NO_NATIVE_EXT, now deprecated. If you see it in older docs, use NO_GPU_EXT or NO_NATIVE_EXT consistently from now on.
On a MacBook you only need one command:
source ~/projects-test/.venv-lmcache/bin/activate # nvtx upstream sdist misses Cython as a build dep; # install it manually on macOS first pip install Cython openai # One-liner: install LMCache (skip GPU ext + GPU vendor deps) NO_GPU_EXT=1 pip install --no-build-isolation -e ~/projects-test/LMCache
Why not NO_CUDA_EXT=1? Because it is a legacy alias equivalent to NO_NATIVE_EXT — it skips all native extensions, but the cupy/nixl in install_requires are still not skipped, and cupy-cuda12x simply has no wheel on macOS arm64, which makes the pip resolver unsolvable outright. NO_GPU_EXT=1 is tailored for the “no GPU but has a C++ toolchain” scenario, and also skips the cupy/nixl in cuda_core.txt.
3.5 Verify the installation
source ~/projects-test/.venv-lmcache/bin/activate
python -c 'import lmcache; print(lmcache.__version__)'
python -c \
'from lmcache.integration.vllm.lmcache_mp_connector \
import LMCacheMPConnector; print("connector OK")'
Seeing “StubCPUDevice” / “Skipping backend lmcache.c_ops” means the CPU backend has been correctly activated — you now have everything you need to develop LMCache.
◍ Quick feedback: verify LMCache standalone with server_bench (no vLLM needed)
Before moving to the full vLLM + LMCache end-to-end flow, there is a lighter-weight way to verify: just start the LMCache server, then run lmcache bench server –mode cpu. It needs no vLLM, no model download — only that you already have LMCache installed. This is the fastest “change code → verify” loop.
Benefits
- Extremely fast feedback: from starting the server to seeing results takes under 20 seconds — ideal for tuning iterations.
- Zero dependencies: no vLLM, no model files, no GPU required.
- Focused on LMCache itself: it directly tests store → retrieve checksum consistency, ruling out external variables like the vLLM connector / KV tensor serialization.
- Covers both transfer modes: lmcache_driven and engine_driven, so you can verify the store/retrieve correctness of both data paths separately.
Limitations
- Doesn’t involve real inference: the bench generates random tensors rather than real KV cache, so it can’t verify token-generation consistency or semantic correctness.
- Doesn’t go through the KV connector: the bench communicates with the server directly via ZMQ/HTTP, bypassing vLLM’s LMCacheMPConnector, so it can’t be used to verify connector integration.
- Can’t test cache-hit behavior: server_bench only does a single store+retrieve checksum comparison — no consecutive requests, no cache hit/miss logic.
Just three steps to run it once
The steps below follow .github/scripts/cpu_server_bench_test.sh in CI, and support both the engine_driven and lmcache_driven modes.
# Step 1: start LMCache server (background) source ~/projects-test/.venv-lmcache/bin/activate lmcache server \ --port 5555 \ --http-port 8080 \ --l1-size-gb 1 \ --eviction-policy LRU & # Step 2: wait for healthcheck while ! curl -fsS http://127.0.0.1:8080/healthcheck 2>/dev/null; do sleep 1 done echo 'Server ready' # Step 3: run bench (lmcache_driven mode) lmcache bench server \ --rpc-url tcp://127.0.0.1:5555 \ --url http://127.0.0.1:8080 \ --mode cpu \ --transfer-mode lmcache_driven \ --num-tokens 512 \ --end 3 # Or with engine_driven mode: # lmcache bench server \ # --rpc-url tcp://127.0.0.1:5555 \ # --url http://127.0.0.1:8080 \ # --mode cpu \ # --transfer-mode engine_driven \ # --num-tokens 512 \ # --end 3 # Stop the server when done: # kill %1
A successful run should print “CHECKSUM MATCH OK” × 3 in the output. If you see “CHECKSUM MISMATCH”, it means the data was corrupted during store or retrieve, and your code change likely introduced a bug. This is also the minimal health check we recommend running before opening a PR.
◍ 4. Your first end-to-end run: vLLM CPU + LMCache
Once you bring vLLM into the mix, you have a complete debuggable inference + KV-cache-reuse environment. vLLM runs on the CPU device and hands KV to the LMCache server via SHM handles through the KV connector.
4.1 Memory planning (must-read, or you’ll OOM)
This is the easiest trap to fall into when running vLLM + LMCache on a MacBook. You need to reserve memory simultaneously for vLLM’s KV cache pool, LMCache’s L1 cache, and the Python process itself. Recommended configuration for a 16 GB MacBook:
- vLLM KV cache pool: controlled by
VLLM_CPU_KVCACHE_SPACE; 1 GB recommended (more than enough for opt-125m). Use--kv-cache-memory-bytesfor more precise control. - LMCache L1 cache: controlled by
--l1-size-gb; 1 GB recommended. Don’t set it too large, or together with vLLM’s memory the whole machine will run short. - If your MacBook has only 8 GB of memory, set vLLM KV cache to 0.25 GB (
VLLM_CPU_KVCACHE_SPACE=0.25), LMCache L1 to 0.5 GB (--l1-size-gb 0.5), and use--max-model-len 300--max-num-seqs 1to limit sequence length and concurrency.
# Memory budget for a 16 GB MacBook (safe defaults): # vLLM KV pool: 1 GB (VLLM_CPU_KVCACHE_SPACE=1) # LMCache L1: 1 GB (--l1-size-gb 1) # torch + model weights: ~1.5 GB # Python runtime: ~0.5 GB # OS + apps: ~6 GB # ------------------------------------------ # Free headroom: ~6 GB # For an 8 GB MacBook, shrink everything: # VLLM_CPU_KVCACHE_SPACE=0.25 # --l1-size-gb 0.5 # --max-model-len 300 # --max-num-seqs 1
4.2 macOS arm64 special handling: avoiding the OMP deadlock
This issue is specific to running vLLM CPU on Apple Silicon Macs. If these three environment variables are not set, vLLM may hang indefinitely during OpenMP initialization at startup.
The root cause is a thread-binding deadlock caused by a conflict between macOS’s built-in Accelerate framework and the external OpenMP runtime. The fix is to disable OpenMP thread binding and force single-threaded execution:
# === CRITICAL: avoid vLLM OMP deadlock on macOS arm64 === export VLLM_CPU_OMP_THREADS_BIND=nobind export OMP_NUM_THREADS=1 export KMP_BLOCKTIME=0 # ==========================================================
These three lines only need to be exported in the terminal where you start vLLM; they don’t affect the LMCache server. If you’re on an Intel Mac or Linux, you don’t need them. Not setting them won’t cause an error either — you’ll just have a few harmless extra environment variables.
4.3 Start the LMCache server
The command below starts LMCache’s combined ZMQ + HTTP server. This is the new CLI entry point (previously it was python -m lmcache.v1.multiprocess.http_server). L1 is 1 GB with LRU eviction. ZMQ listens on 5555 and the HTTP frontend listens on 8080. The sign of a successful start is the log line: “LMCache ZMQ cache server is running”.
# Terminal A: start LMCache server source ~/projects-test/.venv-lmcache/bin/activate lmcache server \ --port 5555 \ --http-port 8080 \ --l1-size-gb 1 \ --eviction-policy LRU
After startup you should see: “accelerator available: False”, “TokenHasher initialized: chunk_size=256, hash_algorithm=blake3”, and “LMCache ZMQ cache server is running on tcp://localhost:5555”. Verify the HTTP health check with
curl http://localhost:8080/healthcheck.
4.4 Start vLLM
Set the KV connector to lmcache_mp_connector, with host tcp://localhost and port 5555 to match the server. Turn off vLLM’s built-in prefix caching (–disable-hybrid-kv-cache-manager / –no-enable-prefix-caching) and hand the KV-reuse responsibility entirely to LMCache.
# Terminal B: start vLLM (macOS arm64)
source ~/projects-test/.venv-lmcache/bin/activate
# === CRITICAL: avoid vLLM OMP deadlock on macOS arm64 ===
export VLLM_CPU_OMP_THREADS_BIND=nobind
export OMP_NUM_THREADS=1
export KMP_BLOCKTIME=0
# ==========================================================
# CPU device + gloo rendezvous
export VLLM_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=1
export VLLM_HOST_IP=127.0.0.1
export GLOO_SOCKET_IFNAME=lo0
# You can also set "lmcache.mp.mp_transfer_mode" to "engine_driven"
vllm serve facebook/opt-125m \
--port 18000 \
--dtype bfloat16 \
--disable-hybrid-kv-cache-manager \
--no-enable-prefix-caching \
--max-model-len 2048 \
--max-num-seqs 1 \
--kv-transfer-config '{
"kv_connector": "LMCacheMPConnector",
"kv_role": "kv_both",
"kv_connector_module_path":
"lmcache.integration.vllm.lmcache_mp_connector",
"kv_connector_extra_config": {
"lmcache.mp.host": "tcp://localhost",
"lmcache.mp.port": 5555,
"lmcache.mp.mp_transfer_mode": "lmcache_driven"
}
}'
4.5 Send requests to verify the cache hit
LMCache’s smallest reuse unit is the chunk (256 tokens by default), so a prompt that is too short won’t land in the cache at all. The Python script below sends the same ~640-token prompt twice: the first cold start writes the KV into LMCache, and the second retrieves it directly from SHM.
# Terminal C: verify cache hit
source ~/projects-test/.venv-lmcache/bin/activate
cat > /tmp/test_lmcache_e2e.py <<'EOF'
import time, requests
URL = 'http://localhost:18000/v1/completions'
words = ['the','quick','brown','fox','jumps',
'over','lazy','dog'] * 80 # ~640 tokens
prompt = ' '.join(words)
payload = dict(model='facebook/opt-125m', prompt=prompt,
max_tokens=8, temperature=0.0)
for i in (1, 2):
t0 = time.time()
r = requests.post(URL, json=payload, timeout=600)
print('round %d %d %.2fs' % (i, r.status_code, time.time()-t0),
r.json()['usage'])
EOF
python /tmp/test_lmcache_e2e.py
Expected output (M-series MacBook, opt-125m, bf16):
round 1 200 0.71s {'prompt_tokens': 641, 'total_tokens': 649, ...}
round 2 200 0.25s {'prompt_tokens': 641, 'total_tokens': 649, ...}
In the LMCache server log in Terminal A you’ll see:
# First request: store LMCache INFO: Stored 512 tokens in 0.013 seconds # Second request: hit + retrieve LMCache INFO: Prefetch request completed (L1+L2): 2/2 prefix hits (2 L1, 0 L2) in 0.4 ms LMCache INFO: Retrieved 512 tokens in 0.007 seconds
At this moment you have already run the complete pipeline on your MacBook: vLLM CPU worker → POSIX SHM → LMCache server (server-side memcpy) → L1 → second request hits and reuses. The only difference from the GPU case (CUDA IPC handle) is “how the peer’s KV buffer is obtained” — the business logic, scheduling, eviction, and observability are entirely equivalent.

◍ 5. What if the model won’t download?
In some network environments, the HuggingFace Hub may be unreachable or very slow. Three solutions, listed in recommended order:
5.1 Option 1: Use a HF mirror (most convenient)
Set the HF_ENDPOINT environment variable to point at a mirror site:
# Use HF mirror (hf-mirror.com is a popular option)
export HF_ENDPOINT=https://hf-mirror.com
# Then start vLLM as usual - it will download from the mirror
vllm serve facebook/opt-125m ...
# Or use snapshot_download manually to pre-download:
python -c "
from huggingface_hub import snapshot_download
snapshot_download('facebook/opt-125m')
"
If hf-mirror.com doesn’t work either, you can try other mirrors: modelscope (https://modelscope.cn) or a company-internal HF mirror. In CI we also use
.github/scripts/download_model.shto download models; it has built-in retry + exponential backoff and can be used directly:
# Use the same download script as CI (with retry) MODEL_ID=facebook/opt-125m \ bash ~/projects-test/LMCache/.github/scripts/download_model.sh
5.2 Option 2: Download locally and specify the path
If you’ve already downloaded the model locally, you can use the local path directly in place of the HuggingFace repo id:
# Download to local first (on a machine with good network) # pip install huggingface_hub # huggingface-cli download facebook/opt-125m \ # --local-dir ~/models/opt-125m # Then use the local path with vLLM: vllm serve ~/models/opt-125m \ --port 18000 \ --dtype bfloat16 \ ... # same args as section 4.4
Local paths support both absolute and relative paths, as long as the directory contains files like
config.jsonandpytorch_model.bin. This is also the recommended approach for offline development.
5.3 Option 3: Specify the HF cache directory
If you’ve downloaded the model before, you can reuse the cache via HF_HOME or a symlink:
# Point HF cache to your existing model directory export HF_HOME=~/my_hf_cache # Or symlink the specific model into the default cache: mkdir -p ~/.cache/huggingface/hub ln -s ~/my_models/models--facebook--opt-125m \ ~/.cache/huggingface/hub/models--facebook--opt-125m
◍ 6. Submit your first PR
- First, Fork LMCache/LMCache on GitHub and clone your own fork;
- Create a new feature branch, following conventional commits (prefixes like feat / fix / refactor / docs / test);
- Set up pre-commit locally:
pip install pre-commit && pre-commit install— formatting and static checks likeruff/mypy/ etc. will run automatically before commits; - Run the unit tests for the relevant module at least once — most tests in the lmcache repo don’t depend on a GPU.
- Run the end-to-end verification locally — with this guide, you can fully verify most functionality locally.
- Submit the PR to upstream main, clearly explaining “problem / motivation / solution / verification method” in the description, and attach a snippet of the logs or a metrics screenshot you produced on your own machine — this will make the reviewer very happy.
When you git commit, remember to add the -s/–signoff flag to sign off with Signed-off-by — this is a project requirement. You can do it all at once with
git commit -s -m“…”.
◍ 7. A few closing words
LMCache is an open-source project built in the open and shaped by the community. The reason this multi-platform abstraction exists is to make it possible for developers without accelerators, new hardware vendors, and users of different inference engines to collaborate on the same codebase.
You do not need a GPU to start contributing. Pick any one of the four directions, Frontend/L1 Eviction/L2 Storage/Observability, and your MacBook is enough to support you in writing your first piece of code that can be merged into main.
Zero GPU cost. Zero accelerator barrier. Zero waiting.That is the gift LMCache’s multi-platform work aims to give every community contributor. We would love to have you here.
◍ Appendix: CI script reference
The following are the key CPU-only-testing files in the LMCache repo. They are your best reference for local troubleshooting:
- .github/workflows/cpu_device.yml — CI workflow definition for CPU device testing, including both macOS and Ubuntu matrices.
- .github/scripts/install_vllm_cpu.sh — installs the vLLM CPU nightly wheel, including the dist-info alias fix.
- .github/scripts/install_lmcache_cpu.sh — installs LMCache in editable mode with NO_GPU_EXT=1.
- .github/scripts/download_model.sh — downloads HF models, with built-in retry + exponential backoff.
- .github/scripts/cpu_device_test.sh — unified entry point, supporting both server_bench and vllm_e2e modes.
- .github/scripts/run-cpu-e2e-validation.sh — the complete end-to-end validation script, covering installation, startup, cache-hit verification, and cross-instance restart verification.
Leave a Reply