Follow us on: X, LinkedIn
Initiated and Officially Supported by Tensormesh
A collaboration story about LMCache multiprocess mode + MooncakeStore — From 0 to 1, from functional to optimized. 1. Before We Begin Recently, the LMCache community and the Mooncake community carried out a series of valuable open-source collaborations around the Mooncake Store L2 adapter for LMCache MP (multiprocess) mode. The main contributors include: This was…
A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the internet or web infrastructure. These systems were never cleanly designed from first principles; they…
A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation traces from kv-cache-tester against MiniMax-M2.5 (230 GB FP8 MoE) on 2× AMD MI300X with vLLM 0.19.0 + LMCache (built from source for…
DeepSeek V4 — an open weight model that gives you the state-of-the-art intelligence, while potentially gives you much cheaper token price than its preceding model, DeepSeek V3.2. But how does DeepSeek v4 does that? Pre-requisite: attention, KV caches, and why KV cache is the key that affects token pricing To know why DeepSeek V4 can…
For years, we have referred to one of the most critical components of modern LLM inference as a “KV cache.” That name made sense once. Today, it is increasingly misleading. What began as a small, ephemeral optimization inside a single inference pass has quietly evolved into something far more important: a first-class data object with…
Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context lengths grow and concurrent users increase, the KV cache can exceed GPU memory capacity, forcing expensive recomputation that degrades latency and…
GTC wrapped up a month ago. Our open-source KV cache management library, LMCache, was shown in Jensen Huang’s keynote, was spotlighted by NVIDIA SVP Kevin Deierling, I was invited to speak at the first-ever industry KV cache tutorial, and mentioned in talk after talk by industry partners. At one point, someone even came to our…
TL;DR: TurboQuant allows you to put 4x more context in your GPU without blowing up GPU memory or dropping AI’s intelligence. It does so by quantizing the memory of large language models, also known as KV cache, an important bottleneck mentioned by Jensen Huang multiple times at this year’s GTC. It relies on two secret…
Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent…
The standard approach to reducing LLM inference costs is prefix caching, which reuses previously computed token states to avoid redundant computation. In practice, however, this approach misses significant caching opportunities in real-world agentic workloads! Caching in Agentic Workflows In agentic workloads, shared content (e.g., retrieved contexts and documents) frequently appears across requests at varied positions,…