The challenge: Scaling enterprise AI Enterprises today are racing to integrate large language models (LLMs) into their products and workflows, but doing it at scale brings challenges in performance, cost, and accuracy. Organizations need models to be based on their specific data, while making sure that this information remains private. Cohere, one of the leading…

TL;DR LMCache, the state-of-the-art KV cache layer library developed by TensorMesh and the project’s open-source community, delivers breakthrough performance improvements to modern enterprise LLM inference frameworks, including the vLLM Production Stack, KServe, and NVIDIA’s Dynamo. With fast and scalable caching of long-context KV cache, LMCache helps reduce inference costs and ensures SLOs for both latency…

Break News: “CacheBlend” Receives BEST PAPER AWARD at ACM EuroSys 2025 This week, at ACM EuroSys 2025 (Top Academic Conference in Computer Systems), Jiayi Yao, the first author of the groundbreaking paper on CacheBlend, will present our innovative work that redefines the landscape of LLM efficiency, particularly in retrieval-augmented generation (RAG) applications. This paper has…

TL;DR: Your RAG can run up to 4.5Γ faster by pairing vLLM with LMCache . [π» Source code] [π Paper] will appear in the 10th ACM EuroSys (European Conference on Computer Systems) 2025 [π¬ 3-minute introduction video] The Problem: RAG is WAY TOO SLOW Retrieval-Augmented Generation (RAG) has become a key technique in…

TL;DR: LMCache turboboosts vLLM with 7Γ faster access to 100x more KV caches, for both multi-turn conversation and RAG . [π» Source code] [π Paper1] [π Paper2] [π¬ 3-minute introduction video] LLMs are ubiquitous across industries, but when using them with long documents, it takes forever for the model even to spit…
