Tech Explained

What is TurboQuant and why it matters for LLM inference, in laymen’s term

2026-04-15

KKuntai Du

Comparison of Full-Precision and TurboQuant models, showing performance scores of 0.997 for both, with a note on 5x compression and maintaining the same quality.

TL;DR: TurboQuant allows you to put 4x more context in your GPU without blowing up GPU memory or dropping AI’s intelligence. It does so by quantizing the memory of large language models, also known as KV cache, an important bottleneck mentioned by Jensen Huang multiple times at this year’s GTC.

It relies on two secret sauces:

Random rotation, literally: It applies random rotation to smooth out outliers (which is surprisingly common in KV caches), make sure that every element in the data roughly are of the same scale, which helps the strategy of spending same amount of bits to encode each element achieves near-optimal quantization performance.
Reserve 1-bit to correct cosine similarity: It saves exactly 1 bit of budget not to reduce the absolute difference, but to align the direction between quantized data and original ones. This make sure that the cosine similarity derived from quantized data is zero-biased, which makes LLM attention operation much more accurate (because attention relies on cosine similarity).

TurboQuant is really hot these days. Some people on x even claim it is the most significant AI breakthrough this year. But why? Why is TurboQuant so smart, and why does it matter for LLM inference? Let’s break it down without math formulas.

Background: what is quantization

At a high level, quantization is basically rounding a number so you can record it using much less bits at the cost of having a tiny loss in precision. For example:

			
2328088   -->>     1.23
844393   -->>     30.8

The underlying mechanics can get a bit complicated (for example, OpenAI’s open source model GPT-OSS involves MXFP4 and complex algorithms), but for this blog, you can safely think of quantization as just strategic rounding.

Why prior work fails — outliers in KV caches

A simple solution (and it is a really popular one) is to just round all the LLM numbers down to a lower precision.

However, this solution does not work well for LLMs, especially KV caches — the LLM’s internal memory for all your tokens.
This is because KV cache is pretty spiky: most of the numbers are small but some of them are much bigger than others. The rounding error of rounding these large data will then be much larger than rounding other numbers.

This observation is empirically observed by multiple works, if you are interested, here are links: [1,2,3].

How to handle large outliers

A naive solution is to use more bits for big numbers and less bits for small numbers. But GPU hates this — GPU functions much more effectively when the amount of bits aligns between numbers.

So TurboQuant uses a smart trick: it applies a random rotation on top of the original data. By shuffling and rotating the data (mathematically: multiplying the vectors with a random matrix), the massive outliers are now “spread out” across other data. Suddenly, the scale of every floating point number is roughly the same.

You take spiky data, give it a random mathematical spin, and voila — everything now aligns perfectly for the GPU to apply quantization magic.

Another issue: cosine similarity becomes biased

There is another massive headache with compressing LLM data: when you round LLM numbers, you accidentally change the direction the data is pointing. This matters because the exact angle and direction between data points is how the LLMs decides what things are important and what are not — if you mess up the direction, the LLM starts to become stupid.

To solve this, TurboQuant uses a two-stage quantization trick. If the algorithm has a budget of, say, 4 bits, it uses 3 bits to get the data as close to the original as possible. Then, it uses the last 1 single bit purely to act as a compass that corrects the direction of the data back to original.

Just 1 bit to correct the direction, the direction of data now has zero bias compared to original.

The Real-World Impact: The Needle in the Haystack

So, what does TurboQuant actually achieve? The researchers tested TurboQuant on a “Needle-In-A-Haystack” test — basically asking LLM to find one specific hidden sentence in a massive, up to 104k-words long document.

TurboQuant was able to shrink the LLM’s KV caches by more than 5x — while still finding the same amount of needles compared to without quantization! This means with TurboQuant, you can run your LLM vastly faster with much longer context! All enabled by math!

Math really is fun.

Resources:

Share via:

Join the conversation on GitHub Discussions or Slack.

Tech Explained

What is TurboQuant and why it matters for LLM inference, in laymen’s term

2026-04-15

Background: what is quantization

Why prior work fails — outliers in KV caches

How to handle large outliers

Another issue: cosine similarity becomes biased

The Real-World Impact: The Needle in the Haystack

Like this:

Related

Table of Contents

Resources:

Share via:

Leave a ReplyCancel reply

More from the blog

lmcache

2026-06-23

vLLM + LMCache: A Starter Guide, No GPU Required

lmcache

2026-06-15

Understanding LMCache MP Mode Transfer Paths: A Beginner’s Guide

Behind the Build

2026-06-02

A New Chapter for LMCache and the KV Cache Community

Tech Explained

What is TurboQuant and why it matters for LLM inference, in laymen’s term

2026-04-15

Background: what is quantization

Why prior work fails — outliers in KV caches

How to handle large outliers

Another issue: cosine similarity becomes biased

The Real-World Impact: The Needle in the Haystack

Like this:

Related

Table of Contents

Resources:

Share via:

Leave a ReplyCancel reply

More from the blog

lmcache

2026-06-23

vLLM + LMCache: A Starter Guide, No GPU Required

lmcache

2026-06-15

Understanding LMCache MP Mode Transfer Paths: A Beginner’s Guide

Behind the Build

2026-06-02

A New Chapter for LMCache and the KV Cache Community

Discover more from LMCache Blog