Deepseek V4 explained, and why it matters to your wallet

DeepSeek V4 — an open weight model that gives you the state-of-the-art intelligence, while potentially gives you much cheaper token price than its preceding model, DeepSeek V3.2.

But how does DeepSeek v4 does that?

Pre-requisite: attention, KV caches, and why KV cache is the key that affects token pricing

To know why DeepSeek V4 can potentially gives you the lowest token price, let’s understand what bottlenecks large language models (LLMs) to generate massive amount of tokens.

In LLM, the key module that allows LLM to correlate between tokens is the attention module — the remaining modules focuses on further understanding of each word individually.

In order to remember all the tokens that LLM has already processed, LLM needs to transform context to KV cache. Think of it as LLM’s internal memory of the context — with KV cache, LLM can process future texts without re-reading existing contexts.

KV cache — the LLM’s internal memory — is so large that it becomes the key that prevents LLM from massively generating tokens (and prevents provider to lower token pricing). The reason is simple — modern GPUs wants to process as many generation requests as possible to generate massive amount of tokens, but if we do so the GPU memory will be blown up by KV caches.

To summary — the key of reducing token pricing, is to just make KV cache smaller.

How does DeepSeek reduces token pricing?

To reduce token pricing, DeepSeek use two set of special attention modules to reduce KV cache.

First type: compressed sparse attention

The key novel part of this layer (compared to previous DeepSeek V3.2) is to compress the KV cache of a group of tokens (4 tokens to be specific) into one. Simple and effective way of reducing the size of KV cache.

This introduces a subtle but important issue: causality break. In large language models, a key properties is that later tokens should not affect the meaning of earlier tokens (else it means that every time the LLM generates a new word, the meaning of all earlier tokens changed and needs to be recomputed, too costly). But in such attention module, tokens 1,2,3,4 will have the same KV cache, which means that when the LLM queries the meaning of token 1, it uses the KV cache of token 1,2,3,4 (recalls that we compress the KV cache of 4 tokens into 1), which means that the meaning of token 1 is actually affected by token 4.

To solve this issue, DeepSeek V4 ensures that the KV caches are only compressed after it is at least 128-tokens-away from the current processing position. As a result, we can strictly make sure that when we query the meaning of one token, we don’t use the information of any future tokens.

Second type: highly-compressed sparse attention

As the name tells, highly-compressed sparse attention is a variant of compressed sparse attention, with much aggressive compression — it compresses the KV cache of 128 tokens instead of 4 tokens into one. Really aggressive.

But why DeepSeek V4 uses such heavy compression? What is the rationale behind?

After digging into the paper, I found that the answer actually lies in a seemingly small detail: this type of layer uses full attention (meaning all token contributes to the calculation) instead of DeepSeek’s signature sparse attention that only considers top-1024 tokens.

So the answer is: such aggressive compression is a trade — it trades the ability to accurately represents finer details of each individual token, for the ability of letting all token (instead of just only 1024 tokens) participates into the attention computation without blowing up the computation demand. This means that, DeepSeek V4 will have much better long-term memory compared to DeepSeek V3.2 — it will remember all your history conversation and letting all of them decides the future text generation.

End-to-end effect: KV caches are ~10x smaller → 2-3x lower token price

With all these optimizations, the KV caches of DeepSeek V4 is ~10x smaller than DeepSeek V3.2!

Graph depicting accumulated KV cache (in GB) for DeepSeek models versus sequence length (in K), comparing DeepSeek-V3.2, DeepSeek-V4-Pro, and DeepSeek-V4-Flash with annotations stating their relative sizes.

With 10x smaller KV cache — GPUs can process 10x more requests than before, leading to 2-3x higher token generation throughput — and ultimately 2-3x cheaper token price!

About us

Tags

Deepseek V4 explained, and why it matters to your wallet

Pre-requisite: attention, KV caches, and why KV cache is the key that affects token pricing

How does DeepSeek reduces token pricing?

First type: compressed sparse attention

Second type: highly-compressed sparse attention

End-to-end effect: KV caches are ~10x smaller → 2-3x lower token price

Like this:

Leave a ReplyCancel reply

About us

Tags

Deepseek V4 explained, and why it matters to your wallet

Pre-requisite: attention, KV caches, and why KV cache is the key that affects token pricing

How does DeepSeek reduces token pricing?

First type: compressed sparse attention

Second type: highly-compressed sparse attention

End-to-end effect: KV caches are ~10x smaller → 2-3x lower token price

Like this:

Leave a ReplyCancel reply

Discover more from LMCache Blog