Speculative Decoding From Zero • Thibault Castells

Speculative decoding is a way to make LLM generation faster without asking the large model to invent every token one by one. The basic idea is to let a small, cheap mechanism guess a few future tokens, then let the large model verify those guesses in a single pass.

If the guesses are good, the large model can emit several tokens after one verification pass; if they are bad, it corrects them and generation continues. The draft model is therefore not replacing the large model. It is helping the large model avoid unnecessary sequential work while keeping the target model as the source of truth.

The useful condition to remember is this: speculative decoding helps when the draft mechanism is much cheaper than the target model and correct often enough that one target-model verification pass can replace several normal target-model decode steps.

0. The core idea

Speculative decoding turns this:

Large model generates 1 token.
Large model generates 1 token.
Large model generates 1 token.
Large model generates 1 token.

into this:

Small model guesses 4 tokens.
Large model checks those 4 guesses in one pass.
Accept maybe 0, 1, 2, 3, or 4 of them.
Then emit a correction or bonus token.
Repeat.

The whole trick is to guess several tokens cheaply, verify them with the target model, keep the valid prefix, and repeat.

1. Why speculative decoding exists

1.1 Normal autoregressive decoding

Let us start from normal LLM generation. You have a prompt:

Speculative decoding is useful because

A decoder-only LLM generates text one token at a time:

Step 1:
  input:  "Speculative decoding is useful because"
  output: "it"

Step 2:
  input:  "Speculative decoding is useful because it"
  output: "can"

Step 3:
  input:  "Speculative decoding is useful because it can"
  output: "reduce"

Step 4:
  input:  "Speculative decoding is useful because it can reduce"
  output: "latency"

This is called autoregressive decoding. Autoregressive means:

Token 4 depends on token 3.
Token 3 depends on token 2.
Token 2 depends on token 1.

So the large model cannot simply generate all future tokens independently. A naive generation loop looks like this:

current_text = prompt

repeat until enough tokens:
  next_token = target_model.predict_next(current_text)
  current_text = current_text + next_token

This is easy to understand, but it has a big problem:

The large model is called once per generated token.

For a 500-token answer, that means roughly 500 sequential large-model decoding steps. Speculative decoding tries to reduce the number of large-model steps.

1.2 The real bottleneck

At first, you might think the model is slow because it has many FLOPs. That is partly true. But during decode, especially at small batch sizes, a huge part of the cost is repeatedly running the same large model step by step. Each generated token requires a new target-model pass, and each pass has GPU kernel launches, memory reads, KV-cache reads, sampling work, and scheduling overhead. The important practical fact is this:

Generating 4 tokens with 4 separate large-model decode steps
can be slower than
checking 4 proposed tokens with 1 larger target-model verification step.

Not always. But often enough to be useful. The original speculative decoding paper describes the motivation very directly: decoding K tokens normally takes K serial model runs, while speculative decoding tries to compute several tokens in parallel without changing the target model’s output distribution.

1.3 Draft and target: the first mental model

Imagine two people writing a sentence. The first person is fast but not always reliable:

Draft model:
  "I think the next words are: a fast decoding method"

The second person is slower but trusted:

Target model:
  "Let me check those words."

The trusted person does not blindly accept the draft. It checks the proposal token by token:

Draft says:   a      fast      decoding      method
Target wants: a      fast      inference     method

Accept:
  a fast

Reject:
  decoding

Correct with:
  inference

The output becomes:

a fast inference

Then the process repeats. The draft model is useful only if it guesses well enough. The target model is necessary because it decides which guesses are actually valid.

1.4 The core components

Speculative decoding has five moving parts:

Target model
The large model you actually want to sample from.
Draft mechanism
Something cheaper that guesses future tokens. This can be a smaller model, n-gram matching, extra prediction heads, early-exit layers, or another learned module.
Proposed tokens
The draft mechanism’s guesses.
Verifier
The target model checks the proposed tokens in one forward pass.
Accept/reject rule
The algorithm decides how many draft tokens can be kept.

The loop is:

while not done:
    draft_tokens = draft_model.generate_a_few_tokens()
    target_logits = target_model.verify(draft_tokens)
    accepted_tokens = accept_or_reject(draft_tokens, target_logits)
    append accepted_tokens to output

That is speculative decoding.

2. The workflow from one request to many checked tokens

2.1 A request’s journey

Suppose the prompt is:

Explain speculative decoding simply.

Normal decoding might eventually produce:

Explain speculative decoding simply. It is a method that speeds up LLMs.

But the system does not know this full sentence in advance. It discovers it step by step. Let us walk through speculative decoding.

Iteration 1

Current text:

Explain speculative decoding simply.

The draft model proposes 4 tokens:

d1 = It
d2 = is
d3 = a
d4 = method

The target model now runs once on this candidate text:

Explain speculative decoding simply. It is a method

Because a causal Transformer produces logits at every position, this one target-model pass can answer several questions:

After "Explain speculative decoding simply.", what would the target choose?
After "Explain speculative decoding simply. It", what would the target choose?
After "Explain speculative decoding simply. It is", what would the target choose?
After "Explain speculative decoding simply. It is a", what would the target choose?

Suppose the target’s greedy choices are:

It is a method

Compare token by token:

draft token:   It      is      a      method
target wants:  It      is      a      method

Everything matches. So we accept all draft tokens. Current text:

Explain speculative decoding simply. It is a method

Iteration 2

Current text:

Explain speculative decoding simply. It is a method

The draft model proposes 3 tokens:

d1 = that
d2 = makes
d3 = LLMs

The target model now runs once on the proposed continuation:

Explain speculative decoding simply. It is a method that makes LLMs

This does not mean the target already knew the correct final answer. It means the target is checking the draft sequence position by position. The target pass answers:

After "It is a method", what would the target choose?
After "It is a method that", what would the target choose?
After "It is a method that makes", what would the target choose?

Suppose the target’s greedy choices are:

that speeds ...

Now compare:

draft token:   that      makes      LLMs
target wants:  that      speeds     ...

The first token matches:

Accept "that".

The second token does not match:

Reject "makes".
Emit the target token "speeds" instead.

The later draft token is discarded, because the sequence has already diverged. Current text:

Explain speculative decoding simply. It is a method that speeds

Iteration 3

Repeat from the new current text. The important point is:

The target model does not know the final answer before running.
It only computes, in one pass, what it would have predicted at each position of the draft sequence.

That is why the draft model can help without replacing the target model.

2.2 Why one target pass can check multiple tokens

This is the key technical trick. Normal generation is sequential: the model must predict token 1 before token 2, because token 2 depends on token 1. Speculative decoding changes the shape of the work by making the future tokens known candidates first. Once the draft model has proposed them, the target model can check those positions in one forward pass. Checking can be much faster than asking the target model to predict the same tokens one by one, because the proposed tokens remove the sequential dependency.

Suppose the current text is:

Speculative decoding is

The draft model proposes:

a fast method

So we run the target model on:

Speculative decoding is a fast method

A causal Transformer returns logits at every position. That means one target pass can check all three proposed tokens:

logits after "Speculative decoding is"
  -> distribution for the next token
  -> can check whether "a" was valid

logits after "Speculative decoding is a"
  -> distribution for the next token
  -> can check whether "fast" was valid

logits after "Speculative decoding is a fast"
  -> distribution for the next token
  -> can check whether "method" was valid

logits after "Speculative decoding is a fast method"
  -> distribution for the next token
  -> can produce a bonus token if all drafts were accepted

The speed gain comes from the difference between prediction and verification:

Prediction:
  token 1 must be generated before token 2
  token 2 must be generated before token 3
  each step needs a new target-model call

Verification:
  draft tokens are already available
  the target model sees the whole proposed sequence
  all proposed positions can be checked in the same pass

The target verification pass is still work, and it is usually larger than a single one-token decode step. The point is that it can replace several sequential target-model calls when enough draft tokens are accepted.

3. The algorithms, starting simple

3.1 Greedy speculative decoding

Let us first ignore sampling. Assume the large model always chooses:

next_token = argmax(logits)

This is greedy decoding. In greedy speculative decoding, the rule is simple:

Accept draft tokens while they match the target model's greedy choice.
At the first mismatch, use the target model's token instead.

Example:

Current text:
  A serving engine is

Draft proposes:
  a high performance inference engine

Target greedy verification:
  a high performance serving library

Compare:
  draft:  a       high       performance       inference
  target: a       high       performance       serving

Accepted:
  a high performance

Correction:
  serving

So this iteration emits:

a high performance serving

The draft token inference is rejected because the target would have chosen serving. In pseudocode:

draft_tokens = draft_model.propose(current_text)
target_choices = target_model.verify(current_text + draft_tokens)

for each draft token:
  if draft token matches target choice:
    accept it
  else:
    emit the target choice
    discard the rest of the draft
    stop this iteration

if every draft token matched:
  emit one bonus token from the target model

3.2 The bonus token

The bonus token is easy to miss. Suppose the draft model proposes 4 tokens:

d1 d2 d3 d4

The target model checks:

Is d1 correct?
Is d2 correct?
Is d3 correct?
Is d4 correct?
What comes after d4?

If all draft tokens are accepted, we already have the target distribution for the token after d4. So we can sample or greedily choose one more token. That means:

4 draft tokens can produce up to 5 emitted tokens.

This is one reason speculative decoding can be surprisingly effective when the draft model is accurate.

3.3 Speculative sampling

Greedy decoding is simple. Sampling is more subtle. With sampling, the target model is not saying:

The next token must be X.

It is saying:

Here is a probability distribution over tokens.

Example:

target distribution p:
  "cat": 0.50
  "dog": 0.30
  "GPU": 0.20

draft distribution q:
  "cat": 0.40
  "dog": 0.40
  "GPU": 0.20

The draft model samples a token from q. The target model wants the final output to behave as if it had sampled from p. So the question is:

How can we use draft samples from q,
but still produce samples from p?

The answer is rejection sampling.

3.4 The acceptance rule

Let:

p = target model distribution
q = draft model distribution
x = draft token

The draft token is accepted with probability:

min(1, p[x] / q[x])

This means:

If the target likes token x more than the draft does:
  accept it.

If the draft likes token x more than the target does:
  accept it only sometimes.

Example 1:

p[x] = 0.60
q[x] = 0.30

p[x] / q[x] = 2.0
accept probability = 1.0

The target likes this token more than the draft does. Keep it. Example 2:

p[x] = 0.20
q[x] = 0.50

p[x] / q[x] = 0.4
accept probability = 0.4

The draft over-produced this token. Keep it only 40% of the time. If a draft token is rejected, we do not simply sample from p. We sample from the correction distribution:

correction[token] proportional to max(0, p[token] - q[token])

Why? Because the accepted draft samples already account for the overlap between p and q. The correction distribution fills in the missing probability mass.

3.5 Why the correction distribution works

For one token, the final probability of returning token z is:

probability accepted as z
+
probability produced by correction as z

The accepted part contributes:

min(p[z], q[z])

The correction part contributes:

max(p[z] - q[z], 0)

Add them:

min(p[z], q[z]) + max(p[z] - q[z], 0) = p[z]

So the final distribution is exactly the target distribution. That is the core mathematical idea.

3.6 Exactness and terminology

A common misunderstanding:

Speculative decoding uses a smaller model, so quality must be worse.

Not necessarily. The small model proposes. The large model verifies. For greedy decoding, the result can be exactly the same as target-model greedy decoding. For sampling, the result can preserve the target model’s sampling distribution when the rejection-sampling algorithm is implemented correctly. But there are caveats:

Exactness depends on the implementation.
Exactness depends on the decoding mode.
Exactness depends on matching tokenization and logits processing.
Exactness can be lost by approximate acceptance rules.

Production engines sometimes implement only a subset of the full algorithm. So you should always check the inference engine’s documentation.

People often use the terms loosely. A useful distinction:

Speculative decoding:
  General family of methods.

Speculative sampling:
  Sampling version with rejection-sampling correction.

Greedy speculative decoding:
  Deterministic version that accepts tokens while they match target argmax.

Draft-target speculative decoding:
  Classic setup where a smaller draft model proposes tokens
  and the larger target model verifies them.

In practice, when people say “speculative decoding,” they often mean the full family.

4. Performance intuition and tuning knobs

4.1 The speedup condition

Let:

K = number of draft tokens proposed per iteration
A = average number of accepted draft tokens

Each speculative iteration emits roughly:

A + 1 tokens

The +1 is the correction token or bonus token. Baseline decoding:

1 target call -> 1 token

Speculative decoding:

K draft calls + 1 target verification call -> A + 1 emitted tokens

So speculative decoding helps when:

cost(K draft calls + 1 target verification call)
<
cost((A + 1) normal target decode calls)

This is the most important condition. The lesson:

Speculative decoding is not automatically faster.
It is faster when the draft mechanism is cheap and often correct.

4.2 Why the large model is still required

Yes. If the draft model were always correct, then maybe we could just use the draft model. But that is not the real situation. Usually:

The draft model is smaller.
The draft model is cheaper.
The draft model is less accurate.

The target model remains the source of truth. Speculative decoding is useful because the draft model only needs to be locally predictive enough. It does not need to be as good as the target model for the whole task. Example:

Current text:
  The capital of France is

Draft model guesses:
  Paris

Target model verifies:
  Paris is acceptable

That was easy. But for a harder continuation:

Current text:
  The proof relies on a subtle application of

Draft model guesses:
  Jensen's inequality

Target model may reject and correct it.

So speculative decoding uses the cheap model where it is good, and falls back to the large model where it is not.

4.3 When speculation helps, and when it hurts

It usually helps when:

The target model is large.
The batch size is low or medium.
Decode latency matters.
The workload is memory-bandwidth-bound.
The draft model is much cheaper than the target model.
The draft and target models agree often.
The generated text is predictable enough.

It can hurt when:

The draft model is too slow.
The draft model is too inaccurate.
The target server is already saturated by batching.
The verification pass is much more expensive than a normal decode step.
The acceptance rate is low.
The implementation adds too much overhead.

A practical rule:

Speculative decoding is mainly a latency optimization.
Continuous batching is mainly a throughput optimization.

They can interact, but you should not assume that enabling speculation improves every workload.

4.4 Acceptance rate

The most important metric is:

acceptance rate

There are a few related quantities:

draft acceptance rate:
  fraction of proposed draft tokens accepted

tokens per target call:
  average number of emitted tokens per target verification

speedup:
  baseline latency / speculative latency

Example:

K = 4 draft tokens

Iteration 1 accepts 4 draft tokens and emits 1 bonus:
  emitted = 5

Iteration 2 accepts 2 draft tokens and emits 1 correction:
  emitted = 3

Iteration 3 accepts 0 draft tokens and emits 1 correction:
  emitted = 1

Average emitted tokens:

(5 + 3 + 1) / 3 = 3

That means:

1 target call -> 3 emitted tokens on average

Ignoring draft overhead, that looks like a 3x improvement. But after draft overhead, scheduler overhead, cache overhead, and verification cost, the real speedup might be lower.

4.5 Choosing the draft model

The draft model should be:

small enough to be fast
large enough to agree with the target
compatible with the target tokenizer
cheap enough in memory
easy to serve next to the target

There is a tradeoff:

Very small draft model:
  + cheap
  - poor guesses
  - low acceptance rate

Larger draft model:
  + better guesses
  - more expensive
  - more memory

A common pattern:

Target:
  Llama-like 70B model

Draft:
  Llama-like 1B, 3B, or 8B model from the same family

But the best draft model is workload-dependent. Code completion workloads may be easier to speculate than open-ended creative writing. Summarization may benefit from prompt lookup. Chat may vary heavily by prompt type.

4.6 Choosing the draft length

The number of draft tokens is often called:

speculation length
draft length
lookahead
num_assistant_tokens
max_draft_len

If it is too small:

You do not get enough speedup.

If it is too large:

The draft model spends too much time guessing tokens that will be rejected.
The target model verifies too many useless positions.
The KV cache and scheduler do more work.

A simple heuristic:

If all draft tokens were accepted:
  try drafting more next time.

If a draft token was rejected early:
  try drafting fewer next time.

Real systems use more sophisticated policies, but the intuition is the same: increase lookahead after clean acceptances, and reduce it after early rejection.

5. KV cache behavior and memory

5.1 KV cache interaction

Speculative decoding also affects the KV cache. If you want the full background on KV caching, see KV Cache in LLMs From Zero. During normal decode:

Generate token t.
Append K/V for token t.
Repeat.

During speculative decoding:

Draft model generates draft tokens.
Target model verifies several positions.
Accepted tokens keep their target KV cache.
Rejected draft positions must not become part of the final target sequence.

So the runtime must handle:

accepted token KV
rejected token KV
draft-model KV
target-model KV
KV cache rewind
extra temporary pages

Conceptually:

Initial target cache:
  The model is

Temporary verification cache:
  The model is very fast today

Accepted tokens:
  very fast

Final target cache after rejection handling:
  The model is very fast

The real implementation is more complex because KV cache is stored in GPU memory blocks, often with paged attention and continuous batching. But the principle is simple:

Accepted tokens become real history.
Rejected speculative tokens must disappear.

5.2 Questions before using speculation

Before using speculative decoding in a real system, answer these questions:

1. How expensive is one target decode step?
2. How expensive is one target verification step?
3. How expensive is one draft step?
4. How many draft tokens are accepted on average?
5. How much extra memory does the draft model need?
6. Does speculation still help at realistic concurrency?
7. Does streaming behavior improve or get worse?

A good speculative decoding setup has:

cheap draft
high acceptance
low overhead
no quality regression
good behavior under real traffic

A bad setup has:

expensive draft
low acceptance
extra memory pressure
scheduler overhead
worse throughput under load

6. The main families of speculative methods

6.1 Prompt lookup decoding

You do not always need a neural draft model. Sometimes the prompt already contains useful future text. Example: summarization. Prompt:

Article:
NVIDIA announced a new inference library. The library improves batching,
KV cache management, and GPU utilization.

Summary:

The output may reuse phrases from the prompt:

The article says NVIDIA announced a new inference library...

Prompt lookup decoding uses n-gram matching to propose candidate tokens from the prompt itself. The idea is:

Take the last n generated tokens.
Find the same n-token phrase in the prompt.
If it appears, propose the prompt tokens that followed that phrase.

This can propose tokens without loading a second neural model. But it only works when the output overlaps with existing context. Good use cases:

summarization
document QA
copy-heavy extraction
structured generation from a source document

Bad use cases:

creative writing
open-ended chat
math reasoning with novel steps

6.2 Self-speculative decoding

Classic speculative decoding uses two models:

small draft model
large target model

Self-speculative decoding uses one model in two modes:

cheap approximate mode -> draft
full mode              -> verify

For example:

Draft:
  run only some layers

Verify:
  run the full model

Mental model:

Same model, shallow pass:
  "I think the next tokens are X Y Z"

Same model, full pass:
  "Let me check X Y Z properly"

Advantages:

No separate draft model.
Potentially less extra memory.
Tokenizer always matches.

Disadvantages:

Needs model support or special training for good early exits.
Skipping layers may produce weak drafts.
Implementation is model-specific.

6.3 Medusa

Medusa removes the separate draft model in another way. Instead of a second model, it adds extra prediction heads to the target model. A normal LLM head predicts:

next token

Medusa-style heads predict:

token +1
token +2
token +3
...

Conceptually:

Backbone hidden state
  |
  +--> normal LM head predicts next token
  +--> Medusa head 1 predicts token after that
  +--> Medusa head 2 predicts token after that
  +--> Medusa head 3 predicts token after that

Why this is attractive:

No separate draft model server.
Drafting can be integrated into the target model.
The extra heads are much smaller than a full draft LLM.

Why it is not free:

The model needs extra heads.
The heads usually need training or fine-tuning.
The runtime needs tree verification support.

6.4 EAGLE

EAGLE is another important family. Instead of drafting only in token space, EAGLE-style methods use hidden features from the target model to help predict future tokens. A simplified mental model:

Target model produces hidden features.
Small EAGLE module predicts future features or tokens.
Target model verifies the candidates.

You do not need to understand every EAGLE detail to understand speculative decoding. Just remember:

Draft/target:
  separate small model proposes tokens.

Medusa:
  extra heads propose future tokens.

EAGLE:
  lightweight learned module uses target-model features to propose better drafts.

Prompt lookup:
  string or token matching proposes drafts.

Self-speculation:
  partial target model proposes drafts.

All of them keep the same high-level pattern:

propose -> verify -> accept/reject

6.5 Multi-token prediction

Some modern models are trained with multi-token prediction heads. Instead of only learning:

predict token t + 1

the model also learns:

predict token t + 2
predict token t + 3
...

At inference time, these extra predictions can be used as draft tokens. This is similar in spirit to Medusa, but it may be built into the model’s training recipe.

6.6 Practical taxonomy

Method	Draft source	Extra model?	Best intuition
Draft/target	Small assistant LLM	Yes	The classic method
Prompt lookup / n-gram	Existing prompt text	No	Copy likely continuations
Self-speculative	Earlier layers or skipped layers	No	Same model drafts cheaply
Medusa	Extra decoding heads	No full draft model	Predict multiple future tokens
EAGLE	Learned lightweight speculator	Usually small module/model	Better feature-informed drafts
MTP	Built-in multi-token heads	Usually built into model	Model was trained to speculate

The details differ, but the workflow stays the same:

guess several tokens
check them with the target
keep the valid part
repeat

7. Conclusion

The useful way to think about speculative decoding is as a shortcut around repeated target-model calls. A cheap draft mechanism moves first, then the target model checks several proposed positions in one pass. Only the prefix that survives that check becomes real output.

When the draft is wrong, the target model corrects the sequence and generation continues. When the draft is right for several tokens in a row, the target model avoids several separate decode steps.

The loop looks like this:

draft a few tokens
verify them with the target model
accept the valid prefix
emit a correction or bonus token
update the KV cache
repeat

The tradeoff is practical: the draft has to be cheap, the accepted prefix has to be long enough, and the extra cache and scheduling work has to stay smaller than the target-model calls you saved. That is the core intuition to keep.

References

Yaniv Leviathan, Matan Kalman, and Yossi Matias, “Fast Inference from Transformers via Speculative Decoding”: https://arxiv.org/abs/2211.17192
Charlie Chen et al., “Accelerating Large Language Model Decoding with Speculative Sampling”: https://arxiv.org/abs/2302.01318
Medusa paper: https://arxiv.org/abs/2401.10774
EAGLE paper: https://arxiv.org/abs/2401.15077