# Final Cumulative Stack Benchmark

This run compares the original eager full-prefix inference path with the cumulative no-loss optimization stack.

Final stack: JIT + KV cache + parallel prefill + fixed-length unrolled decode + default XLA fusion + batching.

| batch | eager tok/s | final tok/s | speedup | eager batch latency | final batch latency | final compile s | reference match |
|---:|---:|---:|---:|---:|---:|---:|---:|
| 1 | 31.371 | 509.259 | 16.23x | 0.6057s | 0.0373s | 7.961s | 3277/3277 |
| 8 | 222.220 | 2252.309 | 10.14x | 0.6840s | 0.0675s | 8.095s | 3277/3277 |
| 32 | 594.830 | 4076.913 | 6.85x | 1.0221s | 0.1491s | 8.072s | 3277/3277 |
| 128 | 624.679 | 6225.107 | 9.97x | 3.8932s | 0.3907s | 8.493s | 3277/3277 |

## Best Result

- Best throughput: `kv_cache_prefill_unrolled` batch `128` at `6225.107` generated tokens/sec.
- Same-batch speedup at batch 128: `9.97x` over original eager batch 128.
- Throughput vs original eager batch 1: `198.43x`.
- Exact arithmetic accuracy stayed `0.991456` on `3277` correctness examples.
- Reference text match vs eager: `3277/3277`.

## Caveats

- This is a CPU run on the available local backend.
- Compile time is separated from steady-state timing; the final batch-128 stack compiled in about 8.49 seconds.
- The model is not perfectly accurate, so speedups are reported for outputs that match the original model, not for a newly perfect model.
- Large-batch throughput improves substantially, but batch latency remains higher than single-request latency.