# Float16 Cumulative Stack Check

This compares float16 against float32 for the cumulative KV-cache prefill and unrolled decode stack.

Correctness text match vs float32 stack: `3277/3277`.

| batch | float32 tok/s | float16 tok/s | fp16/fp32 | float32 compile s | float16 compile s |
|---:|---:|---:|---:|---:|---:|
| 1 | 287.080 | 101.849 | 0.355x | 13.359 | 27.226 |
| 8 | 1253.333 | 1335.112 | 1.065x | 18.295 | 25.782 |
| 32 | 1301.142 | 1527.947 | 1.174x | 20.629 | 26.716 |
| 128 | 1751.450 | 2133.326 | 1.218x | 27.121 | 27.373 |

## Interpretation

- Float16 was lossless relative to float32 on the cached correctness set in this run.
- On CPU, float16 did not help single-request latency, but it improved throughput at larger batch sizes in this run.
- Compile time was higher for float16 than float32.
- These results are CPU-specific and should be retested on GPU or Apple Metal if those backends become available.
