# Int8 Cumulative Stack Check

This checks whether int8 weight-only quantization improves the cumulative KV-cache prefill and unrolled decode stack.

Correctness text match vs float32 stack: `3277/3277`.

| batch | float32 tok/s | int8 tok/s | int8/float32 | float32 compile s | int8 compile s |
|---:|---:|---:|---:|---:|---:|
| 1 | 331.583 | 343.264 | 1.035x | 12.904 | 18.270 |
| 8 | 1316.392 | 1299.127 | 0.987x | 17.985 | 19.637 |
| 32 | 2099.439 | 2415.076 | 1.150x | 17.355 | 20.949 |
| 128 | 3007.320 | 2445.481 | 0.813x | 18.611 | 28.184 |

## Interpretation

- Int8 is lossless here relative to the float32 stack on the cached correctness set.
- It is mixed for speed: slightly faster at batch 1 and 32, slower at batch 8 and 128.
- It does not replace the previous final best result, because the final target is maximum throughput and batch 128 int8 was slower in this focused run.
- This implementation is not a production int8 matmul path; it dequantizes/materializes weights inside the JAX graph.