Phi 3 Medium punches above its weight class for a 14B model, reflecting Microsoft's research focus on training efficiency over raw scale. It handles long-context tasks with a 128k token window, making it comfortable with lengthy documents or extended conversations. The trade-off is that it can occasionally struggle with complex multi-step reasoning where larger models have a clear edge.
| Benchmark | Score | Type | Recorded |
|---|---|---|---|
| MATH | 19.2 | accuracy | 23d ago |
| MMLU-Pro | 41.2 | accuracy | 23d ago |
| BBH | 48.5 | accuracy | 23d ago |
| IFEval | 60.4 | accuracy | 23d ago |
| MuSR | 11.4 | accuracy | 23d ago |
| GPQA Diamond | 11.5 | accuracy | 23d ago |