Phi-2 punches well above its weight class for a 2.8B parameter model, showing surprisingly strong reasoning and coding ability relative to its size. It was trained with a heavy focus on 'textbook quality' data, which gives it a clean, structured way of explaining concepts. However, its small size means it can struggle with complex multi-step tasks and has limited world knowledge compared to larger models.
| Benchmark | Score | Type | Recorded |
|---|---|---|---|
| IFEval | 27.4 | accuracy | 26d ago |
| MATH | 2.9 | accuracy | 26d ago |
| MuSR | 13.8 | accuracy | 26d ago |
| MMLU-Pro | 18.1 | accuracy | 26d ago |
| BBH | 28.0 | accuracy | 26d ago |
| GPQA Diamond | 2.9 | accuracy | 26d ago |