| **Category** | **Benchmark** | **Phi-3.5 Mini-Ins** | **Mistral-Nemo-12B-Ins-2407** | **Llama-3.1-8B-Ins** | **Gemma-2-9B-Ins** | **Gemini 1.5 Flash** | |------------------------------|----------------------------|----------------------|-------------------------------|----------------------|--------------------|----------------------| | Popular aggregated benchmark | Arena Hard | 37 | 39.4 | 25.7 | 42 | 55.2 | | | BigBench Hard CoT (0-shot) | 69 | 60.2 | 63.4 | 63.5 | 66.7 | | | MMLU (5-shot) | 69 | 67.2 | 68.1 | 71.3 | 78.7 | | | MMLU-Pro (0-shot, CoT) | 47.4 | 40.7 | 44 | 50.1 | 57.2 | | Reasoning | ARC Challenge (10-shot) | 84.6 | 84.8 | 83.1 | 89.8 | 92.8 | | | TruthfulQA (MC2) (10-shot) | 64 | 68.1 | 69.2 | 76.6 | 76.6 | | | WinoGrande (5-shot) | 68.5 | 70.4 | 64.7 | 74 | 74.7 | | Multilingual | Multilingual MMLU (5-shot) | 55.4 | 58.9 | 56.2 | 63.8 | 77.2 | | Math | GSM8K (8-shot, CoT) | 86.2 | 84.2 | 82.4 | 84.9 | 82.4 | | | MATH (0-shot, CoT) | 48.5 | 31.2 | 47.6 | 50.9 | 38 | | Long context | Qasper | 41.9 | 30.7 | 37.2 | 13.9 | 43.5 | | | SQuALITY | 24.3 | 25.8 | 26.2 | 0 | 23.5 | | Code Generation | HumanEval (0-shot) | 62.8 | 63.4 | 66.5 | 61 | 74.4 | | | MBPP (3-shot) | 69.6 | 68.1 | 69.4 | 69.3 | 77.5 |