LLM ranking
Compare the performance of leading large language models across key benchmarks
Showing by:GPQA
Table columns:
GPQA
MMLU
Math
HumanEval
| # | Model | Release Date | GPQA | MMLU | Math | HumanEval | Actions |
|---|---|---|---|---|---|---|---|
o3 | 3 months ago | 87.7% | — | — | — | ||
Claude 3.7 Sonnet | 1 month ago | 84.8% | — | — | — | ||
Grok-3 | 1 month ago | 84.6% | — | — | — | ||
| 4 | Grok-3 Mini | 1 month ago | 84.6% | — | — | — | |
| 5 | Gemini Pro 2.5 Experimental NEW | 5 days ago | 84.0% | — | — | — | |
| 6 | o3-mini | 1 month ago | 79.7% | 86.9% | 97.9% | — | |
| 7 | o1-pro | 3 months ago | 79.0% | — | — | — | |
| 8 | o1 | 3 months ago | 78.0% | 91.8% | 96.4% | 88.1% | |
| 9 | Gemini 2.0 Flash Thinking | 2 months ago | 74.2% | — | — | — | |
| 10 | o1-preview | 6 months ago | 73.3% | 90.8% | 85.5% | — | |
| 11 | DeepSeek-R1 | 2 months ago | 71.5% | 90.8% | — | — | |
| 12 | GPT-4.5 | 1 month ago | 71.4% | — | — | — | |
| 13 | Claude 3.5 Sonnet | 5 months ago | 67.2% | 90.4% | 78.3% | 93.7% | |
| 14 | QwQ-32B-Preview | 4 months ago | 65.2% | — | — | — | |
| 15 | Gemini 2.0 Flash | 3 months ago | 62.1% | — | 89.7% | — | |
| 16 | o1-mini | 6 months ago | 60.0% | 85.2% | — | 92.4% | |
| 17 | DeepSeek-V3 | 3 months ago | 59.1% | 88.5% | 61.6% | — | |
| 18 | Gemini 1.5 Pro | 11 months ago | 59.1% | 85.9% | 86.5% | 84.1% | |
| 19 | Phi-4 | 3 months ago | 56.1% | 84.8% | 80.4% | 82.6% | |
| 20 | Grok-2 | 7 months ago | 56.0% | 87.5% | 76.1% | 88.4% | |
| 21 | GPT-4o | 7 months ago | 53.6% | 88.0% | 76.6% | 90.2% | |
| 22 | Gemini 1.5 Flash | 11 months ago | 51.0% | 78.9% | 77.9% | 74.3% | |
| 23 | Grok-2 mini | 7 months ago | 51.0% | 86.2% | 73.0% | 85.7% | |
| 24 | Llama 3.1 405B Instruct | 8 months ago | 50.7% | 87.3% | 73.8% | 89.0% | |
| 25 | Llama 3.3 70B Instruct | 3 months ago | 50.5% | 86.0% | 77.0% | 88.4% | |
| 26 | Claude 3 Opus | 1 years ago | 50.4% | 86.8% | 60.1% | 84.9% | |
| 27 | Qwen2.5 32B Instruct | 6 months ago | 49.5% | 83.3% | 83.1% | 88.4% | |
| 28 | Qwen2.5 72B Instruct | 6 months ago | 49.0% | — | 83.1% | 86.6% | |
| 29 | GPT-4 Turbo | 11 months ago | 48.0% | 86.5% | 72.6% | 87.1% | |
| 30 | Nova Pro | 4 months ago | 46.9% | 85.9% | 76.6% | 89.0% | |
| 31 | Llama 3.2 90B Instruct | 6 months ago | 46.7% | 86.0% | 68.0% | — | |
| 32 | Qwen2.5 14B Instruct | 6 months ago | 45.5% | 79.7% | 80.0% | 83.5% | |
| 33 | Mistral Small 3 | 1 month ago | 45.3% | — | 70.6% | 84.8% | |
| 34 | Qwen2 72B Instruct | 8 months ago | 42.4% | 82.3% | 59.7% | 86.0% | |
| 35 | Nova Lite | 4 months ago | 42.0% | 80.5% | 73.3% | 85.4% | |
| 36 | Llama 3.1 70B Instruct | 8 months ago | 41.7% | 83.6% | — | 80.5% | |
| 37 | Claude 3.5 Haiku | 5 months ago | 41.6% | — | 69.4% | 88.1% | |
| 38 | Claude 3 Sonnet | 1 years ago | 40.4% | 79.0% | 43.1% | 73.0% | |
| 39 | GPT-4o mini | 8 months ago | 40.2% | 82.0% | 70.2% | 87.2% | |
| 40 | Nova Micro | 4 months ago | 40.0% | 77.6% | 69.3% | 81.1% | |
| 41 | Gemini 1.5 Flash 8B | 1 years ago | 38.4% | — | 58.7% | — | |
| 42 | Jamba 1.5 Large | 7 months ago | 36.9% | 81.2% | — | — | |
| 43 | Phi-3.5-MoE-instruct | 7 months ago | 36.8% | 78.9% | 59.5% | 70.7% | |
| 44 | Qwen2.5 7B Instruct | 6 months ago | 36.4% | — | 75.5% | 84.8% | |
| 45 | Grok-1.5 | 1 years ago | 35.9% | 81.3% | 50.6% | 74.1% | |
| 46 | GPT-4 | 1 years ago | 35.7% | 86.4% | 42.0% | 67.0% | |
| 47 | Claude 3 Haiku | 1 years ago | 33.3% | 75.2% | 38.9% | 75.9% | |
| 48 | Llama 3.2 11B Instruct | 6 months ago | 32.8% | 73.0% | 51.9% | — | |
| 49 | Llama 3.2 3B Instruct | 6 months ago | 32.8% | 63.4% | 48.0% | — | |
| 50 | Jamba 1.5 Mini | 7 months ago | 32.3% | 69.7% | — | — | |
| 51 | GPT-3.5 Turbo | 2 years ago | 30.8% | 69.8% | 43.1% | 68.0% | |
| 52 | Llama 3.1 8B Instruct | 8 months ago | 30.4% | 69.4% | — | 72.6% | |
| 53 | Phi-3.5-mini-instruct | 7 months ago | 30.4% | 69.0% | 48.5% | 62.8% | |
| 54 | Gemini 1.0 Pro | 1 years ago | 27.9% | 71.8% | 32.6% | — | |
| 55 | Qwen2 7B Instruct | 8 months ago | 25.3% | 70.5% | 49.6% | 79.9% | |
| 56 | Claude 3.5 Sonnet | 9 months ago | — | — | — | — | |
| 57 | Codestral-22B | 10 months ago | — | — | — | 81.1% | |
| 58 | Command A NEW | 2 weeks ago | — | 84.0% | 78.0% | — | |
| 59 | Command R+ | 7 months ago | — | 75.7% | — | — | |
| 60 | DeepSeek-V2.5 | 10 months ago | — | 80.4% | 74.7% | 89.0% | |
| 61 | Gemma 2 27B | 9 months ago | — | 75.2% | 42.3% | 51.8% | |
| 62 | Gemma 2 9B | 9 months ago | — | 71.3% | 36.6% | 40.2% | |
| 63 | Gemma 3 27B NEW | 2 weeks ago | — | 76.9% | 89.0% | 87.8% | |
| 64 | GPT-4o | 10 months ago | — | — | — | — | |
| 65 | Grok-1.5V | 11 months ago | — | — | — | — | |
| 66 | Jamba 1.6 Large NEW | 2 weeks ago | — | — | — | — | |
| 67 | Jamba 1.6 Mini NEW | 2 weeks ago | — | — | — | — | |
| 68 | Kimi-k1.5 | 2 months ago | — | 87.4% | — | — | |
| 69 | Llama 3.1 Nemotron 70B Instruct | 6 months ago | — | 80.2% | — | — | |
| 70 | Ministral 8B Instruct | 5 months ago | — | 65.0% | 54.5% | 34.8% | |
| 71 | Mistral Large 2 | 8 months ago | — | 84.0% | — | 92.0% | |
| 72 | Mistral NeMo Instruct | 8 months ago | — | 68.0% | — | — | |
| 73 | Mistral Small | 6 months ago | — | — | — | — | |
| 74 | Mistral Small 3.1 24B NEW | 1 week ago | — | 80.6% | 69.3% | 88.4% | |
| 75 | Olmo 2 32B NEW | 2 weeks ago | — | 74.9% | 49.7% | — | |
| 76 | Phi-3.5-vision-instruct | 7 months ago | — | — | — | — | |
| 77 | Pixtral Large | 4 months ago | — | — | — | — | |
| 78 | Pixtral-12B | 6 months ago | — | 69.2% | 48.1% | 72.0% | |
| 79 | QvQ-72B-Preview | 3 months ago | — | — | — | — | |
| 80 | Qwen2-VL-72B-Instruct | 7 months ago | — | — | — | — | |
| 81 | Qwen2.5-Coder 32B Instruct | 6 months ago | — | 75.1% | 57.2% | 92.7% | |
| 82 | Qwen2.5-Coder 7B Instruct | 6 months ago | — | 67.6% | 46.6% | 88.4% | |
| 83 | QwQ 32B NEW | 3 weeks ago | — | — | — | — |
Showing 83 of 83 models