LLM ranking
Compare the performance of leading large language models across key benchmarks
Showing by:GPQA
Table columns:
GPQA
MMLU
Math
HumanEval
# | Model | Release Date | GPQA | MMLU | Math | HumanEval | Actions |
---|---|---|---|---|---|---|---|
o3 | 3 months ago | 87.7% | — | — | — | ||
Claude 3.7 Sonnet | 1 month ago | 84.8% | — | — | — | ||
Grok-3 | 1 month ago | 84.6% | — | — | — | ||
4 | Grok-3 Mini | 1 month ago | 84.6% | — | — | — | |
5 | Gemini Pro 2.5 Experimental NEW | 5 days ago | 84.0% | — | — | — | |
6 | o3-mini | 1 month ago | 79.7% | 86.9% | 97.9% | — | |
7 | o1-pro | 3 months ago | 79.0% | — | — | — | |
8 | o1 | 3 months ago | 78.0% | 91.8% | 96.4% | 88.1% | |
9 | Gemini 2.0 Flash Thinking | 2 months ago | 74.2% | — | — | — | |
10 | o1-preview | 6 months ago | 73.3% | 90.8% | 85.5% | — | |
11 | DeepSeek-R1 | 2 months ago | 71.5% | 90.8% | — | — | |
12 | GPT-4.5 | 1 month ago | 71.4% | — | — | — | |
13 | Claude 3.5 Sonnet | 5 months ago | 67.2% | 90.4% | 78.3% | 93.7% | |
14 | QwQ-32B-Preview | 4 months ago | 65.2% | — | — | — | |
15 | Gemini 2.0 Flash | 3 months ago | 62.1% | — | 89.7% | — | |
16 | o1-mini | 6 months ago | 60.0% | 85.2% | — | 92.4% | |
17 | DeepSeek-V3 | 3 months ago | 59.1% | 88.5% | 61.6% | — | |
18 | Gemini 1.5 Pro | 11 months ago | 59.1% | 85.9% | 86.5% | 84.1% | |
19 | Phi-4 | 3 months ago | 56.1% | 84.8% | 80.4% | 82.6% | |
20 | Grok-2 | 7 months ago | 56.0% | 87.5% | 76.1% | 88.4% | |
21 | GPT-4o | 7 months ago | 53.6% | 88.0% | 76.6% | 90.2% | |
22 | Gemini 1.5 Flash | 11 months ago | 51.0% | 78.9% | 77.9% | 74.3% | |
23 | Grok-2 mini | 7 months ago | 51.0% | 86.2% | 73.0% | 85.7% | |
24 | Llama 3.1 405B Instruct | 8 months ago | 50.7% | 87.3% | 73.8% | 89.0% | |
25 | Llama 3.3 70B Instruct | 3 months ago | 50.5% | 86.0% | 77.0% | 88.4% | |
26 | Claude 3 Opus | 1 years ago | 50.4% | 86.8% | 60.1% | 84.9% | |
27 | Qwen2.5 32B Instruct | 6 months ago | 49.5% | 83.3% | 83.1% | 88.4% | |
28 | Qwen2.5 72B Instruct | 6 months ago | 49.0% | — | 83.1% | 86.6% | |
29 | GPT-4 Turbo | 11 months ago | 48.0% | 86.5% | 72.6% | 87.1% | |
30 | Nova Pro | 4 months ago | 46.9% | 85.9% | 76.6% | 89.0% | |
31 | Llama 3.2 90B Instruct | 6 months ago | 46.7% | 86.0% | 68.0% | — | |
32 | Qwen2.5 14B Instruct | 6 months ago | 45.5% | 79.7% | 80.0% | 83.5% | |
33 | Mistral Small 3 | 1 month ago | 45.3% | — | 70.6% | 84.8% | |
34 | Qwen2 72B Instruct | 8 months ago | 42.4% | 82.3% | 59.7% | 86.0% | |
35 | Nova Lite | 4 months ago | 42.0% | 80.5% | 73.3% | 85.4% | |
36 | Llama 3.1 70B Instruct | 8 months ago | 41.7% | 83.6% | — | 80.5% | |
37 | Claude 3.5 Haiku | 5 months ago | 41.6% | — | 69.4% | 88.1% | |
38 | Claude 3 Sonnet | 1 years ago | 40.4% | 79.0% | 43.1% | 73.0% | |
39 | GPT-4o mini | 8 months ago | 40.2% | 82.0% | 70.2% | 87.2% | |
40 | Nova Micro | 4 months ago | 40.0% | 77.6% | 69.3% | 81.1% | |
41 | Gemini 1.5 Flash 8B | 1 years ago | 38.4% | — | 58.7% | — | |
42 | Jamba 1.5 Large | 7 months ago | 36.9% | 81.2% | — | — | |
43 | Phi-3.5-MoE-instruct | 7 months ago | 36.8% | 78.9% | 59.5% | 70.7% | |
44 | Qwen2.5 7B Instruct | 6 months ago | 36.4% | — | 75.5% | 84.8% | |
45 | Grok-1.5 | 1 years ago | 35.9% | 81.3% | 50.6% | 74.1% | |
46 | GPT-4 | 1 years ago | 35.7% | 86.4% | 42.0% | 67.0% | |
47 | Claude 3 Haiku | 1 years ago | 33.3% | 75.2% | 38.9% | 75.9% | |
48 | Llama 3.2 11B Instruct | 6 months ago | 32.8% | 73.0% | 51.9% | — | |
49 | Llama 3.2 3B Instruct | 6 months ago | 32.8% | 63.4% | 48.0% | — | |
50 | Jamba 1.5 Mini | 7 months ago | 32.3% | 69.7% | — | — | |
51 | GPT-3.5 Turbo | 2 years ago | 30.8% | 69.8% | 43.1% | 68.0% | |
52 | Llama 3.1 8B Instruct | 8 months ago | 30.4% | 69.4% | — | 72.6% | |
53 | Phi-3.5-mini-instruct | 7 months ago | 30.4% | 69.0% | 48.5% | 62.8% | |
54 | Gemini 1.0 Pro | 1 years ago | 27.9% | 71.8% | 32.6% | — | |
55 | Qwen2 7B Instruct | 8 months ago | 25.3% | 70.5% | 49.6% | 79.9% | |
56 | Claude 3.5 Sonnet | 9 months ago | — | — | — | — | |
57 | Codestral-22B | 10 months ago | — | — | — | 81.1% | |
58 | Command A NEW | 2 weeks ago | — | 84.0% | 78.0% | — | |
59 | Command R+ | 7 months ago | — | 75.7% | — | — | |
60 | DeepSeek-V2.5 | 10 months ago | — | 80.4% | 74.7% | 89.0% | |
61 | Gemma 2 27B | 9 months ago | — | 75.2% | 42.3% | 51.8% | |
62 | Gemma 2 9B | 9 months ago | — | 71.3% | 36.6% | 40.2% | |
63 | Gemma 3 27B NEW | 2 weeks ago | — | 76.9% | 89.0% | 87.8% | |
64 | GPT-4o | 10 months ago | — | — | — | — | |
65 | Grok-1.5V | 11 months ago | — | — | — | — | |
66 | Jamba 1.6 Large NEW | 2 weeks ago | — | — | — | — | |
67 | Jamba 1.6 Mini NEW | 2 weeks ago | — | — | — | — | |
68 | Kimi-k1.5 | 2 months ago | — | 87.4% | — | — | |
69 | Llama 3.1 Nemotron 70B Instruct | 6 months ago | — | 80.2% | — | — | |
70 | Ministral 8B Instruct | 5 months ago | — | 65.0% | 54.5% | 34.8% | |
71 | Mistral Large 2 | 8 months ago | — | 84.0% | — | 92.0% | |
72 | Mistral NeMo Instruct | 8 months ago | — | 68.0% | — | — | |
73 | Mistral Small | 6 months ago | — | — | — | — | |
74 | Mistral Small 3.1 24B NEW | 1 week ago | — | 80.6% | 69.3% | 88.4% | |
75 | Olmo 2 32B NEW | 2 weeks ago | — | 74.9% | 49.7% | — | |
76 | Phi-3.5-vision-instruct | 7 months ago | — | — | — | — | |
77 | Pixtral Large | 4 months ago | — | — | — | — | |
78 | Pixtral-12B | 6 months ago | — | 69.2% | 48.1% | 72.0% | |
79 | QvQ-72B-Preview | 3 months ago | — | — | — | — | |
80 | Qwen2-VL-72B-Instruct | 7 months ago | — | — | — | — | |
81 | Qwen2.5-Coder 32B Instruct | 6 months ago | — | 75.1% | 57.2% | 92.7% | |
82 | Qwen2.5-Coder 7B Instruct | 6 months ago | — | 67.6% | 46.6% | 88.4% | |
83 | QwQ 32B NEW | 3 weeks ago | — | — | — | — |
Showing 83 of 83 models