LLM ranking

Compare the performance of leading large language models across key benchmarks

Showing by:GPQA

Table columns:

GPQA

MMLU

Math

HumanEval

#	Model	Release Date	GPQA	MMLU	Math	HumanEval
	o3	3 months ago	87.7%	—	—	—
	Claude 3.7 Sonnet	1 month ago	84.8%	—	—	—
	Grok-3	1 month ago	84.6%	—	—	—
4	Grok-3 Mini	1 month ago	84.6%	—	—	—
5	Gemini Pro 2.5 Experimental NEW	5 days ago	84.0%	—	—	—
6	o3-mini	1 month ago	79.7%	86.9%	97.9%	—
7	o1-pro	3 months ago	79.0%	—	—	—
8	o1	3 months ago	78.0%	91.8%	96.4%	88.1%
9	Gemini 2.0 Flash Thinking	2 months ago	74.2%	—	—	—
10	o1-preview	6 months ago	73.3%	90.8%	85.5%	—
11	DeepSeek-R1	2 months ago	71.5%	90.8%	—	—
12	GPT-4.5	1 month ago	71.4%	—	—	—
13	Claude 3.5 Sonnet	5 months ago	67.2%	90.4%	78.3%	93.7%
14	QwQ-32B-Preview	4 months ago	65.2%	—	—	—
15	Gemini 2.0 Flash	3 months ago	62.1%	—	89.7%	—
16	o1-mini	6 months ago	60.0%	85.2%	—	92.4%
17	DeepSeek-V3	3 months ago	59.1%	88.5%	61.6%	—
18	Gemini 1.5 Pro	11 months ago	59.1%	85.9%	86.5%	84.1%
19	Phi-4	3 months ago	56.1%	84.8%	80.4%	82.6%
20	Grok-2	7 months ago	56.0%	87.5%	76.1%	88.4%
21	GPT-4o	7 months ago	53.6%	88.0%	76.6%	90.2%
22	Gemini 1.5 Flash	11 months ago	51.0%	78.9%	77.9%	74.3%
23	Grok-2 mini	7 months ago	51.0%	86.2%	73.0%	85.7%
24	Llama 3.1 405B Instruct	8 months ago	50.7%	87.3%	73.8%	89.0%
25	Llama 3.3 70B Instruct	3 months ago	50.5%	86.0%	77.0%	88.4%
26	Claude 3 Opus	1 years ago	50.4%	86.8%	60.1%	84.9%
27	Qwen2.5 32B Instruct	6 months ago	49.5%	83.3%	83.1%	88.4%
28	Qwen2.5 72B Instruct	6 months ago	49.0%	—	83.1%	86.6%
29	GPT-4 Turbo	11 months ago	48.0%	86.5%	72.6%	87.1%
30	Nova Pro	4 months ago	46.9%	85.9%	76.6%	89.0%
31	Llama 3.2 90B Instruct	6 months ago	46.7%	86.0%	68.0%	—
32	Qwen2.5 14B Instruct	6 months ago	45.5%	79.7%	80.0%	83.5%
33	Mistral Small 3	1 month ago	45.3%	—	70.6%	84.8%
34	Qwen2 72B Instruct	8 months ago	42.4%	82.3%	59.7%	86.0%
35	Nova Lite	4 months ago	42.0%	80.5%	73.3%	85.4%
36	Llama 3.1 70B Instruct	8 months ago	41.7%	83.6%	—	80.5%
37	Claude 3.5 Haiku	5 months ago	41.6%	—	69.4%	88.1%
38	Claude 3 Sonnet	1 years ago	40.4%	79.0%	43.1%	73.0%
39	GPT-4o mini	8 months ago	40.2%	82.0%	70.2%	87.2%
40	Nova Micro	4 months ago	40.0%	77.6%	69.3%	81.1%
41	Gemini 1.5 Flash 8B	1 years ago	38.4%	—	58.7%	—
42	Jamba 1.5 Large	7 months ago	36.9%	81.2%	—	—
43	Phi-3.5-MoE-instruct	7 months ago	36.8%	78.9%	59.5%	70.7%
44	Qwen2.5 7B Instruct	6 months ago	36.4%	—	75.5%	84.8%
45	Grok-1.5	1 years ago	35.9%	81.3%	50.6%	74.1%
46	GPT-4	1 years ago	35.7%	86.4%	42.0%	67.0%
47	Claude 3 Haiku	1 years ago	33.3%	75.2%	38.9%	75.9%
48	Llama 3.2 11B Instruct	6 months ago	32.8%	73.0%	51.9%	—
49	Llama 3.2 3B Instruct	6 months ago	32.8%	63.4%	48.0%	—
50	Jamba 1.5 Mini	7 months ago	32.3%	69.7%	—	—
51	GPT-3.5 Turbo	2 years ago	30.8%	69.8%	43.1%	68.0%
52	Llama 3.1 8B Instruct	8 months ago	30.4%	69.4%	—	72.6%
53	Phi-3.5-mini-instruct	7 months ago	30.4%	69.0%	48.5%	62.8%
54	Gemini 1.0 Pro	1 years ago	27.9%	71.8%	32.6%	—
55	Qwen2 7B Instruct	8 months ago	25.3%	70.5%	49.6%	79.9%
56	Claude 3.5 Sonnet	9 months ago	—	—	—	—
57	Codestral-22B	10 months ago	—	—	—	81.1%
58	Command A NEW	2 weeks ago	—	84.0%	78.0%	—
59	Command R+	7 months ago	—	75.7%	—	—
60	DeepSeek-V2.5	10 months ago	—	80.4%	74.7%	89.0%
61	Gemma 2 27B	9 months ago	—	75.2%	42.3%	51.8%
62	Gemma 2 9B	9 months ago	—	71.3%	36.6%	40.2%
63	Gemma 3 27B NEW	2 weeks ago	—	76.9%	89.0%	87.8%
64	GPT-4o	10 months ago	—	—	—	—
65	Grok-1.5V	11 months ago	—	—	—	—
66	Jamba 1.6 Large NEW	2 weeks ago	—	—	—	—
67	Jamba 1.6 Mini NEW	2 weeks ago	—	—	—	—
68	Kimi-k1.5	2 months ago	—	87.4%	—	—
69	Llama 3.1 Nemotron 70B Instruct	6 months ago	—	80.2%	—	—
70	Ministral 8B Instruct	5 months ago	—	65.0%	54.5%	34.8%
71	Mistral Large 2	8 months ago	—	84.0%	—	92.0%
72	Mistral NeMo Instruct	8 months ago	—	68.0%	—	—
73	Mistral Small	6 months ago	—	—	—	—
74	Mistral Small 3.1 24B NEW	1 week ago	—	80.6%	69.3%	88.4%
75	Olmo 2 32B NEW	2 weeks ago	—	74.9%	49.7%	—
76	Phi-3.5-vision-instruct	7 months ago	—	—	—	—
77	Pixtral Large	4 months ago	—	—	—	—
78	Pixtral-12B	6 months ago	—	69.2%	48.1%	72.0%
79	QvQ-72B-Preview	3 months ago	—	—	—	—
80	Qwen2-VL-72B-Instruct	7 months ago	—	—	—	—
81	Qwen2.5-Coder 32B Instruct	6 months ago	—	75.1%	57.2%	92.7%
82	Qwen2.5-Coder 7B Instruct	6 months ago	—	67.6%	46.6%	88.4%
83	QwQ 32B NEW	3 weeks ago	—	—	—	—

Showing 83 of 83 models

LLM ranking

Stay Ahead with AI Updates