Phi-4

Unknown Developer

Phi-4 is a cutting-edge open-source model designed for superior performance in complex reasoning, coding, and knowledge-intensive applications. Its capabilities are powered by a carefully curated mix of synthetic data, refined web content, scholarly materials, and supervised fine-tuning, ensuring accuracy, ethical alignment, and safety.

Model Specifications

Technical details and capabilities of Phi-4

Core Specifications

14.7B Parameters

Model size and complexity

9800.0B Training Tokens

Amount of data used in training

16.0K / 16.0K

Input / Output tokens

May 31, 2024

Knowledge cutoff date

December 11, 2024

Release date

Capabilities & License

Multimodal Support
Not Supported
Web Hydrated
No
License
MIT

Resources

Research Paper
https://arxiv.org/pdf/2412.08905
API Reference
https://huggingface.co/microsoft/phi-4

Performance Insights

Check out how Phi-4 handles various AI tasks through comprehensive benchmark results.

90
68
45
23
0
84.8
MMLU
84.8
(94%)
82.8
HumanEval+
82.8
(92%)
82.6
HumanEval
82.6
(92%)
80.6
MGSM
80.6
(90%)
80.4
MATH
80.4
(89%)
75.5
DROP
75.5
(84%)
75.4
ArenaHard
75.4
(84%)
70.4
MMLU-Pro
70.4
(78%)
63
IFEval
63
(70%)
56.2
PhiBench
56.2
(62%)
56.1
GPQA
56.1
(62%)
47.6
LiveBench
47.6
(53%)
3
SimpleQA
3
(3%)
MMLU
HumanEval+
HumanEval
MGSM
MATH
DROP
ArenaHard
MMLU-Pro
IFEval
PhiBench
GPQA
LiveBench
SimpleQA

Model Comparison

See how Phi-4 stacks up against other leading models across key performance metrics.

90
72
54
36
18
0
84.8
MMLU - Phi-4
84.8
(94%)
85.9
MMLU - Gemini 1.5 Pro
85.9
(95%)
86
MMLU - Llama 3.3 70B Instruct
86
(96%)
87.5
MMLU - Grok-2
87.5
(97%)
83.3
MMLU - Qwen2.5 32B Instruct
83.3
(93%)
86.2
MMLU - Grok-2 mini
86.2
(96%)
56.1
GPQA - Phi-4
56.1
(62%)
59.1
GPQA - Gemini 1.5 Pro
59.1
(66%)
50.5
GPQA - Llama 3.3 70B Instruct
50.5
(56%)
56.0
GPQA - Grok-2
56.0
(62%)
49.5
GPQA - Qwen2.5 32B Instruct
49.5
(55%)
51
GPQA - Grok-2 mini
51
(57%)
80.4
MATH - Phi-4
80.4
(89%)
86.5
MATH - Gemini 1.5 Pro
86.5
(96%)
77
MATH - Llama 3.3 70B Instruct
77
(86%)
76.1
MATH - Grok-2
76.1
(85%)
83.1
MATH - Qwen2.5 32B Instruct
83.1
(92%)
73
MATH - Grok-2 mini
73
(81%)
82.6
HumanEval - Phi-4
82.6
(92%)
84.1
HumanEval - Gemini 1.5 Pro
84.1
(93%)
88.4
HumanEval - Llama 3.3 70B Instruct
88.4
(98%)
88.4
HumanEval - Grok-2
88.4
(98%)
88.4
HumanEval - Qwen2.5 32B Instruct
88.4
(98%)
85.7
HumanEval - Grok-2 mini
85.7
(95%)
70.4
MMLU-Pro - Phi-4
70.4
(78%)
75.8
MMLU-Pro - Gemini 1.5 Pro
75.8
(84%)
68.9
MMLU-Pro - Llama 3.3 70B Instruct
68.9
(77%)
75.5
MMLU-Pro - Grok-2
75.5
(84%)
69
MMLU-Pro - Qwen2.5 32B Instruct
69
(77%)
72
MMLU-Pro - Grok-2 mini
72
(80%)
MMLU
GPQA
MATH
HumanEval
MMLU-Pro
Phi-4
Gemini 1.5 Pro
Llama 3.3 70B Instruct
Grok-2
Qwen2.5 32B Instruct
Grok-2 mini

Detailed Benchmarks

Dive deeper into Phi-4's performance across specific task categories. Expand each section to see detailed metrics and comparisons.

Coding

HumanEval+

Current model
Other models
Avg (62.1%)

Reasoning

DROP

Current model
Other models
Avg (76.3%)

Knowledge

MMLU

Current model
Other models
Avg (83.5%)

GPQA

Current model
Other models
Avg (57.0%)

MATH

Current model
Other models
Avg (78.5%)

Non categorized

MGSM

Current model
Other models
Avg (77.8%)

SimpleQA

Current model
Other models
Avg (26.4%)

ArenaHard

Current model
Other models
Avg (81.3%)

LiveBench

Current model
Other models
Avg (54.7%)

IFEval

Current model
Other models
Avg (75.4%)

Providers Pricing Coming Soon

We're working on gathering comprehensive pricing data from all major providers for Phi-4. Compare costs across platforms to find the best pricing for your use case.

OpenAI
Anthropic
Google
Mistral AI
Cohere

Share your feedback

Hi, I'm Charlie Palars, the founder of Deepranking.ai. I'm always looking for ways to improve the site and make it more useful for you. You can write me through this form or directly through X at @palarsio.

Your feedback helps us improve our service

Stay Ahead with AI Updates

Get insights on Gemini Pro 2.5, Sonnet 3.7 and more top AI models