DeepSeek-R1

DeepSeek

DeepSeek-R1 is a cutting-edge reasoning model developed using DeepSeek-V3 as its foundation (671B total parameters, 37B activated per token). This first-generation model leverages extensive reinforcement learning (RL) to significantly improve its chain-of-thought processes and overall reasoning abilities. As a result, DeepSeek-R1 excels in complex tasks involving mathematics, coding, and multi-step reasoning.

Model Specifications

Technical details and capabilities of DeepSeek-R1

Core Specifications

671.0B Parameters

Model size and complexity

14800.0B Training Tokens

Amount of data used in training

131.1K / 131.1K

Input / Output tokens

January 19, 2025

Release date

Capabilities & License

Multimodal Support

Not Supported

Web Hydrated

No

License

MIT License

Resources

Research Paper

https://arxiv.org/abs/2501.12948

API Reference

https://api-docs.deepseek.com/news/news250120

Playground

https://chat.deepseek.com

Code Repository

https://github.com/deepseek-ai/DeepSeek-R1

Performance Insights

Check out how DeepSeek-R1 handles various AI tasks through comprehensive benchmark results.

100

75

50

25

0

97.3

MATH-500

97.3

(97%)

92.9

MMLU-Redux

92.9

(93%)

92.8

CLUEWSC

92.8

(93%)

92.3

ArenaHard

92.3

(92%)

92.2

DROP

92.2

(92%)

91.8

C-Eval

91.8

(92%)

90.8

MMLU

90.8

(91%)

87.6

AlpacaEval2.0

87.6

(88%)

84

MMLU-Pro

84

(84%)

83.3

IFEval

83.3

(83%)

82.5

FRAMES

82.5

(83%)

79.8

AIME 2024

79.8

(80%)

78.8

CNMO 2024

78.8

(79%)

71.5

GPQA

71.5

(72%)

70

AIME 2025

70

(70%)

65.9

LiveCodeBench

65.9

(66%)

63.7

C-SimpleQA

63.7

(64%)

56.9

Aider Polyglot

56.9

(57%)

53.3

Aider-Polyglot

53.3

(53%)

49.2

SWE-bench Verified

49.2

(49%)

30.1

SimpleQA

30.1

(30%)

8.6

Humanity's Last Exam

8.6

(9%)

MATH-500

MMLU-Redux

CLUEWSC

ArenaHard

DROP

C-Eval

MMLU

AlpacaEval2.0

MMLU-Pro

IFEval

FRAMES

AIME 2024

CNMO 2024

GPQA

AIME 2025

LiveCodeBench

C-SimpleQA

Aider Polyglot

Aider-Polyglot

SWE-bench Verified

SimpleQA

Humanity's Last Exam

Model Comparison

See how DeepSeek-R1 stacks up against other leading models across key performance metrics.

100

80

60

40

20

0

90.8

MMLU - DeepSeek-R1

90.8

(91%)

88.5

MMLU - DeepSeek-V3

88.5

(89%)

88.7

MMLU - GPT-4o

88.7

(89%)

87.3

MMLU - Llama 3.1 405B Instruct

87.3

(87%)

84.8

MMLU - Phi-4

84.8

(85%)

83.6

MMLU - Llama 3.1 70B Instruct

83.6

(84%)

84

MMLU-Pro - DeepSeek-R1

84

(84%)

75.9

MMLU-Pro - DeepSeek-V3

75.9

(76%)

72.6

MMLU-Pro - GPT-4o

72.6

(73%)

73.3

MMLU-Pro - Llama 3.1 405B Instruct

73.3

(73%)

70.4

MMLU-Pro - Phi-4

70.4

(70%)

66.4

MMLU-Pro - Llama 3.1 70B Instruct

66.4

(66%)

92.2

DROP - DeepSeek-R1

92.2

(92%)

91.6

DROP - DeepSeek-V3

91.6

(92%)

83.4

DROP - GPT-4o

83.4

(83%)

84.8

DROP - Llama 3.1 405B Instruct

84.8

(85%)

75.5

DROP - Phi-4

75.5

(76%)

79.6

DROP - Llama 3.1 70B Instruct

79.6

(80%)

83.3

IFEval - DeepSeek-R1

83.3

(83%)

86.1

IFEval - DeepSeek-V3

86.1

(86%)

84

IFEval - GPT-4o

84

(84%)

88.6

IFEval - Llama 3.1 405B Instruct

88.6

(89%)

63

IFEval - Phi-4

63

(63%)

87.5

IFEval - Llama 3.1 70B Instruct

87.5

(88%)

71.5

GPQA - DeepSeek-R1

71.5

(72%)

59.1

GPQA - DeepSeek-V3

59.1

(59%)

53.6

GPQA - GPT-4o

53.6

(54%)

50.7

GPQA - Llama 3.1 405B Instruct

50.7

(51%)

56.1

GPQA - Phi-4

56.1

(56%)

41.7

GPQA - Llama 3.1 70B Instruct

41.7

(42%)

MMLU

MMLU-Pro

DROP

IFEval

GPQA

DeepSeek-R1

DeepSeek-V3

GPT-4o

Llama 3.1 405B Instruct

Phi-4

Llama 3.1 70B Instruct

Detailed Benchmarks

Dive deeper into DeepSeek-R1's performance across specific task categories. Expand each section to see detailed metrics and comparisons.

Math

AIME 2024

96.7%

87.3%

86.0%

83.3%

Claude 3.7 Sonnet

80.0%

79.8%

77.5%

42.0%

36.7%

13.4%

Current model

Other models

Avg (68.3%)

MATH-500

97.3%

Claude 3.7 Sonnet

96.2%

96.2%

QwQ-32B-Preview

90.6%

90.2%

90.0%

Current model

Other models

Avg (93.4%)

AIME 2025

93.0%

90.3%

Gemini Pro 2.5 Experimental

86.7%

86.5%

70.0%

Claude 3.7 Sonnet

49.5%

Current model

Other models

Avg (79.3%)

Coding

LiveCodeBench

80.0%

79.0%

74.1%

Gemini Pro 2.5 Experimental

70.4%

65.9%

63.4%

62.5%

Qwen2.5 72B Instruct

55.5%

Qwen2.5-Coder 7B Instruct

18.2%

Current model

Other models

Avg (63.2%)

SWE-bench Verified

Claude 3.7 Sonnet

70.3%

Gemini Pro 2.5 Experimental

63.8%

49.3%

49.2%

Claude 3.5 Sonnet

49.0%

48.9%

42.0%

Claude 3.5 Haiku

40.6%

38.0%

Current model

Other models

Avg (50.1%)

Aider-Polyglot

53.3%

49.6%

Current model

Other models

Avg (51.4%)

Aider Polyglot

Gemini Pro 2.5 Experimental

74.0%

Gemini Pro 2.5 Experimental

72.9%

Claude 3.7 Sonnet

64.9%

61.7%

60.4%

56.9%

44.9%

27.1%

20.9%

Current model

Other models

Avg (53.7%)

Reasoning

DROP

92.2%

91.6%

Claude 3.5 Sonnet

87.1%

Claude 3.5 Sonnet

87.1%

86.0%

85.4%

Llama 3.1 405B Instruct

84.8%

83.4%

Current model

Other models

Avg (87.2%)

Knowledge

MMLU

91.8%

90.8%

90.8%

Claude 3.5 Sonnet

90.4%

Claude 3.5 Sonnet

90.4%

88.7%

88.5%

88.0%

Current model

Other models

Avg (89.9%)

GPQA

87.7%

79.0%

78.0%

Gemini 2.0 Flash Thinking

74.2%

73.3%

71.5%

71.4%

Claude 3.5 Sonnet

67.2%

QwQ-32B-Preview

65.2%

Qwen2 7B Instruct

25.3%

Current model

Other models

Avg (69.3%)

Non categorized

CLUEWSC

92.8%

91.4%

90.9%

Current model

Other models

Avg (91.7%)

MMLU-Redux

92.9%

89.1%

Qwen2.5 72B Instruct

86.8%

Qwen2.5 32B Instruct

83.9%

Qwen2.5 14B Instruct

80.0%

Qwen2.5-Coder 32B Instruct

77.5%

Qwen2.5 7B Instruct

75.4%

Qwen2.5-Coder 7B Instruct

66.6%

Current model

Other models

Avg (81.5%)

MMLU-Pro

84.0%

Claude 3.5 Sonnet

77.6%

Gemini 2.0 Flash

76.4%

Claude 3.5 Sonnet

76.1%

75.9%

75.8%

75.5%

74.7%

Current model

Other models

Avg (77.0%)

IFEval

Claude 3.7 Sonnet

93.2%

85.6%

Qwen2.5 72B Instruct

84.1%

84.0%

83.9%

83.3%

Mistral Small 3

82.9%

Llama 3.1 8B Instruct

80.4%

Llama 3.2 3B Instruct

77.4%

61.3%

Current model

Other models

Avg (81.6%)

SimpleQA

62.5%

Gemini Pro 2.5 Experimental

52.9%

43.6%

42.6%

42.4%

30.1%

24.9%

13.8%

Mistral Small 3.1 24B

10.4%

3.0%

Current model

Other models

Avg (32.6%)

FRAMES

82.5%

73.3%

Current model

Other models

Avg (77.9%)

ArenaHard

92.3%

76.2%

75.4%

Current model

Other models

Avg (81.3%)

C-Eval

91.8%

86.5%

Qwen2 72B Instruct

83.8%

Qwen2 7B Instruct

77.2%

Current model

Other models

Avg (84.8%)

C-SimpleQA

64.8%

63.7%

Current model

Other models

Avg (64.3%)

Humanity's Last Exam

Gemini Pro 2.5 Experimental

18.8%

14.0%

Claude 3.7 Sonnet

8.9%

8.6%

6.4%

Current model

Other models

Avg (11.3%)

Providers Pricing Coming Soon

We're working on gathering comprehensive pricing data from all major providers for DeepSeek-R1. Compare costs across platforms to find the best pricing for your use case.

OpenAI

Anthropic

Google

Mistral AI

Cohere

Share your feedback

Hi, I'm Charlie Palars, the founder of Deepranking.ai. I'm always looking for ways to improve the site and make it more useful for you. You can write me through this form or directly through X at @palarsio.

Your feedback helps us improve our service

Stay Ahead with AI Updates

Get insights on Gemini Pro 2.5, Sonnet 3.7 and more top AI models