o1

OpenAI

This model is a research preview that excels in math and logic. It's particularly good at tasks requiring detailed reasoning, like solving math problems and generating code. It demonstrates improved formal reasoning skills while still performing well on a variety of general tasks.

Model Specifications

Technical details and capabilities of o1

Core Specifications

200.0K / 100.0K

Input / Output tokens

December 31, 2023

Knowledge cutoff date

December 16, 2024

Release date

Capabilities & License

Multimodal Support

Not Supported

Web Hydrated

No

License

Proprietary

Resources

Research Paper

https://cdn.openai.com/o1-system-card-20240917.pdf

API Reference

https://platform.openai.com/docs/models

Code Repository

https://openai.com/index/o1-and-new-tools-for-developers/

Performance Insights

Check out how o1 handles various AI tasks through comprehensive benchmark results.

100

75

50

25

0

97.1

GSM8K

97.1

(97%)

96.4

MATH

96.4

(96%)

92.8

GPQA Physics

92.8

(93%)

91.8

MMLU

91.8

(92%)

89.3

MGSM

89.3

(89%)

88.1

HumanEval

88.1

(88%)

83.3

AIME 2024

83.3

(83%)

78

GPQA

78

(78%)

77.3

MMMU

77.3

(77%)

73.5

TAU-bench Retail

73.5

(74%)

71

MathVista

71

(71%)

69.2

GPQA Biology

69.2

(69%)

67

LiveBench

67

(67%)

64.7

GPQA Chemistry

64.7

(65%)

61.7

Aider Polyglot

61.7

(62%)

54.2

TAU-bench Airline

54.2

(54%)

48.9

SWE-bench Verified

48.9

(49%)

47

Codeforces

47

(47%)

42.6

SimpleQA

42.6

(43%)

5.5

FrontierMath

5.5

(6%)

GSM8K

MATH

GPQA Physics

MMLU

MGSM

HumanEval

AIME 2024

GPQA

MMMU

TAU-bench Retail

MathVista

GPQA Biology

LiveBench

GPQA Chemistry

Aider Polyglot

TAU-bench Airline

SWE-bench Verified

Codeforces

SimpleQA

FrontierMath

Model Comparison

See how o1 stacks up against other leading models across key performance metrics.

100

80

60

40

20

0

96.4

MATH - o1

96.4

(96%)

86.5

MATH - Gemini 1.5 Pro

86.5

(87%)

71.1

MATH - Claude 3.5 Sonnet

71.1

(71%)

83.1

MATH - Qwen2.5 32B Instruct

83.1

(83%)

73.8

MATH - Llama 3.1 405B Instruct

73.8

(74%)

76.6

MATH - Nova Pro

76.6

(77%)

91.8

MMLU - o1

91.8

(92%)

85.9

MMLU - Gemini 1.5 Pro

85.9

(86%)

90.4

MMLU - Claude 3.5 Sonnet

90.4

(90%)

83.3

MMLU - Qwen2.5 32B Instruct

83.3

(83%)

87.3

MMLU - Llama 3.1 405B Instruct

87.3

(87%)

85.9

MMLU - Nova Pro

85.9

(86%)

97.1

GSM8K - o1

97.1

(97%)

90.8

GSM8K - Gemini 1.5 Pro

90.8

(91%)

96.4

GSM8K - Claude 3.5 Sonnet

96.4

(96%)

95.9

GSM8K - Qwen2.5 32B Instruct

95.9

(96%)

96.8

GSM8K - Llama 3.1 405B Instruct

96.8

(97%)

94.8

GSM8K - Nova Pro

94.8

(95%)

88.1

HumanEval - o1

88.1

(88%)

84.1

HumanEval - Gemini 1.5 Pro

84.1

(84%)

92

HumanEval - Claude 3.5 Sonnet

92

(92%)

88.4

HumanEval - Qwen2.5 32B Instruct

88.4

(88%)

89

HumanEval - Llama 3.1 405B Instruct

89

(89%)

89

HumanEval - Nova Pro

89

(89%)

78

GPQA - o1

78

(78%)

59.1

GPQA - Gemini 1.5 Pro

59.1

(59%)

59.4

GPQA - Claude 3.5 Sonnet

59.4

(59%)

49.5

GPQA - Qwen2.5 32B Instruct

49.5

(50%)

50.7

GPQA - Llama 3.1 405B Instruct

50.7

(51%)

46.9

GPQA - Nova Pro

46.9

(47%)

MATH

MMLU

GSM8K

HumanEval

GPQA

o1

Gemini 1.5 Pro

Claude 3.5 Sonnet

Qwen2.5 32B Instruct

Llama 3.1 405B Instruct

Nova Pro

Detailed Benchmarks

Dive deeper into o1's performance across specific task categories. Expand each section to see detailed metrics and comparisons.

Math

GSM8K

97.1%

Llama 3.1 405B Instruct

96.8%

Claude 3.5 Sonnet

96.4%

Claude 3.5 Sonnet

96.4%

Qwen2.5 32B Instruct

95.9%

95.9%

Qwen2.5 72B Instruct

95.8%

95.1%

Current model

Other models

Avg (96.2%)

AIME 2024

96.7%

93.0%

Gemini Pro 2.5 Experimental

92.0%

87.3%

86.0%

83.3%

Claude 3.7 Sonnet

80.0%

79.8%

77.5%

13.4%

Current model

Other models

Avg (78.9%)

Coding

HumanEval

Claude 3.5 Sonnet

93.7%

Qwen2.5 32B Instruct

88.4%

Qwen2.5-Coder 7B Instruct

88.4%

88.4%

Claude 3.5 Haiku

88.1%

88.1%

87.8%

87.2%

87.1%

Ministral 8B Instruct

34.8%

Current model

Other models

Avg (83.2%)

Codeforces

94.0%

90.0%

79.0%

68.0%

51.6%

47.0%

41.3%

31.4%

11.0%

Current model

Other models

Avg (57.0%)

SWE-bench Verified

Claude 3.7 Sonnet

70.3%

Gemini Pro 2.5 Experimental

63.8%

49.3%

49.2%

Claude 3.5 Sonnet

49.0%

48.9%

42.0%

Claude 3.5 Haiku

40.6%

38.0%

Current model

Other models

Avg (50.1%)

Aider Polyglot

Gemini Pro 2.5 Experimental

74.0%

Gemini Pro 2.5 Experimental

72.9%

Claude 3.7 Sonnet

64.9%

61.7%

60.4%

56.9%

44.9%

27.1%

20.9%

Current model

Other models

Avg (53.7%)

Knowledge

MATH

97.9%

96.4%

Gemini 2.0 Flash

89.7%

89.0%

86.5%

85.5%

Qwen2.5 32B Instruct

83.1%

Qwen2.5 72B Instruct

83.1%

Current model

Other models

Avg (88.9%)

MMLU

91.8%

90.8%

90.8%

Claude 3.5 Sonnet

90.4%

Claude 3.5 Sonnet

90.4%

88.7%

88.5%

88.0%

Current model

Other models

Avg (89.9%)

GPQA

87.7%

84.6%

Gemini Pro 2.5 Experimental

84.0%

79.7%

79.0%

78.0%

Gemini 2.0 Flash Thinking

74.2%

73.3%

71.5%

Qwen2 7B Instruct

25.3%

Current model

Other models

Avg (73.7%)

Non categorized

MMMU

Gemini Pro 2.5 Experimental

81.7%

78.0%

77.3%

Gemini 2.0 Flash Thinking

75.4%

Claude 3.7 Sonnet

75.0%

74.4%

Gemini 2.0 Flash

70.7%

QvQ-72B-Preview

70.3%

0.0%

Current model

Other models

Avg (67.0%)

MathVista

74.9%

QvQ-72B-Preview

71.4%

71.0%

69.4%

69.0%

Mistral Small 3.1 24B

68.9%

68.1%

68.1%

0.0%

Current model

Other models

Avg (62.3%)

LiveBench

73.1%

67.0%

52.3%

Qwen2.5 72B Instruct

52.3%

47.6%

Qwen2.5 7B Instruct

35.9%

Current model

Other models

Avg (54.7%)

MGSM

92.0%

Llama 3.3 70B Instruct

91.1%

90.8%

90.7%

90.5%

89.3%

88.5%

87.5%

87.0%

Phi-3.5-mini-instruct

47.9%

Current model

Other models

Avg (85.5%)

SimpleQA

62.5%

61.8%

Gemini Pro 2.5 Experimental

52.9%

43.6%

42.6%

42.4%

30.1%

24.9%

3.0%

Current model

Other models

Avg (40.4%)

TAU-bench Retail

Claude 3.7 Sonnet

81.2%

73.5%

Claude 3.5 Sonnet

69.2%

Claude 3.5 Haiku

51.0%

Current model

Other models

Avg (68.7%)

TAU-bench Airline

Claude 3.7 Sonnet

58.4%

54.2%

Claude 3.5 Sonnet

46.0%

Claude 3.5 Haiku

22.8%

Current model

Other models

Avg (45.3%)

FrontierMath

9.2%

5.5%

Current model

Other models

Avg (7.3%)

Providers Pricing Coming Soon

We're working on gathering comprehensive pricing data from all major providers for o1. Compare costs across platforms to find the best pricing for your use case.

OpenAI

Anthropic

Google

Mistral AI

Cohere

Share your feedback

Hi, I'm Charlie Palars, the founder of Deepranking.ai. I'm always looking for ways to improve the site and make it more useful for you. You can write me through this form or directly through X at @palarsio.

Your feedback helps us improve our service

Stay Ahead with AI Updates

Get insights on Gemini Pro 2.5, Sonnet 3.7 and more top AI models