Mistral Small 3.1 24B

New

Released in the last 30 days

Mistral AI

Mistral Small 3.1-24B Instruct, an upgraded variant of Mistral Small 3 (2501), brings together 24 billion parameters, native image-text understanding, and a 128k token context window to deliver state-of-the-art performance across reasoning (MMLU 80.62%, GPQA Diamond 45.96%), coding (HumanEval 88.41%, MBPP 74.71%), math (MATH 69.3%), and multilingual benchmarks (71.18% average across 24 languages), while also dominating vision tasks like MathVista (68.91%), DocVQA (94.08%), and AI2D (93.72%), outperforming similarly sized proprietary models like Gemma 3 27B and GPT-4o Mini in most categories, all under an Apache 2.0 license and runnable on a single RTX 4090 or 32GB MacBook—making it uniquely positioned for fast, private, and open deployment in applications ranging from conversational agents and function calling to document analysis and edge-device multimodal inference.

Model Specifications

Technical details and capabilities of Mistral Small 3.1 24B

Core Specifications

24.0B Parameters

Model size and complexity

128.0K / 128.0K

Input / Output tokens

March 16, 2025

Last Week

Release date

Capabilities & License

Multimodal Support

Supported

Web Hydrated

No

License

Apache

Resources

API Reference

https://docs.mistral.ai/api/

Performance Insights

Check out how Mistral Small 3.1 24B handles various AI tasks through comprehensive benchmark results.

100

75

50

25

0

94.1

DocVQA

94.1

(94%)

94.0

RULER 32k

94.0

(94%)

93.7

AI2D

93.7

(94%)

88.4

HumanEval

88.4

(88%)

86.2

ChartQA

86.2

(86%)

81.2

RULER 128k

81.2

(81%)

80.6

MMLU

80.6

(81%)

80.5

TriviaQA

80.5

(81%)

75.3

Multilingual European

75.3

(75%)

73

MM-MT-Bench

73

(73%)

71.2

Multilingual Average

71.2

(71%)

69.3

MATH

69.3

(69%)

69.2

Multilingual East Asian

69.2

(69%)

69.1

Multilingual Middle Eastern

69.1

(69%)

68.9

MathVista

68.9

(69%)

66.8

MMLU Pro

66.8

(67%)

62.8

MMMU

62.8

(63%)

49.3

MMMU-Pro

49.3

(49%)

46.0

GPQA Diamond

46.0

(46%)

44.4

GPQA Main

44.4

(44%)

37.2

LongBench v2

37.2

(37%)

10.4

SimpleQA

10.4

(10%)

DocVQA

RULER 32k

AI2D

HumanEval

ChartQA

RULER 128k

MMLU

TriviaQA

Multilingual European

MM-MT-Bench

Multilingual Average

MATH

Multilingual East Asian

Multilingual Middle Eastern

MathVista

MMLU Pro

MMMU

MMMU-Pro

GPQA Diamond

GPQA Main

LongBench v2

SimpleQA

Model Comparison

See how Mistral Small 3.1 24B stacks up against other leading models across key performance metrics.

100

80

60

40

20

0

80.6

MMLU - Mistral Small 3.1 24B

80.6

(81%)

86.2

MMLU - Grok-2 mini

86.2

(86%)

87.5

MMLU - Grok-2

87.5

(88%)

90.4

MMLU - Claude 3.5 Sonnet

90.4

(90%)

82

MMLU - GPT-4o mini

82

(82%)

88.7

MMLU - GPT-4o

88.7

(89%)

88.4

HumanEval - Mistral Small 3.1 24B

88.4

(88%)

85.7

HumanEval - Grok-2 mini

85.7

(86%)

88.4

HumanEval - Grok-2

88.4

(88%)

92

HumanEval - Claude 3.5 Sonnet

92

(92%)

87.2

HumanEval - GPT-4o mini

87.2

(87%)

90.2

HumanEval - GPT-4o

90.2

(90%)

69.3

MATH - Mistral Small 3.1 24B

69.3

(69%)

73

MATH - Grok-2 mini

73

(73%)

76.1

MATH - Grok-2

76.1

(76%)

71.1

MATH - Claude 3.5 Sonnet

71.1

(71%)

70.2

MATH - GPT-4o mini

70.2

(70%)

76.6

MATH - GPT-4o

76.6

(77%)

68.9

MathVista - Mistral Small 3.1 24B

68.9

(69%)

68.1

MathVista - Grok-2 mini

68.1

(68%)

69

MathVista - Grok-2

69

(69%)

67.7

MathVista - Claude 3.5 Sonnet

67.7

(68%)

56.7

MathVista - GPT-4o mini

56.7

(57%)

63.8

MathVista - GPT-4o

63.8

(64%)

62.8

MMMU - Mistral Small 3.1 24B

62.8

(63%)

63.2

MMMU - Grok-2 mini

63.2

(63%)

66.1

MMMU - Grok-2

66.1

(66%)

68.3

MMMU - Claude 3.5 Sonnet

68.3

(68%)

59.4

MMMU - GPT-4o mini

59.4

(59%)

69.1

MMMU - GPT-4o

69.1

(69%)

MMLU

HumanEval

MATH

MathVista

MMMU

Mistral Small 3.1 24B

Grok-2 mini

Grok-2

Claude 3.5 Sonnet

GPT-4o mini

GPT-4o

Detailed Benchmarks

Dive deeper into Mistral Small 3.1 24B's performance across specific task categories. Expand each section to see detailed metrics and comparisons.

Coding

HumanEval

Claude 3.5 Sonnet

93.7%

90.2%

89.0%

89.0%

Llama 3.1 405B Instruct

89.0%

Mistral Small 3.1 24B

88.4%

Llama 3.3 70B Instruct

88.4%

Qwen2.5 32B Instruct

88.4%

Qwen2.5-Coder 7B Instruct

88.4%

Ministral 8B Instruct

34.8%

Current model

Other models

Avg (83.9%)

Knowledge

MMLU

91.8%

Qwen2 72B Instruct

82.3%

82.0%

81.3%

Jamba 1.5 Large

81.2%

Mistral Small 3.1 24B

80.6%

80.5%

80.4%

Llama 3.1 Nemotron 70B Instruct

80.2%

Llama 3.2 3B Instruct

63.4%

Current model

Other models

Avg (80.4%)

MATH

97.9%

Mistral Small 3

70.6%

70.2%

Claude 3.5 Haiku

69.4%

69.3%

Mistral Small 3.1 24B

69.3%

Llama 3.2 90B Instruct

68.0%

61.6%

60.1%

32.6%

Current model

Other models

Avg (66.9%)

Non categorized

SimpleQA

43.6%

42.6%

42.4%

30.1%

24.9%

13.8%

Mistral Small 3.1 24B

10.4%

3.0%

Current model

Other models

Avg (26.4%)

MMMU-Pro

Mistral Small 3.1 24B

49.3%

Qwen2-VL-72B-Instruct

46.2%

Llama 3.2 90B Instruct

45.2%

Llama 3.2 11B Instruct

33.0%

Current model

Other models

Avg (43.4%)

MathVista

74.9%

QvQ-72B-Preview

71.4%

71.0%

69.4%

69.0%

Mistral Small 3.1 24B

68.9%

68.1%

68.1%

Claude 3.5 Sonnet

67.7%

0.0%

Current model

Other models

Avg (62.9%)

MMMU

Gemini Pro 2.5 Experimental

81.7%

66.1%

65.9%

64.0%

63.2%

Mistral Small 3.1 24B

62.8%

Gemini 1.5 Flash

62.3%

61.7%

Llama 3.2 90B Instruct

60.3%

0.0%

Current model

Other models

Avg (58.8%)

MM-MT-Bench

74.0%

Mistral Small 3.1 24B

73.0%

Current model

Other models

Avg (73.5%)

ChartQA

Claude 3.5 Sonnet

90.8%

88.1%

Mistral Small 3.1 24B

86.2%

85.7%

Llama 3.2 90B Instruct

85.5%

Llama 3.2 11B Instruct

83.4%

Phi-3.5-vision-instruct

81.8%

81.8%

76.1%

Current model

Other models

Avg (84.4%)

DocVQA

Claude 3.5 Sonnet

95.2%

Mistral Small 3.1 24B

94.1%

93.6%

93.3%

93.2%

92.8%

90.7%

Llama 3.2 90B Instruct

90.1%

Current model

Other models

Avg (92.9%)

AI2D

Claude 3.5 Sonnet

94.7%

94.2%

93.8%

Mistral Small 3.1 24B

93.7%

88.3%

Phi-3.5-vision-instruct

78.1%

Current model

Other models

Avg (90.5%)

LongBench v2

48.7%

Mistral Small 3.1 24B

37.2%

Current model

Other models

Avg (42.9%)

TriviaQA

88.0%

83.7%

Mistral Small 3.1 24B

80.5%

76.6%

Mistral NeMo Instruct

73.8%

Ministral 8B Instruct

65.5%

Current model

Other models

Avg (78.0%)

Providers Pricing Coming Soon

We're working on gathering comprehensive pricing data from all major providers for Mistral Small 3.1 24B. Compare costs across platforms to find the best pricing for your use case.

OpenAI

Anthropic

Google

Mistral AI

Cohere

Share your feedback

Hi, I'm Charlie Palars, the founder of Deepranking.ai. I'm always looking for ways to improve the site and make it more useful for you. You can write me through this form or directly through X at @palarsio.

Your feedback helps us improve our service

Stay Ahead with AI Updates

Get insights on Gemini Pro 2.5, Sonnet 3.7 and more top AI models