Phi-3.5-vision-instruct

Unknown Developer

Phi-3.5-vision-instruct is an open-source multimodal model boasting 4.2 billion parameters and a large 128K context window. Designed with a focus on advanced image understanding and logical reasoning, it excels in both single-image tasks and complex multi-image analysis, including comparison, summarization, and video processing. Enhanced with safety-focused post-training, the model demonstrates improved instruction-following, alignment, and resilience when processing diverse visual and textual data. It is available for use under the permissive MIT license.

Model Specifications

Technical details and capabilities of Phi-3.5-vision-instruct

Core Specifications

4.2B Parameters

Model size and complexity

500.0B Training Tokens

Amount of data used in training

128.0K / 128.0K

Input / Output tokens

September 30, 2023

Knowledge cutoff date

August 22, 2024

Release date

Capabilities & License

Multimodal Support

Supported

Web Hydrated

License

MIT

Resources

Research Paper

https://arxiv.org/abs/2404.14219

API Reference

https://huggingface.co/microsoft/Phi-3.5-vision-instruct

Performance Insights

Check out how Phi-3.5-vision-instruct handles various AI tasks through comprehensive benchmark results.

100

91.3

ScienceQA

91.3

(91%)

86.1

POPE

86.1

(86%)

81.9

MMBench

81.9

(82%)

81.8

ChartQA

81.8

(82%)

78.1

AI2D

78.1

(78%)

TextVQA

(72%)

43.9

MathVista

43.9

(44%)

MMMU

(43%)

36.3

InterGPS

36.3

(36%)

ScienceQA

POPE

MMBench

ChartQA

AI2D

TextVQA

MathVista

MMMU

InterGPS

Detailed Benchmarks

Dive deeper into Phi-3.5-vision-instruct's performance across specific task categories. Expand each section to see detailed metrics and comparisons.

Non categorized

MMMU

53.7%

53.6%

53.6%

52.5%

Llama 3.2 11B Instruct

50.7%

Gemini 1.0 Pro

47.9%

Phi-3.5-vision-instruct

43.0%

GPT-3.5 Turbo

0.0%

Current model

Other models

Avg (44.4%)

MathVista

Llama 3.2 90B Instruct

57.3%

GPT-4o mini

56.7%

Gemini 1.5 Flash 8B

54.7%

Grok-1.5

52.8%

Llama 3.2 11B Instruct

51.5%

Gemini 1.0 Pro

46.6%

Phi-3.5-vision-instruct

43.9%

GPT-3.5 Turbo

0.0%

Current model

Other models

Avg (45.4%)

AI2D

Claude 3.5 Sonnet

94.7%

GPT-4o

94.2%

Pixtral Large

93.8%

Mistral Small 3.1 24B

93.7%

Grok-1.5V

88.3%

Phi-3.5-vision-instruct

78.1%

Current model

Other models

Avg (90.5%)

ChartQA

Claude 3.5 Sonnet

90.8%

Mistral Small 3.1 24B

86.2%

GPT-4o

85.7%

Llama 3.2 90B Instruct

85.5%

Llama 3.2 11B Instruct

83.4%

Phi-3.5-vision-instruct

81.8%

Pixtral-12B

81.8%

Grok-1.5V

76.1%

Current model

Other models

Avg (83.9%)

TextVQA

Grok-1.5V

78.1%

Phi-3.5-vision-instruct

72.0%

Current model

Other models

Avg (75.0%)

Providers Pricing Coming Soon

We're working on gathering comprehensive pricing data from all major providers for Phi-3.5-vision-instruct. Compare costs across platforms to find the best pricing for your use case.

OpenAI

Anthropic

Google

Mistral AI

Cohere

Share your feedback

Hi, I'm Charlie Palars, the founder of Deepranking.ai. I'm always looking for ways to improve the site and make it more useful for you. You can write me through this form or directly through X at @palarsio.

Your feedback helps us improve our service

Phi-3.5-vision-instruct

Model Specifications

Core Specifications

Capabilities & License

Resources

Performance Insights

Detailed Benchmarks

Non categorized

MMMU

MathVista

AI2D

ChartQA

TextVQA

Providers Pricing Coming Soon

Share your feedback

Stay Ahead with AI Updates