Global AI Model Benchmarks (April 2026)

Created by Adrian Dunkley | maestrosai.com | ceo@maestrosai.com | Fair Use

This page aggregates published scores for current-generation large and small language models on the major public benchmarks. Scores come from each vendor’s official release notes and model cards, independent evaluations (Artificial Analysis, lmarena.ai, SEAL leaderboards), and peer-reviewed work where available. Reading this table:

”~” means rounded to whole percent.
”, ” means not reported by the vendor or not applicable.
All scores are as of April 2026.
Benchmarks evolve fast. Always verify against the vendor’s current card before making a procurement decision.

LAC-specific performance lives in a separate page: see lac-benchmark.md.

Benchmarks covered

Benchmark	What it measures	Why it matters
MMLU-Pro	Multi-task knowledge across 14 subjects	General-knowledge proxy
GPQA Diamond	Graduate-level physics, chemistry, biology	Hard reasoning in STEM
SWE-bench Verified	Real GitHub issue resolution (coding)	Software-engineering quality
HumanEval+	Function-level code generation	Coding basics
MATH	Competition-level math	Math reasoning
ARC-AGI-2	Abstract visual reasoning	Novel-problem reasoning
MMMU	Multimodal college-level QA	Image + text reasoning
Long-context (needle-in-haystack and RULER)	Information retrieval over very long inputs	Production long-doc tasks
HELM / Artificial Analysis Intelligence Index	Composite rank	Headline general intelligence

Frontier large language models

Model	MMLU-Pro	GPQA-D	SWE-bench V	HumanEval+	MATH	ARC-AGI-2	MMMU	Long ctx
Claude Opus 4.7 (1M)	~86%	~84%	~79%	~92%	~92%	~45%	~80%	Strong, 1M
Claude Sonnet 4.6	~81%	~78%	~73%	~89%	~88%	~30%	~75%	Strong, 200K
Claude Haiku 4.5	~72%	~60%	~55%	~82%	~76%	~14%	~64%	Strong, 200K
GPT-5.4 Pro	~86%	~83%	~77%	~91%	~93%	~42%	~79%	Strong, 400K
GPT-5.4 Thinking	~85%	~82%	~76%	~91%	~93%	~40%	~79%	Strong, 400K
GPT-5.4	~82%	~78%	~71%	~88%	~89%	~32%	~76%	Strong, 400K
GPT-5.3 Instant	~76%	~65%	~60%	~83%	~79%	~18%	~68%	Strong, 200K
Gemini 3.1 Pro	~86%	~85%	~77%	~90%	~91%	~77%	~82%	Strong, 1M
Gemini 3.1 Flash	~78%	~70%	~62%	~85%	~82%	~35%	~72%	Strong, 1M
Gemini 3.1 Flash Lite	~70%	~55%	~48%	~78%	~70%	~12%	~60%	Strong, 1M
Mistral Large (2026)	~75%	~65%	~55%	~84%	~78%	~20%	~65%	128K
DeepSeek V3	~77%	~70%	~58%	~85%	~82%	~22%	—	128K

Sources: Anthropic model cards for Claude 4.x; OpenAI release notes for GPT-5.x; Google AI for Developers for Gemini 3.1; Mistral release notes; DeepSeek technical report. ARC-AGI-2 scores cross-validated via arcprize.org leaderboard.

Open-weight frontier and mid-size

Model	MMLU-Pro	GPQA-D	SWE-bench V	HumanEval+	MATH	Long ctx
Llama 4 Maverick (17B active, 400B total MoE)	~78%	~72%	~60%	~87%	~83%	1M
Llama 4 Scout (17B active, 109B total MoE)	~74%	~65%	~50%	~82%	~74%	10M
Llama 3.3 70B	~68%	~55%	~42%	~80%	~68%	128K
Qwen 3 32B	~73%	~67%	~55%	~85%	~81%	128K
Qwen 3 14B	~68%	~55%	~48%	~80%	~72%	128K
DeepSeek R1 distill 32B	~72%	~68%	~52%	~82%	~89%	128K
Mistral NeMo 12B	~60%	~45%	~30%	~70%	~60%	128K

Sources: Hugging Face model cards; Artificial Analysis leaderboards; vendor release notes.

Small Language Models (≤ 15B active parameters)

These are the most relevant for offline, privacy-first, and on-device deployments. See SLM section for deployment guidance.

Model	Size	MMLU-Pro	GPQA-D	HumanEval+	MATH	License
Phi-4	14B	~70%	~58%	~80%	~80%	MIT
Phi-4 Mini	3.8B	~56%	~38%	~65%	~60%	MIT
Gemma 4 27B	27B	~72%	~58%	~82%	~74%	Gemma License
Gemma 4 9B	9B	~65%	~51%	~75%	~67%	Gemma License
Gemma 4 2B	2B	~48%	~28%	~55%	~45%	Gemma License
Qwen 3 7B	7B	~62%	~50%	~72%	~65%	Apache 2.0
Mistral 7B / Small	7B	~55%	~38%	~65%	~50%	Apache 2.0
Llama 3.3 8B	8B	~56%	~40%	~66%	~52%	Llama Community
IBM Granite 3 8B	8B	~58%	~40%	~65%	~55%	Apache 2.0
SmolLM 2 1.7B	1.7B	~38%	~20%	~42%	~30%	Apache 2.0

Sources: vendor model cards on Hugging Face; Artificial Analysis SLM leaderboard; Open LLM Leaderboard (April 2026 refresh).

Multimodal standings

Model	MMMU	MathVista	ChartQA	OCR-bench
Gemini 3.1 Pro	~82%	~75%	~89%	~88%
GPT-5.4	~76%	~72%	~87%	~85%
Claude Opus 4.7	~80%	~74%	~86%	~86%
Llama 4 Maverick	~70%	~60%	~78%	~76%
Gemma 4 9B	~58%	~48%	~70%	~66%
Phi-4 multimodal (preview)	~55%	~45%	~68%	~65%

Long-context reliability (RULER average at 128K)

Model	Context window	RULER avg @ 128K
Gemini 3.1 Pro	1M	~92%
Claude Opus 4.7	1M	~91%
GPT-5.4	400K	~89%
Llama 4 Scout	10M	~85%
Claude Sonnet 4.6	200K	~88%

Cost per million tokens (April 2026, in USD)

Listed prices are blended input/output approximations for SMB planning. Always verify current pricing with the vendor.

Model	Input / Output (per 1M tokens)	Practical monthly cost at 5M tokens
Claude Opus 4.7	$15 /$ 75	$225-375
Claude Sonnet 4.6	$3 /$ 15	$45-75
Claude Haiku 4.5	$0.80 /$ 4	$12-20
GPT-5.4 Pro	~ $15 /$ 75	$225-375
GPT-5.4	~ $3 /$ 15	$45-75
GPT-5.3 Instant	~ $0.80 /$ 4	$12-20
Gemini 3.1 Pro	~ $5 /$ 25	$75-125
Gemini 3.1 Flash	~ $0.30 /$ 2.50	$7-12
Gemini 3.1 Flash Lite	~ $0.10 /$ 0.40	$2-3
Llama 4 Maverick (via Groq, Together)	~ $0.50 /$ 0.80	$5-10
Llama 4 Scout (via Groq, Together)	~ $0.20 /$ 0.40	$2-3
Mistral Large	~ $2 /$ 6	$20-30
Phi-4 (self-hosted)	Hardware only	$0 marginal
Gemma 4 9B (self-hosted)	Hardware only	$0 marginal

Notes on benchmark interpretation

Benchmark gaming is real. Small score gaps (1-3 points) rarely translate to visible differences in production.
Reasoning modes inflate scores. Models with “thinking” modes often publish their best numbers with extended reasoning on. Budget accordingly.
Tool-use benchmarks (SWE-bench) are the closest proxy for real agent performance.
ARC-AGI-2 is the current gold standard for novel-problem reasoning. Gemini 3.1 Pro’s lead here is the notable 2026 story.
No single benchmark captures LAC usefulness. See lac-benchmark.md for that.

How to use this page

Identify the dimension that matters most for your task (coding, long context, math, multimodal, cost).
Pick the top 2 or 3 models in that column.
Run your own eval on 20 real queries from your business. The winner on your data is rarely the global leader.

lac-benchmark.md for LAC-specific scoring.
tools/README.md for the full tool landscape.
slm/models.md for open-weight SLM deep dives.

Start here

Practical Guide

Tools and Models

AI Agents

Small Language Models

Governance and Responsible AI

AI Risks

Rankings and Benchmarks

Global benchmarks

Global AI Model Benchmarks (April 2026)

Benchmarks covered

Frontier large language models

Open-weight frontier and mid-size

Small Language Models (≤ 15B active parameters)

Multimodal standings

Long-context reliability (RULER average at 128K)

Cost per million tokens (April 2026, in USD)

Notes on benchmark interpretation

How to use this page

Start here

Practical Guide

Tools and Models

AI Agents

Small Language Models

Governance and Responsible AI

AI Risks

Rankings and Benchmarks

​Global AI Model Benchmarks (April 2026)

​Benchmarks covered

​Frontier large language models

​Open-weight frontier and mid-size

​Small Language Models (≤ 15B active parameters)

​Multimodal standings

​Long-context reliability (RULER average at 128K)

​Cost per million tokens (April 2026, in USD)

​Notes on benchmark interpretation

​How to use this page

​Related reading

Global AI Model Benchmarks (April 2026)

Benchmarks covered

Frontier large language models

Open-weight frontier and mid-size

Small Language Models (≤ 15B active parameters)

Multimodal standings

Long-context reliability (RULER average at 128K)

Cost per million tokens (April 2026, in USD)

Notes on benchmark interpretation

How to use this page

Related reading