Skip to main content

Global AI Model Benchmarks (April 2026)

Created by Adrian Dunkley | maestrosai.com | ceo@maestrosai.com | Fair Use

This page aggregates published scores for current-generation large and small language models on the major public benchmarks. Scores come from each vendor’s official release notes and model cards, independent evaluations (Artificial Analysis, lmarena.ai, SEAL leaderboards), and peer-reviewed work where available. Reading this table:
  • ”~” means rounded to whole percent.
  • ”, ” means not reported by the vendor or not applicable.
  • All scores are as of April 2026.
  • Benchmarks evolve fast. Always verify against the vendor’s current card before making a procurement decision.
LAC-specific performance lives in a separate page: see lac-benchmark.md.

Benchmarks covered

BenchmarkWhat it measuresWhy it matters
MMLU-ProMulti-task knowledge across 14 subjectsGeneral-knowledge proxy
GPQA DiamondGraduate-level physics, chemistry, biologyHard reasoning in STEM
SWE-bench VerifiedReal GitHub issue resolution (coding)Software-engineering quality
HumanEval+Function-level code generationCoding basics
MATHCompetition-level mathMath reasoning
ARC-AGI-2Abstract visual reasoningNovel-problem reasoning
MMMUMultimodal college-level QAImage + text reasoning
Long-context (needle-in-haystack and RULER)Information retrieval over very long inputsProduction long-doc tasks
HELM / Artificial Analysis Intelligence IndexComposite rankHeadline general intelligence

Frontier large language models

ModelMMLU-ProGPQA-DSWE-bench VHumanEval+MATHARC-AGI-2MMMULong ctx
Claude Opus 4.7 (1M)~86%~84%~79%~92%~92%~45%~80%Strong, 1M
Claude Sonnet 4.6~81%~78%~73%~89%~88%~30%~75%Strong, 200K
Claude Haiku 4.5~72%~60%~55%~82%~76%~14%~64%Strong, 200K
GPT-5.4 Pro~86%~83%~77%~91%~93%~42%~79%Strong, 400K
GPT-5.4 Thinking~85%~82%~76%~91%~93%~40%~79%Strong, 400K
GPT-5.4~82%~78%~71%~88%~89%~32%~76%Strong, 400K
GPT-5.3 Instant~76%~65%~60%~83%~79%~18%~68%Strong, 200K
Gemini 3.1 Pro~86%~85%~77%~90%~91%~77%~82%Strong, 1M
Gemini 3.1 Flash~78%~70%~62%~85%~82%~35%~72%Strong, 1M
Gemini 3.1 Flash Lite~70%~55%~48%~78%~70%~12%~60%Strong, 1M
Mistral Large (2026)~75%~65%~55%~84%~78%~20%~65%128K
DeepSeek V3~77%~70%~58%~85%~82%~22%128K
Sources: Anthropic model cards for Claude 4.x; OpenAI release notes for GPT-5.x; Google AI for Developers for Gemini 3.1; Mistral release notes; DeepSeek technical report. ARC-AGI-2 scores cross-validated via arcprize.org leaderboard.

Open-weight frontier and mid-size

ModelMMLU-ProGPQA-DSWE-bench VHumanEval+MATHLong ctx
Llama 4 Maverick (17B active, 400B total MoE)~78%~72%~60%~87%~83%1M
Llama 4 Scout (17B active, 109B total MoE)~74%~65%~50%~82%~74%10M
Llama 3.3 70B~68%~55%~42%~80%~68%128K
Qwen 3 32B~73%~67%~55%~85%~81%128K
Qwen 3 14B~68%~55%~48%~80%~72%128K
DeepSeek R1 distill 32B~72%~68%~52%~82%~89%128K
Mistral NeMo 12B~60%~45%~30%~70%~60%128K
Sources: Hugging Face model cards; Artificial Analysis leaderboards; vendor release notes.

Small Language Models (≤ 15B active parameters)

These are the most relevant for offline, privacy-first, and on-device deployments. See SLM section for deployment guidance.
ModelSizeMMLU-ProGPQA-DHumanEval+MATHLicense
Phi-414B~70%~58%~80%~80%MIT
Phi-4 Mini3.8B~56%~38%~65%~60%MIT
Gemma 4 27B27B~72%~58%~82%~74%Gemma License
Gemma 4 9B9B~65%~51%~75%~67%Gemma License
Gemma 4 2B2B~48%~28%~55%~45%Gemma License
Qwen 3 7B7B~62%~50%~72%~65%Apache 2.0
Mistral 7B / Small7B~55%~38%~65%~50%Apache 2.0
Llama 3.3 8B8B~56%~40%~66%~52%Llama Community
IBM Granite 3 8B8B~58%~40%~65%~55%Apache 2.0
SmolLM 2 1.7B1.7B~38%~20%~42%~30%Apache 2.0
Sources: vendor model cards on Hugging Face; Artificial Analysis SLM leaderboard; Open LLM Leaderboard (April 2026 refresh).

Multimodal standings

ModelMMMUMathVistaChartQAOCR-bench
Gemini 3.1 Pro~82%~75%~89%~88%
GPT-5.4~76%~72%~87%~85%
Claude Opus 4.7~80%~74%~86%~86%
Llama 4 Maverick~70%~60%~78%~76%
Gemma 4 9B~58%~48%~70%~66%
Phi-4 multimodal (preview)~55%~45%~68%~65%

Long-context reliability (RULER average at 128K)

ModelContext windowRULER avg @ 128K
Gemini 3.1 Pro1M~92%
Claude Opus 4.71M~91%
GPT-5.4400K~89%
Llama 4 Scout10M~85%
Claude Sonnet 4.6200K~88%

Cost per million tokens (April 2026, in USD)

Listed prices are blended input/output approximations for SMB planning. Always verify current pricing with the vendor.
ModelInput / Output (per 1M tokens)Practical monthly cost at 5M tokens
Claude Opus 4.715/15 / 75$225-375
Claude Sonnet 4.63/3 / 15$45-75
Claude Haiku 4.50.80/0.80 / 4$12-20
GPT-5.4 Pro~15/15 / 75$225-375
GPT-5.4~3/3 / 15$45-75
GPT-5.3 Instant~0.80/0.80 / 4$12-20
Gemini 3.1 Pro~5/5 / 25$75-125
Gemini 3.1 Flash~0.30/0.30 / 2.50$7-12
Gemini 3.1 Flash Lite~0.10/0.10 / 0.40$2-3
Llama 4 Maverick (via Groq, Together)~0.50/0.50 / 0.80$5-10
Llama 4 Scout (via Groq, Together)~0.20/0.20 / 0.40$2-3
Mistral Large~2/2 / 6$20-30
Phi-4 (self-hosted)Hardware only$0 marginal
Gemma 4 9B (self-hosted)Hardware only$0 marginal

Notes on benchmark interpretation

  • Benchmark gaming is real. Small score gaps (1-3 points) rarely translate to visible differences in production.
  • Reasoning modes inflate scores. Models with “thinking” modes often publish their best numbers with extended reasoning on. Budget accordingly.
  • Tool-use benchmarks (SWE-bench) are the closest proxy for real agent performance.
  • ARC-AGI-2 is the current gold standard for novel-problem reasoning. Gemini 3.1 Pro’s lead here is the notable 2026 story.
  • No single benchmark captures LAC usefulness. See lac-benchmark.md for that.

How to use this page

  1. Identify the dimension that matters most for your task (coding, long context, math, multimodal, cost).
  2. Pick the top 2 or 3 models in that column.
  3. Run your own eval on 20 real queries from your business. The winner on your data is rarely the global leader.


Created by Adrian Dunkley | MaestrosAI | maestrosai.com | ceo@maestrosai.com Fair Use, Educational Resource | April 2026 Licensed under Creative Commons BY 4.0. SEO: AI model benchmarks 2026 | MMLU-Pro | GPQA Diamond | SWE-bench | ARC-AGI-2 | MMMU | Claude vs GPT-5 vs Gemini | SLM benchmarks | Phi-4 Gemma 4 Llama 4