> ## Documentation Index
> Fetch the complete documentation index at: https://aiplaybooklac.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Global benchmarks

# Global AI Model Benchmarks (April 2026)

> **Created by Adrian Dunkley** | [maestrosai.com](https://maestrosai.com) | [ceo@maestrosai.com](mailto:ceo@maestrosai.com) | Fair Use

***

This page aggregates published scores for current-generation large and small language models on the major public benchmarks. Scores come from each vendor's official release notes and model cards, independent evaluations (Artificial Analysis, lmarena.ai, SEAL leaderboards), and peer-reviewed work where available.

**Reading this table:**

* "\~" means rounded to whole percent.
* ", " means not reported by the vendor or not applicable.
* All scores are as of **April 2026**.
* Benchmarks evolve fast. Always verify against the vendor's current card before making a procurement decision.

**LAC-specific performance** lives in a separate page: see [lac-benchmark.md](lac-benchmark.md).

***

## Benchmarks covered

| Benchmark                                         | What it measures                            | Why it matters                |
| ------------------------------------------------- | ------------------------------------------- | ----------------------------- |
| **MMLU-Pro**                                      | Multi-task knowledge across 14 subjects     | General-knowledge proxy       |
| **GPQA Diamond**                                  | Graduate-level physics, chemistry, biology  | Hard reasoning in STEM        |
| **SWE-bench Verified**                            | Real GitHub issue resolution (coding)       | Software-engineering quality  |
| **HumanEval+**                                    | Function-level code generation              | Coding basics                 |
| **MATH**                                          | Competition-level math                      | Math reasoning                |
| **ARC-AGI-2**                                     | Abstract visual reasoning                   | Novel-problem reasoning       |
| **MMMU**                                          | Multimodal college-level QA                 | Image + text reasoning        |
| **Long-context (needle-in-haystack and RULER)**   | Information retrieval over very long inputs | Production long-doc tasks     |
| **HELM / Artificial Analysis Intelligence Index** | Composite rank                              | Headline general intelligence |

***

## Frontier large language models

| Model                     | MMLU-Pro | GPQA-D | SWE-bench V | HumanEval+ | MATH  | ARC-AGI-2 | MMMU  | Long ctx     |
| ------------------------- | -------- | ------ | ----------- | ---------- | ----- | --------- | ----- | ------------ |
| **Claude Opus 4.7 (1M)**  | \~86%    | \~84%  | \~79%       | \~92%      | \~92% | \~45%     | \~80% | Strong, 1M   |
| **Claude Sonnet 4.6**     | \~81%    | \~78%  | \~73%       | \~89%      | \~88% | \~30%     | \~75% | Strong, 200K |
| **Claude Haiku 4.5**      | \~72%    | \~60%  | \~55%       | \~82%      | \~76% | \~14%     | \~64% | Strong, 200K |
| **GPT-5.4 Pro**           | \~86%    | \~83%  | \~77%       | \~91%      | \~93% | \~42%     | \~79% | Strong, 400K |
| **GPT-5.4 Thinking**      | \~85%    | \~82%  | \~76%       | \~91%      | \~93% | \~40%     | \~79% | Strong, 400K |
| **GPT-5.4**               | \~82%    | \~78%  | \~71%       | \~88%      | \~89% | \~32%     | \~76% | Strong, 400K |
| **GPT-5.3 Instant**       | \~76%    | \~65%  | \~60%       | \~83%      | \~79% | \~18%     | \~68% | Strong, 200K |
| **Gemini 3.1 Pro**        | \~86%    | \~85%  | \~77%       | \~90%      | \~91% | **\~77%** | \~82% | Strong, 1M   |
| **Gemini 3.1 Flash**      | \~78%    | \~70%  | \~62%       | \~85%      | \~82% | \~35%     | \~72% | Strong, 1M   |
| **Gemini 3.1 Flash Lite** | \~70%    | \~55%  | \~48%       | \~78%      | \~70% | \~12%     | \~60% | Strong, 1M   |
| **Mistral Large (2026)**  | \~75%    | \~65%  | \~55%       | \~84%      | \~78% | \~20%     | \~65% | 128K         |
| **DeepSeek V3**           | \~77%    | \~70%  | \~58%       | \~85%      | \~82% | \~22%     | —     | 128K         |

**Sources**: Anthropic model cards for Claude 4.x; OpenAI release notes for GPT-5.x; Google AI for Developers for Gemini 3.1; Mistral release notes; DeepSeek technical report. ARC-AGI-2 scores cross-validated via [arcprize.org](https://arcprize.org) leaderboard.

***

## Open-weight frontier and mid-size

| Model                                             | MMLU-Pro | GPQA-D | SWE-bench V | HumanEval+ | MATH  | Long ctx |
| ------------------------------------------------- | -------- | ------ | ----------- | ---------- | ----- | -------- |
| **Llama 4 Maverick** (17B active, 400B total MoE) | \~78%    | \~72%  | \~60%       | \~87%      | \~83% | 1M       |
| **Llama 4 Scout** (17B active, 109B total MoE)    | \~74%    | \~65%  | \~50%       | \~82%      | \~74% | **10M**  |
| **Llama 3.3 70B**                                 | \~68%    | \~55%  | \~42%       | \~80%      | \~68% | 128K     |
| **Qwen 3 32B**                                    | \~73%    | \~67%  | \~55%       | \~85%      | \~81% | 128K     |
| **Qwen 3 14B**                                    | \~68%    | \~55%  | \~48%       | \~80%      | \~72% | 128K     |
| **DeepSeek R1 distill 32B**                       | \~72%    | \~68%  | \~52%       | \~82%      | \~89% | 128K     |
| **Mistral NeMo 12B**                              | \~60%    | \~45%  | \~30%       | \~70%      | \~60% | 128K     |

**Sources**: Hugging Face model cards; Artificial Analysis leaderboards; vendor release notes.

***

## Small Language Models (≤ 15B active parameters)

These are the most relevant for offline, privacy-first, and on-device deployments. See [SLM section](../slm/README.md) for deployment guidance.

| Model                  | Size | MMLU-Pro | GPQA-D | HumanEval+ | MATH  | License         |
| ---------------------- | ---- | -------- | ------ | ---------- | ----- | --------------- |
| **Phi-4**              | 14B  | \~70%    | \~58%  | \~80%      | \~80% | MIT             |
| **Phi-4 Mini**         | 3.8B | \~56%    | \~38%  | \~65%      | \~60% | MIT             |
| **Gemma 4 27B**        | 27B  | \~72%    | \~58%  | \~82%      | \~74% | Gemma License   |
| **Gemma 4 9B**         | 9B   | \~65%    | \~51%  | \~75%      | \~67% | Gemma License   |
| **Gemma 4 2B**         | 2B   | \~48%    | \~28%  | \~55%      | \~45% | Gemma License   |
| **Qwen 3 7B**          | 7B   | \~62%    | \~50%  | \~72%      | \~65% | Apache 2.0      |
| **Mistral 7B / Small** | 7B   | \~55%    | \~38%  | \~65%      | \~50% | Apache 2.0      |
| **Llama 3.3 8B**       | 8B   | \~56%    | \~40%  | \~66%      | \~52% | Llama Community |
| **IBM Granite 3 8B**   | 8B   | \~58%    | \~40%  | \~65%      | \~55% | Apache 2.0      |
| **SmolLM 2 1.7B**      | 1.7B | \~38%    | \~20%  | \~42%      | \~30% | Apache 2.0      |

**Sources**: vendor model cards on Hugging Face; Artificial Analysis SLM leaderboard; Open LLM Leaderboard (April 2026 refresh).

***

## Multimodal standings

| Model                      | MMMU  | MathVista | ChartQA | OCR-bench |
| -------------------------- | ----- | --------- | ------- | --------- |
| Gemini 3.1 Pro             | \~82% | \~75%     | \~89%   | \~88%     |
| GPT-5.4                    | \~76% | \~72%     | \~87%   | \~85%     |
| Claude Opus 4.7            | \~80% | \~74%     | \~86%   | \~86%     |
| Llama 4 Maverick           | \~70% | \~60%     | \~78%   | \~76%     |
| Gemma 4 9B                 | \~58% | \~48%     | \~70%   | \~66%     |
| Phi-4 multimodal (preview) | \~55% | \~45%     | \~68%   | \~65%     |

***

## Long-context reliability (RULER average at 128K)

| Model             | Context window | RULER avg @ 128K |
| ----------------- | -------------- | ---------------- |
| Gemini 3.1 Pro    | 1M             | \~92%            |
| Claude Opus 4.7   | 1M             | \~91%            |
| GPT-5.4           | 400K           | \~89%            |
| Llama 4 Scout     | 10M            | \~85%            |
| Claude Sonnet 4.6 | 200K           | \~88%            |

***

## Cost per million tokens (April 2026, in USD)

Listed prices are blended input/output approximations for SMB planning. Always verify current pricing with the vendor.

| Model                                 | Input / Output (per 1M tokens) | Practical monthly cost at 5M tokens |
| ------------------------------------- | ------------------------------ | ----------------------------------- |
| Claude Opus 4.7                       | $15 / $75                      | \$225-375                           |
| Claude Sonnet 4.6                     | $3 / $15                       | \$45-75                             |
| Claude Haiku 4.5                      | $0.80 / $4                     | \$12-20                             |
| GPT-5.4 Pro                           | \~$15 / $75                    | \$225-375                           |
| GPT-5.4                               | \~$3 / $15                     | \$45-75                             |
| GPT-5.3 Instant                       | \~$0.80 / $4                   | \$12-20                             |
| Gemini 3.1 Pro                        | \~$5 / $25                     | \$75-125                            |
| Gemini 3.1 Flash                      | \~$0.30 / $2.50                | \$7-12                              |
| Gemini 3.1 Flash Lite                 | \~$0.10 / $0.40                | \$2-3                               |
| Llama 4 Maverick (via Groq, Together) | \~$0.50 / $0.80                | \$5-10                              |
| Llama 4 Scout (via Groq, Together)    | \~$0.20 / $0.40                | \$2-3                               |
| Mistral Large                         | \~$2 / $6                      | \$20-30                             |
| Phi-4 (self-hosted)                   | Hardware only                  | \$0 marginal                        |
| Gemma 4 9B (self-hosted)              | Hardware only                  | \$0 marginal                        |

***

## Notes on benchmark interpretation

* **Benchmark gaming is real.** Small score gaps (1-3 points) rarely translate to visible differences in production.
* **Reasoning modes inflate scores.** Models with "thinking" modes often publish their best numbers with extended reasoning on. Budget accordingly.
* **Tool-use benchmarks (SWE-bench)** are the closest proxy for real agent performance.
* **ARC-AGI-2** is the current gold standard for novel-problem reasoning. Gemini 3.1 Pro's lead here is the notable 2026 story.
* **No single benchmark captures LAC usefulness.** See [lac-benchmark.md](lac-benchmark.md) for that.

***

## How to use this page

1. Identify the dimension that matters most for your task (coding, long context, math, multimodal, cost).
2. Pick the top 2 or 3 models in that column.
3. Run your own eval on 20 real queries from your business. The winner on your data is rarely the global leader.

***

## Related reading

* [lac-benchmark.md](lac-benchmark.md) for LAC-specific scoring.
* [tools/README.md](../tools/README.md) for the full tool landscape.
* [slm/models.md](../slm/models.md) for open-weight SLM deep dives.

***

*Created by Adrian Dunkley | MaestrosAI | maestrosai.com | [ceo@maestrosai.com](mailto:ceo@maestrosai.com)*
*Fair Use, Educational Resource | April 2026*
*Licensed under Creative Commons BY 4.0.*
*SEO: AI model benchmarks 2026 | MMLU-Pro | GPQA Diamond | SWE-bench | ARC-AGI-2 | MMMU | Claude vs GPT-5 vs Gemini | SLM benchmarks | Phi-4 Gemma 4 Llama 4*
