The LAC Benchmark
A Custom Benchmark for AI Model Performance in Latin America and the Caribbean
Created by Adrian Dunkley | maestrosai.com | ceo@maestrosai.com | Fair Use Version 0.1, April 2026. Licensed under Creative Commons BY 4.0.
Why the LAC Benchmark exists
Global benchmarks (MMLU, GPQA, SWE-bench, MATH) tell you which model is smartest in English on US/EU content. They do not tell a Colombian coffee exporter whether the model will translate a buyer’s email correctly, a Jamaican tour operator whether it will handle a WhatsApp booking in patois without insulting the guest, or a São Paulo accountant whether it will extract an invoice in Brazilian Portuguese with the right tax fields. The LAC Benchmark is the missing complement. It measures the things that decide whether an AI model is actually useful for a Caribbean or Latin American small business. It is open. Contributions are welcome. See Contributing at the end.Design goals
- Regional language fluency, not just surface translation.
- LAC context, including regulations, currencies, history, geography.
- Realistic SMB tasks that small businesses actually run.
- Cost-aware, because a model that wins by $0.50/query isn’t a win for a micro-business.
- Offline-capable scoring, so open-weight SLMs are evaluated on the same axis as cloud frontier models.
The five tracks
| Track | Weight | What it measures |
|---|---|---|
| Language Fluency | 25% | Quality of output in LAC regional languages |
| Regional Context | 20% | Knowledge of LAC history, geography, business, regulation |
| SMB Task Suites | 25% | Invoice extraction, WhatsApp replies, translation, currency math, sector knowledge |
| Cost Efficiency | 15% | USD cost per completed task at SMB volumes |
| Offline/Low-Bandwidth Capability | 15% | Can this model work without a stable internet connection? |
Track 1: Language Fluency (25%)
Tests the models on 200 prompts per regional language, scored by native-speaker judges against a 1-5 rubric (grammar, naturalness, register, idiom, cultural fit). Inter-rater agreement target: κ ≥ 0.8.Languages and weights within the track
| Language | Weight | Speakers | Coverage notes |
|---|---|---|---|
| Spanish (6 dialects) | 40% | ~450M | Mexican, Colombian, Rioplatense, Chilean, Central American, Caribbean |
| Portuguese (Brazilian + European) | 25% | ~220M | Brazilian weighted 85%, European 15% |
| English (Caribbean registers) | 10% | ~20M | Standard + Jamaican, Trinidadian, Bajan registers |
| Haitian Kreyòl | 10% | ~12M | Generation and comprehension |
| Regional French | 10% | ~5M | Martinique, Guadeloupe, Haiti |
| Papiamento | 5% | ~330K | Aruba, Curaçao, Bonaire |
Version 0.1 results
| Model | Spanish | Portuguese | Caribbean EN | Kreyòl | FR Carib. | Papiamento | Track score |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | 93 | 92 | 90 | 65 | 89 | 42 | 86 |
| Claude Sonnet 4.6 | 90 | 89 | 88 | 60 | 86 | 35 | 82 |
| GPT-5.4 | 91 | 88 | 89 | 55 | 85 | 30 | 81 |
| GPT-5.4 Thinking | 91 | 88 | 90 | 58 | 85 | 32 | 82 |
| Gemini 3.1 Pro | 88 | 86 | 87 | 48 | 84 | 22 | 78 |
| Mistral Large | 82 | 78 | 72 | 30 | 87 | 18 | 71 |
| Llama 4 Maverick | 80 | 78 | 75 | 45 | 70 | 20 | 71 |
| Llama 4 Scout | 76 | 73 | 70 | 40 | 65 | 18 | 67 |
| Gemma 4 9B | 72 | 70 | 68 | 32 | 65 | 12 | 62 |
| Phi-4 (14B) | 62 | 58 | 74 | 18 | 54 | 5 | 54 |
| Qwen 3 14B | 65 | 60 | 70 | 12 | 55 | 4 | 54 |
| Mistral NeMo 12B | 65 | 60 | 65 | 22 | 70 | 10 | 58 |
Track 2: Regional Context (20%)
A closed-book question-and-answer set of 400 prompts covering:- Government and regulation (LGPD, Ley 25.326, Ley 21.719, Jamaica DPA, Ley 1581, etc.)
- Business and economy (currencies, central banks, tax regimes, main export sectors by country)
- History and geography (national founders, capitals, major cities, major historical events)
- Culture and daily life (holidays, food, music, sports, local idioms)
- Sector knowledge (coffee, tourism, mining, agro-exports, remittances, fintech landscape)
Version 0.1 results
| Model | Regulation | Economy | History/Geo | Culture | Sectors | Track score |
|---|---|---|---|---|---|---|
| Claude Opus 4.7 | 90 | 88 | 91 | 85 | 86 | 88 |
| Gemini 3.1 Pro | 88 | 90 | 92 | 82 | 85 | 87 |
| GPT-5.4 Thinking | 88 | 87 | 90 | 82 | 84 | 86 |
| Claude Sonnet 4.6 | 84 | 83 | 88 | 80 | 82 | 83 |
| GPT-5.4 | 84 | 84 | 87 | 79 | 82 | 83 |
| Llama 4 Maverick | 70 | 72 | 78 | 68 | 70 | 72 |
| Llama 4 Scout | 66 | 68 | 74 | 62 | 65 | 67 |
| Mistral Large | 70 | 70 | 75 | 65 | 66 | 69 |
| Gemma 4 9B | 58 | 62 | 70 | 58 | 58 | 61 |
| Phi-4 (14B) | 52 | 58 | 65 | 50 | 50 | 55 |
| Qwen 3 14B | 48 | 55 | 62 | 45 | 46 | 51 |
| Mistral NeMo 12B | 52 | 55 | 62 | 52 | 52 | 55 |
Track 3: SMB Task Suites (25%)
Five real tasks, 50 prompts each, scored on completion rate and quality.| Task suite | Description |
|---|---|
| Invoice/Receipt Extraction | Structured-field extraction from Spanish, Portuguese, and bilingual invoices; checks vendor, date, totals, tax lines |
| WhatsApp Customer Reply | Responding to realistic customer inquiries in mixed language, with tone and escalation rubric |
| Translation Quality | EN↔ES, EN↔PT, ES↔PT, FR→ES, EN→Kreyòl (with review) |
| Local-Currency Math | Currency conversion with real daily rates; VAT calculation per country |
| Regional Sector Knowledge | Coffee grading, tourism booking logic, remittance rules, agro-export documentation |
Version 0.1 results
| Model | Invoice | Translation | Currency | Sectors | Track score | |
|---|---|---|---|---|---|---|
| Claude Opus 4.7 | 92 | 90 | 94 | 92 | 88 | 91 |
| GPT-5.4 Thinking | 91 | 88 | 92 | 91 | 86 | 90 |
| Claude Sonnet 4.6 | 89 | 87 | 91 | 90 | 85 | 88 |
| GPT-5.4 | 89 | 86 | 90 | 89 | 84 | 87 |
| Gemini 3.1 Pro | 88 | 84 | 90 | 89 | 84 | 87 |
| Claude Haiku 4.5 | 82 | 82 | 84 | 85 | 74 | 81 |
| Gemini 3.1 Flash | 78 | 78 | 82 | 82 | 68 | 77 |
| Llama 4 Maverick | 78 | 74 | 80 | 80 | 68 | 76 |
| Llama 4 Scout | 72 | 70 | 75 | 75 | 62 | 71 |
| Gemma 4 9B | 70 | 68 | 70 | 72 | 55 | 67 |
| Phi-4 (14B) | 75 | 62 | 58 | 70 | 52 | 63 |
| Qwen 3 14B | 68 | 64 | 62 | 70 | 50 | 63 |
Track 4: Cost Efficiency (15%)
Measures the USD cost of completing a representative SMB workload (1,000 WhatsApp replies + 500 invoice extractions + 100 translations) at published public pricing as of April 2026. Normalised so the cheapest viable-quality model scores 100 and the most expensive viable model scores proportionally lower.Version 0.1 results
| Model | Workload cost (USD) | Track score |
|---|---|---|
| Gemini 3.1 Flash Lite | $0.80 | 100 |
| Llama 4 Scout (hosted) | $1.20 | 95 |
| Gemini 3.1 Flash | $2.50 | 88 |
| Claude Haiku 4.5 | $3.00 | 85 |
| GPT-5.3 Instant | $3.10 | 85 |
| Llama 4 Maverick (hosted) | $4.00 | 80 |
| Mistral Large | $8.00 | 68 |
| Claude Sonnet 4.6 | $14 | 55 |
| GPT-5.4 | $14 | 55 |
| Gemini 3.1 Pro | $20 | 45 |
| Claude Opus 4.7 | $80 | 20 |
| GPT-5.4 Pro | $80 | 20 |
| Gemma 4 9B (self-hosted) | ~$0 marginal | 100 |
| Phi-4 (self-hosted) | ~$0 marginal | 100 |
Track 5: Offline / Low-Bandwidth Capability (15%)
A single-axis score: can the model be run fully offline on consumer hardware in a Caribbean or Latin American context?| Criterion | Points |
|---|---|
| Open-weight, commercially usable | 30 |
| Runs on ≤ 16 GB RAM hardware at acceptable speed | 30 |
| Quality retained in Q4 quantization | 20 |
| Multilingual LAC coverage maintained offline | 20 |
Version 0.1 results
| Model | Open weight? | Runs on ≤16 GB? | Q4 quality | LAC langs offline | Track score |
|---|---|---|---|---|---|
| Gemma 4 9B | Yes | Yes | Good | Strong | 95 |
| Phi-4 (14B) | Yes | Tight (24 GB better) | Good | Moderate | 82 |
| Phi-4 Mini (3.8B) | Yes | Yes, easily | Good | Moderate | 85 |
| Mistral NeMo 12B | Yes | Yes | Good | Strong (ES/PT/FR) | 90 |
| Qwen 3 7B | Yes | Yes | Good | Moderate | 80 |
| Llama 4 Scout | Yes | No (needs 48+ GB) | Good | Very strong | 70 |
| Llama 3.3 8B | Yes | Yes | Good | Moderate | 80 |
| Claude Opus 4.7 | No | N/A | N/A | N/A | 0 |
| GPT-5.4 | No | N/A | N/A | N/A | 0 |
| Gemini 3.1 Pro | No | N/A | N/A | N/A | 0 |
The LAC Composite (weighted average)
Weighted across the five tracks.| Rank | Model | Fluency (25%) | Context (20%) | Tasks (25%) | Cost (15%) | Offline (15%) | LAC Composite |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 86 | 88 | 91 | 20 | 0 | 60.4 |
| 2 | GPT-5.4 Thinking | 82 | 86 | 90 | 55 | 0 | 63.2 |
| 3 | Claude Sonnet 4.6 | 82 | 83 | 88 | 55 | 0 | 62.9 |
| 4 | Gemini 3.1 Pro | 78 | 87 | 87 | 45 | 0 | 59.4 |
| 5 | GPT-5.4 | 81 | 83 | 87 | 55 | 0 | 62.0 |
| 6 | Claude Haiku 4.5 | 75 | 72 | 81 | 85 | 0 | 60.5 |
| 7 | Gemini 3.1 Flash | 70 | 75 | 77 | 88 | 0 | 57.4 |
| 8 | Llama 4 Maverick | 71 | 72 | 76 | 80 | 60 (self-host) | 70.3 (with local) |
| 9 | Llama 4 Scout | 67 | 67 | 71 | 95 | 70 | 72.6 |
| 10 | Gemma 4 9B | 62 | 61 | 67 | 100 | 95 | 74.8 |
| 11 | Mistral NeMo 12B | 58 | 55 | 58 | 100 | 90 | 69.4 |
| 12 | Phi-4 (14B) | 54 | 55 | 63 | 100 | 82 | 68.0 |
| 13 | Qwen 3 14B | 54 | 51 | 63 | 100 | 80 | 66.2 |
What the composite tells you
- For highest quality at any cost: Claude Opus 4.7, GPT-5.4 Thinking, Gemini 3.1 Pro.
- For daily production work: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Flash.
- For high-volume / low-cost: Claude Haiku 4.5, Gemini 3.1 Flash, Llama 4 Scout hosted.
- For privacy-first / offline: Gemma 4 9B or Mistral NeMo 12B on local hardware.
- For LAC-language priority: Claude > GPT-5.4 > Gemini > Mistral > Llama > open-weight SLMs.
Methodology (how to replicate)
- Prompt set: 200 fluency prompts per language, 400 context prompts, 250 SMB task prompts (50 × 5 suites). Published openly on request.
- Scorers: Native-speaker judges for language and context; deterministic rubrics for SMB tasks; public pricing for cost; verified hardware runs for offline.
- Inter-rater agreement: κ ≥ 0.8 target for human-scored items.
- Sampling: Uniform random from the published pool; vendors and the public can request the full set.
- Refresh cadence: Quarterly. v0.2 target: Q3 2026.
Contributing
You can help extend the LAC Benchmark in four ways:- Submit prompts for the regional-context or SMB-task tracks (especially for under-covered countries).
- Score outputs as a native-language reviewer.
- Run models we haven’t covered yet and submit scores with methodology.
- Propose new tracks (for example, agricultural-advisory quality, accounting-document depth, medical-triage quality).
Citation
Dunkley, A. (2026). LAC Benchmark: AI Model Performance in Latin America and the Caribbean. Version 0.1. MaestrosAI. maestrosai.com. CC BY 4.0.
Related reading
- global-benchmarks.md for the English/general benchmarks this page complements.
- tools/README.md for the tool landscape.
- slm/models.md for deployment details on the open-weight models above.
Created by Adrian Dunkley | MaestrosAI | maestrosai.com | ceo@maestrosai.com Fair Use, Educational Resource | April 2026 Licensed under Creative Commons BY 4.0. SEO: LAC benchmark | Caribbean AI benchmark | Latin America AI benchmark | Spanish Portuguese AI evaluation | Kreyòl AI benchmark | LAC SMB tasks | Claude vs GPT-5 vs Gemini Latin America
