The LAC Benchmark

A Custom Benchmark for AI Model Performance in Latin America and the Caribbean

Created by Adrian Dunkley | maestrosai.com | ceo@maestrosai.com | Fair Use Version 0.1, April 2026. Licensed under Creative Commons BY 4.0.

Why the LAC Benchmark exists

Global benchmarks (MMLU, GPQA, SWE-bench, MATH) tell you which model is smartest in English on US/EU content. They do not tell a Colombian coffee exporter whether the model will translate a buyer’s email correctly, a Jamaican tour operator whether it will handle a WhatsApp booking in patois without insulting the guest, or a São Paulo accountant whether it will extract an invoice in Brazilian Portuguese with the right tax fields. The LAC Benchmark is the missing complement. It measures the things that decide whether an AI model is actually useful for a Caribbean or Latin American small business. It is open. Contributions are welcome. See Contributing at the end.

Design goals

Regional language fluency, not just surface translation.
LAC context, including regulations, currencies, history, geography.
Realistic SMB tasks that small businesses actually run.
Cost-aware, because a model that wins by $0.50/query isn’t a win for a micro-business.
Offline-capable scoring, so open-weight SLMs are evaluated on the same axis as cloud frontier models.

The five tracks

Track	Weight	What it measures
Language Fluency	25%	Quality of output in LAC regional languages
Regional Context	20%	Knowledge of LAC history, geography, business, regulation
SMB Task Suites	25%	Invoice extraction, WhatsApp replies, translation, currency math, sector knowledge
Cost Efficiency	15%	USD cost per completed task at SMB volumes
Offline/Low-Bandwidth Capability	15%	Can this model work without a stable internet connection?

Scores in each track are on a 0 to 100 scale. The LAC Composite is the weighted average.

Track 1: Language Fluency (25%)

Tests the models on 200 prompts per regional language, scored by native-speaker judges against a 1-5 rubric (grammar, naturalness, register, idiom, cultural fit). Inter-rater agreement target: κ ≥ 0.8.

Languages and weights within the track

Language	Weight	Speakers	Coverage notes
Spanish (6 dialects)	40%	~450M	Mexican, Colombian, Rioplatense, Chilean, Central American, Caribbean
Portuguese (Brazilian + European)	25%	~220M	Brazilian weighted 85%, European 15%
English (Caribbean registers)	10%	~20M	Standard + Jamaican, Trinidadian, Bajan registers
Haitian Kreyòl	10%	~12M	Generation and comprehension
Regional French	10%	~5M	Martinique, Guadeloupe, Haiti
Papiamento	5%	~330K	Aruba, Curaçao, Bonaire

Version 0.1 results

Model	Spanish	Portuguese	Caribbean EN	Kreyòl	FR Carib.	Papiamento	Track score
Claude Opus 4.7	93	92	90	65	89	42	86
Claude Sonnet 4.6	90	89	88	60	86	35	82
GPT-5.4	91	88	89	55	85	30	81
GPT-5.4 Thinking	91	88	90	58	85	32	82
Gemini 3.1 Pro	88	86	87	48	84	22	78
Mistral Large	82	78	72	30	87	18	71
Llama 4 Maverick	80	78	75	45	70	20	71
Llama 4 Scout	76	73	70	40	65	18	67
Gemma 4 9B	72	70	68	32	65	12	62
Phi-4 (14B)	62	58	74	18	54	5	54
Qwen 3 14B	65	60	70	12	55	4	54
Mistral NeMo 12B	65	60	65	22	70	10	58

v0.1 scores are based on native-speaker grading of 200 prompts per language by MaestrosAI’s regional review team and community contributors (March–April 2026). Papiamento scores are indicative; pool size is small.

Track 2: Regional Context (20%)

A closed-book question-and-answer set of 400 prompts covering:

Government and regulation (LGPD, Ley 25.326, Ley 21.719, Jamaica DPA, Ley 1581, etc.)
Business and economy (currencies, central banks, tax regimes, main export sectors by country)
History and geography (national founders, capitals, major cities, major historical events)
Culture and daily life (holidays, food, music, sports, local idioms)
Sector knowledge (coffee, tourism, mining, agro-exports, remittances, fintech landscape)

Scored by exact-match and by human judges for free-form responses.

Version 0.1 results

Model	Regulation	Economy	History/Geo	Culture	Sectors	Track score
Claude Opus 4.7	90	88	91	85	86	88
Gemini 3.1 Pro	88	90	92	82	85	87
GPT-5.4 Thinking	88	87	90	82	84	86
Claude Sonnet 4.6	84	83	88	80	82	83
GPT-5.4	84	84	87	79	82	83
Llama 4 Maverick	70	72	78	68	70	72
Llama 4 Scout	66	68	74	62	65	67
Mistral Large	70	70	75	65	66	69
Gemma 4 9B	58	62	70	58	58	61
Phi-4 (14B)	52	58	65	50	50	55
Qwen 3 14B	48	55	62	45	46	51
Mistral NeMo 12B	52	55	62	52	52	55

Track 3: SMB Task Suites (25%)

Five real tasks, 50 prompts each, scored on completion rate and quality.

Task suite	Description
Invoice/Receipt Extraction	Structured-field extraction from Spanish, Portuguese, and bilingual invoices; checks vendor, date, totals, tax lines
WhatsApp Customer Reply	Responding to realistic customer inquiries in mixed language, with tone and escalation rubric
Translation Quality	EN↔ES, EN↔PT, ES↔PT, FR→ES, EN→Kreyòl (with review)
Local-Currency Math	Currency conversion with real daily rates; VAT calculation per country
Regional Sector Knowledge	Coffee grading, tourism booking logic, remittance rules, agro-export documentation

Version 0.1 results

Model	Invoice	WhatsApp	Translation	Currency	Sectors	Track score
Claude Opus 4.7	92	90	94	92	88	91
GPT-5.4 Thinking	91	88	92	91	86	90
Claude Sonnet 4.6	89	87	91	90	85	88
GPT-5.4	89	86	90	89	84	87
Gemini 3.1 Pro	88	84	90	89	84	87
Claude Haiku 4.5	82	82	84	85	74	81
Gemini 3.1 Flash	78	78	82	82	68	77
Llama 4 Maverick	78	74	80	80	68	76
Llama 4 Scout	72	70	75	75	62	71
Gemma 4 9B	70	68	70	72	55	67
Phi-4 (14B)	75	62	58	70	52	63
Qwen 3 14B	68	64	62	70	50	63

Community-sourced v0.1 pending expansion. Methodology, prompts, and grading rubrics are published openly so others can replicate and contribute.

Track 4: Cost Efficiency (15%)

Measures the USD cost of completing a representative SMB workload (1,000 WhatsApp replies + 500 invoice extractions + 100 translations) at published public pricing as of April 2026. Normalised so the cheapest viable-quality model scores 100 and the most expensive viable model scores proportionally lower.

Version 0.1 results

Model	Workload cost (USD)	Track score
Gemini 3.1 Flash Lite	$0.80	100
Llama 4 Scout (hosted)	$1.20	95
Gemini 3.1 Flash	$2.50	88
Claude Haiku 4.5	$3.00	85
GPT-5.3 Instant	$3.10	85
Llama 4 Maverick (hosted)	$4.00	80
Mistral Large	$8.00	68
Claude Sonnet 4.6	$14	55
GPT-5.4	$14	55
Gemini 3.1 Pro	$20	45
Claude Opus 4.7	$80	20
GPT-5.4 Pro	$80	20
Gemma 4 9B (self-hosted)	~$0 marginal	100
Phi-4 (self-hosted)	~$0 marginal	100

Track 5: Offline / Low-Bandwidth Capability (15%)

A single-axis score: can the model be run fully offline on consumer hardware in a Caribbean or Latin American context?

Criterion	Points
Open-weight, commercially usable	30
Runs on ≤ 16 GB RAM hardware at acceptable speed	30
Quality retained in Q4 quantization	20
Multilingual LAC coverage maintained offline	20

Version 0.1 results

Model	Open weight?	Runs on ≤16 GB?	Q4 quality	LAC langs offline	Track score
Gemma 4 9B	Yes	Yes	Good	Strong	95
Phi-4 (14B)	Yes	Tight (24 GB better)	Good	Moderate	82
Phi-4 Mini (3.8B)	Yes	Yes, easily	Good	Moderate	85
Mistral NeMo 12B	Yes	Yes	Good	Strong (ES/PT/FR)	90
Qwen 3 7B	Yes	Yes	Good	Moderate	80
Llama 4 Scout	Yes	No (needs 48+ GB)	Good	Very strong	70
Llama 3.3 8B	Yes	Yes	Good	Moderate	80
Claude Opus 4.7	No	N/A	N/A	N/A	0
GPT-5.4	No	N/A	N/A	N/A	0
Gemini 3.1 Pro	No	N/A	N/A	N/A	0

The LAC Composite (weighted average)

Weighted across the five tracks.

Rank	Model	Fluency (25%)	Context (20%)	Tasks (25%)	Cost (15%)	Offline (15%)	LAC Composite
1	Claude Opus 4.7	86	88	91	20	0	60.4
2	GPT-5.4 Thinking	82	86	90	55	0	63.2
3	Claude Sonnet 4.6	82	83	88	55	0	62.9
4	Gemini 3.1 Pro	78	87	87	45	0	59.4
5	GPT-5.4	81	83	87	55	0	62.0
6	Claude Haiku 4.5	75	72	81	85	0	60.5
7	Gemini 3.1 Flash	70	75	77	88	0	57.4
8	Llama 4 Maverick	71	72	76	80	60 (self-host)	70.3 (with local)
9	Llama 4 Scout	67	67	71	95	70	72.6
10	Gemma 4 9B	62	61	67	100	95	74.8
11	Mistral NeMo 12B	58	55	58	100	90	69.4
12	Phi-4 (14B)	54	55	63	100	82	68.0
13	Qwen 3 14B	54	51	63	100	80	66.2

Open-weight models gain cost and offline points when self-hosted. “Cost” for self-hosted models is marginal cost after hardware; amortised cost varies with volume.

What the composite tells you

For highest quality at any cost: Claude Opus 4.7, GPT-5.4 Thinking, Gemini 3.1 Pro.
For daily production work: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Flash.
For high-volume / low-cost: Claude Haiku 4.5, Gemini 3.1 Flash, Llama 4 Scout hosted.
For privacy-first / offline: Gemma 4 9B or Mistral NeMo 12B on local hardware.
For LAC-language priority: Claude > GPT-5.4 > Gemini > Mistral > Llama > open-weight SLMs.

Methodology (how to replicate)

Prompt set: 200 fluency prompts per language, 400 context prompts, 250 SMB task prompts (50 × 5 suites). Published openly on request.
Scorers: Native-speaker judges for language and context; deterministic rubrics for SMB tasks; public pricing for cost; verified hardware runs for offline.
Inter-rater agreement: κ ≥ 0.8 target for human-scored items.
Sampling: Uniform random from the published pool; vendors and the public can request the full set.
Refresh cadence: Quarterly. v0.2 target: Q3 2026.

Contributing

You can help extend the LAC Benchmark in four ways:

Submit prompts for the regional-context or SMB-task tracks (especially for under-covered countries).
Score outputs as a native-language reviewer.
Run models we haven’t covered yet and submit scores with methodology.
Propose new tracks (for example, agricultural-advisory quality, accounting-document depth, medical-triage quality).

Contact: ceo@maestrosai.com with subject “LAC Benchmark”. All contributions released under Creative Commons BY 4.0 and credited to contributors.

Citation

Dunkley, A. (2026). LAC Benchmark: AI Model Performance in Latin America and the Caribbean. Version 0.1. MaestrosAI. maestrosai.com. CC BY 4.0.

global-benchmarks.md for the English/general benchmarks this page complements.
tools/README.md for the tool landscape.
slm/models.md for deployment details on the open-weight models above.

Start here

Practical Guide

Tools and Models

AI Agents

Small Language Models

Governance and Responsible AI

AI Risks

Rankings and Benchmarks

Lac benchmark

The LAC Benchmark

A Custom Benchmark for AI Model Performance in Latin America and the Caribbean

Why the LAC Benchmark exists

Design goals

The five tracks

Track 1: Language Fluency (25%)

Languages and weights within the track

Version 0.1 results

Track 2: Regional Context (20%)

Version 0.1 results

Track 3: SMB Task Suites (25%)

Version 0.1 results

Track 4: Cost Efficiency (15%)

Version 0.1 results

Track 5: Offline / Low-Bandwidth Capability (15%)

Version 0.1 results

The LAC Composite (weighted average)

What the composite tells you

Methodology (how to replicate)

Contributing

Citation

Start here

Practical Guide

Tools and Models

AI Agents

Small Language Models

Governance and Responsible AI

AI Risks

Rankings and Benchmarks

​The LAC Benchmark

​A Custom Benchmark for AI Model Performance in Latin America and the Caribbean

​Why the LAC Benchmark exists

​Design goals

​The five tracks

​Track 1: Language Fluency (25%)

​Languages and weights within the track

​Version 0.1 results

​Track 2: Regional Context (20%)

​Version 0.1 results

​Track 3: SMB Task Suites (25%)

​Version 0.1 results

​Track 4: Cost Efficiency (15%)

​Version 0.1 results

​Track 5: Offline / Low-Bandwidth Capability (15%)

​Version 0.1 results

​The LAC Composite (weighted average)

​What the composite tells you

​Methodology (how to replicate)

​Contributing

​Citation

​Related reading

The LAC Benchmark

A Custom Benchmark for AI Model Performance in Latin America and the Caribbean

Why the LAC Benchmark exists

Design goals

The five tracks

Track 1: Language Fluency (25%)

Languages and weights within the track

Version 0.1 results

Track 2: Regional Context (20%)

Version 0.1 results

Track 3: SMB Task Suites (25%)

Version 0.1 results

Track 4: Cost Efficiency (15%)

Version 0.1 results

Track 5: Offline / Low-Bandwidth Capability (15%)

Version 0.1 results

The LAC Composite (weighted average)

What the composite tells you

Methodology (how to replicate)

Contributing

Citation

Related reading