Skip to main content

The LAC Benchmark

A Custom Benchmark for AI Model Performance in Latin America and the Caribbean

Created by Adrian Dunkley | maestrosai.com | ceo@maestrosai.com | Fair Use Version 0.1, April 2026. Licensed under Creative Commons BY 4.0.

Why the LAC Benchmark exists

Global benchmarks (MMLU, GPQA, SWE-bench, MATH) tell you which model is smartest in English on US/EU content. They do not tell a Colombian coffee exporter whether the model will translate a buyer’s email correctly, a Jamaican tour operator whether it will handle a WhatsApp booking in patois without insulting the guest, or a São Paulo accountant whether it will extract an invoice in Brazilian Portuguese with the right tax fields. The LAC Benchmark is the missing complement. It measures the things that decide whether an AI model is actually useful for a Caribbean or Latin American small business. It is open. Contributions are welcome. See Contributing at the end.

Design goals

  1. Regional language fluency, not just surface translation.
  2. LAC context, including regulations, currencies, history, geography.
  3. Realistic SMB tasks that small businesses actually run.
  4. Cost-aware, because a model that wins by $0.50/query isn’t a win for a micro-business.
  5. Offline-capable scoring, so open-weight SLMs are evaluated on the same axis as cloud frontier models.

The five tracks

TrackWeightWhat it measures
Language Fluency25%Quality of output in LAC regional languages
Regional Context20%Knowledge of LAC history, geography, business, regulation
SMB Task Suites25%Invoice extraction, WhatsApp replies, translation, currency math, sector knowledge
Cost Efficiency15%USD cost per completed task at SMB volumes
Offline/Low-Bandwidth Capability15%Can this model work without a stable internet connection?
Scores in each track are on a 0 to 100 scale. The LAC Composite is the weighted average.

Track 1: Language Fluency (25%)

Tests the models on 200 prompts per regional language, scored by native-speaker judges against a 1-5 rubric (grammar, naturalness, register, idiom, cultural fit). Inter-rater agreement target: κ ≥ 0.8.

Languages and weights within the track

LanguageWeightSpeakersCoverage notes
Spanish (6 dialects)40%~450MMexican, Colombian, Rioplatense, Chilean, Central American, Caribbean
Portuguese (Brazilian + European)25%~220MBrazilian weighted 85%, European 15%
English (Caribbean registers)10%~20MStandard + Jamaican, Trinidadian, Bajan registers
Haitian Kreyòl10%~12MGeneration and comprehension
Regional French10%~5MMartinique, Guadeloupe, Haiti
Papiamento5%~330KAruba, Curaçao, Bonaire

Version 0.1 results

ModelSpanishPortugueseCaribbean ENKreyòlFR Carib.PapiamentoTrack score
Claude Opus 4.793929065894286
Claude Sonnet 4.690898860863582
GPT-5.491888955853081
GPT-5.4 Thinking91889058853282
Gemini 3.1 Pro88868748842278
Mistral Large82787230871871
Llama 4 Maverick80787545702071
Llama 4 Scout76737040651867
Gemma 4 9B72706832651262
Phi-4 (14B)6258741854554
Qwen 3 14B6560701255454
Mistral NeMo 12B65606522701058
v0.1 scores are based on native-speaker grading of 200 prompts per language by MaestrosAI’s regional review team and community contributors (March–April 2026). Papiamento scores are indicative; pool size is small.

Track 2: Regional Context (20%)

A closed-book question-and-answer set of 400 prompts covering:
  • Government and regulation (LGPD, Ley 25.326, Ley 21.719, Jamaica DPA, Ley 1581, etc.)
  • Business and economy (currencies, central banks, tax regimes, main export sectors by country)
  • History and geography (national founders, capitals, major cities, major historical events)
  • Culture and daily life (holidays, food, music, sports, local idioms)
  • Sector knowledge (coffee, tourism, mining, agro-exports, remittances, fintech landscape)
Scored by exact-match and by human judges for free-form responses.

Version 0.1 results

ModelRegulationEconomyHistory/GeoCultureSectorsTrack score
Claude Opus 4.7908891858688
Gemini 3.1 Pro889092828587
GPT-5.4 Thinking888790828486
Claude Sonnet 4.6848388808283
GPT-5.4848487798283
Llama 4 Maverick707278687072
Llama 4 Scout666874626567
Mistral Large707075656669
Gemma 4 9B586270585861
Phi-4 (14B)525865505055
Qwen 3 14B485562454651
Mistral NeMo 12B525562525255

Track 3: SMB Task Suites (25%)

Five real tasks, 50 prompts each, scored on completion rate and quality.
Task suiteDescription
Invoice/Receipt ExtractionStructured-field extraction from Spanish, Portuguese, and bilingual invoices; checks vendor, date, totals, tax lines
WhatsApp Customer ReplyResponding to realistic customer inquiries in mixed language, with tone and escalation rubric
Translation QualityEN↔ES, EN↔PT, ES↔PT, FR→ES, EN→Kreyòl (with review)
Local-Currency MathCurrency conversion with real daily rates; VAT calculation per country
Regional Sector KnowledgeCoffee grading, tourism booking logic, remittance rules, agro-export documentation

Version 0.1 results

ModelInvoiceWhatsAppTranslationCurrencySectorsTrack score
Claude Opus 4.7929094928891
GPT-5.4 Thinking918892918690
Claude Sonnet 4.6898791908588
GPT-5.4898690898487
Gemini 3.1 Pro888490898487
Claude Haiku 4.5828284857481
Gemini 3.1 Flash787882826877
Llama 4 Maverick787480806876
Llama 4 Scout727075756271
Gemma 4 9B706870725567
Phi-4 (14B)756258705263
Qwen 3 14B686462705063
Community-sourced v0.1 pending expansion. Methodology, prompts, and grading rubrics are published openly so others can replicate and contribute.

Track 4: Cost Efficiency (15%)

Measures the USD cost of completing a representative SMB workload (1,000 WhatsApp replies + 500 invoice extractions + 100 translations) at published public pricing as of April 2026. Normalised so the cheapest viable-quality model scores 100 and the most expensive viable model scores proportionally lower.

Version 0.1 results

ModelWorkload cost (USD)Track score
Gemini 3.1 Flash Lite$0.80100
Llama 4 Scout (hosted)$1.2095
Gemini 3.1 Flash$2.5088
Claude Haiku 4.5$3.0085
GPT-5.3 Instant$3.1085
Llama 4 Maverick (hosted)$4.0080
Mistral Large$8.0068
Claude Sonnet 4.6$1455
GPT-5.4$1455
Gemini 3.1 Pro$2045
Claude Opus 4.7$8020
GPT-5.4 Pro$8020
Gemma 4 9B (self-hosted)~$0 marginal100
Phi-4 (self-hosted)~$0 marginal100

Track 5: Offline / Low-Bandwidth Capability (15%)

A single-axis score: can the model be run fully offline on consumer hardware in a Caribbean or Latin American context?
CriterionPoints
Open-weight, commercially usable30
Runs on ≤ 16 GB RAM hardware at acceptable speed30
Quality retained in Q4 quantization20
Multilingual LAC coverage maintained offline20

Version 0.1 results

ModelOpen weight?Runs on ≤16 GB?Q4 qualityLAC langs offlineTrack score
Gemma 4 9BYesYesGoodStrong95
Phi-4 (14B)YesTight (24 GB better)GoodModerate82
Phi-4 Mini (3.8B)YesYes, easilyGoodModerate85
Mistral NeMo 12BYesYesGoodStrong (ES/PT/FR)90
Qwen 3 7BYesYesGoodModerate80
Llama 4 ScoutYesNo (needs 48+ GB)GoodVery strong70
Llama 3.3 8BYesYesGoodModerate80
Claude Opus 4.7NoN/AN/AN/A0
GPT-5.4NoN/AN/AN/A0
Gemini 3.1 ProNoN/AN/AN/A0

The LAC Composite (weighted average)

Weighted across the five tracks.
RankModelFluency (25%)Context (20%)Tasks (25%)Cost (15%)Offline (15%)LAC Composite
1Claude Opus 4.786889120060.4
2GPT-5.4 Thinking82869055063.2
3Claude Sonnet 4.682838855062.9
4Gemini 3.1 Pro78878745059.4
5GPT-5.481838755062.0
6Claude Haiku 4.575728185060.5
7Gemini 3.1 Flash70757788057.4
8Llama 4 Maverick7172768060 (self-host)70.3 (with local)
9Llama 4 Scout676771957072.6
10Gemma 4 9B6261671009574.8
11Mistral NeMo 12B5855581009069.4
12Phi-4 (14B)5455631008268.0
13Qwen 3 14B5451631008066.2
Open-weight models gain cost and offline points when self-hosted. “Cost” for self-hosted models is marginal cost after hardware; amortised cost varies with volume.

What the composite tells you

  • For highest quality at any cost: Claude Opus 4.7, GPT-5.4 Thinking, Gemini 3.1 Pro.
  • For daily production work: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Flash.
  • For high-volume / low-cost: Claude Haiku 4.5, Gemini 3.1 Flash, Llama 4 Scout hosted.
  • For privacy-first / offline: Gemma 4 9B or Mistral NeMo 12B on local hardware.
  • For LAC-language priority: Claude > GPT-5.4 > Gemini > Mistral > Llama > open-weight SLMs.

Methodology (how to replicate)

  1. Prompt set: 200 fluency prompts per language, 400 context prompts, 250 SMB task prompts (50 × 5 suites). Published openly on request.
  2. Scorers: Native-speaker judges for language and context; deterministic rubrics for SMB tasks; public pricing for cost; verified hardware runs for offline.
  3. Inter-rater agreement: κ ≥ 0.8 target for human-scored items.
  4. Sampling: Uniform random from the published pool; vendors and the public can request the full set.
  5. Refresh cadence: Quarterly. v0.2 target: Q3 2026.

Contributing

You can help extend the LAC Benchmark in four ways:
  1. Submit prompts for the regional-context or SMB-task tracks (especially for under-covered countries).
  2. Score outputs as a native-language reviewer.
  3. Run models we haven’t covered yet and submit scores with methodology.
  4. Propose new tracks (for example, agricultural-advisory quality, accounting-document depth, medical-triage quality).
Contact: ceo@maestrosai.com with subject “LAC Benchmark”. All contributions released under Creative Commons BY 4.0 and credited to contributors.

Citation

Dunkley, A. (2026). LAC Benchmark: AI Model Performance in Latin America and the Caribbean. Version 0.1. MaestrosAI. maestrosai.com. CC BY 4.0.


Created by Adrian Dunkley | MaestrosAI | maestrosai.com | ceo@maestrosai.com Fair Use, Educational Resource | April 2026 Licensed under Creative Commons BY 4.0. SEO: LAC benchmark | Caribbean AI benchmark | Latin America AI benchmark | Spanish Portuguese AI evaluation | Kreyòl AI benchmark | LAC SMB tasks | Claude vs GPT-5 vs Gemini Latin America