> ## Documentation Index
> Fetch the complete documentation index at: https://aiplaybooklac.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Lac benchmark

# The LAC Benchmark

## A Custom Benchmark for AI Model Performance in Latin America and the Caribbean

> **Created by Adrian Dunkley** | [maestrosai.com](https://maestrosai.com) | [ceo@maestrosai.com](mailto:ceo@maestrosai.com) | Fair Use
> *Version 0.1, April 2026. Licensed under Creative Commons BY 4.0.*

***

## Why the LAC Benchmark exists

Global benchmarks (MMLU, GPQA, SWE-bench, MATH) tell you which model is smartest *in English on US/EU content*. They do not tell a Colombian coffee exporter whether the model will translate a buyer's email correctly, a Jamaican tour operator whether it will handle a WhatsApp booking in patois without insulting the guest, or a São Paulo accountant whether it will extract an invoice in Brazilian Portuguese with the right tax fields.

The LAC Benchmark is the missing complement. It measures **the things that decide whether an AI model is actually useful for a Caribbean or Latin American small business**.

It is open. Contributions are welcome. See *Contributing* at the end.

***

## Design goals

1. **Regional language fluency**, not just surface translation.
2. **LAC context**, including regulations, currencies, history, geography.
3. **Realistic SMB tasks** that small businesses actually run.
4. **Cost-aware**, because a model that wins by \$0.50/query isn't a win for a micro-business.
5. **Offline-capable scoring**, so open-weight SLMs are evaluated on the same axis as cloud frontier models.

***

## The five tracks

| Track                                | Weight | What it measures                                                                   |
| ------------------------------------ | ------ | ---------------------------------------------------------------------------------- |
| **Language Fluency**                 | 25%    | Quality of output in LAC regional languages                                        |
| **Regional Context**                 | 20%    | Knowledge of LAC history, geography, business, regulation                          |
| **SMB Task Suites**                  | 25%    | Invoice extraction, WhatsApp replies, translation, currency math, sector knowledge |
| **Cost Efficiency**                  | 15%    | USD cost per completed task at SMB volumes                                         |
| **Offline/Low-Bandwidth Capability** | 15%    | Can this model work without a stable internet connection?                          |

Scores in each track are on a 0 to 100 scale. The LAC Composite is the weighted average.

***

## Track 1: Language Fluency (25%)

Tests the models on 200 prompts per regional language, scored by native-speaker judges against a 1-5 rubric (grammar, naturalness, register, idiom, cultural fit). Inter-rater agreement target: κ ≥ 0.8.

### Languages and weights within the track

| Language                              | Weight | Speakers | Coverage notes                                                        |
| ------------------------------------- | ------ | -------- | --------------------------------------------------------------------- |
| **Spanish (6 dialects)**              | 40%    | \~450M   | Mexican, Colombian, Rioplatense, Chilean, Central American, Caribbean |
| **Portuguese (Brazilian + European)** | 25%    | \~220M   | Brazilian weighted 85%, European 15%                                  |
| **English (Caribbean registers)**     | 10%    | \~20M    | Standard + Jamaican, Trinidadian, Bajan registers                     |
| **Haitian Kreyòl**                    | 10%    | \~12M    | Generation and comprehension                                          |
| **Regional French**                   | 10%    | \~5M     | Martinique, Guadeloupe, Haiti                                         |
| **Papiamento**                        | 5%     | \~330K   | Aruba, Curaçao, Bonaire                                               |

### Version 0.1 results

| Model                 | Spanish | Portuguese | Caribbean EN | Kreyòl | FR Carib. | Papiamento | Track score |
| --------------------- | ------- | ---------- | ------------ | ------ | --------- | ---------- | ----------- |
| **Claude Opus 4.7**   | 93      | 92         | 90           | 65     | 89        | 42         | **86**      |
| **Claude Sonnet 4.6** | 90      | 89         | 88           | 60     | 86        | 35         | 82          |
| **GPT-5.4**           | 91      | 88         | 89           | 55     | 85        | 30         | 81          |
| **GPT-5.4 Thinking**  | 91      | 88         | 90           | 58     | 85        | 32         | 82          |
| **Gemini 3.1 Pro**    | 88      | 86         | 87           | 48     | 84        | 22         | 78          |
| **Mistral Large**     | 82      | 78         | 72           | 30     | 87        | 18         | 71          |
| **Llama 4 Maverick**  | 80      | 78         | 75           | 45     | 70        | 20         | 71          |
| **Llama 4 Scout**     | 76      | 73         | 70           | 40     | 65        | 18         | 67          |
| **Gemma 4 9B**        | 72      | 70         | 68           | 32     | 65        | 12         | 62          |
| **Phi-4 (14B)**       | 62      | 58         | 74           | 18     | 54        | 5          | 54          |
| **Qwen 3 14B**        | 65      | 60         | 70           | 12     | 55        | 4          | 54          |
| **Mistral NeMo 12B**  | 65      | 60         | 65           | 22     | 70        | 10         | 58          |

*v0.1 scores are based on native-speaker grading of 200 prompts per language by MaestrosAI's regional review team and community contributors (March–April 2026). Papiamento scores are indicative; pool size is small.*

***

## Track 2: Regional Context (20%)

A closed-book question-and-answer set of 400 prompts covering:

* **Government and regulation** (LGPD, Ley 25.326, Ley 21.719, Jamaica DPA, Ley 1581, etc.)
* **Business and economy** (currencies, central banks, tax regimes, main export sectors by country)
* **History and geography** (national founders, capitals, major cities, major historical events)
* **Culture and daily life** (holidays, food, music, sports, local idioms)
* **Sector knowledge** (coffee, tourism, mining, agro-exports, remittances, fintech landscape)

Scored by exact-match and by human judges for free-form responses.

### Version 0.1 results

| Model                 | Regulation | Economy | History/Geo | Culture | Sectors | Track score |
| --------------------- | ---------- | ------- | ----------- | ------- | ------- | ----------- |
| **Claude Opus 4.7**   | 90         | 88      | 91          | 85      | 86      | **88**      |
| **Gemini 3.1 Pro**    | 88         | 90      | 92          | 82      | 85      | 87          |
| **GPT-5.4 Thinking**  | 88         | 87      | 90          | 82      | 84      | 86          |
| **Claude Sonnet 4.6** | 84         | 83      | 88          | 80      | 82      | 83          |
| **GPT-5.4**           | 84         | 84      | 87          | 79      | 82      | 83          |
| **Llama 4 Maverick**  | 70         | 72      | 78          | 68      | 70      | 72          |
| **Llama 4 Scout**     | 66         | 68      | 74          | 62      | 65      | 67          |
| **Mistral Large**     | 70         | 70      | 75          | 65      | 66      | 69          |
| **Gemma 4 9B**        | 58         | 62      | 70          | 58      | 58      | 61          |
| **Phi-4 (14B)**       | 52         | 58      | 65          | 50      | 50      | 55          |
| **Qwen 3 14B**        | 48         | 55      | 62          | 45      | 46      | 51          |
| **Mistral NeMo 12B**  | 52         | 55      | 62          | 52      | 52      | 55          |

***

## Track 3: SMB Task Suites (25%)

Five real tasks, 50 prompts each, scored on completion rate and quality.

| Task suite                     | Description                                                                                                          |
| ------------------------------ | -------------------------------------------------------------------------------------------------------------------- |
| **Invoice/Receipt Extraction** | Structured-field extraction from Spanish, Portuguese, and bilingual invoices; checks vendor, date, totals, tax lines |
| **WhatsApp Customer Reply**    | Responding to realistic customer inquiries in mixed language, with tone and escalation rubric                        |
| **Translation Quality**        | EN↔ES, EN↔PT, ES↔PT, FR→ES, EN→Kreyòl (with review)                                                                  |
| **Local-Currency Math**        | Currency conversion with real daily rates; VAT calculation per country                                               |
| **Regional Sector Knowledge**  | Coffee grading, tourism booking logic, remittance rules, agro-export documentation                                   |

### Version 0.1 results

| Model                 | Invoice | WhatsApp | Translation | Currency | Sectors | Track score |
| --------------------- | ------- | -------- | ----------- | -------- | ------- | ----------- |
| **Claude Opus 4.7**   | 92      | 90       | 94          | 92       | 88      | **91**      |
| **GPT-5.4 Thinking**  | 91      | 88       | 92          | 91       | 86      | 90          |
| **Claude Sonnet 4.6** | 89      | 87       | 91          | 90       | 85      | 88          |
| **GPT-5.4**           | 89      | 86       | 90          | 89       | 84      | 87          |
| **Gemini 3.1 Pro**    | 88      | 84       | 90          | 89       | 84      | 87          |
| **Claude Haiku 4.5**  | 82      | 82       | 84          | 85       | 74      | 81          |
| **Gemini 3.1 Flash**  | 78      | 78       | 82          | 82       | 68      | 77          |
| **Llama 4 Maverick**  | 78      | 74       | 80          | 80       | 68      | 76          |
| **Llama 4 Scout**     | 72      | 70       | 75          | 75       | 62      | 71          |
| **Gemma 4 9B**        | 70      | 68       | 70          | 72       | 55      | 67          |
| **Phi-4 (14B)**       | 75      | 62       | 58          | 70       | 52      | 63          |
| **Qwen 3 14B**        | 68      | 64       | 62          | 70       | 50      | 63          |

*Community-sourced v0.1 pending expansion. Methodology, prompts, and grading rubrics are published openly so others can replicate and contribute.*

***

## Track 4: Cost Efficiency (15%)

Measures the USD cost of completing a representative SMB workload (1,000 WhatsApp replies + 500 invoice extractions + 100 translations) at published public pricing as of April 2026.

Normalised so the cheapest viable-quality model scores 100 and the most expensive viable model scores proportionally lower.

### Version 0.1 results

| Model                         | Workload cost (USD) | Track score |
| ----------------------------- | ------------------- | ----------- |
| **Gemini 3.1 Flash Lite**     | \$0.80              | **100**     |
| **Llama 4 Scout (hosted)**    | \$1.20              | 95          |
| **Gemini 3.1 Flash**          | \$2.50              | 88          |
| **Claude Haiku 4.5**          | \$3.00              | 85          |
| **GPT-5.3 Instant**           | \$3.10              | 85          |
| **Llama 4 Maverick (hosted)** | \$4.00              | 80          |
| **Mistral Large**             | \$8.00              | 68          |
| **Claude Sonnet 4.6**         | \$14                | 55          |
| **GPT-5.4**                   | \$14                | 55          |
| **Gemini 3.1 Pro**            | \$20                | 45          |
| **Claude Opus 4.7**           | \$80                | 20          |
| **GPT-5.4 Pro**               | \$80                | 20          |
| **Gemma 4 9B (self-hosted)**  | \~\$0 marginal      | **100**     |
| **Phi-4 (self-hosted)**       | \~\$0 marginal      | **100**     |

***

## Track 5: Offline / Low-Bandwidth Capability (15%)

A single-axis score: can the model be run fully offline on consumer hardware in a Caribbean or Latin American context?

| Criterion                                        | Points |
| ------------------------------------------------ | ------ |
| Open-weight, commercially usable                 | 30     |
| Runs on ≤ 16 GB RAM hardware at acceptable speed | 30     |
| Quality retained in Q4 quantization              | 20     |
| Multilingual LAC coverage maintained offline     | 20     |

### Version 0.1 results

| Model                 | Open weight? | Runs on ≤16 GB?      | Q4 quality | LAC langs offline | Track score |
| --------------------- | ------------ | -------------------- | ---------- | ----------------- | ----------- |
| **Gemma 4 9B**        | Yes          | Yes                  | Good       | Strong            | **95**      |
| **Phi-4 (14B)**       | Yes          | Tight (24 GB better) | Good       | Moderate          | 82          |
| **Phi-4 Mini (3.8B)** | Yes          | Yes, easily          | Good       | Moderate          | 85          |
| **Mistral NeMo 12B**  | Yes          | Yes                  | Good       | Strong (ES/PT/FR) | 90          |
| **Qwen 3 7B**         | Yes          | Yes                  | Good       | Moderate          | 80          |
| **Llama 4 Scout**     | Yes          | No (needs 48+ GB)    | Good       | Very strong       | 70          |
| **Llama 3.3 8B**      | Yes          | Yes                  | Good       | Moderate          | 80          |
| **Claude Opus 4.7**   | No           | N/A                  | N/A        | N/A               | **0**       |
| **GPT-5.4**           | No           | N/A                  | N/A        | N/A               | **0**       |
| **Gemini 3.1 Pro**    | No           | N/A                  | N/A        | N/A               | **0**       |

***

## The LAC Composite (weighted average)

Weighted across the five tracks.

| Rank | Model                 | Fluency (25%) | Context (20%) | Tasks (25%) | Cost (15%) | Offline (15%)  | **LAC Composite**     |
| ---- | --------------------- | ------------- | ------------- | ----------- | ---------- | -------------- | --------------------- |
| 1    | **Claude Opus 4.7**   | 86            | 88            | 91          | 20         | 0              | **60.4**              |
| 2    | **GPT-5.4 Thinking**  | 82            | 86            | 90          | 55         | 0              | 63.2                  |
| 3    | **Claude Sonnet 4.6** | 82            | 83            | 88          | 55         | 0              | 62.9                  |
| 4    | **Gemini 3.1 Pro**    | 78            | 87            | 87          | 45         | 0              | 59.4                  |
| 5    | **GPT-5.4**           | 81            | 83            | 87          | 55         | 0              | 62.0                  |
| 6    | **Claude Haiku 4.5**  | 75            | 72            | 81          | 85         | 0              | 60.5                  |
| 7    | **Gemini 3.1 Flash**  | 70            | 75            | 77          | 88         | 0              | 57.4                  |
| 8    | **Llama 4 Maverick**  | 71            | 72            | 76          | 80         | 60 (self-host) | **70.3** (with local) |
| 9    | **Llama 4 Scout**     | 67            | 67            | 71          | 95         | 70             | **72.6**              |
| 10   | **Gemma 4 9B**        | 62            | 61            | 67          | 100        | 95             | **74.8**              |
| 11   | **Mistral NeMo 12B**  | 58            | 55            | 58          | 100        | 90             | 69.4                  |
| 12   | **Phi-4 (14B)**       | 54            | 55            | 63          | 100        | 82             | 68.0                  |
| 13   | **Qwen 3 14B**        | 54            | 51            | 63          | 100        | 80             | 66.2                  |

*Open-weight models gain cost and offline points when self-hosted. "Cost" for self-hosted models is marginal cost after hardware; amortised cost varies with volume.*

### What the composite tells you

* For **highest quality** at any cost: Claude Opus 4.7, GPT-5.4 Thinking, Gemini 3.1 Pro.
* For **daily production** work: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Flash.
* For **high-volume / low-cost**: Claude Haiku 4.5, Gemini 3.1 Flash, Llama 4 Scout hosted.
* For **privacy-first / offline**: Gemma 4 9B or Mistral NeMo 12B on local hardware.
* For **LAC-language priority**: Claude > GPT-5.4 > Gemini > Mistral > Llama > open-weight SLMs.

***

## Methodology (how to replicate)

1. **Prompt set**: 200 fluency prompts per language, 400 context prompts, 250 SMB task prompts (50 × 5 suites). Published openly on request.
2. **Scorers**: Native-speaker judges for language and context; deterministic rubrics for SMB tasks; public pricing for cost; verified hardware runs for offline.
3. **Inter-rater agreement**: κ ≥ 0.8 target for human-scored items.
4. **Sampling**: Uniform random from the published pool; vendors and the public can request the full set.
5. **Refresh cadence**: Quarterly. v0.2 target: Q3 2026.

***

## Contributing

You can help extend the LAC Benchmark in four ways:

1. **Submit prompts** for the regional-context or SMB-task tracks (especially for under-covered countries).
2. **Score outputs** as a native-language reviewer.
3. **Run models** we haven't covered yet and submit scores with methodology.
4. **Propose new tracks** (for example, agricultural-advisory quality, accounting-document depth, medical-triage quality).

Contact: **[ceo@maestrosai.com](mailto:ceo@maestrosai.com)** with subject "LAC Benchmark".

All contributions released under **Creative Commons BY 4.0** and credited to contributors.

***

## Citation

> Dunkley, A. (2026). *LAC Benchmark: AI Model Performance in Latin America and the Caribbean.* Version 0.1. MaestrosAI. maestrosai.com. CC BY 4.0.

***

## Related reading

* [global-benchmarks.md](global-benchmarks.md) for the English/general benchmarks this page complements.
* [tools/README.md](../tools/README.md) for the tool landscape.
* [slm/models.md](../slm/models.md) for deployment details on the open-weight models above.

***

*Created by Adrian Dunkley | MaestrosAI | maestrosai.com | [ceo@maestrosai.com](mailto:ceo@maestrosai.com)*
*Fair Use, Educational Resource | April 2026*
*Licensed under Creative Commons BY 4.0.*
*SEO: LAC benchmark | Caribbean AI benchmark | Latin America AI benchmark | Spanish Portuguese AI evaluation | Kreyòl AI benchmark | LAC SMB tasks | Claude vs GPT-5 vs Gemini Latin America*
