> ## Documentation Index
> Fetch the complete documentation index at: https://aiplaybooklac.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Deployment

# Deploying a Small Language Model: A Practical Guide

> **Created by Adrian Dunkley** | [maestrosai.com](https://maestrosai.com) | [ceo@maestrosai.com](mailto:ceo@maestrosai.com) | Fair Use

***

This page takes you from "I want to try this" to a working SLM on your own machine in about an hour. It covers the three mainstream runtimes (Ollama, LM Studio, llama.cpp), realistic cost math, and how local SLMs fit into LAC data-protection rules.

If you haven't picked a model, read [models.md](models.md) first.

***

## The three runtimes

| Runtime       | Who it's for                     | Ease      | Control | Cost              |
| ------------- | -------------------------------- | --------- | ------- | ----------------- |
| **Ollama**    | Beginners, developers            | Very easy | High    | Free, open source |
| **LM Studio** | Non-developers, visual users     | Easy      | Medium  | Free              |
| **llama.cpp** | Advanced, low-level optimization | Medium    | Maximum | Free, open source |

For 90% of LAC SMBs, **Ollama** is the right answer.

***

## Ollama: the 30-minute setup

### Install

```bash theme={null}
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download installer from https://ollama.com/download
```

### Pull a model

Pick based on your hardware (see [models.md](models.md)):

```bash theme={null}
ollama pull gemma3:9b          # Good general starting point, 16 GB RAM
ollama pull phi4:14b           # Strong reasoning, 24 GB RAM
ollama pull phi4-mini:3.8b     # Tiny, works on 8 GB RAM
ollama pull llama4-scout:17b   # Long context, needs 48+ GB unified memory
ollama pull qwen3:7b           # Fast, reasoning-focused
ollama pull mistral:7b         # Good French Caribbean support
```

### Chat with it

```bash theme={null}
ollama run gemma3:9b
>>> Hola, ¿puedes ayudarme a redactar una respuesta a un cliente de Cartagena?
```

That's it. The model runs locally; nothing leaves your machine.

### Use it from a script

```python theme={null}
import requests

r = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "gemma3:9b",
        "prompt": "Escribe un correo en español profesional...",
        "stream": False,
    },
)
print(r.json()["response"])
```

### Use it as a drop-in OpenAI-compatible endpoint

Ollama exposes an OpenAI-compatible endpoint at `http://localhost:11434/v1`. Most LangChain, n8n, and Claude-Agent-SDK-style stacks can point at it by changing the base URL and skipping the API key.

***

## LM Studio: the no-code path

[LM Studio](https://lmstudio.ai) is a desktop app. You install it, browse models through a UI, download, and chat. Good for:

* Staff who don't want to touch a terminal.
* Quick experimentation with different models.
* Small businesses with one or two users.

It includes the same OpenAI-compatible endpoint, so the integration story matches Ollama.

***

## llama.cpp: when you need the last 20% of performance

[llama.cpp](https://github.com/ggerganov/llama.cpp) is the low-level engine that powers much of the ecosystem. Use it directly when:

* You need custom quantization levels.
* You want to squeeze an SLM onto an edge device (Raspberry Pi 5 with AI HAT, Jetson Nano, older laptops).
* You're shipping an embedded product.

Most LAC SMBs never need to touch llama.cpp directly. Ollama and LM Studio both use it under the hood.

***

## Quantization: the one concept you must understand

Language models store weights as floating-point numbers. **Quantization** is compression: replacing those numbers with lower-precision ones, so the model uses less memory and runs faster, at the cost of a small quality drop.

Common levels:

| Label    | Precision    | Size (Gemma 4 9B) | Quality          | When to use                    |
| -------- | ------------ | ----------------- | ---------------- | ------------------------------ |
| FP16     | 16-bit float | \~18 GB           | Full             | Best, if you have the hardware |
| Q8\_0    | 8-bit        | \~9 GB            | Near-full        | High-end laptops               |
| Q6\_K    | 6-bit        | \~7 GB            | Strong           | Default for serious work       |
| Q4\_K\_M | 4-bit        | \~5 GB            | Good             | Default for most SMBs          |
| Q3\_K\_M | 3-bit        | \~4 GB            | Noticeable loss  | Low-RAM fallback               |
| Q2\_K    | 2-bit        | \~3 GB            | Significant loss | Emergency only                 |

Rule of thumb: start with **Q4\_K\_M**. It's the best price-to-performance sweet spot. Move up to Q6\_K if quality feels thin, down to Q3 only if you run out of memory.

Ollama tags usually default to Q4\_K\_M unless the tag says otherwise.

***

## Hardware buying guide, 2026

For a small business ready to commit:

| Budget        | Hardware                               | Runs                                           |
| ------------- | -------------------------------------- | ---------------------------------------------- |
| \$600-900     | Mac Mini M4, 16 GB                     | Phi-4 Mini, Gemma 3 2B, Mistral 7B Q4          |
| \$900-1,400   | Mac Mini M4 Pro, 24 GB                 | Gemma 4 9B, Phi-4 14B Q4                       |
| \$1,800-2,400 | Mac Studio M4 Max, 64 GB               | Llama 4 Scout Q4, Qwen 3 32B                   |
| \$2,500+      | Linux + RTX 4090 (used: \$1,200-1,500) | Anything practical up to Llama 4 Scout in FP16 |
| \$6,000+      | Workstation + H100                     | Everything including light Llama 4 Maverick    |

In 2026, for a serious LAC SMB, a **Mac Mini M4 Pro with 24 GB** or a **Linux desktop with a used RTX 4090** is the sweet-spot recommendation. Either runs Gemma 4 9B or Phi-4 well and leaves headroom to scale up.

***

## Cost of ownership: real math

Cloud model (for comparison):

* Claude Haiku 4.5 at SMB volume (\~5,000 chat turns/mo): \~\$25-50/mo.
* Claude Sonnet 4.6 at the same volume: \~\$75-200/mo.
* GPT-5.4 at the same volume: similar band.
* After 36 months: **\$2,700-7,200** in cumulative fees.

Local SLM:

* Mac Mini M4 Pro: \$1,200 one-time.
* Electricity in LAC: \$5-15/mo.
* Maintenance: 1-2 hours/mo.
* After 36 months: **$1,200 + $360 electricity = \$1,560** total.

Crossover is typically 9-14 months for medium volumes and under 6 months for high-volume tasks.

**Caveat**: a frontier cloud model will outperform an SLM on hard reasoning and long-context tasks. Use both: SLM for volume, cloud for hard cases. Expected total cost is still well below cloud-only.

***

## Network and privacy architecture

Three patterns, pick the one that matches your data sensitivity.

### Pattern A: Fully local

Nothing leaves the machine. Best for healthcare, legal, financial records, HR, anything under strict LAC data-protection rules (see [governance/README.md](../governance/README.md)).

```
Customer phone / laptop
   ↓
SLM on local server
   ↓
Local database (SQLite / Postgres)
```

### Pattern B: Local SLM + cloud for hard cases

Default to local; call a cloud API only for high-difficulty queries. Best for most SMBs.

```
Customer → local SLM
                ↓ if uncertain
            → cloud frontier model
                ↓
            local log
```

### Pattern C: Local SLM with regional cloud backup

Same as B, but the cloud provider is São Paulo, Santiago, or another LAC region. Keeps data in-region.

***

## Fine-tuning on your business data

Most LAC SMBs don't need to fine-tune. A good RAG (retrieval-augmented generation) setup over your documents gets 80% of the benefit with 10% of the effort.

If you do need to fine-tune:

* **Data volume**: 500-5,000 well-curated examples is enough for style and terminology. Under 500, RAG will beat fine-tuning.
* **Tools**: Unsloth, Axolotl, or Hugging Face's `trl` library for Gemma, Llama, Phi, and Mistral.
* **Time**: A Gemma 4 9B LoRA fine-tune takes 4-8 hours on a single 24 GB GPU.
* **Cost**: $0 on your own hardware; $10-40 on a rented cloud GPU for the duration.

Fine-tuned SLMs can sound exactly like your business. It's a powerful move for established brands with a clear voice.

***

## Running multiple models on one box

Ollama can hold many models and swap them as needed. For a small business:

* `gemma3:9b` for customer chat (ES/PT/EN/FR).
* `phi4:14b` for document processing and invoice extraction.
* `mistral:7b` for French Caribbean content.

The OS swaps them in and out of memory. With 24-64 GB unified memory, you keep two resident and swap to a third on demand.

***

## Troubleshooting

**"My model is slow."** Check RAM usage. If the model exceeds your RAM, it swaps to disk and crawls. Use a smaller quantization or a smaller model.

**"My model says something wrong."** Local SLMs are weaker than frontier cloud models on reasoning. Give more context in the prompt, use RAG for facts, or escalate hard cases to cloud.

**"My model forgets."** SLMs have shorter effective context windows than cloud models. Summarise older turns into a running "state" variable and put it in the system prompt.

**"My fine-tune got worse."** Usually caused by too few examples, too many epochs, or inconsistent data. Start with 1,000 high-quality examples and 2-3 epochs.

**"I can't run Llama 4 Scout on my Mac."** Even Q4 needs 48-64 GB of unified memory. Drop to Gemma 4 9B or a smaller Qwen 3.

***

## Related reading

* [models.md](models.md): pick the right model.
* [use-cases.md](use-cases.md): concrete scenarios.
* [agents/build.md](../agents/build.md): code-path agent building with local models.
* [governance/README.md](../governance/README.md): compliance fit.
* [rankings/global-benchmarks.md](../rankings/global-benchmarks.md): benchmark scores for the models above.

***

*Created by Adrian Dunkley | MaestrosAI | maestrosai.com | [ceo@maestrosai.com](mailto:ceo@maestrosai.com)*
*Fair Use, Educational Resource | April 2026*
*SEO: Ollama LAC | LM Studio | llama.cpp | local AI deployment | fine-tuning Gemma Phi Mistral | quantization Q4 | small language model setup Caribbean | IA local América Latina*
