Deploying a Small Language Model: A Practical Guide

Created by Adrian Dunkley | maestrosai.com | ceo@maestrosai.com | Fair Use

This page takes you from “I want to try this” to a working SLM on your own machine in about an hour. It covers the three mainstream runtimes (Ollama, LM Studio, llama.cpp), realistic cost math, and how local SLMs fit into LAC data-protection rules. If you haven’t picked a model, read models.md first.

The three runtimes

Runtime	Who it’s for	Ease	Control	Cost
Ollama	Beginners, developers	Very easy	High	Free, open source
LM Studio	Non-developers, visual users	Easy	Medium	Free
llama.cpp	Advanced, low-level optimization	Medium	Maximum	Free, open source

For 90% of LAC SMBs, Ollama is the right answer.

Ollama: the 30-minute setup

Install

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download installer from https://ollama.com/download

Pull a model

Pick based on your hardware (see models.md):

ollama pull gemma3:9b          # Good general starting point, 16 GB RAM
ollama pull phi4:14b           # Strong reasoning, 24 GB RAM
ollama pull phi4-mini:3.8b     # Tiny, works on 8 GB RAM
ollama pull llama4-scout:17b   # Long context, needs 48+ GB unified memory
ollama pull qwen3:7b           # Fast, reasoning-focused
ollama pull mistral:7b         # Good French Caribbean support

Chat with it

ollama run gemma3:9b
>>> Hola, ¿puedes ayudarme a redactar una respuesta a un cliente de Cartagena?

That’s it. The model runs locally; nothing leaves your machine.

Use it from a script

import requests

r = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "gemma3:9b",
        "prompt": "Escribe un correo en español profesional...",
        "stream": False,
    },
)
print(r.json()["response"])

Use it as a drop-in OpenAI-compatible endpoint

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Most LangChain, n8n, and Claude-Agent-SDK-style stacks can point at it by changing the base URL and skipping the API key.

LM Studio: the no-code path

LM Studio is a desktop app. You install it, browse models through a UI, download, and chat. Good for:

Staff who don’t want to touch a terminal.
Quick experimentation with different models.
Small businesses with one or two users.

It includes the same OpenAI-compatible endpoint, so the integration story matches Ollama.

llama.cpp: when you need the last 20% of performance

llama.cpp is the low-level engine that powers much of the ecosystem. Use it directly when:

You need custom quantization levels.
You want to squeeze an SLM onto an edge device (Raspberry Pi 5 with AI HAT, Jetson Nano, older laptops).
You’re shipping an embedded product.

Most LAC SMBs never need to touch llama.cpp directly. Ollama and LM Studio both use it under the hood.

Quantization: the one concept you must understand

Language models store weights as floating-point numbers. Quantization is compression: replacing those numbers with lower-precision ones, so the model uses less memory and runs faster, at the cost of a small quality drop. Common levels:

Label	Precision	Size (Gemma 4 9B)	Quality	When to use
FP16	16-bit float	~18 GB	Full	Best, if you have the hardware
Q8_0	8-bit	~9 GB	Near-full	High-end laptops
Q6_K	6-bit	~7 GB	Strong	Default for serious work
Q4_K_M	4-bit	~5 GB	Good	Default for most SMBs
Q3_K_M	3-bit	~4 GB	Noticeable loss	Low-RAM fallback
Q2_K	2-bit	~3 GB	Significant loss	Emergency only

Rule of thumb: start with Q4_K_M. It’s the best price-to-performance sweet spot. Move up to Q6_K if quality feels thin, down to Q3 only if you run out of memory. Ollama tags usually default to Q4_K_M unless the tag says otherwise.

Hardware buying guide, 2026

For a small business ready to commit:

Budget	Hardware	Runs
$600-900	Mac Mini M4, 16 GB	Phi-4 Mini, Gemma 3 2B, Mistral 7B Q4
$900-1,400	Mac Mini M4 Pro, 24 GB	Gemma 4 9B, Phi-4 14B Q4
$1,800-2,400	Mac Studio M4 Max, 64 GB	Llama 4 Scout Q4, Qwen 3 32B
$2,500+	Linux + RTX 4090 (used: $1,200-1,500)	Anything practical up to Llama 4 Scout in FP16
$6,000+	Workstation + H100	Everything including light Llama 4 Maverick

In 2026, for a serious LAC SMB, a Mac Mini M4 Pro with 24 GB or a Linux desktop with a used RTX 4090 is the sweet-spot recommendation. Either runs Gemma 4 9B or Phi-4 well and leaves headroom to scale up.

Cost of ownership: real math

Cloud model (for comparison):

Claude Haiku 4.5 at SMB volume (~5,000 chat turns/mo): ~$25-50/mo.
Claude Sonnet 4.6 at the same volume: ~$75-200/mo.
GPT-5.4 at the same volume: similar band.
After 36 months: $2,700-7,200 in cumulative fees.

Local SLM:

Mac Mini M4 Pro: $1,200 one-time.
Electricity in LAC: $5-15/mo.
Maintenance: 1-2 hours/mo.
After 36 months: $1,200 +$ 360 electricity = $1,560 total.

Crossover is typically 9-14 months for medium volumes and under 6 months for high-volume tasks. Caveat: a frontier cloud model will outperform an SLM on hard reasoning and long-context tasks. Use both: SLM for volume, cloud for hard cases. Expected total cost is still well below cloud-only.

Network and privacy architecture

Three patterns, pick the one that matches your data sensitivity.

Pattern A: Fully local

Nothing leaves the machine. Best for healthcare, legal, financial records, HR, anything under strict LAC data-protection rules (see governance/README.md).

Customer phone / laptop
   ↓
SLM on local server
   ↓
Local database (SQLite / Postgres)

Pattern B: Local SLM + cloud for hard cases

Default to local; call a cloud API only for high-difficulty queries. Best for most SMBs.

Customer → local SLM
                ↓ if uncertain
            → cloud frontier model
                ↓
            local log

Pattern C: Local SLM with regional cloud backup

Same as B, but the cloud provider is São Paulo, Santiago, or another LAC region. Keeps data in-region.

Fine-tuning on your business data

Most LAC SMBs don’t need to fine-tune. A good RAG (retrieval-augmented generation) setup over your documents gets 80% of the benefit with 10% of the effort. If you do need to fine-tune:

Data volume: 500-5,000 well-curated examples is enough for style and terminology. Under 500, RAG will beat fine-tuning.
Tools: Unsloth, Axolotl, or Hugging Face’s trl library for Gemma, Llama, Phi, and Mistral.
Time: A Gemma 4 9B LoRA fine-tune takes 4-8 hours on a single 24 GB GPU.
Cost: $0 on your own hardware;$ 10-40 on a rented cloud GPU for the duration.

Fine-tuned SLMs can sound exactly like your business. It’s a powerful move for established brands with a clear voice.

Running multiple models on one box

Ollama can hold many models and swap them as needed. For a small business:

gemma3:9b for customer chat (ES/PT/EN/FR).
phi4:14b for document processing and invoice extraction.
mistral:7b for French Caribbean content.

The OS swaps them in and out of memory. With 24-64 GB unified memory, you keep two resident and swap to a third on demand.

Troubleshooting

“My model is slow.” Check RAM usage. If the model exceeds your RAM, it swaps to disk and crawls. Use a smaller quantization or a smaller model. “My model says something wrong.” Local SLMs are weaker than frontier cloud models on reasoning. Give more context in the prompt, use RAG for facts, or escalate hard cases to cloud. “My model forgets.” SLMs have shorter effective context windows than cloud models. Summarise older turns into a running “state” variable and put it in the system prompt. “My fine-tune got worse.” Usually caused by too few examples, too many epochs, or inconsistent data. Start with 1,000 high-quality examples and 2-3 epochs. “I can’t run Llama 4 Scout on my Mac.” Even Q4 needs 48-64 GB of unified memory. Drop to Gemma 4 9B or a smaller Qwen 3.

models.md: pick the right model.
use-cases.md: concrete scenarios.
agents/build.md: code-path agent building with local models.
governance/README.md: compliance fit.
rankings/global-benchmarks.md: benchmark scores for the models above.

Start here

Practical Guide

Tools and Models

AI Agents

Small Language Models

Governance and Responsible AI

AI Risks

Rankings and Benchmarks

Deployment

Deploying a Small Language Model: A Practical Guide

The three runtimes

Ollama: the 30-minute setup

Install

Pull a model

Chat with it

Use it from a script

Use it as a drop-in OpenAI-compatible endpoint

LM Studio: the no-code path

llama.cpp: when you need the last 20% of performance

Quantization: the one concept you must understand

Hardware buying guide, 2026

Cost of ownership: real math

Network and privacy architecture

Pattern A: Fully local

Pattern B: Local SLM + cloud for hard cases

Pattern C: Local SLM with regional cloud backup

Fine-tuning on your business data

Running multiple models on one box

Troubleshooting

Start here

Practical Guide

Tools and Models

AI Agents

Small Language Models

Governance and Responsible AI

AI Risks

Rankings and Benchmarks

​Deploying a Small Language Model: A Practical Guide

​The three runtimes

​Ollama: the 30-minute setup

​Install

​Pull a model

​Chat with it

​Use it from a script

​Use it as a drop-in OpenAI-compatible endpoint

​LM Studio: the no-code path

​llama.cpp: when you need the last 20% of performance

​Quantization: the one concept you must understand

​Hardware buying guide, 2026

​Cost of ownership: real math

​Network and privacy architecture

​Pattern A: Fully local

​Pattern B: Local SLM + cloud for hard cases

​Pattern C: Local SLM with regional cloud backup

​Fine-tuning on your business data

​Running multiple models on one box

​Troubleshooting

​Related reading

Deploying a Small Language Model: A Practical Guide

The three runtimes

Ollama: the 30-minute setup

Install

Pull a model

Chat with it

Use it from a script

Use it as a drop-in OpenAI-compatible endpoint

LM Studio: the no-code path

llama.cpp: when you need the last 20% of performance

Quantization: the one concept you must understand

Hardware buying guide, 2026

Cost of ownership: real math

Network and privacy architecture

Pattern A: Fully local

Pattern B: Local SLM + cloud for hard cases

Pattern C: Local SLM with regional cloud backup

Fine-tuning on your business data

Running multiple models on one box

Troubleshooting

Related reading