Skip to main content

Deploying a Small Language Model: A Practical Guide

Created by Adrian Dunkley | maestrosai.com | ceo@maestrosai.com | Fair Use

This page takes you from “I want to try this” to a working SLM on your own machine in about an hour. It covers the three mainstream runtimes (Ollama, LM Studio, llama.cpp), realistic cost math, and how local SLMs fit into LAC data-protection rules. If you haven’t picked a model, read models.md first.

The three runtimes

RuntimeWho it’s forEaseControlCost
OllamaBeginners, developersVery easyHighFree, open source
LM StudioNon-developers, visual usersEasyMediumFree
llama.cppAdvanced, low-level optimizationMediumMaximumFree, open source
For 90% of LAC SMBs, Ollama is the right answer.

Ollama: the 30-minute setup

Install

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download installer from https://ollama.com/download

Pull a model

Pick based on your hardware (see models.md):
ollama pull gemma3:9b          # Good general starting point, 16 GB RAM
ollama pull phi4:14b           # Strong reasoning, 24 GB RAM
ollama pull phi4-mini:3.8b     # Tiny, works on 8 GB RAM
ollama pull llama4-scout:17b   # Long context, needs 48+ GB unified memory
ollama pull qwen3:7b           # Fast, reasoning-focused
ollama pull mistral:7b         # Good French Caribbean support

Chat with it

ollama run gemma3:9b
>>> Hola, ¿puedes ayudarme a redactar una respuesta a un cliente de Cartagena?
That’s it. The model runs locally; nothing leaves your machine.

Use it from a script

import requests

r = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "gemma3:9b",
        "prompt": "Escribe un correo en español profesional...",
        "stream": False,
    },
)
print(r.json()["response"])

Use it as a drop-in OpenAI-compatible endpoint

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Most LangChain, n8n, and Claude-Agent-SDK-style stacks can point at it by changing the base URL and skipping the API key.

LM Studio: the no-code path

LM Studio is a desktop app. You install it, browse models through a UI, download, and chat. Good for:
  • Staff who don’t want to touch a terminal.
  • Quick experimentation with different models.
  • Small businesses with one or two users.
It includes the same OpenAI-compatible endpoint, so the integration story matches Ollama.

llama.cpp: when you need the last 20% of performance

llama.cpp is the low-level engine that powers much of the ecosystem. Use it directly when:
  • You need custom quantization levels.
  • You want to squeeze an SLM onto an edge device (Raspberry Pi 5 with AI HAT, Jetson Nano, older laptops).
  • You’re shipping an embedded product.
Most LAC SMBs never need to touch llama.cpp directly. Ollama and LM Studio both use it under the hood.

Quantization: the one concept you must understand

Language models store weights as floating-point numbers. Quantization is compression: replacing those numbers with lower-precision ones, so the model uses less memory and runs faster, at the cost of a small quality drop. Common levels:
LabelPrecisionSize (Gemma 4 9B)QualityWhen to use
FP1616-bit float~18 GBFullBest, if you have the hardware
Q8_08-bit~9 GBNear-fullHigh-end laptops
Q6_K6-bit~7 GBStrongDefault for serious work
Q4_K_M4-bit~5 GBGoodDefault for most SMBs
Q3_K_M3-bit~4 GBNoticeable lossLow-RAM fallback
Q2_K2-bit~3 GBSignificant lossEmergency only
Rule of thumb: start with Q4_K_M. It’s the best price-to-performance sweet spot. Move up to Q6_K if quality feels thin, down to Q3 only if you run out of memory. Ollama tags usually default to Q4_K_M unless the tag says otherwise.

Hardware buying guide, 2026

For a small business ready to commit:
BudgetHardwareRuns
$600-900Mac Mini M4, 16 GBPhi-4 Mini, Gemma 3 2B, Mistral 7B Q4
$900-1,400Mac Mini M4 Pro, 24 GBGemma 4 9B, Phi-4 14B Q4
$1,800-2,400Mac Studio M4 Max, 64 GBLlama 4 Scout Q4, Qwen 3 32B
$2,500+Linux + RTX 4090 (used: $1,200-1,500)Anything practical up to Llama 4 Scout in FP16
$6,000+Workstation + H100Everything including light Llama 4 Maverick
In 2026, for a serious LAC SMB, a Mac Mini M4 Pro with 24 GB or a Linux desktop with a used RTX 4090 is the sweet-spot recommendation. Either runs Gemma 4 9B or Phi-4 well and leaves headroom to scale up.

Cost of ownership: real math

Cloud model (for comparison):
  • Claude Haiku 4.5 at SMB volume (~5,000 chat turns/mo): ~$25-50/mo.
  • Claude Sonnet 4.6 at the same volume: ~$75-200/mo.
  • GPT-5.4 at the same volume: similar band.
  • After 36 months: $2,700-7,200 in cumulative fees.
Local SLM:
  • Mac Mini M4 Pro: $1,200 one-time.
  • Electricity in LAC: $5-15/mo.
  • Maintenance: 1-2 hours/mo.
  • After 36 months: 1,200+1,200 + 360 electricity = $1,560 total.
Crossover is typically 9-14 months for medium volumes and under 6 months for high-volume tasks. Caveat: a frontier cloud model will outperform an SLM on hard reasoning and long-context tasks. Use both: SLM for volume, cloud for hard cases. Expected total cost is still well below cloud-only.

Network and privacy architecture

Three patterns, pick the one that matches your data sensitivity.

Pattern A: Fully local

Nothing leaves the machine. Best for healthcare, legal, financial records, HR, anything under strict LAC data-protection rules (see governance/README.md).
Customer phone / laptop

SLM on local server

Local database (SQLite / Postgres)

Pattern B: Local SLM + cloud for hard cases

Default to local; call a cloud API only for high-difficulty queries. Best for most SMBs.
Customer → local SLM
                ↓ if uncertain
            → cloud frontier model

            local log

Pattern C: Local SLM with regional cloud backup

Same as B, but the cloud provider is São Paulo, Santiago, or another LAC region. Keeps data in-region.

Fine-tuning on your business data

Most LAC SMBs don’t need to fine-tune. A good RAG (retrieval-augmented generation) setup over your documents gets 80% of the benefit with 10% of the effort. If you do need to fine-tune:
  • Data volume: 500-5,000 well-curated examples is enough for style and terminology. Under 500, RAG will beat fine-tuning.
  • Tools: Unsloth, Axolotl, or Hugging Face’s trl library for Gemma, Llama, Phi, and Mistral.
  • Time: A Gemma 4 9B LoRA fine-tune takes 4-8 hours on a single 24 GB GPU.
  • Cost: 0onyourownhardware;0 on your own hardware; 10-40 on a rented cloud GPU for the duration.
Fine-tuned SLMs can sound exactly like your business. It’s a powerful move for established brands with a clear voice.

Running multiple models on one box

Ollama can hold many models and swap them as needed. For a small business:
  • gemma3:9b for customer chat (ES/PT/EN/FR).
  • phi4:14b for document processing and invoice extraction.
  • mistral:7b for French Caribbean content.
The OS swaps them in and out of memory. With 24-64 GB unified memory, you keep two resident and swap to a third on demand.

Troubleshooting

“My model is slow.” Check RAM usage. If the model exceeds your RAM, it swaps to disk and crawls. Use a smaller quantization or a smaller model. “My model says something wrong.” Local SLMs are weaker than frontier cloud models on reasoning. Give more context in the prompt, use RAG for facts, or escalate hard cases to cloud. “My model forgets.” SLMs have shorter effective context windows than cloud models. Summarise older turns into a running “state” variable and put it in the system prompt. “My fine-tune got worse.” Usually caused by too few examples, too many epochs, or inconsistent data. Start with 1,000 high-quality examples and 2-3 epochs. “I can’t run Llama 4 Scout on my Mac.” Even Q4 needs 48-64 GB of unified memory. Drop to Gemma 4 9B or a smaller Qwen 3.

Created by Adrian Dunkley | MaestrosAI | maestrosai.com | ceo@maestrosai.com Fair Use, Educational Resource | April 2026 SEO: Ollama LAC | LM Studio | llama.cpp | local AI deployment | fine-tuning Gemma Phi Mistral | quantization Q4 | small language model setup Caribbean | IA local América Latina