Deploying a Small Language Model: A Practical Guide
Created by Adrian Dunkley | maestrosai.com | ceo@maestrosai.com | Fair Use
This page takes you from “I want to try this” to a working SLM on your own machine in about an hour. It covers the three mainstream runtimes (Ollama, LM Studio, llama.cpp), realistic cost math, and how local SLMs fit into LAC data-protection rules. If you haven’t picked a model, read models.md first.
The three runtimes
| Runtime | Who it’s for | Ease | Control | Cost |
|---|---|---|---|---|
| Ollama | Beginners, developers | Very easy | High | Free, open source |
| LM Studio | Non-developers, visual users | Easy | Medium | Free |
| llama.cpp | Advanced, low-level optimization | Medium | Maximum | Free, open source |
Ollama: the 30-minute setup
Install
Pull a model
Pick based on your hardware (see models.md):Chat with it
Use it from a script
Use it as a drop-in OpenAI-compatible endpoint
Ollama exposes an OpenAI-compatible endpoint athttp://localhost:11434/v1. Most LangChain, n8n, and Claude-Agent-SDK-style stacks can point at it by changing the base URL and skipping the API key.
LM Studio: the no-code path
LM Studio is a desktop app. You install it, browse models through a UI, download, and chat. Good for:- Staff who don’t want to touch a terminal.
- Quick experimentation with different models.
- Small businesses with one or two users.
llama.cpp: when you need the last 20% of performance
llama.cpp is the low-level engine that powers much of the ecosystem. Use it directly when:- You need custom quantization levels.
- You want to squeeze an SLM onto an edge device (Raspberry Pi 5 with AI HAT, Jetson Nano, older laptops).
- You’re shipping an embedded product.
Quantization: the one concept you must understand
Language models store weights as floating-point numbers. Quantization is compression: replacing those numbers with lower-precision ones, so the model uses less memory and runs faster, at the cost of a small quality drop. Common levels:| Label | Precision | Size (Gemma 4 9B) | Quality | When to use |
|---|---|---|---|---|
| FP16 | 16-bit float | ~18 GB | Full | Best, if you have the hardware |
| Q8_0 | 8-bit | ~9 GB | Near-full | High-end laptops |
| Q6_K | 6-bit | ~7 GB | Strong | Default for serious work |
| Q4_K_M | 4-bit | ~5 GB | Good | Default for most SMBs |
| Q3_K_M | 3-bit | ~4 GB | Noticeable loss | Low-RAM fallback |
| Q2_K | 2-bit | ~3 GB | Significant loss | Emergency only |
Hardware buying guide, 2026
For a small business ready to commit:| Budget | Hardware | Runs |
|---|---|---|
| $600-900 | Mac Mini M4, 16 GB | Phi-4 Mini, Gemma 3 2B, Mistral 7B Q4 |
| $900-1,400 | Mac Mini M4 Pro, 24 GB | Gemma 4 9B, Phi-4 14B Q4 |
| $1,800-2,400 | Mac Studio M4 Max, 64 GB | Llama 4 Scout Q4, Qwen 3 32B |
| $2,500+ | Linux + RTX 4090 (used: $1,200-1,500) | Anything practical up to Llama 4 Scout in FP16 |
| $6,000+ | Workstation + H100 | Everything including light Llama 4 Maverick |
Cost of ownership: real math
Cloud model (for comparison):- Claude Haiku 4.5 at SMB volume (~5,000 chat turns/mo): ~$25-50/mo.
- Claude Sonnet 4.6 at the same volume: ~$75-200/mo.
- GPT-5.4 at the same volume: similar band.
- After 36 months: $2,700-7,200 in cumulative fees.
- Mac Mini M4 Pro: $1,200 one-time.
- Electricity in LAC: $5-15/mo.
- Maintenance: 1-2 hours/mo.
- After 36 months: 360 electricity = $1,560 total.
Network and privacy architecture
Three patterns, pick the one that matches your data sensitivity.Pattern A: Fully local
Nothing leaves the machine. Best for healthcare, legal, financial records, HR, anything under strict LAC data-protection rules (see governance/README.md).Pattern B: Local SLM + cloud for hard cases
Default to local; call a cloud API only for high-difficulty queries. Best for most SMBs.Pattern C: Local SLM with regional cloud backup
Same as B, but the cloud provider is São Paulo, Santiago, or another LAC region. Keeps data in-region.Fine-tuning on your business data
Most LAC SMBs don’t need to fine-tune. A good RAG (retrieval-augmented generation) setup over your documents gets 80% of the benefit with 10% of the effort. If you do need to fine-tune:- Data volume: 500-5,000 well-curated examples is enough for style and terminology. Under 500, RAG will beat fine-tuning.
- Tools: Unsloth, Axolotl, or Hugging Face’s
trllibrary for Gemma, Llama, Phi, and Mistral. - Time: A Gemma 4 9B LoRA fine-tune takes 4-8 hours on a single 24 GB GPU.
- Cost: 10-40 on a rented cloud GPU for the duration.
Running multiple models on one box
Ollama can hold many models and swap them as needed. For a small business:gemma3:9bfor customer chat (ES/PT/EN/FR).phi4:14bfor document processing and invoice extraction.mistral:7bfor French Caribbean content.
Troubleshooting
“My model is slow.” Check RAM usage. If the model exceeds your RAM, it swaps to disk and crawls. Use a smaller quantization or a smaller model. “My model says something wrong.” Local SLMs are weaker than frontier cloud models on reasoning. Give more context in the prompt, use RAG for facts, or escalate hard cases to cloud. “My model forgets.” SLMs have shorter effective context windows than cloud models. Summarise older turns into a running “state” variable and put it in the system prompt. “My fine-tune got worse.” Usually caused by too few examples, too many epochs, or inconsistent data. Start with 1,000 high-quality examples and 2-3 epochs. “I can’t run Llama 4 Scout on my Mac.” Even Q4 needs 48-64 GB of unified memory. Drop to Gemma 4 9B or a smaller Qwen 3.Related reading
- models.md: pick the right model.
- use-cases.md: concrete scenarios.
- agents/build.md: code-path agent building with local models.
- governance/README.md: compliance fit.
- rankings/global-benchmarks.md: benchmark scores for the models above.
Created by Adrian Dunkley | MaestrosAI | maestrosai.com | ceo@maestrosai.com Fair Use, Educational Resource | April 2026 SEO: Ollama LAC | LM Studio | llama.cpp | local AI deployment | fine-tuning Gemma Phi Mistral | quantization Q4 | small language model setup Caribbean | IA local América Latina
