Running AI models locally gives you privacy, no API costs, and the ability to experiment freely. Here’s how to set it up on a Mac in under 10 minutes.
What You’ll Install
- Ollama — the model runtime (downloads and runs models, exposes a local API)
- Open WebUI — a ChatGPT-style web interface that connects to Ollama
Step 1: Install Ollama
# Download and install from ollama.com, or via Homebrew:
brew install ollama
# Start the Ollama server
ollama serve
Ollama runs a local HTTP server at localhost:11434. Leave this running.
Step 2: Pull a Model
# Gemma4:e4b — best quality/size ratio for 24GB Macs (9.6GB)
ollama pull gemma4:e4b
# Other good options:
ollama pull llama3.2:3b # Fast, 2GB, good for quick tasks
ollama pull mistral:7b # 4.1GB, solid general purpose
ollama pull gemma3:27b # 17GB, high quality (needs 24GB RAM)
To see what’s installed: ollama list
Step 3: Test it Works
ollama run gemma4:e4b "Explain transformers in one paragraph"
You should see a response stream in your terminal. If so, Ollama is working.
Step 4: Install Open WebUI (Optional but Recommended)
Open WebUI gives you a proper chat interface with conversation history, model switching, and file uploads.
Option A: With Docker (recommended)
# Install Docker Desktop or Colima (lighter weight)
brew install colima docker
colima start
# Run Open WebUI
docker run -d \
-p 3000:8080 \
-v open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Option B: Via pip (no Docker)
pip install open-webui
open-webui serve
Open localhost:3000 in your browser. Create an account (local only), then start chatting.
Choosing the Right Model for Your RAM
| RAM | Best Model | Size | Speed |
|---|---|---|---|
| 8GB | llama3.2:3b | 2GB | Very fast |
| 16GB | gemma4:e4b | 9.6GB | Fast |
| 24GB | gemma3:27b or gemma4:26b | 17-20GB | Good |
| 32GB+ | mixtral:8x7b | 26GB | Fast (MoE) |
The model must fit in RAM with headroom. If your system starts heavy disk swapping, you’ve exceeded what fits.
Python API
Ollama has a Python library for building scripts and benchmarks:
import ollama
response = ollama.chat(
model="gemma4:e4b",
messages=[{"role": "user", "content": "What is 2+2?"}]
)
print(response["message"]["content"])
Install it: pip install ollama
Useful Commands
ollama list # Show installed models
ollama ps # Show running models
ollama rm gemma3:27b # Remove a model
ollama pull gemma4:e4b # Update to latest version
Performance Tips
- Close other apps — unified memory means everything shares the same pool
- Use Q4 quantization (Ollama’s default) — 4× faster than full precision with minimal quality loss
- Shorter context = faster — the KV cache grows with context length, eating memory bandwidth
- Monitor Activity Monitor → Memory tab to see how much RAM your model is using
With Gemma4:e4b on an M4 Pro Mac Mini, you’ll get 13-18 tokens/second — fast enough for real-time chat. That’s comparable to cloud API speeds for typical conversational use.