Run Any AI Model Locally on Mac in 10 Minutes (Ollama + Open WebUI)

Running AI models locally gives you privacy, no API costs, and the ability to experiment freely. Here’s how to set it up on a Mac in under 10 minutes.

What You’ll Install

Ollama — the model runtime (downloads and runs models, exposes a local API)
Open WebUI — a ChatGPT-style web interface that connects to Ollama

Step 1: Install Ollama

# Download and install from ollama.com, or via Homebrew:
brew install ollama

# Start the Ollama server
ollama serve

Ollama runs a local HTTP server at localhost:11434. Leave this running.

Step 2: Pull a Model

# Gemma4:e4b — best quality/size ratio for 24GB Macs (9.6GB)
ollama pull gemma4:e4b

# Other good options:
ollama pull llama3.2:3b      # Fast, 2GB, good for quick tasks
ollama pull mistral:7b        # 4.1GB, solid general purpose
ollama pull gemma3:27b        # 17GB, high quality (needs 24GB RAM)

To see what’s installed: ollama list

Step 3: Test it Works

ollama run gemma4:e4b "Explain transformers in one paragraph"

You should see a response stream in your terminal. If so, Ollama is working.

Step 4: Install Open WebUI (Optional but Recommended)

Open WebUI gives you a proper chat interface with conversation history, model switching, and file uploads.

Option A: With Docker (recommended)

# Install Docker Desktop or Colima (lighter weight)
brew install colima docker
colima start

# Run Open WebUI
docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Option B: Via pip (no Docker)

pip install open-webui
open-webui serve

Open localhost:3000 in your browser. Create an account (local only), then start chatting.

Choosing the Right Model for Your RAM

RAM	Best Model	Size	Speed
8GB	llama3.2:3b	2GB	Very fast
16GB	gemma4:e4b	9.6GB	Fast
24GB	gemma3:27b or gemma4:26b	17-20GB	Good
32GB+	mixtral:8x7b	26GB	Fast (MoE)

The model must fit in RAM with headroom. If your system starts heavy disk swapping, you’ve exceeded what fits.

Python API

Ollama has a Python library for building scripts and benchmarks:

import ollama

response = ollama.chat(
    model="gemma4:e4b",
    messages=[{"role": "user", "content": "What is 2+2?"}]
)
print(response["message"]["content"])

Install it: pip install ollama

Useful Commands

ollama list          # Show installed models
ollama ps            # Show running models
ollama rm gemma3:27b # Remove a model
ollama pull gemma4:e4b  # Update to latest version

Performance Tips

Close other apps — unified memory means everything shares the same pool
Use Q4 quantization (Ollama’s default) — 4× faster than full precision with minimal quality loss
Shorter context = faster — the KV cache grows with context length, eating memory bandwidth
Monitor Activity Monitor → Memory tab to see how much RAM your model is using

With Gemma4:e4b on an M4 Pro Mac Mini, you’ll get 13-18 tokens/second — fast enough for real-time chat. That’s comparable to cloud API speeds for typical conversational use.