Why Your Mac's 273 GB/s Memory Bandwidth Is the Secret to Fast AI Inference

When people ask why an Apple M4 Pro Mac Mini runs large language models faster than a gaming laptop with a dedicated GPU, the answer usually surprises them: it’s not about TFLOPS. It’s about memory bandwidth.

The Inference Bottleneck Is Not Compute

Generating a single token requires loading the model’s weights — billions of floating-point numbers — from memory into the processor, doing some math, then repeating for the next token. The math itself takes microseconds. The bottleneck is how fast you can read those weights from RAM.

This is called being memory-bandwidth-bound.

For a 9.6GB model like Gemma4:e4b running at Q4 quantization:

9.6 GB × 2 reads/token = 19.2 GB per token
Memory BW = 273 GB/s
Max theoretical speed = 273 / 19.2 ≈ 14 tokens/sec

Actual measured speed: ~13-18 tokens/sec. That’s 90%+ efficiency — almost no overhead.

Why Apple Silicon Wins

On a conventional PC, your GPU has its own VRAM connected via PCIe. The CPU has system RAM connected via a separate memory controller. They don’t share bandwidth.

On Apple Silicon, the CPU, GPU, and Neural Engine all share a single unified memory pool with a single high-bandwidth bus. There’s no PCIe bottleneck. No memory copying between CPU and GPU address spaces.

Hardware	Memory BW	Max LLM Model Size
M4 Pro (24GB)	273 GB/s	~20GB (leaves headroom)
RTX 4090 (24GB VRAM)	1,008 GB/s	~20GB
RTX 3090 (24GB VRAM)	936 GB/s	~20GB
Gaming laptop 16GB	68 GB/s	~12GB
M1 MacBook Air	68 GB/s	~12GB

The RTX 4090 has 3.7× more bandwidth — but also costs 3.7× more and can’t fit on a desk without a full tower PC.

What 273 GB/s Means in Practice

The M4 Pro’s 273 GB/s comes from LPDDR5X RAM running at extremely high clock speeds across a wide 256-bit bus. Apple designed this specifically for the Neural Engine and GPU workloads.

For comparison:

Your laptop’s system RAM: ~40-68 GB/s
Server DDR5: ~100-150 GB/s
Apple M4 Pro: 273 GB/s
High-end GPU VRAM: 500-1,000 GB/s

So the M4 Pro sits between standard server RAM and GPU VRAM — while being a consumer desktop chip that costs $700.

The KV Cache Factor

Every transformer model maintains a key-value (KV) cache — it stores the attention keys and values for every previous token so it doesn’t need to recompute them. This cache grows linearly with context length.

For a 128K context window at full precision:

KV cache size = layers × heads × head_dim × context_len × 2 × sizeof(float16)
             = ~8-16GB for large models at 128K tokens

This eats into your available RAM — another reason why 24GB unified memory matters. A 16GB Mac can run the model but may struggle to maintain large context windows.

Quantization and Bandwidth

Quantization reduces the bits-per-weight, making the model smaller and faster to read:

Format	Bits/weight	Speed	Quality loss
BF16	16	1× baseline	None
Q8	8	~2× faster	Negligible
Q4	4	~4× faster	Minor
Q2	2	~8× faster	Significant

Ollama defaults to Q4 for most models — the sweet spot of speed and quality. This is why Gemma4:e4b at Q4 runs comfortably at 13-18 tok/s despite being a multi-billion parameter model.

The Practical Takeaway

If you’re running AI models locally:

Memory bandwidth > raw compute for inference
Unified memory (Apple Silicon) is more efficient than separate GPU VRAM for models that fit in RAM
24GB unified comfortably runs up to ~20GB models with room for KV cache
A gaming laptop with 16GB GPU VRAM will run models faster than Apple Silicon per token, but can’t run larger models at all
Cloud APIs sidestep all of this — but cost money and have latency

For a Mac Mini M4 Pro at $700, running Gemma4:e4b at 15+ tok/s with 6/6 long context is genuinely impressive.