When people ask why an Apple M4 Pro Mac Mini runs large language models faster than a gaming laptop with a dedicated GPU, the answer usually surprises them: it’s not about TFLOPS. It’s about memory bandwidth.
The Inference Bottleneck Is Not Compute
Generating a single token requires loading the model’s weights — billions of floating-point numbers — from memory into the processor, doing some math, then repeating for the next token. The math itself takes microseconds. The bottleneck is how fast you can read those weights from RAM.
This is called being memory-bandwidth-bound.
For a 9.6GB model like Gemma4:e4b running at Q4 quantization:
9.6 GB × 2 reads/token = 19.2 GB per token
Memory BW = 273 GB/s
Max theoretical speed = 273 / 19.2 ≈ 14 tokens/sec
Actual measured speed: ~13-18 tokens/sec. That’s 90%+ efficiency — almost no overhead.
Why Apple Silicon Wins
On a conventional PC, your GPU has its own VRAM connected via PCIe. The CPU has system RAM connected via a separate memory controller. They don’t share bandwidth.
On Apple Silicon, the CPU, GPU, and Neural Engine all share a single unified memory pool with a single high-bandwidth bus. There’s no PCIe bottleneck. No memory copying between CPU and GPU address spaces.
| Hardware | Memory BW | Max LLM Model Size |
|---|---|---|
| M4 Pro (24GB) | 273 GB/s | ~20GB (leaves headroom) |
| RTX 4090 (24GB VRAM) | 1,008 GB/s | ~20GB |
| RTX 3090 (24GB VRAM) | 936 GB/s | ~20GB |
| Gaming laptop 16GB | 68 GB/s | ~12GB |
| M1 MacBook Air | 68 GB/s | ~12GB |
The RTX 4090 has 3.7× more bandwidth — but also costs 3.7× more and can’t fit on a desk without a full tower PC.
What 273 GB/s Means in Practice
The M4 Pro’s 273 GB/s comes from LPDDR5X RAM running at extremely high clock speeds across a wide 256-bit bus. Apple designed this specifically for the Neural Engine and GPU workloads.
For comparison:
- Your laptop’s system RAM: ~40-68 GB/s
- Server DDR5: ~100-150 GB/s
- Apple M4 Pro: 273 GB/s
- High-end GPU VRAM: 500-1,000 GB/s
So the M4 Pro sits between standard server RAM and GPU VRAM — while being a consumer desktop chip that costs $700.
The KV Cache Factor
Every transformer model maintains a key-value (KV) cache — it stores the attention keys and values for every previous token so it doesn’t need to recompute them. This cache grows linearly with context length.
For a 128K context window at full precision:
KV cache size = layers × heads × head_dim × context_len × 2 × sizeof(float16)
= ~8-16GB for large models at 128K tokens
This eats into your available RAM — another reason why 24GB unified memory matters. A 16GB Mac can run the model but may struggle to maintain large context windows.
Quantization and Bandwidth
Quantization reduces the bits-per-weight, making the model smaller and faster to read:
| Format | Bits/weight | Speed | Quality loss |
|---|---|---|---|
| BF16 | 16 | 1× baseline | None |
| Q8 | 8 | ~2× faster | Negligible |
| Q4 | 4 | ~4× faster | Minor |
| Q2 | 2 | ~8× faster | Significant |
Ollama defaults to Q4 for most models — the sweet spot of speed and quality. This is why Gemma4:e4b at Q4 runs comfortably at 13-18 tok/s despite being a multi-billion parameter model.
The Practical Takeaway
If you’re running AI models locally:
- Memory bandwidth > raw compute for inference
- Unified memory (Apple Silicon) is more efficient than separate GPU VRAM for models that fit in RAM
- 24GB unified comfortably runs up to ~20GB models with room for KV cache
- A gaming laptop with 16GB GPU VRAM will run models faster than Apple Silicon per token, but can’t run larger models at all
- Cloud APIs sidestep all of this — but cost money and have latency
For a Mac Mini M4 Pro at $700, running Gemma4:e4b at 15+ tok/s with 6/6 long context is genuinely impressive.