AIPulse
← All articles
Hardware 6 min read April 10, 2026

Why Your Mac's 273 GB/s Memory Bandwidth Is the Secret to Fast AI Inference

LLM speed isn't limited by compute — it's limited by how fast you can move model weights from memory to the processor. Here's the physics.


When people ask why an Apple M4 Pro Mac Mini runs large language models faster than a gaming laptop with a dedicated GPU, the answer usually surprises them: it’s not about TFLOPS. It’s about memory bandwidth.

The Inference Bottleneck Is Not Compute

Generating a single token requires loading the model’s weights — billions of floating-point numbers — from memory into the processor, doing some math, then repeating for the next token. The math itself takes microseconds. The bottleneck is how fast you can read those weights from RAM.

This is called being memory-bandwidth-bound.

For a 9.6GB model like Gemma4:e4b running at Q4 quantization:

9.6 GB × 2 reads/token = 19.2 GB per token
Memory BW = 273 GB/s
Max theoretical speed = 273 / 19.2 ≈ 14 tokens/sec

Actual measured speed: ~13-18 tokens/sec. That’s 90%+ efficiency — almost no overhead.

Why Apple Silicon Wins

On a conventional PC, your GPU has its own VRAM connected via PCIe. The CPU has system RAM connected via a separate memory controller. They don’t share bandwidth.

On Apple Silicon, the CPU, GPU, and Neural Engine all share a single unified memory pool with a single high-bandwidth bus. There’s no PCIe bottleneck. No memory copying between CPU and GPU address spaces.

HardwareMemory BWMax LLM Model Size
M4 Pro (24GB)273 GB/s~20GB (leaves headroom)
RTX 4090 (24GB VRAM)1,008 GB/s~20GB
RTX 3090 (24GB VRAM)936 GB/s~20GB
Gaming laptop 16GB68 GB/s~12GB
M1 MacBook Air68 GB/s~12GB

The RTX 4090 has 3.7× more bandwidth — but also costs 3.7× more and can’t fit on a desk without a full tower PC.

What 273 GB/s Means in Practice

The M4 Pro’s 273 GB/s comes from LPDDR5X RAM running at extremely high clock speeds across a wide 256-bit bus. Apple designed this specifically for the Neural Engine and GPU workloads.

For comparison:

  • Your laptop’s system RAM: ~40-68 GB/s
  • Server DDR5: ~100-150 GB/s
  • Apple M4 Pro: 273 GB/s
  • High-end GPU VRAM: 500-1,000 GB/s

So the M4 Pro sits between standard server RAM and GPU VRAM — while being a consumer desktop chip that costs $700.

The KV Cache Factor

Every transformer model maintains a key-value (KV) cache — it stores the attention keys and values for every previous token so it doesn’t need to recompute them. This cache grows linearly with context length.

For a 128K context window at full precision:

KV cache size = layers × heads × head_dim × context_len × 2 × sizeof(float16)
             = ~8-16GB for large models at 128K tokens

This eats into your available RAM — another reason why 24GB unified memory matters. A 16GB Mac can run the model but may struggle to maintain large context windows.

Quantization and Bandwidth

Quantization reduces the bits-per-weight, making the model smaller and faster to read:

FormatBits/weightSpeedQuality loss
BF16161× baselineNone
Q88~2× fasterNegligible
Q44~4× fasterMinor
Q22~8× fasterSignificant

Ollama defaults to Q4 for most models — the sweet spot of speed and quality. This is why Gemma4:e4b at Q4 runs comfortably at 13-18 tok/s despite being a multi-billion parameter model.

The Practical Takeaway

If you’re running AI models locally:

  1. Memory bandwidth > raw compute for inference
  2. Unified memory (Apple Silicon) is more efficient than separate GPU VRAM for models that fit in RAM
  3. 24GB unified comfortably runs up to ~20GB models with room for KV cache
  4. A gaming laptop with 16GB GPU VRAM will run models faster than Apple Silicon per token, but can’t run larger models at all
  5. Cloud APIs sidestep all of this — but cost money and have latency

For a Mac Mini M4 Pro at $700, running Gemma4:e4b at 15+ tok/s with 6/6 long context is genuinely impressive.