AIPulse

Model Benchmarks

5 models tested across reasoning, hallucination resistance, coding, and long-context recall.

Last updated: April 11, 2026 · Hardware: Apple M4 Pro (24GB unified memory, 273 GB/s bandwidth)

Model Score Long Context Avg Speed Size Type Cost
Gemma3:27B 9/10
4/6 93s 17GB Local Free
GPT-4o-mini 9/10
4/6 7s Cloud Cloud API Paid
Gemini 2.5 Flash 9/10
5/6 12s Cloud Cloud API Free tier
Gemma4:e4b 8/10
6/6 20s 9.6GB Local Free
Gemma4:26B (API) 9/10
5/6 23s Cloud Cloud API Free tier

Accuracy Score (out of 10)

Avg Response Time (seconds, lower is better)

Overall Capabilities (radar)

Methodology

Reasoning (3 tests)

  • Logic puzzle (seat arrangement)
  • Ethical dilemma analysis
  • Math + code verification

Hallucination (4 tests)

  • False premise rejection
  • Fake person detection
  • Trick math question
  • Pluto reclassification

Coding (3 tests)

  • Debug broken Python code
  • Algorithm design (O(n) pair sum)
  • SQL query with joins + aggregates

Long Context (6 tests)

  • 3,600-word tech trends document
  • Specific fact retrieval
  • Cross-section math calculations
  • Buried detail (appendix)