Model Benchmarks
5 models tested across reasoning, hallucination resistance, coding, and long-context recall.
Last updated: April 11, 2026 · Hardware: Apple M4 Pro (24GB unified memory, 273 GB/s bandwidth)
| Model | Score | Long Context | Avg Speed | Size | Type | Cost |
|---|---|---|---|---|---|---|
| Gemma3:27B | 9/10 | 4/6 | 93s | 17GB | Local | Free |
| GPT-4o-mini | 9/10 | 4/6 | 7s | Cloud | Cloud API | Paid |
| Gemini 2.5 Flash | 9/10 | 5/6 | 12s | Cloud | Cloud API | Free tier |
| Gemma4:e4b | 8/10 | 6/6 | 20s | 9.6GB | Local | Free |
| Gemma4:26B (API) | 9/10 | 5/6 | 23s | Cloud | Cloud API | Free tier |
Accuracy Score (out of 10)
Avg Response Time (seconds, lower is better)
Overall Capabilities (radar)
Methodology
Reasoning (3 tests)
- Logic puzzle (seat arrangement)
- Ethical dilemma analysis
- Math + code verification
Hallucination (4 tests)
- False premise rejection
- Fake person detection
- Trick math question
- Pluto reclassification
Coding (3 tests)
- Debug broken Python code
- Algorithm design (O(n) pair sum)
- SQL query with joins + aggregates
Long Context (6 tests)
- 3,600-word tech trends document
- Specific fact retrieval
- Cross-section math calculations
- Buried detail (appendix)