We tested 5 models head-to-head across two rounds: 10 standard tests covering reasoning, hallucination, and coding — then 4 harder tests added after community feedback. Models tested: Gemma3:27B (local), GPT-4o-mini (OpenAI API), Gemini 2.5 Flash (Google API), Gemma4:e4b (local), and Gemma4:26B (Google AI Studio API).
Hardware: Apple M4 Pro Mac Mini, 24GB unified memory, 273 GB/s bandwidth. Local models ran via Ollama.
Final Scorecard (14 tests)
| Model | Easy (10) | Hard (4) | Total | Long Context | Avg Speed |
|---|---|---|---|---|---|
| Gemini 2.5 Flash | 9/10 | 4/4 | 13/14 | 5/6 | 12s |
| Gemma4:26B (API) | 9/10 | 4/4 | 13/14 | 5/6 | 23s |
| GPT-4o-mini | 9/10 | 3/4 | 12/14 | 4/6 | 7s |
| Gemma4:e4b | 8/10 | 3/4 | 11/14 | 6/6 | 20s |
| Gemma3:27B | 9/10 | — | 9/10 | 4/6 | 93s |
Round 1 — Standard Tests (10 tests)
Reasoning
Q1 — Logic Puzzle (seat arrangement): All models passed. Gemma4:e4b used formal mathematical notation with set-theoretic constraints — the most precise approach.
Q2 — Ethical Dilemma (trolley problem variant): All passed. Gemma4:e4b gave the most structured analysis, explicitly labeling utilitarian, deontological, and virtue ethics frameworks.
Q3 — Snail / Math + Code: All passed. The key insight is that the snail reaches the top on day 28 during the daytime climb — models that miss this get the wrong answer. None did.
Hallucination Tests
H1 — False Premise (Mars 1987): All models correctly rejected the premise.
H2 — Fake Person (Dr. Harold Finster): All models correctly identified this person doesn’t exist and gave the real 2003 Nobel Prize winners (Abrikosov, Ginzburg, Leggett).
H3 — Trick Math (roosters don’t lay eggs): All passed. Gemma4:26B fastest and most concise: “Roosters don’t lay eggs.”
H4 — Pluto Trap: The question asks to “include Pluto in your count.” Gemini 2.5 Flash failed — it just answered 9 with no correction. All others answered 9 but noted Pluto is a dwarf planet reclassified by the IAU in 2006.
Coding Tests
C1 — Debug Broken Python: 4 bugs: missing colon, =+ instead of +=, results vs result, division-by-zero edge case on empty list. All models found the first three; Gemma3:27B and GPT-4o-mini also flagged the edge case.
C2 — Pair Sum Algorithm: All produced correct O(n) hash-set solutions.
C3 — SQL Challenge: All produced correct queries with JOIN, GROUP BY, SUM, COUNT, date filtering, and LIMIT 3.
Round 2 — Harder Tests (4 tests)
Added after community feedback that the standard tests weren’t hard enough to separate models.
H1 — Multi-hop Logic Puzzle
Five people, five pets, five constraints requiring chaining multiple deductions. The correct answer: Alice=Hamster, Bob=Bird, Carol=Fish, Dave=Dog, Eve=Cat.
| Model | Result | Notes |
|---|---|---|
| Gemma4:e4b | ❌ | Got confused at constraint intersection, wrong final answer |
| GPT-4o-mini | ✅ | Correct, clean step-by-step |
| Gemma4:26B | ✅ | Correct with full working |
| Gemini 2.5 Flash | ✅ | Correct, most structured breakdown |
H2 — Competition Math (AIME-style)
“Find integers n, 1 ≤ n ≤ 1000, such that n²-n is divisible by 5.” Answer: 400.
All 4 models passed with correct working (factor n(n-1), two residue classes mod 5, 200+200=400). No separation here — all models handle competition-level number theory.
H3 — Adversarial Hallucination (Fake Theorem)
Asked about “the Einzel-Hoffmann theorem in graph theory” — a completely fabricated theorem.
| Model | Result | Notes |
|---|---|---|
| Gemma4:e4b | ✅ | Immediately said “no recognized theorem by this name exists” |
| GPT-4o-mini | ❌ | Hallucinated — confidently explained the fake theorem with applications |
| Gemma4:26B | ✅ | Correctly said it doesn’t exist, offered real Hoffman-related theorems |
| Gemini 2.5 Flash | ✅ | Correctly rejected, noted “Einzel” is German for “individual/single” |
This is the most important finding of the harder tests. GPT-4o-mini fabricated a detailed, plausible-sounding explanation for a theorem that doesn’t exist. The other three all correctly flagged it.
H4 — Subtle Code Edge Cases
A second_largest() function that crashes on empty lists, single-element lists, and all-duplicate lists.
All 4 models passed — identified all three edge cases and provided correct fixes. Gemma4:e4b returned None for edge cases; GPT-4o-mini and Gemini raised ValueError with descriptive messages (arguably better practice).
Long Context (6 tests on a 3,600-word document)
Gemma4:e4b remains the only model to score 6/6, including finding “Dr. Sarah Chen, PhD (MIT, Stanford AI Lab alumna)” buried in the appendix — which all cloud models missed in the first round.
| Model | Score |
|---|---|
| Gemma4:e4b | 6/6 |
| Gemini 2.5 Flash | 5/6 |
| Gemma4:26B | 5/6 |
| Gemma3:27B | 4/6 |
| GPT-4o-mini | 4/6 |
Key Takeaways
- Gemini 2.5 Flash and Gemma4:26B tie at 13/14 — best overall accuracy, but Gemini is 2× faster via API
- GPT-4o-mini’s hallucination failure is significant — confidently fabricating a fake theorem is a real risk in production use
- Gemma4:e4b wins long context at 6/6 — the only model to do so, running free and locally on 9.6GB
- Gemma4:e4b struggles with multi-hop logic — fine for most tasks, but chained constraint reasoning is a weakness
- Gemma3:27B is outclassed — same accuracy as smaller models at 17GB and 93s average, hard to justify over Gemma4:e4b
- Gemma4:26B via API was very slow on H1 (363s) — API latency varies significantly
Methodology
All benchmark scripts are open source on GitHub. Every test is reproducible. Local models ran on Ollama; cloud models used their standard Python SDKs.
Updated: April 13, 2026 — added Round 2 harder tests after community feedback on r/LocalLLaMA.