Gemma4 vs GPT-4o-mini vs Gemini 2.5 Flash: Full Benchmark Results (Updated)

We tested 5 models head-to-head across two rounds: 10 standard tests covering reasoning, hallucination, and coding — then 4 harder tests added after community feedback. Models tested: Gemma3:27B (local), GPT-4o-mini (OpenAI API), Gemini 2.5 Flash (Google API), Gemma4:e4b (local), and Gemma4:26B (Google AI Studio API).

Hardware: Apple M4 Pro Mac Mini, 24GB unified memory, 273 GB/s bandwidth. Local models ran via Ollama.

Final Scorecard (14 tests)

Model	Easy (10)	Hard (4)	Total	Long Context	Avg Speed
Gemini 2.5 Flash	9/10	4/4	13/14	5/6	12s
Gemma4:26B (API)	9/10	4/4	13/14	5/6	23s
GPT-4o-mini	9/10	3/4	12/14	4/6	7s
Gemma4:e4b	8/10	3/4	11/14	6/6	20s
Gemma3:27B	9/10	—	9/10	4/6	93s

Round 1 — Standard Tests (10 tests)

Reasoning

Q1 — Logic Puzzle (seat arrangement): All models passed. Gemma4:e4b used formal mathematical notation with set-theoretic constraints — the most precise approach.

Q2 — Ethical Dilemma (trolley problem variant): All passed. Gemma4:e4b gave the most structured analysis, explicitly labeling utilitarian, deontological, and virtue ethics frameworks.

Q3 — Snail / Math + Code: All passed. The key insight is that the snail reaches the top on day 28 during the daytime climb — models that miss this get the wrong answer. None did.

Hallucination Tests

H1 — False Premise (Mars 1987): All models correctly rejected the premise.

H2 — Fake Person (Dr. Harold Finster): All models correctly identified this person doesn’t exist and gave the real 2003 Nobel Prize winners (Abrikosov, Ginzburg, Leggett).

H3 — Trick Math (roosters don’t lay eggs): All passed. Gemma4:26B fastest and most concise: “Roosters don’t lay eggs.”

H4 — Pluto Trap: The question asks to “include Pluto in your count.” Gemini 2.5 Flash failed — it just answered 9 with no correction. All others answered 9 but noted Pluto is a dwarf planet reclassified by the IAU in 2006.

Coding Tests

C1 — Debug Broken Python: 4 bugs: missing colon, =+ instead of +=, results vs result, division-by-zero edge case on empty list. All models found the first three; Gemma3:27B and GPT-4o-mini also flagged the edge case.

C2 — Pair Sum Algorithm: All produced correct O(n) hash-set solutions.

C3 — SQL Challenge: All produced correct queries with JOIN, GROUP BY, SUM, COUNT, date filtering, and LIMIT 3.

Round 2 — Harder Tests (4 tests)

Added after community feedback that the standard tests weren’t hard enough to separate models.

H1 — Multi-hop Logic Puzzle

Five people, five pets, five constraints requiring chaining multiple deductions. The correct answer: Alice=Hamster, Bob=Bird, Carol=Fish, Dave=Dog, Eve=Cat.

Model	Result	Notes
Gemma4:e4b	❌	Got confused at constraint intersection, wrong final answer
GPT-4o-mini	✅	Correct, clean step-by-step
Gemma4:26B	✅	Correct with full working
Gemini 2.5 Flash	✅	Correct, most structured breakdown

H2 — Competition Math (AIME-style)

“Find integers n, 1 ≤ n ≤ 1000, such that n²-n is divisible by 5.” Answer: 400.

All 4 models passed with correct working (factor n(n-1), two residue classes mod 5, 200+200=400). No separation here — all models handle competition-level number theory.

H3 — Adversarial Hallucination (Fake Theorem)

Asked about “the Einzel-Hoffmann theorem in graph theory” — a completely fabricated theorem.

Model	Result	Notes
Gemma4:e4b	✅	Immediately said “no recognized theorem by this name exists”
GPT-4o-mini	❌	Hallucinated — confidently explained the fake theorem with applications
Gemma4:26B	✅	Correctly said it doesn’t exist, offered real Hoffman-related theorems
Gemini 2.5 Flash	✅	Correctly rejected, noted “Einzel” is German for “individual/single”

This is the most important finding of the harder tests. GPT-4o-mini fabricated a detailed, plausible-sounding explanation for a theorem that doesn’t exist. The other three all correctly flagged it.

H4 — Subtle Code Edge Cases

A second_largest() function that crashes on empty lists, single-element lists, and all-duplicate lists.

All 4 models passed — identified all three edge cases and provided correct fixes. Gemma4:e4b returned None for edge cases; GPT-4o-mini and Gemini raised ValueError with descriptive messages (arguably better practice).

Long Context (6 tests on a 3,600-word document)

Gemma4:e4b remains the only model to score 6/6, including finding “Dr. Sarah Chen, PhD (MIT, Stanford AI Lab alumna)” buried in the appendix — which all cloud models missed in the first round.

Model	Score
Gemma4:e4b	6/6
Gemini 2.5 Flash	5/6
Gemma4:26B	5/6
Gemma3:27B	4/6
GPT-4o-mini	4/6

Key Takeaways

Gemini 2.5 Flash and Gemma4:26B tie at 13/14 — best overall accuracy, but Gemini is 2× faster via API
GPT-4o-mini’s hallucination failure is significant — confidently fabricating a fake theorem is a real risk in production use
Gemma4:e4b wins long context at 6/6 — the only model to do so, running free and locally on 9.6GB
Gemma4:e4b struggles with multi-hop logic — fine for most tasks, but chained constraint reasoning is a weakness
Gemma3:27B is outclassed — same accuracy as smaller models at 17GB and 93s average, hard to justify over Gemma4:e4b
Gemma4:26B via API was very slow on H1 (363s) — API latency varies significantly

Methodology

All benchmark scripts are open source on GitHub. Every test is reproducible. Local models ran on Ollama; cloud models used their standard Python SDKs.

Updated: April 13, 2026 — added Round 2 harder tests after community feedback on r/LocalLLaMA.