AIPulse
← All articles
Benchmarks 8 min read April 11, 2026

Gemma4 vs GPT-4o-mini vs Gemini 2.5 Flash: Full Benchmark Results

We ran 5 AI models through 16 tests covering reasoning, hallucination resistance, coding, and long-context recall. Here are the raw results.


We tested 5 models head-to-head: Gemma3:27B (local), GPT-4o-mini (OpenAI API), Gemini 2.5 Flash (Google API), Gemma4:e4b (local), and Gemma4:26B (Google AI Studio API). All cloud models were called via their respective APIs. Local models ran on an Apple M4 Pro Mac Mini with 24GB unified memory.

The Scorecard

ModelScoreLong ContextAvg SpeedSize
Gemma3:27B9/104/693s17GB local
GPT-4o-mini9/104/67sCloud
Gemini 2.5 Flash9/105/612sCloud
Gemma4:e4b8/106/620s9.6GB local
Gemma4:26B (API)9/105/623sCloud API

Reasoning Tests

Q1 — Logic Puzzle (seat arrangement): All models passed. Gemma4:e4b used formal mathematical notation (set-theoretic constraints) which was the most precise approach.

Q2 — Ethical Dilemma (trolley problem variant): All passed. Gemma4:e4b gave the most structured analysis, explicitly labeling utilitarian, deontological, and virtue ethics frameworks.

Q3 — Snail / Math + Code: All passed. The key insight is that the snail reaches the top on day 28 during the daytime climb, not at the end of the day — models that missed this got the wrong answer (none did).

Hallucination Tests

H1 — False Premise (Mars 1987): All models correctly rejected the premise. Nobody has walked on Mars.

H2 — Fake Person (Dr. Harold Finster): All models correctly identified this person doesn’t exist and provided the real 2003 Nobel Prize in Physics winners (Abrikosov, Ginzburg, Leggett for superconductivity/superfluidity).

H3 — Trick Math (roosters don’t lay eggs): All passed instantly. Gemma4:26B gave the fastest and most concise answer: “Roosters don’t lay eggs.”

H4 — Pluto Trap: This is where it gets interesting. The question asks to “include Pluto in your count” — a trick to see if models mindlessly comply. Gemini 2.5 Flash failed — it just answered 9 with no correction. All others correctly answered 9 but added that Pluto is a dwarf planet reclassified by the IAU in 2006.

Coding Tests

C1 — Debug Broken Python: The code had 4 bugs: missing colon in for loop, =+ instead of +=, results instead of result, and a division-by-zero edge case on empty list. All models found the first three; Gemma3:27B and GPT-4o-mini also flagged the edge case.

C2 — Pair Sum Algorithm: All models produced correct O(n) hash-set solutions. Gemini 2.5 Flash gave the clearest explanation of why O(n²) naive approach is worse.

C3 — SQL Challenge: All produced correct queries with JOIN, GROUP BY, SUM, COUNT, WHERE for date filtering, and LIMIT 3.

Long Context (6 tests on a 3,600-word document)

This is where Gemma4:e4b stood out — it was the only model to score 6/6, including correctly finding “Dr. Sarah Chen, PhD (MIT, Stanford AI Lab alumna)” buried in the appendix.

  • Gemini 2.5 Flash: 5/6
  • Gemma4:26B: 5/6
  • Gemma3:27B / GPT-4o-mini: 4/6 (both missed the buried researcher detail)
  • Gemma4:e4b: 6/6 ✓

The one question that tripped models was Q5 (largest market + highest CAGR). The answer requires reading Chapter 7’s cross-reference table and understanding that “largest” refers to Space ($546B) and “highest CAGR” refers to Quantum (48.7%) — they’re different sectors. Gemma4:e4b correctly identified both as separate answers.

Key Takeaways

  1. Gemma4:e4b is the best local model — 9.6GB footprint, 6/6 long context (best overall), 20s average response
  2. Gemini 2.5 Flash is fastest at 12s average and best cloud option on a free tier
  3. GPT-4o-mini is fastest at 7s but costs money and missed the Pluto correction
  4. Gemma3:27B is 17GB and 93s — same accuracy as smaller models, just slower
  5. Long context is where local models surprise — Gemma4:e4b beat every cloud model

Reproducibility

All benchmark scripts are on GitHub. You can reproduce every test exactly. The scripts use ollama Python library for local models and standard API clients for cloud models.