WildEconForce — AI Build Journal

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

TL;DR. Gemma 4 31B (open weights, $0.12 / $0.37 per million tokens) was benchmarked against Gemini 3.1 Pro Preview, DeepSeek V4 Pro, and Claude Opus 4.7. The task: read a 7,500-token architecture spec, apply it to design a 4-module trading bot, then adversarially critique the spec itself.

Three results worth a developer's time.

One. Gemma 4 31B agreed with Claude Opus 4.7 and Gemini 3.1 Pro on the layer assignment of 3 of 4 modules: the same structural call as the $12-per-million flagship, at roughly 1/14 the per-call cost on this task.

Two. Gemma 4 31B caught every one of the four major architectural flaws that all four models converged on, including the most subtle one, with the shortest output of the four.

Three. The full reusable setup is on GitHub. Total OpenRouter cost: $0.05.

If you are a solo developer auditing your own architecture documents, Gemma 4 31B is the model you can afford to run on every iteration.

Why Gemma 4 was the model I most wanted tested

For solo developers and small teams in 2026 the most expensive line item in an LLM-assisted workflow is not the model. It is the willingness to skip a step because the model is expensive.

If you have to think twice before running a $0.06 critique pass on every draft, you will skip it most of the time, and the architecture quality of your output will reflect that.

Gemma 4 31B Dense is priced at $0.12 input and $0.37 output per million tokens on OpenRouter. For a single 7,500-token system prompt and a 2,000-token user task with 4,000 tokens of structured output back. The math is one tenth of a US cent per critique. That is a price you do not have to think about.

The question I cared about: does a 31B-parameter open-weights model read dense Markdown architecture specs at the same depth as a frontier closed model? If yes, the entire economics of solo architecture work changes.

So I picked an architecture spec that was deliberately hard: 7,500 tokens of mixed Korean and English Markdown, four named strata (P0 / P1 / P2 / P3), eight hard-locked principles, five cross-domain invariants, and four domain mappings. The kind of document where surface-skim model reading misses the actual constraints.

Then I gave the same spec, with identical prompts, to Gemma 4 31B, Gemini 3.1 Pro Preview, DeepSeek V4 Pro, and Claude Opus 4.7.

This is what came back.

The setup

All four models got the same system prompt, with the architecture spec loaded directly, and the same user prompt asking for two outputs.

PART 1. Apply the spec to a real domain: design a crypto trading bot as four modules (signal / risk / executor / state), assign each module to one of the four strata, specify hard-locked invariants, sketch the Python class, and write the test assertions.

PART 2. Adversarially critique the spec. Flag spurious universal claims, list missing layers, name the over-abstractions, and say if you would actually use this.

No retries, no cherry-picking, temperature 0.3 for the three OpenRouter calls. Claude Opus 4.7 ran in a clean subagent context for fairness.

Model	Provider	Context	$/M in	$/M out	This call cost
Gemma 4 31B Dense	Google open	262K	$0.12	$0.37	~$0.003
Gemini 3.1 Pro Preview	Google flagship	1M	$2.00	$12.00	~$0.043
DeepSeek V4 Pro (MoE 1.6T)	DeepSeek	1M	$0.435	$0.87	~$0.012
Claude Opus 4.7	Anthropic	1M	(Max session, marginal $0)

Gemma 4 31B's per-call cost was about 14x cheaper than Gemini 3.1 Pro Preview's. Closer in price to a chat message than to an audit pass.

The full prompts and the four parsed JSON responses are on GitHub. Linked at the end.

What I evaluated, concretely, was three things.

Layer assignment table. Same architecture should yield similar mapping. Where does each model put the four modules?
Critique parity. Did each model catch the same architectural flaws? Where did they diverge?
The verdict. Asked to say if they would actually use the spec. What did each model conclude?

Result 1. Layer assignment. Gemma 4 31B matched the flagship.

Four modules. Four models. Four layers each. 16 cells.

Module	Gemma 4 31B	Gemini 3.1 Pro	Claude Opus 4.7	DeepSeek V4 Pro
signal	P2	P2	P2	P2
risk	P3	P3	P3	P0
executor	P1	P1	P1	P1
state	P0	P0	P0	P0

Gemma 4 31B agreed with Claude Opus 4.7 and Gemini 3.1 Pro on every module, making the same structural call as the $12-per-million flagship for roughly 1/14 the per-call inference cost on this task.

The outlier was not Gemma 4. It was DeepSeek V4 Pro. The reasoning-heavy 1.6 trillion parameter MoE model put risk in P0 (always-on safety) rather than P3 (validation). Read charitably this is defensible. Circuit breakers are conceptually always-on. The other three models read the spec more literally and kept risk in the validation stratum where the spec puts it.

I am going to use the strict reading. The point for this article is that on a 7,500-token Markdown spec that asks for careful semantic placement, Gemma 4 31B matched the flagship reading. The cheap model did not give a sloppy answer.

For a solo developer auditing their own document. This means you can run Gemma 4 31B as the primary reader on every revision. And reserve frontier models for the moments you genuinely want a second opinion.

Result 2. Critique parity. Gemma 4 31B caught every flaw the flagships caught.

I built PART 2 specifically to surface model divergence. To watch them attack the architecture from different angles. The opposite happened.

On the four most important critiques. All four models agreed. Gemma 4 31B included.

Critique 1. "Self-correction is silent" is dangerous in audited domains.

The architecture spec's Cross-domain Invariant 2 says the system should heal its own drift quietly. No user-facing output. This is sane for an LLM wandering off-tone in a content draft. It is dangerous in trading, or legal, or any regulated domain, where silent recovery hides state mutations that real money depends on.

Gemma 4 31B's wording: "In financial and trading systems silent recovery of state drift is an anti-pattern. State anomalies must be highly observable and often require halting. Not silent internal masking." Direct and short. Same structural diagnosis as the longer outputs from Claude and DeepSeek.

Critique 2. "Plain Text only" is content-system thinking.

Core Principle 7 in the spec says all generation is Plain Text. Format conversion to HTML or JSON happens only at the very last step. The principle is correct for content pipelines. It is meaningless for a trading bot where the "generation" is a Python dataclass and the "conversion" is a ccxt API payload.

All four models flagged this. Three of them proposed the same fix. Replace "Plain Text" with "Schema-Validated Intermediate Representation". Same idea. Gemma 4 31B was among the three that proposed it.

Critique 3. "Brand/tone injection" stretches in non-content domains.

P3 in the original spec was where you injected the brand voice. Tone only. No structure changes. The spec maps this to "sizing" in the trading bot domain. Position size is structural. Not tonal.

All four caught the strain. Gemini 3.1 Pro Preview said it most directly. Gemma 4 31B said the same thing in fewer words: "P3 Brand Injection. While useful for LLMs as a general architectural term it is too vague to be actionable for non-content domains. In Trading it is just Sizing."

Critique 4. The "universal" claim is too strong.

The spec is titled "Universal Layered Architecture". Section 5 walks it back with a "this is a thinking pattern not a framework" disclaimer. All four models noticed the tension between the title and the disclaimer.

This convergence matters. With 4 models the sample is small. The article cannot make strong "this is signal not noise" claims from N=4. What the convergence does show. Each of the four models found the same four flaws independently. Gemma 4 31B included. A solo developer running Gemma 4 31B alone would surface these same four issues. Without paying for the other three calls.

Result 3. Gemma 4 31B's signature critique. The retag I would not have found alone.

Convergence on the obvious flaws is reassuring. Divergence on what to do about it is where each model's training depth shows. Each of the four models produced a distinct signature suggestion.

Gemma 4 31B's signature was the most actionable structural fix in the entire audit.

The spec calls P0 a "stratum" in a vertical four-layer model. Gemma 4 31B observed that P0 is actually a cross-cutting concern in software engineering terms. A decorator. Middleware. An interceptor wrapping the other layers. Not a stratum sitting at the top of them.

Reclassifying P0 from layer to interceptor changes how the architecture maps to concrete code. If you treat P0 as a stratum you spend energy figuring out where the always-on watchdog fits in the vertical ordering. If you treat P0 as an interceptor you wrap the existing P1-P2-P3 flow with the watchdog. The implementation is simpler. The mental model is cleaner.

This is the kind of fix that an experienced engineer would propose. From Gemma 4 31B's 9KB JSON output. The shortest output of any model in the test. Brevity did not compromise depth.

The other models' signatures.

Claude Opus 4.7 caught a category error in the spec. "Hard-locked invariants are byte-exact" uses text-world language for numeric thresholds like SAFE-12's -$3 daily loss limit. The actual property is "changed only via versioned patch". A config-management property. Not a byte property. Sharp observation. Did not propose a structural fix.
Gemini 3.1 Pro Preview caught that the spec's three-reviewer AND-gate makes no sense for deterministic logic. It is theater when applied to a ccxt order payload. Direct. Honest. Did not propose a structural fix beyond "make this optional".
DeepSeek V4 Pro identified five missing layers (observability, persistence, deployment, data pipeline, authz) and drew a full mermaid sequence diagram. Exhaustive coverage. Highest-volume output.

Gemma 4 31B's P0-as-interceptor suggestion was the single fix that I will implement first. Out of all the proposals across the four models. It is also the one I am least likely to have found on my own.

Result 4. Gemma 4 31B's verdict on the spec was the most surgical.

The final question I asked each model. After designing the spec and after critiquing it. Would you actually use this architecture.

Gemma 4 31B's answer.

"Conditional. Yes for LLM-orchestrated workflows where reliability and hallucination-proofing are more critical than latency. No for pure high-performance software where the overhead of multi-stage validation is a bottleneck."

Two clean cases, a clear bright line, no diplomatic hedging, no "it depends on many factors" filler.

This is the kind of answer that compresses well into a decision rule. If your task is LLM-orchestrated and reliability matters more than speed. Use the spec. Otherwise do not. That is a usable heuristic.

Compare to the other answers.

Claude Opus 4.7: "Conditional. Yes for V5.0 specifically because the layer ordering captures the module separation I wanted anyway. No as a general universal architecture." More precise about the specific use case. Less generalizable.
DeepSeek V4 Pro: "Conditional. For LLM-based content yes. For trading bot adapt the safety concept but discard the plain text and trigger and brand injection layers." More elaborate. Higher reading cost.
Gemini 3.1 Pro Preview: "Conditional. I would absolutely use this for the Content and Legal domains but discard it for the Trading Bot domain." Most direct rejection. But Gemini also produced a full trading bot spec following the same architecture in the same response. A literal contradiction between the spec it produced and its closing verdict. I find that contradiction useful as a data point but I cannot tell whether it reflects model honesty or sampling variance.

For a solo developer who needs a fast decision rule. Gemma 4 31B's answer is the most directly usable.

What I changed in the spec

After reading the four critiques together I edited the architecture document. Three changes. All driven by Gemma 4 31B's findings either alone or in convergence with the other models.

Cross-domain Invariant 2 is no longer "Self-correction is silent". It is a four-tier escalation contract. Silent for tone. Logged for state. Surfaced for policy. Blocked for safety. Driven by all four models. Gemma 4 31B included.
Core Principle 7 is no longer "Plain Text only". It is "Schema-Validated Intermediate Representation". Driven by Gemma 4 31B (most concise version of the proposal). Confirmed by Gemini 3.1 Pro and DeepSeek V4 Pro.
P0 is being reclassified from "stratum" to "interceptor". Single-source attribution. Gemma 4 31B's signature contribution. The other three models converged on the flaw but did not propose this specific fix.

Three edits. The third one came from the cheapest model in the lineup. The one I could afford to run on every iteration.

What this does not say

I want to be specific about what the data supports and what it does not.

Four models is a small sample. I tested Gemma 4 31B alongside three other models. Two of which (Claude and Gemini) were trained by labs that may share critique heuristics with parts of Gemma's training corpus. To claim "convergence is signal not noise" would require 7 to 10 models including non-Google and non-Anthropic lineages, with multiple temperatures and seeds, plus a control prompt (no architecture spec) to check base-rate critique overlap. I did not run those controls.

What the data does show. On one architecture document with one prompt structure at temperature 0.3. Gemma 4 31B produced layer assignments matching frontier models on 3 of 4 cells. Caught every flaw the others caught. And contributed the most actionable structural fix in the set.

The $0.05 figure is the inference cost. Not the cost of the audit. The architecture improvements required me to read four JSON outputs. Reconcile them. Edit the document. The inference was an input to my work. Not the work itself.

The Gemini paradox (same model producing the spec and saying not to use it) is a literal contradiction in its output. Whether it reflects "honesty" or "instruction-following two sub-tasks separately" or sampling variance. I cannot tell from a single run. I noted it because the contradiction is itself informative regardless of what causes it.

I am the operator who wrote the prompt and reads the outputs. There is a real risk that I am pattern-matching the four responses against my own preferences. Acknowledging the risk does not eliminate it.

Reproduce this

The setup is small and runs in under five minutes.

EFA_Universal_Architecture.md is the system prompt. ~7,500 tokens. Dense Markdown with strata definitions and domain mappings. Available on the GitHub repo.
run_round2_v5_spec.py is the OpenRouter caller for the three open and semi-open models. Uses standard chat completions API.
The Claude Opus 4.7 call was made via a clean subagent in Claude Code. Same prompt content. Independent context.
The four parsed JSON responses are in results/round_2_v5_spec/.

Minimum-cost reproduction path. Run only Gemma 4 31B. Skip the other three. Total cost ~$0.003. You will get the same four converging flaws and a usable structural suggestion. If you want a second opinion add DeepSeek V4 Pro for missing-layer coverage at ~$0.01 more.

The two-model setup (Gemma 4 + DeepSeek V4) covers the convergence layer plus the deep critique layer for about 1/40th of running the full four-model set. For solo developers auditing their own blueprints this is the path I would actually recommend.

If you run this protocol on your own architecture document and the results diverge from mine. I would like to see the comparison. Counter-experiments welcome.

Jack. wildeconforce.com