I Cut My Gemma 4 Challenge API Costs by 87% With Context Engineering. Here Is the Math.
This is a submission for the Gemma 4 Challenge: Write About Gemma 4.
TL;DR. Over three previous Gemma 4 Challenge posts I logged real API spend across Claude Opus 4.7, Gemini 3.1 Pro, DeepSeek V4 Pro, and Gemma 4 31B. Then I rebuilt the same pipeline with prompt caching and a six-file context engineering kit (CLAUDE.md, AGENTS.md, MEMORY.md, TESTING.md, GLOSSARY.md, ADR). Final spend per surfaced insight on the same 47-day trading bot fixture: frontier closed = $0.32, Gemma 4 cold = $0.034, Gemma 4 with cache + context kit = $0.046. The cost-per-insight floor dropped about 87% in five months. Open weights did most of the lift. Context engineering closed the remaining gap.
What this article is
This is a cost engineering breakdown of the three earlier Gemma 4 Challenge submissions. The earlier posts already ran the comparisons. This one walks through the receipts.
- Real per-call API spend across three prior submissions, with token counts and findings counts. Numbers below are billed dollars from OpenRouter and Anthropic, not estimates. I will mark anything projected.
- The context engineering stack that raised Gemma 4 31B's findings rate on the same fixture from 75% Claude-equivalent to 92%, without changing the model.
- Prompt caching math. Why a multi-article pipeline gets cheaper per call instead of more expensive.
- The 1-of-12 finding Gemma 4 still misses, and where I still pay frontier prices.
- The six-file replication kit, MIT mirror on GitHub, and the Korean walkthrough bundle I sell on Kmong for readers who want the five-minute setup instead of the five-hour build.
Previous three submissions:
- Article 1: Open-Source-First. Can Gemma 4 close on frontier failure-pattern detection?
- Article 2: 7,500-token architecture spec across 4 models
- Article 3: 8 LLMs building vulnerable apps
This article's angle is different from the rest of the challenge feed. The Bharat edge post and the Turtle demystifying post are about Gemma 4's capability. This one is about Gemma 4's unit economics. If you are a solo developer choosing what to run on every iteration of a real workflow, capability is table stakes. The number that actually decides what you ship is dollars per surfaced insight.
Section 1: Where the frontier bleeds money
The 47-day trading bot log from Article 1 is my reference fixture. Roughly 280K input tokens of mixed Korean and English. A curated rubric of 12 structural issues. Same task, four models, same prompt skeleton.
Here is what I actually paid.
| Model | Input tokens | Output tokens | Wall time | Cost | Findings (of 12) | Cost / finding |
|---|---|---|---|---|---|---|
| Claude Opus 4.7 | 280K | 4.2K | 38.4 s | $0.940 | 11 | $0.0855 |
| Gemini 3.1 Pro | 280K | 3.8K | 22.1 s | $0.412 | 10 | $0.0412 |
| DeepSeek V4 Pro | 280K | 4.0K | 27.9 s | $0.184 | 10 | $0.0184 |
| Gemma 4 31B (cold) | 280K | 3.6K | 31.7 s | $0.0339 | 9 | $0.00377 |
Headline: Gemma 4 31B hit 75% of Claude Opus 4.7's findings for 3.6% of the cost. Cost per finding ratio is 22.6 to 1 in Gemma 4's favor.
The frontier still wins the absolute findings race. Claude caught 11 of 12. Gemma 4 cold caught 9 of 12. That is a real gap and I will not paper over it. But notice the shape of the gap.
Claude finds one more bug than Gemini for 2.3 times the price. Gemini finds nothing more than DeepSeek for 2.2 times the price. DeepSeek finds one more bug than Gemma 4 for 5.4 times the price. Each step up the price ladder buys roughly one additional finding, and the price step doubles or triples.
If your audit pass costs are tied to your willingness to run the pass, the floor matters more than the ceiling. The model you can afford to run on every revision is the model that actually catches things. A $0.94 pass run once a week catches less in practice than a $0.034 pass run on every commit, even if the $0.94 pass is theoretically better per shot.
Section 2: The context engineering stack
Same model, same fixture, same prompt skeleton. The only thing I changed between the cold and warm runs was the surrounding context files.
Six files, written in plain Markdown, loaded as part of the system prompt or attached as fixtures. They are not magic. They are a checklist for the model.
# CLAUDE.md (excerpt)
## Failure patterns to look for in trading bot logs
When reading a multi-day operational log, flag these classes by name:
1. **N=1 symbol exclusion bias.** A strategy decision based on a single
symbol's bad week is statistical noise, not a strategy bug. Surface it
as `bias.n_eq_1` and require N >= 5 before treating as evidence.
2. **Fee-drag arithmetic.** Every closed position has at least two fees.
When PnL is computed without explicit fee deduction, label
`accounting.fee_drag_omitted`.
3. **Time-of-day naive aggregation.** Entries are timestamped UTC, the
operator reads KST. If hour-of-day stats are not tz-shifted before
bucketing, label `analysis.tz_drift`.
4. **Trailing TP vs safety-net SELL conflation.** These are two different
exit reasons with different PnL distributions. If they are bucketed
together, label `aggregation.exit_reason_collapse`.
[... 8 more categories ...]
For each finding, output:
- `category` (from the list above)
- `evidence` (3-7 lines quoted from the log)
- `confidence` (low|medium|high)
- `next_action` (one concrete change with variable name and value)
The pattern is the trick. Failure categories are named with stable identifiers. The model now has a label vocabulary instead of having to invent one mid-output. Once the label vocabulary stabilises, two things happen.
First, the model stops drifting between synonyms. Cold runs label the same bug as n_1_bias in one paragraph and single_symbol_overweight in the next, which makes deduplication impossible downstream. Named labels remove that.
Second, the model uses the label list as a checklist. Cold runs would surface 6 of 12 issues and stop because the output felt complete. Warm runs scan all 12 named categories and explicitly mark the ones they could not find evidence for, which surfaces the borderline cases the cold model would silently skip.
AGENTS.md handles the output side.
# AGENTS.md (excerpt)
## Output format for findings
Emit a single JSON block, no prose before or after. Schema:
```json
{
"findings": [
{
"id": "F-001",
"category": "bias.n_eq_1",
"evidence_lines": [142, 148, 151],
"evidence_quote": "...",
"confidence": "high",
"next_action": {
"file": "scanner.py",
"var": "MIN_SAMPLE_N",
"from": 1,
"to": 5,
"expected_effect": "drops 3 false positives per week"
}
}
],
"categories_not_found": ["accounting.fee_drag_omitted", "..."],
"self_critique": "..."
}
Treat categories_not_found as load-bearing. If a category is missing
from findings, it MUST appear in categories_not_found. Empty fields
are not allowed; write "no evidence" rather than omitting the key.
This is the framing that gives me a clean diff between runs. Two outputs in the same schema can be diffed mechanically. Findings can be deduplicated by category. The `categories_not_found` field forces the model to acknowledge what it skipped, which surfaces the silent misses.
MEMORY.md is the third piece. It carries findings forward between articles in the same series so the model does not rediscover the same bug eight times.
```markdown
# MEMORY.md (excerpt)
## Known issues from previous audit passes
- 2026-04-22, F-001 (bias.n_eq_1): MIN_SAMPLE_N raised from 1 to 5.
Verified in Article 1 followup. CLOSED.
- 2026-04-23, F-002 (accounting.fee_drag_omitted): TP/SELL PnL now
deducts 0.1% maker fee per leg. CLOSED.
- 2026-04-28, F-007 (aggregation.exit_reason_collapse): grouped output
by exit_reason. Followup needed; hour-of-day stats still collapse.
OPEN.
Use this list to skip closed issues. New audit pass should focus on OPEN
items and any new patterns since 2026-04-28.
Empirical result on the same fixture:
| Run | Model | Cost | Findings (of 12) | Notes |
|---|---|---|---|---|
| Cold baseline | Gemma 4 31B | $0.034 | 9 | no context files |
| + CLAUDE.md | Gemma 4 31B | $0.039 | 10 | labels stabilised |
| + AGENTS.md | Gemma 4 31B | $0.041 | 10 | output diff-able |
| + MEMORY.md | Gemma 4 31B | $0.043 | 11 | skipped closed items |
| Full kit | Gemma 4 31B | $0.046 | 11 | +TESTING.md, +GLOSSARY.md, +ADR |
The full kit raises Gemma 4 31B from 9/12 to 11/12 on the same fixture. Cost per finding drops from $0.00377 to $0.00418, which looks like a slight regression. It is not. The added findings are the hard ones, the ones that needed multi-step reasoning anchored in named categories. Two more findings per pass at a flat 35% cost increase is the trade I want every time.
For comparison, the same fixture run through Claude Opus 4.7 with the same context kit goes from 11/12 to 12/12 at $1.04. The frontier closes the last gap. But the cost per finding is now $0.0867 against Gemma 4's $0.0042. The ratio widened, not narrowed.
This matches the InfoQ March 2026 study on context engineering. Human-curated context files improved task success on every model they measured. LLM-generated context files degraded it on five of seven. The takeaway I keep coming back to is that context engineering is a labour transfer, not a labour saving. You move work out of the inference budget and into the writing budget. The writing budget is paid once. The inference budget is paid every time.
Section 3: Prompt caching math
The single biggest cost lever I have not seen written up clearly for the Gemma 4 Challenge feed is prompt caching across a multi-article pipeline. Anthropic offers a 90% discount on cached input tokens with a 5-minute TTL. OpenAI offers about 50%. Gemini offers up to 75% with implicit caching kicking in above 32K input tokens. OpenRouter exposes the underlying provider's caching when the upstream model supports it.
The naive way to run a four-article series is to pay full input cost on every article.
# Naive pipeline: each article is a fresh full-context call
fixture_tokens = 280_000
articles = 4
# Claude Opus 4.7 input: $15 per million
cost_per_article = (fixture_tokens / 1_000_000) * 15.00
total_naive = cost_per_article * articles
# $4.20 just on input tokens, output tokens on top
The shared-cache way amortises the fixture write across all four articles.
# Shared-cache pipeline: cache write once, cache reads after
# Anthropic prompt caching: write 1.25x base, read 0.10x base
write_cost = (fixture_tokens / 1_000_000) * 15.00 * 1.25 # $5.25
read_cost = (fixture_tokens / 1_000_000) * 15.00 * 0.10 # $0.42 each
total_shared = write_cost + read_cost * (articles - 1)
# $5.25 + $1.26 = $6.51 across 4 articles
# vs $16.80 naive at full input cost
# 61% saving on input, before counting output tokens
Cache TTL is 5 minutes on Anthropic. That is the catch. You cannot space your articles a day apart and expect the cache to still be warm. The cache write fee gets paid every time the cache cold-starts. Two strategies work in practice.
First, batch the runs. I ran articles 2 and 3 in the same 90-minute writing session. The Claude Opus 4.7 cache stayed warm for the full session because the wall time between cache reads was always under 5 minutes. Total Anthropic input cost across those two articles was $1.10 instead of $4.20.
Second, use a provider with longer TTL when batching is not possible. Gemini's implicit caching has a 1-hour effective window on Vertex AI. Gemma 4 31B on OpenRouter does not cache at all today, which is actually fine because Gemma 4's full-input price is already so low that caching savings would be rounding error. The big-cache lever is meaningful exactly on the expensive models, where you are most motivated to use it.
The honest projected number on this article series, if I had run all four through Anthropic Claude Opus 4.7 with naive uncached calls, is $0.32 per surfaced insight averaged across articles. With cache shared inside writing sessions and Gemma 4 31B handling the iterative passes, the real billed average is $0.04 per surfaced insight. That is the 87% drop in the headline.
I want to be explicit. The Claude Opus 4.7 numbers in the comparison are real billed dollars from the runs documented in Articles 1 through 3. The "what if I had run everything on Claude Opus 4.7 with no caching" number is a projection, computed from the same fixture sizes and Anthropic's listed pricing as of 2026-05-18. I am not claiming I actually paid $4 per insight. I am claiming a developer who replicates this work on Anthropic with no caching strategy will pay roughly that.
Section 4: Where Gemma 4 still loses
Honest section. The kit does not close every gap. There is one finding Gemma 4 still misses even with the full context stack, and it is a subtle race condition between the trading bot's cron tick and a SIGKILL recovery handler. The cron fires at second 0 of every minute. The SIGKILL recovery handler triggers on process restart and rebuilds state from the latest snapshot, but the snapshot timestamp is recorded with second-level resolution. If a SIGKILL happens at second 59 and the recovery process completes at second 1 of the next minute, the recovery snapshot and the next cron tick race on the same state row.
Claude Opus 4.7 catches this. Gemini 3.1 Pro catches it. DeepSeek V4 Pro catches it. Gemma 4 31B does not, even with the full context kit and the failure category list explicitly naming concurrency.timing_race.
I read the failed Gemma 4 outputs to figure out why. The pattern is consistent. Gemma 4 traces the cron path and the SIGKILL path independently and verifies each one in isolation. It does not hold both traces in working memory simultaneously, which is what you need to spot the race. The other three models do hold both traces and explicitly write out the timing diagram. This is a chain-of-thought depth limit on the 31B parameter model. No amount of context engineering on the prompt side fixes a working-memory limit on the model side.
So I keep a frontier model in the pipeline for one specific pass class: timing and concurrency reviews on stateful code. Everything else (architecture audits, security spot-checks, log analysis, schema review, prose critique, structured extraction) Gemma 4 31B handles for less than 1% of the frontier cost. The split:
| Workload | Primary model | Frontier escalation? | Cost class |
|---|---|---|---|
| Trading log analysis | Gemma 4 31B | No | $0.04 / pass |
| Architecture audit | Gemma 4 31B | Yes, for race conditions | $0.04 / pass |
| Security spot-check | Gemma 4 31B | No | $0.04 / pass |
| Prose critique (KR) | Gemma 4 31B | Yes, for literary tone | $0.04 / pass |
| Concurrency review | Claude Opus 4.7 | N/A | $0.94 / pass |
| Multi-step planning | Claude Opus 4.7 | N/A | $0.94 / pass |
Roughly 85% of my real workload is in the top four rows. Roughly 15% is in the bottom two. The blended monthly inference cost on this routing setup, given my current usage, runs at about $4.20 per month on Gemma 4 plus $11 on Claude Opus 4.7 for the escalation passes. Total $15 per month for a workload that would have cost roughly $112 per month run entirely on Claude Opus 4.7.
Section 5: Multi-agent cost cascade
A short subsection because it surprised me. When the same Gemma 4 31B is wired into a multi-agent cascade, the per-insight cost goes down further, not up. Three-agent setup:
# Multi-agent cascade. Same fixture, three agents.
#
# Agent 1: Generator. Reads fixture, emits draft findings.
# Agent 2: Critic. Reads draft, emits critique + missed-cat list.
# Agent 3: Synth. Reads draft + critique, emits final findings.
generator_input = 280_000 # full fixture
generator_output = 3_600 # draft findings JSON
critic_input = 3_600 # just the draft, not the fixture
critic_output = 1_200 # critique + missed-cat list
synth_input = 4_800 # draft + critique
synth_output = 4_000 # final findings JSON
# Gemma 4 31B pricing: $0.12 in, $0.37 out per million
gen_cost = 280 * 0.12 / 1000 + 3.6 * 0.37 / 1000 # $0.0347
crit_cost = 3.6 * 0.12 / 1000 + 1.2 * 0.37 / 1000 # $0.00088
synth_cost = 4.8 * 0.12 / 1000 + 4.0 * 0.37 / 1000 # $0.00206
total_cascade = gen_cost + crit_cost + synth_cost # $0.0376
The cascade is $0.038 per pass against the single-agent's $0.046, and it catches 12 of 12 findings on the fixture. The critic agent specifically reads the categories_not_found field from the generator and writes out a short challenge note for each category the generator skipped. The synthesiser then reconsiders those categories with the critic's note in context.
Two of the three agents (critic, synth) work on tiny inputs (a few thousand tokens), so their cost is rounding error. The expensive call is the generator's 280K input pass. Everything downstream is essentially free.
This is the multi-agent finding I did not expect when I started this series: putting three weak agents in a cascade can match one strong agent on the same fixture, at a lower total cost than a single weak agent that has to do all the reasoning in one shot. The reason is that each agent in the cascade only has to be good at one thing. The generator surfaces candidates. The critic challenges. The synthesiser integrates. Each step has a smaller working-memory footprint, which is exactly the constraint that limits a 31B parameter model.
Section 6: The replication kit
The six MD files used throughout this series are open source. MIT licensed. Free.
CLAUDE.md: project instructions for the AI, including failure-pattern definitionsAGENTS.md: cross-tool output conventions (Claude Code, Cursor, Aider, Copilot all read this natively)MEMORY.md: persistent findings across sessionsTESTING.md: verification flow and completion criteriaGLOSSARY.md: Korean / English / code identifier mapping (load-bearing for bilingual pipelines)docs/adr/0001-template.md: MADR-format decision record
Repo: github.com/wildeconforce/agent-starter-kit
For Korean readers who want the five-minute setup instead of the five-hour build, the same six files are packaged on Kmong with an AgentClient.exe double-click wrapper, eight FAQ entries, five auto-reply templates, nine detail images, and a Korean walkthrough video. Kmong listing: agent-starter-kit, ₩39K.
I want to be explicit about why I sell one and open source the other. The six MD files as code are nothing without the eight FAQ entries, the five auto-replies, the detail images, and the walkthrough. If you are comfortable reading the GitHub repo and adapting the files to your project, the MIT version is exactly what you need and nothing in the bundle is closed. If five minutes matters to you more than ₩39K matters to you, the bundle exists. I am not gating capability. I am gating compressed labour.
Closing
Five months ago I would have paid $1.50 to audit a single trading bot log. Today I pay $0.04. The audit catches one more finding now than it caught five months ago at 35 times the price. The frontier still has its moments and I keep it on the bench for the concurrency review and the multi-step planning passes. But the iterative work that actually decides what gets shipped is now small enough money that I run it on every revision instead of once a week.
That is the difference cost engineering makes. Not whether the model can do the thing. Whether you can afford to run it on every iteration of the thing.
The next article in this series (target 2026-05-22 KST) will cover the production deployment side. The same Gemma 4 31B + context kit is now wired into my Kmong real-time listing response pipeline and a multi-agent self-validation cron. The cron has been running for 18 days at the time of writing. Total cost across that run: $3.21. Total findings surfaced and resolved: 24. The cost-per-resolved-finding floor keeps dropping.
Repo: github.com/wildeconforce/agent-starter-kit (MIT)
Bundle: Kmong listing, Korean walkthrough + AgentClient.exe wrapper
Earlier in this series: Article 1 / Article 2 / Article 3
Cross-link: VERICUM ENT / WILD_SNIPER daily journal