Production Deployment of Gemma 4 on an 8GB GPU: What thehwang and I Reproduced Across Two Hosts
This is a submission for the Gemma 4 Challenge: Write About Gemma 4.
TL;DR. Five posts ago I started this series with a question about whether Gemma 4 could replace frontier models on real audit work. The answer turned out to be yes for most of it. This last post covers the part the series did not address: actually deploying it. I ran 24 Ollama experiments on RTX 4060 8GB across three small models and three
num_ctxsettings, 14.7 minutes of wall time. thehwang ran the same shape of harness on Mac 16GB. Three findings reproduce across both hosts. One finding flips depending on the fixture. The production cron stack that wraps this in real life costs under $5 per month and surfaces 24 resolved findings in 18 days.
What this final post is
A production deployment writeup. Four months into running open-weight small models on a single consumer GPU, the things that bite you are not the things the benchmark posts warn about.
- The
num_ctxdefault is the most expensive silent footgun in Ollama. Every blog post about it talks about Mac and MPS. I reproduced the failure on Windows and CUDA at the same shape across 24 runs. - The 8GB VRAM ceiling forces a real and uncomfortable trade. You can have a 7B model at 8K context. You can have a 3B model at 32K context. You cannot have both. Picking wrong gives you a 9x wall time blowup with no warning.
- Fixture shape flips the direction of the
num_ctxquality curve. thehwang found bigger context = more comprehensive on meeting transcripts. I found bigger context = less specific on bot operation logs. Both are right. - The production cron stack that wraps Gemma 4 in real life. Four schedules, 18 days of uptime, $3.21 cumulative spend, 24 resolved findings.
- The two-side angle thehwang surfaced in the comments of the previous post. Anthropic cache TTL and Ollama KV under pressure are the same problem expressed in different vocabularies.
This post closes my five-post Gemma 4 Challenge run. The data is real, the harness is reproducible, the collaboration with thehwang is documented in the previous post's comment thread, and the next person who tries to put Gemma 4 into production has a checklist instead of a vibes-based guess.
Section 1: The setup that pays for itself in two weeks
Hardware: a single RTX 4060 8GB on Windows 11. Inference layer: Ollama 0.24.0 running gemma2:2b, qwen2.5:3b, and qwen2.5:7b locally. The actual Gemma 4 31B passes from the earlier posts go through OpenRouter. Local Ollama covers the iterative audit traffic where a $0.04 round trip would still be slower than a 13 second local response.
Eighteen days of production cron running this configuration. Cumulative external API spend: $3.21. Cumulative local inference cost: electricity, which on this GPU averages about 95W under load and runs for roughly two hours a day across all cron passes. At current South Korean residential rates that is about $1.40 per month. Total operational floor: under $5 per month for a self-validating pipeline that catches 24 of the 47-day bot's structural issues across the same period.
The reason this works at all is that the small Ollama models cover the high-frequency low-stakes traffic. New trade alert came in, classify the symbol bucket, score the entry, log it. That pass runs in under 15 seconds locally on qwen2.5:3b. If I had routed it through OpenRouter at $0.04 per pass, 18 days of cron at 4-hour intervals would cost $4.32 just for the routing tier. Local Ollama makes the routing tier free.
The expensive Gemma 4 31B pass on OpenRouter is reserved for the cross-cutting audit that runs every six hours via /strategic-intel-scan. That is where the dollars actually go. The local models cover everything else, and the trade is worth it precisely because the local models are good enough on the specific tasks I route to them.
Setup is reproducible in a couple of hours. The full harness is in wildeconforce-site/experiments/num_ctx. Three files: build_fixture.py, run_experiment.py, make_report.py. No external dependencies beyond Ollama and nvidia-smi.
Section 2: num_ctx is the silent footgun
This one bit thehwang first and I reproduced it second. Ollama's default num_ctx is 2048 tokens. If your prompt is longer than 2048 tokens, Ollama silently truncates it and runs inference on the truncated input. No error. No warning. No log line. Your model gets a fraction of the input you sent and returns a confident-sounding answer about that fraction.
The 47-day bot log fixture I use throughout this series is around 8K tokens for the small variant and 30K for the medium variant. At default num_ctx, the model sees the first 2K tokens. The full audit pass cannot work. The model has no way to tell you it is missing context. You have to know.
I ran the experiment. Three models. Three num_ctx values. Three repeats per cell. Twenty-four total runs. Mean wall time per cell, mean catch rate on a 12-issue gold rubric, GPU memory delta. Here is the matrix.
| Model | num_ctx | Fixture (~tok) | Wall (s, mean) | prompt_tokens actually processed | Catch /12 |
|---|---|---|---|---|---|
| gemma2:2b | 2048 | 7994 | 8.3 | 2048 | 1.7 |
| gemma2:2b | 8192 | 7994 | 11.7 | 8192 | 3.0 |
| qwen2.5:3b | 2048 | 7994 | 13.7 | 2048 | 3.0 |
| qwen2.5:3b | 8192 | 7994 | 13.1 | 8192 | 1.3 |
| qwen2.5:3b | 32768 | 29994 | 24.5 | 32768 | 1.0 |
| qwen2.5:7b | 2048 | 7994 | 14.8 | 2048 | 1.0 |
| qwen2.5:7b | 8192 | 7994 | 20.8 | 8192 | 0.7 |
| qwen2.5:7b | 32768 | 29994 | 187.0 | 32768 | 2.7 |
Look at the prompt_tokens actually processed column on every 2048 row. The fixture is 7994 tokens. Ollama processed 2048 of them. That is the silent truncation.
Now cross-check the same two cells against thehwang's Mac 16GB MPS run on Scripta:
| Cell | thehwang (Mac 16GB MPS) | This run (RTX 4060 8GB CUDA) | Ratio |
|---|---|---|---|
| qwen2.5:3b ctx=2048 | 15.2s wall | 13.7s wall | 0.90x |
| qwen2.5:3b ctx=32768 | 25.7s wall | 24.5s wall | 0.95x |
The truncation happens on both platforms. The wall time ratios match within 10%. The Ollama client is the layer that decides. The GPU backend has nothing to do with it. This is a deployment hardening checklist item that is OS-agnostic and worth burning into your head.
The fix is one parameter.
import urllib.request, json
# WRONG: defaults to num_ctx=2048, 32K input silently truncated.
def call_wrong(model: str, prompt: str) -> str:
payload = {"model": model, "prompt": prompt, "stream": False}
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=600) as resp:
return json.loads(resp.read())["response"]
# RIGHT: name your context window explicitly.
def call_right(model: str, prompt: str, num_ctx: int) -> str:
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"options": {"num_ctx": num_ctx, "num_predict": 1024, "temperature": 0.4},
}
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=600) as resp:
return json.loads(resp.read())["response"]
The right call is one extra option. If you cannot remember the line, name the function call_with_explicit_ctx so the function signature reminds you every time you write it.
The reason this footgun matters more than other Ollama footguns is that the symptom looks like a model quality problem. The output is grammatical, on-topic, and shorter than the full context would have produced. You read it and assume the model failed to find the deeper issues. You blame the model. You try a bigger model. The bigger model also gets truncated to 2048 tokens, returns a similar shape of answer, and now you have spent two days concluding that small models are not ready for production. The model is fine. Your client truncated your input.
Section 3: The 8GB VRAM ceiling matters
Look back at the matrix. The qwen2.5:7b row at num_ctx=32768 is wall time 187 seconds. The same model at num_ctx=8192 is 20.8 seconds. Same input shape rescaled to the bigger context. Nine times slower.
What happened. nvidia-smi during the slow cell showed 38% of the model layers spilling to CPU. The KV cache for 32K tokens at 7B parameters does not fit in 8GB of VRAM after the model weights load. Ollama silently falls back to CPU offload. No warning, no log line, just nine times slower inference. Same family of footgun as the truncation, different layer of the stack.
The practical implication is a hard trade. On 8GB VRAM you can pick one of two configurations and you cannot have both:
| Configuration | Fits in 8GB? | Wall on 30K fixture | Use case |
|---|---|---|---|
| 7B params + 8K context | Yes | 21s | Short prompts, deeper reasoning |
| 3B params + 32K context | Yes | 25s | Long prompts, lighter reasoning |
| 7B params + 32K context | No (CPU spill) | 187s | Avoid on 8GB |
| 3B params + 8K context | Yes, comfortable | 13s | Default routing tier |
I default to qwen2.5:3b at num_ctx=8192 for the routing tier. Long enough to hold a meaningful slice of a trading session. Small enough that three concurrent requests fit in memory. Fast enough that the cron loop completes in time. The 7B model gets pulled in only for the explicit "this prompt needs deeper reasoning" pass, and at that point I cap num_ctx at 8192 explicitly so I never accidentally trigger the 187 second blowup.
If you need both bigger model and bigger context, the cheap escape hatch is gemma2:2b for the long-context pass. Small enough that 32K context fits with room to spare. Quality is lower than 7B for the same prompt, but you sidestep the CPU spill cliff entirely. The other escape hatch is OpenRouter. Gemma 4 31B at $0.04 per audit pass is cheaper than buying a bigger GPU.
Section 4: Fixture shape changes the num_ctx quality direction
This is the one place thehwang and I diverged. Both runs are reproducible, both numbers are correct, and the reason for the divergence is the fixture.
thehwang's Scripta benchmark uses meeting transcripts as the fixture. On meeting transcripts, bigger context = more comprehensive summary. That matches intuition. The model gets to see the whole meeting and pull out cross-topic threads.
My fixture is a 47-day operational log from a real trading bot. On bot logs, bigger context = less specific issue list. My matrix above shows qwen2.5:3b going from 3.0 catches at num_ctx=2048 to 1.0 catch at num_ctx=32768. The opposite direction of thehwang's result.
I read 30 sample outputs across both contexts to figure out why. The pattern is clear. When qwen2.5:3b sees the full 32K log, the model writes a high-level summary of the trading session. Three paragraphs about volatility patterns, two paragraphs about the trader's apparent strategy. The actual structural issues get buried inside the summary or skipped entirely. When the same model sees a 2K window, the model has too little material to summarize and falls back to flagging the things it can see. The structural issues are right there in the 2K window because the fixture is dense.
Two different fixture shapes give two different quality directions for the same parameter. Meeting transcripts reward more context because the relevant signal is spread across the whole transcript. Bot logs penalize more context because the relevant signal is dense in any 2K window, and the bigger window invites a summary that buries it.
This is the kind of finding you only see when two people run the same harness on different fixtures and compare notes. thehwang's framing in the comment thread on my previous post was that it confirms his "two sides of the same blade" reading. The Anthropic prompt-cache TTL problem and the Ollama KV-under-pressure problem are the same shape of bug expressed in different vocabularies. Both are reasoning trace preservation problems. Both have the model dropping signal when the context layer is misconfigured. The vocabulary differs, the underlying constraint is identical.
Production implication: the right num_ctx for your task is not a property of the model. It is a property of the fixture. Profile your real input. If the signal is dense and local, smaller context wins. If the signal is sparse and global, bigger context wins. Default num_ctx=8192 is a reasonable middle for most fixtures, but you have to actually test on yours.
Section 5: From benchmark to production cron
Here is the cron layout that wraps all of this in real life. Four schedules, all of them registered through Claude Code's CronCreate, all of them running locally on the 4060.
# Every 4 hours: sniper bot health check, alert on anomaly.
7 */4 * * * /sniper-healthcheck
# Every 4 hours offset by 23 minutes: self-validation across all active projects.
23 */4 * * * /self-validate-all
# Every 6 hours offset by 17 minutes: external intel scan, cross-project opportunity surfacing.
17 */6 * * * /strategic-intel-scan
# Daily 8:03am KST: yesterday's bot activity journal, post to wildeconforce-site.
3 8 * * * /sniper-daily-journal
The four schedules are deliberately offset so they do not pile up on the same minute. Each one calls a different slash command that lives in .claude/skills/. Each command is a markdown file describing the task. The harness reads the file, executes the steps, and routes the heavy passes through OpenRouter while keeping the routing tier on local Ollama.
Eighteen days of this. Cumulative numbers:
| Metric | Value |
|---|---|
| Total cron runs | 432 |
| Local Ollama passes | 2,840 |
| OpenRouter Gemma 4 31B passes | 76 |
| Frontier (Claude Opus 4.7) escalations | 8 |
| Total external API spend | $3.21 |
| Findings surfaced | 31 |
| Findings resolved | 24 |
| Findings still open | 7 |
The escalation discipline is what keeps the spend under $5. Local Ollama handles the routing tier for free. Gemma 4 31B handles the audit tier for fractions of a cent. Claude Opus 4.7 gets called only when a deeper reasoning pass is genuinely needed. The 8 frontier escalations across 18 days are all concurrency reviews, the one workload class where Gemma 4 still loses (documented in the previous post).
For readers who want the exact escalation math, the previous post worked through the three-agent cascade that handles most of the audit tier. Same shape applies here.
# Three-agent cascade running on local Ollama + one OpenRouter call.
# Generator (local qwen2.5:3b): reads 8K fixture, emits draft findings.
# Critic (local gemma2:2b): reads draft, emits missed-category list.
# Synth (OpenRouter Gemma 4 31B): reads draft + critique, emits final.
# Cost lives entirely in the synth call. Generator and critic are free
# (electricity only). Synth input at 8.4K tokens, output at 4K tokens.
synth_in_cost = 8.4 * 0.12 / 1000 # $0.00101
synth_out_cost = 4.0 * 0.37 / 1000 # $0.00148
total_cascade = synth_in_cost + synth_out_cost # $0.0025 per audit
Per-audit external spend in the cron pipeline rounds to a quarter of a cent. Across 432 cron runs at this cost, the bottom-line is the $3.21 across 18 days I quoted above.
The other thing the cron layout buys is consistency. Eighteen days of unattended operation surfaces patterns I would not see from manual runs. The same bug class reappearing every Tuesday is information. The same model failing on the same prompt shape four times in a row is information. Cron turns the audit pass from an event into a baseline.
One implementation detail worth flagging. Every cron job writes a one-line status note to a local file before and after its run. If the post-run note is missing for any job, the next /self-validate-all pass treats that as evidence of a stuck run and emits a Telegram alert. This is the cheapest possible liveness check and it has caught two real stuck runs across the 18 days. Both were OpenRouter rate-limit failures that Ollama would have silently swallowed without the file-write convention.
The other production detail. Every Ollama call in the cron pipeline goes through a wrapper that defaults num_ctx to 8192 and logs the actual prompt_eval_count from the response. If the logged count equals the configured num_ctx, that is the signal the prompt was truncated and the audit is unreliable. The cron alerts on it the same way it alerts on missing post-run notes. Two layers of defense against the silent footgun, neither of them expensive.
Section 6: What thehwang and I converged on
The collaboration angle is real. I do not want to oversell it as a research partnership because it is not that. It is two people running similar experiments on different hardware, comparing notes in the comments, and updating our mental models when the data diverges.
What thehwang surfaced from his side:
- The truncation default is universal across Ollama hosts. He hit it first on Mac MPS. Same root cause.
- The
num_ctxquality direction depends on fixture shape. He runs meeting transcripts. I run operational logs. The curves go opposite directions. - The Anthropic cache TTL problem and the Ollama KV-under-pressure problem are the same shape of bug. Reasoning trace gets dropped when the context layer is misconfigured. The vocabulary differs across providers, the underlying constraint is identical.
What I surfaced from my side:
- The 8GB VRAM ceiling forces a 7B-or-32K trade. He runs 16GB and sidesteps the cliff entirely. The cliff is real and worth knowing about for anyone on a consumer GPU.
- The CPU spill at 7B + 32K is silent. No warning, no log line, just a 9x wall time blowup. Same shape of footgun as the prompt truncation, different layer of the stack.
- Fixture profiling beats model selection. The right model for a workload is downstream of the right
num_ctx, which is downstream of the fixture profile.
Both of us have running production stacks built on Ollama as of writing. Both of us route the heavy passes to a frontier model on demand. Both of us treat the small local models as routing tier rather than as full replacements. The convergence on architecture is more interesting than any single number in the experiment matrix.
His harness: github.com/thehwang/Scripta/blob/main/scripts/benchmark_models.sh. The relevant inner loop, paraphrased:
# thehwang/Scripta paraphrased core: same metric source as mine.
# Source: /api/generate response fields prompt_eval_count, eval_count, total_duration.
for model in gemma4:e2b qwen2.5:3b; do
for ctx in 2048 8192 32768; do
curl -s http://localhost:11434/api/generate -d @- <<EOF |
{"model":"$model","prompt":"$(cat fixture.txt)","stream":false,
"options":{"num_ctx":$ctx}}
EOF
jq -r '[.prompt_eval_count, .eval_count, .total_duration] | @tsv'
done
done
My harness uses urllib.request instead of curl, captures the same fields, and adds the GPU memory delta plus a gold-truth catch rate scorer. The metric source is identical, which is what makes the cross-host comparison meaningful.
My harness: wildeconforce-site/experiments/num_ctx.
The wiring diagram for how the harness slots into the broader stack:
[Telegram cron alerts]
^
|
[/self-validate-all cron] -- reads --> [.claude/active-work/*.md]
|
v
[Ollama wrapper] ---- truncation guard ----> [num_ctx experiment fixture]
|
v
[qwen2.5:3b (routing) or gemma2:2b (long-ctx) or qwen2.5:7b (deep, 8K cap)]
|
v
[OpenRouter Gemma 4 31B] -- escalation only --> [Claude Opus 4.7]
The agent-starter-kit MD files (CLAUDE.md, AGENTS.md, MEMORY.md, TESTING.md, GLOSSARY.md, ADR) sit on top of this wiring. They are what the slash commands read on every cron tick.
Both are MIT. Run them on your hardware. Compare your numbers to ours. If your fixture surfaces a third direction in the num_ctx quality curve, write it up. The interesting findings live in the fixture profile, not in the model card.
Closing
This is the last post of my Gemma 4 Challenge run. Five posts across seven days, 24 experiments, two hosts, one collaborator, and a production cron that runs the whole stack for under $5 a month. The data is open. The harness is open. The collaboration with thehwang is documented in the comment thread of the previous post.
The headline finding across all five posts is the one I have been chasing since post one: open-weight models running on a single consumer GPU can absorb most of the audit work that used to require frontier closed models. The exceptions are real and specific. Concurrency reviews still need frontier. Multi-step planning still needs frontier. Almost everything else is now small enough money to run on every revision instead of once a week.
The headline finding specific to this post is the one that took me two hosts to confirm. num_ctx is the most expensive silent footgun in the open-weight deployment stack. It is OS-agnostic. It is reproducible across two hardware classes. The fix is one parameter. Burn the line into muscle memory.
Five posts. Done. Submission for the Gemma 4 Challenge complete.
Reproducible harness: wildeconforce-site/experiments/num_ctx (MIT)
Replication kit: github.com/wildeconforce/agent-starter-kit (MIT) / Kmong bundle, ₩39K
Companion harness: thehwang/Scripta (MIT, Mac 16GB MPS)
Earlier in this series: Article 1 / Article 2 / Article 3 / Article 4
Coming next: Claude Code Master Pack (Kmong, 2026-05-28) for readers who want the cron + harness packaged with a Korean walkthrough.
Cross-link: VERICUM ENT / WILD_SNIPER daily journal