WildEconForce — AI Build Journal

TL;DR. I tried to drop the self-critique literature into my one-person stack and most of it did not fit. MetaCrit needs four agents. MAR needs a multi-persona debate. PR-CoT needs an external orchestrator. Reflexion needs a reward signal I do not have a budget for. Self-Reflection is the closest, but it is a two-step loop and does not include a stage that separates fake weaknesses from real ones. So I adapted the pattern down to what runs on a single 8GB GPU in a single agent session. Three stages. Negative-self → self-audit → mind-change. I'm calling it MINDCHANGE and shipping the spec as a seventh MD axis in the context-engineering kit. This post explains the adaptation, names the existing lines it borrows from, presents a 5-model experiment design (Claude Opus 4.7 + Gemma 4 31B + Gemini 3.5 Flash + DeepSeek V4 Pro + Qwen 3.6 Max preview (proxy for Qwen 3.7-Max, not yet on OpenRouter at publish time)), and proposes a direct orthogonal combination with thehwang's num_ctx harness.

Why the existing lines did not fit my stack

The self-critique literature is rich. Reading through it over the past two weeks I kept hitting the same wall. The papers assume infrastructure I do not have.

MetaCrit (arxiv 2507.15015) is a four-agent metacognitive framework grounded in the Nelson-Narens model. An object-level agent generates the initial response. A monitoring agent assesses validity. A control agent critiques logic. A meta-level synthesizer reconciles all three. Cleanly designed. Also four model calls per pass. On my routing tier that is 4x the cost of a single-shot. On a self-hosted 8GB GPU it is four times the wall time. For workloads I run hundreds of times a week through cron, the math kills it.

MAR (Multi-Agent Reflexion) (arxiv 2512.20845) replaces single-agent self-critique with structured debate among persona-based critics. The goal is to dodge self-bias by importing multiple external perspectives. Same scaling problem. Now you have a debate panel to maintain. And the personas need to be authored and tuned. For a solo builder maintaining 18 active projects, that maintenance cost is real.

MyGO PR-CoT (arxiv 2601.07780) is a poly-reflective chain-of-thought. The model self-evaluates across four pre-defined angles. Closer to single-agent but still needs an external orchestrator to enforce the four angles per pass. Doable. Still extra plumbing.

Reflect-Retry-Reward (arxiv 2505.24726) is reinforcement-learning based self-improvement. Requires a reward signal. I do not have a labeled reward dataset for the audits my cron pipeline runs. Cannot use it as-is.

PopuLoRA (Co-Evolving LLM Populations for Reasoning Self-Play) (HN announcement, 2026-05) is on the opposite axis: it evolves multiple LLM populations together through reasoning self-play. Strong line for population-level evolution. Orthogonal to MINDCHANGE — PopuLoRA improves the population over time, MINDCHANGE improves a single model's output within a single session through a personality sequence. They could compose in principle, though I have not tested it.

Self-Reflection is the most generic pattern. First answer → critique → refine. Closest to what a single-agent, single-session setup can support. But it is two stages. There is no stage that asks "is this critique even real or did the model just complain to look thorough?" That missing third stage is what causes self-reflection in practice to either bounce off real weaknesses (negative spiral) or rewrite a perfectly good answer into something worse (over-edit).

So I needed something that:

Runs in a single model call sequence (single agent, single session, no orchestrator)
Includes a stage that separates real weaknesses from fake ones (the missing third stage)
Costs in the 2-4x range of a single-shot, not 4-8x
Sits inside an MD file alongside the existing context-engineering kit, not in a framework

That is the adaptation work. The pattern I landed on is what I am calling MINDCHANGE.

The MINDCHANGE pattern

Three stages. Personality transitions inside one model session. The transitions are explicit in the prompt.

Stage 1. Negative-self

The model is told to look at its own previous output as if a stranger wrote it, then find weaknesses in four named categories.

You are now a *critical reviewer*. The output above is yours,
but treat it as if a stranger produced it. Find weaknesses
in these four categories:

(1) Factual accuracy: are quoted numbers, dates, sources correct?
(2) Logical consistency: are claim-evidence chains broken anywhere?
(3) Vague phrasing: any "well / appropriately / sufficiently"
    predicates with no concrete definition?
(4) Missing counter-arguments: has the author preempted reasonable
    objections, or skipped them?

Find a minimum of 2 and a maximum of 5 in each category.
If a category genuinely has none, say so explicitly.
Be sharp. No sycophancy.

Four design choices in this prompt that matter:

"You are now" pins the personality inside the user prompt, not the system prompt. This keeps it portable across models that have weak system-prompt adherence (small open models often do).
The four categories give the model a task scope. Without scope, "find weaknesses" returns either nothing or surface noise.
The 2-minimum cuts the sycophancy escape. The 5-maximum cuts the negative spiral escape. Both bounds matter.
The "if none, say so" line forces the model to commit to a position, not hedge with "could not find any."

Stage 2. Self-audit

The critique from stage 1 is handed back to the model. The model now switches personality from critical reviewer to self-auditor. For each critique item, the model assesses whether it is a real weakness (Yes / No / Unclear) and gives a one-line reason.

Critique list from Stage 1 received. Switch personality:
you are now a *self-auditor*, not a critic. For each item:

(a) Is this a real weakness an external reader would agree with?
    Yes / No / Unclear.
(b) If Yes, one-line fix recommendation.
(c) If No or Unclear, one-line reason.

Then report what percentage of items were classified as real weaknesses
(example: 7 of 12 items were real). The classification criterion is
"would an external reader agree." That phrase exists to dodge self-bias.

This is the stage missing from generic Self-Reflection. The model is forced to grade its own critique, which means the over-eager critic from Stage 1 has to defend its claims to a different personality inside the same session. The three-way classification (Yes / No / Unclear) gives the model an honest escape if a critique was fake. The "external reader" framing is the explicit anti-self-bias prompt.

Stage 3. Mind-change

The real weaknesses from Stage 2 go to a third personality: the original author returning to the work. Only the weaknesses get fixed. Strong parts are preserved.

List of items classified as *real weaknesses* received. Switch
personality back to *original author*. Rewrite the original output:

(a) Apply fixes to all real-weakness items.
(b) Keep strong parts unchanged. No over-editing.
(c) Maintain original flow, tone, length.

Output the rewrite only. No fix-explanation commentary.

The third personality switch matters. By the time the model gets to Stage 3 it has been a critic, then an auditor. If the prompt does not return it to "author" mode, it tends to keep critiquing in the rewrite. Naming the personality is cheap and works.

The rewrite-only output (no fix-explanation) keeps the artifact clean. Downstream tooling parses the rewrite directly without needing to strip meta-commentary.

Comparison table

How MINDCHANGE differs from the five existing lines.

Dimension	MetaCrit	MAR	PR-CoT	Reflect-Retry-Reward	Self-Reflection	MINDCHANGE
Agent count	4	Multi-persona	1 + orchestrator	1 + reward	1	1
Session boundary	Across agents	Across personas	Across passes	Across episodes	Within session	Within session
Stage count	4	N (debate length)	4	Continuous	2	3
Personality transitions	Implicit (different agents)	Explicit personas	None inside agent	None	None	Explicit, inside one agent
External reward needed	No	No	No	Yes	No	No
External orchestrator	Yes	Yes	Yes	Yes	No	No
Marginal cost	4x	N x	4x	Training pass	2x	2-4x
Fits in MD file	No	No	No	No	Partial	Yes (seventh axis)

The honest framing: MINDCHANGE borrows the personality-transition idea from MAR, the staged-evaluation idea from MetaCrit, the same-session constraint from Self-Reflection, and the no-reward constraint from PR-CoT. None of it is novel as research. The adaptation is the contribution. It runs.

5-model experiment design

The MINDCHANGE pattern is testable. The experiment I am running over the next 5-7 days is:

Hypothesis. Adding the MINDCHANGE 3-stage prompt sequence to a single-pass model call improves output quality by a measurable lift across most model classes, at a cost penalty of 2-4x wall time and tokens. The lift will be larger for models with strong self-bias (small open models) than for models with weaker self-bias (frontier closed models).

Models (5):

Claude Opus 4.7 (frontier closed, baseline)
Gemma 4 31B (open weights, mid-size)
Gemini 3.5 Flash (frontier closed, fast tier)
DeepSeek V4 Pro (open weights, frontier-competitive)
Qwen 3.6 Max preview (proxy for Qwen 3.7-Max, not yet on OpenRouter at publish time) (new release, HN 553 points, agent-focused)

Conditions (2): MINDCHANGE on / off.

Task fixture. Same 47-day Sniper trading bot log fixture used in the cost-engineering and production-deployment posts. The audit task: surface 12 named structural issues. Gold-truth catch rate scored against ground-truth list (which has been blind-judged by Claude Opus 4.7 in single-shot mode previously).

Runs. 3 per cell = 30 total runs. Cost estimated at $1-3 total (mostly Claude Opus 4.7 and Gemini 3.5 Flash for the large input passes; local Ollama free for Gemma / DeepSeek if running self-hosted).

Metrics:

Catch rate (out of 12 issues)
Wall time (seconds)
Token cost (input + output)
Negative spiral rate (rewrite is same quality as original or worse)
Real-weakness rate (Stage 2 reported %)

Expected results (hypothesis, not measurement yet):

Frontier models (Claude / Gemini): smaller lift (+0.5 to +1.5 of 12), cost penalty 2-3x
Open mid-size (Gemma 4, DeepSeek): larger lift (+1.5 to +3.0 of 12), cost penalty 2-4x
Qwen 3.6 Max preview (proxy for Qwen 3.7-Max, not yet on OpenRouter at publish time): unknown. Agent-focused training might make Stage 2 self-audit unusually strong. Or it might make the model resistant to switching personalities. This is the most interesting unknown in the matrix.

If the lift is real and the cost stays under 4x, MINDCHANGE earns its spot as a seventh axis. If not, it gets ablated and the post-mortem ships as a separate writeup.

The results post is targeted for ~7 days from now (early June). I will write it whether the pattern works or fails. Both outcomes are informative.

Orthogonal combination with thehwang's num_ctx harness

The previous post in this series documented thehwang's harness (Scripta) for measuring how num_ctx (Ollama context window parameter) shapes output quality. The cross-replication on RTX 4060 8GB confirmed his Mac 16GB findings, and one of our findings inverted depending on fixture shape.

The MINDCHANGE pattern lives on a different axis from num_ctx. The hypothesis worth testing in a follow-up:

num_ctx controls how much input the model sees per call
MINDCHANGE controls what personality sequence the model goes through across calls

These are orthogonal in the cleanest sense. They address different failure modes. num_ctx addresses "the model missed a structural issue because the input was silently truncated." MINDCHANGE addresses "the model saw the input but did not push back on its own output." Stacking both should produce additive lift, not redundant lift, since the gaps they close are non-overlapping.

A 2x2 matrix on the same task fixture would be the cleanest experiment:

                    num_ctx=2048   num_ctx=32768
MINDCHANGE off      cell A         cell B
MINDCHANGE on       cell C         cell D

Hypothesis: D > B > C > A, with the lift from B → D smaller than from A → C (because B already has the input-shape lift, so the personality-sequence lift adds less). The interesting unknown is whether the two lifts compose linearly or with diminishing returns.

That follow-up experiment is wave 3 of this series. Wave 2 is the 5-model MINDCHANGE matrix above. Wave 3 is the 2x2 combination with thehwang's harness. Both will publish as standalone posts.

Implementation note

MINDCHANGE ships as MINDCHANGE.md in the agent-starter-kit templates folder, alongside the existing six axes (CLAUDE.md, AGENTS.md, MEMORY.md, TESTING.md, GLOSSARY.md, ADR). MIT licensed.

The kit usage pattern is:

Drop the six axes (or seven, with MINDCHANGE) into a project root
The first six define content (project conventions, output schemas, memory, tests, vocabulary, decisions)
MINDCHANGE defines sequence (how to walk a model through the content axes over a personality transition)

The seventh axis sits on top of the other six rather than alongside them. That layering matters for the comparison table above: MINDCHANGE is not a competing axis to MetaCrit or MAR, it is a composition layer.

What I am running next

Wave 2 (target ~5-7 days): 5-model MINDCHANGE matrix, results post.
Wave 3 (target ~14-21 days): 2x2 combination with thehwang's num_ctx harness on the same fixture, joint results post.
Wave 4 (target ~30 days): MINDCHANGE adoption in the agent-starter-kit Kmong bundle for paying users + a Korean-language walkthrough for the claude-code-masterpack 5/28 release.

The kit and the axis are MIT. The cron pipeline that runs the experiments is the same one documented in the production-deployment post. The fixture is the same 47-day Sniper log used across the series.

If you test MINDCHANGE on your own workload, the comparison I would most like to see is the 2x2: kit-only context engineering on/off, crossed with MINDCHANGE on/off. Same task. Same model. Counter-experiments welcome.

Footer

This post follows the Gemma 4 Challenge production-deployment post which closed out the 5-piece challenge series. MINDCHANGE is the first axis of the next-stack series.

MINDCHANGE.md axis spec (MIT, 9.5KB)
agent-starter-kit (MIT) / Kmong bundle ₩39K
thehwang's Scripta harness (MIT)

Jack. wildeconforce.com