Context Kit vs Forge Guardrails: Two Ways to Pull a Small Model Up to Frontier Reliability

TL;DR. Forge (CAIS 2026) wraps a small self-hosted model in runtime guardrails (retry nudges, step enforcement, error recovery, context compaction, VRAM budgeting) and reports an 8B model going from 53 percent to 99 percent on agentic workflows. My own context engineering kit (six Markdown files: CLAUDE.md, AGENTS.md, MEMORY.md, TESTING.md, GLOSSARY.md, ADR template) took Gemma 4 31B from 9 out of 12 findings to 11 out of 12 on a real architecture audit, roughly 75 to 92 percent of Claude Opus 4.7 parity. Same problem space. Different mechanism. Different cost line. This post walks through both, where they collide, and how a hypothetical combination would look.

The problem both approaches solve

If you have run a small open-weights model on anything more involved than a single chat turn, you have probably noticed the same thing. Single-step accuracy looks fine. Multi-step agent loops fall apart.

A model can answer a question correctly 95 percent of the time and still ship a broken five-step workflow. The math is brutal. Five chained steps at 95 percent gives you 77 percent end-to-end completion. Nine steps gives you 63 percent. That is the compounding reliability problem, and it is the reason "frontier closed model" has been the default answer for any agentic task that has to actually finish.

Two recent pieces of work attack the same gap from opposite ends.

One is Forge, a framework presented at ACM CAIS 2026 by Antoine Zambelli (Texas Instruments). Forge sits at runtime. It watches the agent loop, catches partial failures, nudges retries, enforces step ordering, compacts context when it bloats, and budgets VRAM on consumer hardware. The headline from the conference write-up: an 8-billion-parameter model with Forge reaches 99 percent on agentic workflows. Without the guardrails, frontier API models themselves drop into the 49 to 87 percent range. The Hacker News thread that surfaced the project (106 points, 35 comments at time of writing) quoted the framing as taking an 8B model from 53 percent to 99 percent.

The other is the line of work I have been publishing on this blog for the past two weeks. The thesis there is different. Instead of intercepting the model at runtime, I rewrite the input frame before the model even sees the task. A six-file context kit (CLAUDE.md for project conventions, AGENTS.md for output schemas, MEMORY.md for persistent findings, TESTING.md for assertions, GLOSSARY.md for vocabulary, and an ADR template for decisions) loads named failure patterns, structured output contracts, and prior-finding memory into the system prompt. The result on a real architecture audit: Gemma 4 31B caught 11 of 12 findings against Claude Opus 4.7's 12 of 12. The same model on the same task without the kit caught 9 of 12.

Both lines aim at the same metric: small open model reaching close-enough-to-frontier reliability for production. The mechanisms are completely different. The cost profile is completely different. The combination, as far as I can tell, has not been tested by either side.

Approach 1. Context Kit: reshape the input frame

The context kit lives entirely on the prompt side. Six Markdown files, loaded once into the system prompt at the start of a session. No runtime callbacks, no retry loop, no agent harness. The model reads the kit, then reads the task, then writes its answer.

What goes in each file:

# CLAUDE.md (excerpt)

## Failure patterns we have seen on this codebase
- "silent self-correction" anti-pattern: model heals
  internal state drift without surfacing the change.
  Acceptable for tone. Not acceptable for state or money.
- "plain text only" anti-pattern: forces every
  intermediate representation to be a string, breaks
  for structured workloads.
- "universal claim with disclaimer" smell: section
  title promises generality, last subsection walks
  it back. Flag these.

## Domain vocabulary
- P0/P1/P2/P3 = the four strata in our spec, see
  GLOSSARY.md for canonical definitions.
- "Stratum" vs "interceptor": stratum = ordered layer
  in vertical model. Interceptor = cross-cutting
  wrapper. Important not to conflate.

# AGENTS.md (excerpt). Output schema for critique passes.
output_schema:
  findings:
    - id: F-{n}
      severity: [info, warn, error, critical]
      principle_violated: <name from CLAUDE.md>
      evidence: <span quoted from input>
      proposed_fix: <one sentence>
      confidence: <0.0-1.0>
  signature_insight:
    single_most_actionable_fix: <string>
    rationale: <one paragraph>

The kit does three things at once. It names the failure patterns the model should be looking for, so the model is not inventing a taxonomy from scratch on every call. It pins the output schema, so downstream tooling can parse the response deterministically. And it carries forward memory of prior findings, so the model does not re-discover the same flaw on every iteration.

The cost line for this approach lives in two places. Writing the kit is real work. The six files are roughly 2,500 tokens combined for a project of moderate complexity. Maintaining them is a discipline. Every time a new failure pattern shows up in production, it goes into CLAUDE.md. Every architectural decision goes into the ADR folder. The kit is alive.

The inference cost is the second place. Prompt caching makes this near-free on the input side after the first call. Anthropic's 5-minute cache TTL and OpenRouter's caching support drop the repeated input tokens to 10 percent of list price for cache hits. On a Gemma 4 31B call at $0.12/$0.37 per million tokens, a 7,500-token cached system prompt plus a 2,000-token task plus a 4,000-token output costs roughly $0.003 per audit. The full four-model audit I ran cost $0.05 total inference. Numbers from the four-piece series linked at the end.

The findings rate moved from 9 of 12 to 11 of 12 on the architecture audit when the kit was loaded. That is the 75 to 92 percent number. It is one task, one prompt structure, one temperature setting (0.3). N=1 in benchmark terms. Treat it as a directional signal, not a peer-reviewed result.

The mechanism is purely "front of the inference call." Nothing runs at inference time except a single model call. There is no agent loop to interrupt. There is no retry budget. There is no harness.

Approach 2. Forge: intercept at runtime

Forge is the opposite shape. It assumes you already have a self-hosted model on consumer hardware (8 to 14 GB VRAM territory) and a tool-using agent loop that is failing at step 3 of 7. Forge wraps the loop and intervenes when the model misfires.

From the CAIS 2026 demo page, the guardrail stack is described as:

Retry nudges, step enforcement, error recovery, context compaction, and hardware-aware VRAM budgeting.

A reasonable reconstruction of what each component does (the exact code is not in the public page, so this is informed inference from the named functions):

# Conceptual reconstruction of a Forge-style guardrail wrapper.
# Names match the published mechanism; bodies are illustrative.

class GuardrailedAgent:
    def __init__(self, model, tools, max_steps=10, vram_budget_gb=8):
        self.model = model
        self.tools = tools
        self.max_steps = max_steps
        self.vram_budget_gb = vram_budget_gb
        self.context = []

    def step(self, task):
        for i in range(self.max_steps):
            self.compact_if_over_budget()
            response = self.model.generate(self.context, task)

            if self.is_malformed(response):
                # retry nudge: re-inject the tool schema
                self.context.append(self.retry_nudge(response))
                continue

            if not self.respects_step_order(response):
                # step enforcement: reject out-of-order tool call
                self.context.append(self.order_violation_msg(response))
                continue

            tool_result = self.execute(response)

            if tool_result.is_error():
                # error recovery: structured retry with the
                # error message folded back into context
                self.context.append(self.error_recovery_prompt(tool_result))
                continue

            return tool_result

        return self.fallback()

The key property is that the guardrails are tool-agnostic. They do not know what the agent is doing. They know what malformed JSON looks like, what an out-of-order tool call looks like, what a context that is about to bust the VRAM budget looks like. The interventions are local, mechanical, and cheap.

The reported result is that an 8B model under Forge hits 99 percent completion on agentic workflows. The Hacker News framing of "53 percent to 99 percent" is the headline number. The CAIS 2026 page itself reports the without-guardrails baseline as a range (49 to 87 percent for frontier APIs), so the exact "53 percent" likely comes from a specific 8B baseline configuration in the paper that I have not been able to verify against a public PDF at time of writing. The qualitative shape of the claim is well-supported: small model plus guardrails beats frontier model without guardrails on multi-step tasks.

The cost line for Forge sits at runtime. Each guardrail intervention costs an additional model call (the retry, the corrected step, the recovered error). The eval harness in the paper ran 50 trials across 9 scenarios across 50+ model and backend configurations, which is a lot of calls. On consumer hardware those calls are essentially free in dollar terms but have a real latency and throughput cost. On API-hosted small models the per-intervention cost adds up. A run that needs three retries to complete pays for four generations instead of one.

The setup work is also runtime infrastructure. You need to integrate Forge into your agent harness, define your tool schemas in a way the step-enforcement layer can read, and tune the VRAM budgeter for your specific GPU. The CLAUDE.md side of the work happens before any call goes out. The Forge side of the work happens around every call that goes out.

Where they differ

The cleanest framing I can put on the contrast is that the two approaches live at different layers of the same stack.

Dimension	Context Kit	Forge Guardrails
Intervention point	Pre-inference (input frame)	At inference (runtime loop)
Mechanism	Failure-pattern naming, schema pinning, memory carry-forward	Retry nudges, step enforcement, error recovery, VRAM budgeting
Where the work lives	Writing time (six MD files)	Runtime (guardrail wrapper around every call)
Marginal cost per call	Near-zero with prompt cache	One extra call per intervention
Failure mode it targets	Model not understanding the domain or output contract	Model misfiring inside a multi-step loop
Tool-aware?	Yes (domain vocabulary embedded)	No (tool-agnostic by design)
Persistence across sessions	Yes (files on disk)	No (live process state)
Setup effort	High once, low ongoing	Low once if framework exists, ongoing tuning per workload
Best fit task	Single-shot critique, audit, structured-output drafting	Multi-step tool-using agent loops
Reported lift	9/12 to 11/12 findings on architecture audit (one task, N=1)	53 to 99 percent on 9 agentic scenarios (50 trials each, from paper)

The most useful way I have found to think about the difference is labour transfer. The context kit shifts work from the inference budget to the writing budget. You pay once to author the six files. You pay near-nothing on each subsequent inference call. Forge does the opposite. It accepts that the small model will misfire in the loop and pays for the correction at inference time, but only when correction is needed.

If your workload is "I need to audit one document very carefully, once," the context kit is the right shape. The audit is a single call. There is no loop to guardrail.

If your workload is "I need to run a 7-step browser automation agent 200 times a day," Forge is the right shape. The writing budget for a context kit that covers every possible browser-automation failure is unbounded. The runtime guardrails that catch malformed JSON and out-of-order clicks are tractable.

Most real workloads are mixed. Which is what makes the combination interesting.

Hypothetical combination: both layers, same workload

Neither paper tests the combination. The framing below is a hypothesis, not a result. I am writing it out partly to make the hypothesis concrete and partly because I want to actually run this experiment over the next month.

The thesis: the two interventions attack non-overlapping failure modes, so the gains should be roughly additive rather than redundant.

# Hypothesis: stack the two layers.
# Context kit shapes the input. Forge wraps the loop.
# Failure modes addressed should be largely disjoint.

context_kit = load_context_kit([
    "CLAUDE.md",       # failure patterns + domain vocab
    "AGENTS.md",       # output schemas
    "MEMORY.md",       # prior findings
    "TESTING.md",      # assertion patterns
    "GLOSSARY.md",     # named terms
    "docs/adr/0001.md" # decision records
])

agent = GuardrailedAgent(
    model=Gemma4_31B,
    tools=[browser, file_io, search],
    system_prompt=context_kit,
    max_steps=10,
    vram_budget_gb=8,
)

# At inference time:
# - The model knows the domain vocabulary (context kit).
# - The model knows what malformed output looks like at its
#   own level (context kit AGENTS.md schema).
# - The harness catches step ordering and retries (Forge).
# - The harness manages VRAM bloat over long loops (Forge).

The reason I expect the gains to be roughly additive, not multiplicative or redundant:

Context-kit failure modes are mostly "the model does not know what good output looks like for this domain." Naming the failure patterns and pinning the schema fixes those. The model still occasionally produces malformed JSON, drifts off the schema, or asks for the wrong tool. Those are runtime symptoms.

Forge failure modes are mostly "the model produced something that does not parse or does not advance the workflow, and we need to recover." The retry nudge and step enforcement catch those. But Forge cannot fix a model that has the wrong concept of what the task is. A model that thinks "audit" means "summarize" will retry into the same wrong answer ten times.

The two layers are addressing different categories of mistake. Stacked together, the prediction is:

Context kit alone: 75 → 92 percent (observed, N=1).
Forge alone on 8B model: 53 → 99 percent (reported, paper).
Both together: somewhere in the 95 to 99 percent band, with the floor higher than either alone because the input quality is better and the runtime recovery still catches what slips through.

The honest version of this is that I do not know. The two papers measure different things on different tasks. Cross-applying their numbers is exactly the kind of move I would call out as sloppy if someone else did it. The right next step is a single experiment that holds the task constant and toggles each layer on and off. That is a project for June.

When to use which

A short decision rule based on workload shape.

Use the context kit when:

The task is single-shot or near-single-shot. Audits, critiques, structured drafting.
The output contract matters more than the loop. You need parseable JSON, not robust 7-step browser navigation.
You are working with a model that respects long system prompts well. Gemma 4 31B does. Smaller models may not.
You expect to run the same task shape repeatedly. Writing the kit pays off across calls.
Your bottleneck is "the model does not understand my domain."

Use Forge-style guardrails when:

The task is multi-step with real tools. Browser agents, file-system agents, multi-API workflows.
You are running a self-hosted small model on consumer hardware and the alternative is paying frontier API rates.
Step ordering matters and the model has been observed to call tools out of order.
Context bloat over the loop is breaking the model. Compaction matters.
Your bottleneck is "the model misfires in the loop and the run aborts."

Consider both when:

The workload is multi-step AND domain-specific. Most real production workloads.
You have one source of truth for failure patterns (CLAUDE.md) that the runtime guardrails can reference.
You are running the workload at volume and the cost of a single retry call is starting to matter.

Pick neither and pay frontier rates when:

The workload is irregular and short-lived. The setup cost of either approach is not worth it for a one-off script.
You have no time to maintain the kit and no infrastructure to host the model.
The cost of a wrong answer is high enough that you want a single shot at maximum capability and you can afford it.

What I am running next

The directly testable hypothesis from this post is that stacking context kit and runtime guardrails on the same workload produces roughly additive gains. The cheapest version of that experiment is:

Hold the task constant. Use the architecture audit task from the earlier post (12 ground-truth findings).
Pick a small open model that runs on consumer hardware. Gemma 4 31B works. Llama 3.1 8B is closer to the Forge paper baseline.
Toggle two binary variables: context kit on/off, guardrail wrapper on/off.
Run each cell 20 times. Measure findings rate and per-run cost.
Compare against frontier baseline (Claude Opus 4.7) without either layer.

The 2x2 design is small enough that a solo developer can run it in a weekend. The result, regardless of which way it lands, would tell us whether the two layers compose or interfere. I will write it up either way.

Footer

This is the fifth post in a series on context engineering for small open-weights models. The earlier four covered the math and the audit results in detail.

The cost engineering math: I cut my Gemma 4 API costs 87 percent with context engineering. Here is the math.
The architecture audit: I ran a 7,500-token architecture spec through 4 models.
The defense pass: Can Gemma 4 defend what it builds?

Reference for Forge: Antoine Zambelli, Forge: Closing the Agentic Reliability Gap Between Self-Hosted and Frontier Language Models, ACM CAIS 2026.

The full six-file context engineering kit is open source under MIT on GitHub (agent-starter-kit), and packaged as a paid template on Kmong for users who want the curated version with the case-study writeups included. Both links live on the repo README.

If you try this stack on your own workload, the comparison number I would most like to see is the 2x2: kit on/off crossed with guardrail wrapper on/off, same task, same model. Counter-experiments welcome.

Jack. wildeconforce.com