V2 Build Plan: Advanced Modules for the Agent Context Layer¶

Self-contained roadmap for the next five modules after the V1 context, memory, receipts, and replay substrate

Canonical V2 source of truth. Earlier V2 module notes have been consolidated into this document.

Date: April 18, 2026

1. What this document is¶

This document assumes V1 is complete: the compiler core, memory system, policy layer, provider adapters, flight recorder, replay of exact runs, local sidecar, local dashboard, and basic eval hook are already in place.

V2 turns the product from a passive memory-and-receipts layer into an active learning system. The five modules below are not random features; they form the loop that makes the platform progressively more useful every time an agent runs.

The V2 platform remains general-purpose infrastructure for agents. Coding-agent examples should continue to be the first polished wedge, but module schemas and package boundaries should stay domain-neutral.

Do not build all modules in parallel. Build them in sequence so each module can reuse the schemas, fixtures, and UI surfaces created by the previous one.

2. Guiding principles¶

Keep the product positioned as memory + receipts for agents, not as another orchestration framework.
Make every new capability visible in the recorder so the system remains explainable.
Prefer deterministic artifacts over clever heuristics when the two are in tension.
Everything promoted from a run must keep provenance, scope, TTL, and sensitivity metadata.
Every module must work in local mode first; team and hosted capabilities can extend it later.

3. Recommended module sequence¶

Order	Module	Why now
1	Memory Distiller	Turns raw traces into reusable memory so the agent actually learns.
2	Counterfactual Replay	Lets developers answer whether different context or scoring would have changed the outcome.
3	Auto-Regression Generation	Converts failures and near-misses into repeatable test cases and CI gates.
4	Handoff Contracts	Makes agent-to-agent transfers structured, typed, and debuggable.
5	Policy Envelopes	Adds controlled visibility, permissions, redaction, and approval for broader adoption.

4. Module 1 — Memory Distiller¶

What it is: Convert finished runs into durable, scoped memory that can improve future compilations.

Problem to solve: V1 records what happened, but the knowledge remains trapped inside old traces. The product becomes far more valuable once successful runs can produce reusable facts, procedures, warnings, and preferences.

User-visible outcome

A coding agent remembers repository conventions, preferred commands, recurring fixes, and known dead ends.
Developers can see exactly which old run created a memory and why it was reused.
Memories expire, can be pinned, and never bypass the provenance model.

Core types

TraceEpisode — a clustered slice of a run that can be distilled.
MemoryCandidate — a proposed memory extracted from an episode.
MemoryArtifact — an accepted memory item; types include fact, procedure, preference, warning, and summary.
PromotionDecision — accept, reject, merge, expire, or quarantine.
MemoryLink — a record of when a memory was injected into a compilation.

Repository changes

packages/memory/ for extraction, scoring, promotion, retrieval, and GC.
apps/dashboard/src/pages/memories/ for candidate review, accepted memory, and usage receipts.
Database tables: trace_episodes, memory_candidates, memories, memory_links, memory_feedback.

Build in this order

Create the schemas, migrations, and serialization tests.
Build an episode segmenter that groups steps by task intent, tool phase, and outcome boundary.
Implement candidate extraction for facts, procedures, warnings, and user preferences.
Add a promotion engine that scores candidates by confidence, reuse potential, recency, and sensitivity.
Add dedupe and merge logic so similar memories collapse into one artifact instead of polluting the store.
Hook retrieval into the compiler as another context source, with strict scope and TTL enforcement.
Add dashboard screens for review, promotion history, decay, and memory usage receipts.
Add background jobs for decay, quarantine review, and garbage collection.

API surface

POST /v2/memory/distill/run/{run_id} — distill one completed run.
GET /v2/memory/search — search accepted memories by task, repo, agent, user, or scope.
POST /v2/memory/{id}/promote — manually promote or pin a candidate.
POST /v2/memory/{id}/expire — force expiration.
GET /v2/memory/{id}/usage — show every compile that used the memory.

Dashboard requirements

Candidate queue with confidence, scope, TTL, and originating run.
Accepted memory view with provenance chain and usage count.
Memory-in-compile receipt that shows why the memory was selected.
Controls for pin, reject, expire, quarantine, and sensitivity edits.

Done when

At least one accepted memory from a prior run is automatically reused in a later similar task.
The compiler can explain why a memory was included or rejected.
Stale memories decay automatically and do not silently linger forever.
A developer can trace every memory back to the source run and original evidence.

Key risks and mitigations

Memory pollution from weak extractions. Mitigate with strict promotion thresholds, confidence scores, and human review for early versions.
Privacy leaks. Enforce sensitivity classes and scope boundaries before retrieval, not after selection.
Overfitting. Prefer short, typed memories over giant freeform summaries.

5. Module 2 — Counterfactual Replay¶

What it is: Replay a run with targeted mutations to understand whether context decisions caused the outcome.

Problem to solve: Exact replay is useful, but it does not tell the developer what would have happened if the compiler had chosen different evidence, a larger budget, a different memory set, or a different policy. Counterfactual replay turns the recorder into a debugging and research tool.

User-visible outcome

A developer can rerun a failed step while dropping one retrieval doc, increasing the token budget, or disabling one memory class.
The dashboard can compare baseline vs mutated runs side by side.
Teams can answer whether a fix belongs in scoring, retrieval, policy, or the agent prompt.

Core types

ReplayScenario — one replay job anchored to an original run or step.
Mutation — a controlled change such as drop item, change weight, change budget, swap summary vs raw file, or disable a policy rule.
OutcomeMetric — structural success, tool path, latency, cost, diff quality, or custom rubric.
ReplayResult — compiled context, provider output, tool trace, and metric deltas.
ScenarioBatch — a set of related mutations executed together for comparison.

Repository changes

packages/replay/ for scenario orchestration, mutation application, evaluation, and comparison.
apps/dashboard/src/pages/replays/ for scenario setup and side-by-side analysis.
Database tables: replay_scenarios, replay_mutations, replay_results, replay_metrics.

Build in this order

Reuse the V1 exact replay engine as the baseline executor.
Define a mutation model that operates on normalized context items and compile settings, not on raw provider payloads.
Implement a dry-run mode and a stubbed-tool mode so replays do not create accidental side effects.
Add a metrics layer with structural assertions, token deltas, tool-path comparison, and optional LLM judge hooks behind a feature flag.
Build scenario batches so multiple mutations can run against one baseline.
Add dashboard diff views for baseline vs replayed compiled context, provider output, and outcome score.
Add export to regression-case creation so useful counterfactuals can become permanent tests.

API surface

POST /v2/replays/scenarios — create a replay scenario from a run or step.
POST /v2/replays/scenarios/{id}/execute — execute one scenario.
POST /v2/replays/batches — run multiple mutations together.
GET /v2/replays/{id} — fetch results, metrics, and diffs.

Dashboard requirements

Scenario builder with common mutation templates.
Baseline vs replay diff for included items, exclusions, and token allocation.
Outcome summary with win/loss/neutral labels against selected metrics.
Promotion action that turns a useful replay into a regression case.

Done when

A user can rerun a recorded failure with at least five mutation types without hand editing payloads.
The system can compare baseline and replayed outcomes in a single view.
Replay results can be exported directly into the eval/regression pipeline.

Key risks and mitigations

Provider nondeterminism. Minimize with stubbed tools, pinned models, and structural metrics rather than exact text comparison.
Unsafe side effects. Default to dry-run or stubbed execution unless a scenario explicitly opts in.

6. Module 3 — Auto-Regression Generation¶

What it is: Turn failed or high-value runs into repeatable tests that guard future changes.

Problem to solve: Without a regression layer, the same failures reappear after scoring changes, memory promotion, prompt edits, or provider upgrades. The product should let teams create test cases directly from production traces instead of writing them from scratch.

User-visible outcome

One bad run becomes a reusable fixture with assertions.
The CI pipeline can fail when a new version regresses on known cases.
The dashboard shows when a compiler change improved or broke historical tasks.

Core types

RegressionCase — a frozen scenario with fixture inputs and assertions.
FixtureBundle — redacted context, tool stubs, repo snapshots, and expected artifacts.
AssertionSet — structural, behavioral, output, or policy assertions.
RegressionRun — one execution of a case against a target version.
FailureSignature — a reusable tag that clusters similar failures.

Repository changes

packages/regressions/ for case creation, fixture bundling, assertion evaluation, and CI output.
apps/dashboard/src/pages/regressions/ for case review and historical trend charts.
Database tables: regression_cases, regression_case_versions, regression_runs, regression_assertions.

Build in this order

Create a fixture-bundling format that snapshots the minimal data needed to rerun a case safely.
Implement assertion templates for compile shape, tool behavior, output structure, policy decisions, and artifact quality.
Add one-click creation from runs, steps, or replay scenarios.
Build a local CLI and CI mode that executes a suite of regression cases and emits machine-readable results.
Add trend dashboards for pass rate, flaky cases, and case coverage by agent or task type.
Support case versioning so teams can evolve expectations intentionally instead of rewriting history.

API surface

POST /v2/regressions/from-run/{run_id} — create a case from a run.
POST /v2/regressions/from-replay/{replay_id} — create a case from a replay result.
POST /v2/regressions/{id}/execute — run one case.
POST /v2/regressions/suites/execute — run a suite.
GET /v2/regressions/{id}/history — show performance over time.

Dashboard requirements

Case builder with fixture preview and assertion editor.
Suite view showing pass rate, drift, and flaky cases.
Integration docs and badges for local CLI and CI environments.

Done when

A user can create a regression case from a run in one action.
A suite can be executed locally and in CI without cloud dependencies.
The system can highlight which compiler change caused a case to start failing.

Key risks and mitigations

Brittle exact-match assertions. Prefer structural and policy-level assertions first.
Oversized fixtures. Snapshot the minimum viable bundle and redact aggressively.

7. Module 4 — Handoff Contracts¶

What it is: Replace loose text handoffs between agents with typed contracts that preserve goals, evidence, and constraints.

Problem to solve: As soon as users chain agents together, handoffs become a major source of hidden failure. Requirements get dropped, evidence is lost, and downstream agents act on incomplete or stale context. Handoff contracts make multi-agent execution legible and debuggable.

User-visible outcome

An upstream agent hands off a structured packet instead of a paragraph blob.
The downstream agent knows the goal, constraints, required artifacts, and success criteria.
The recorder can show where information was dropped or corrupted between agents.

Core types

HandoffContract — the schema that defines a typed transfer.
GoalPacket — the current task objective and expected output.
EvidenceBundle — the exact supporting sources approved for transfer.
ConstraintSet — hard rules, blocked actions, deadlines, and style requirements.
ConsumerAck — the downstream agent's acceptance, refusal, or clarification state.

Repository changes

packages/handoffs/ for schemas, validators, templates, and contract rendering.
apps/dashboard/src/pages/handoffs/ for handoff graphs and validation errors.
Database tables: handoff_contracts, handoff_events, handoff_artifacts, handoff_acks.

Build in this order

Define a typed contract schema with versioning and task templates.
Add validators that reject incomplete contracts before a downstream agent executes.
Add compiler integration so only the scoped evidence bundle crosses the boundary.
Link producer and consumer traces in the recorder so handoffs appear as a graph, not isolated runs.
Create templates for the first three common patterns: code patch review, research-to-implementation, and planner-to-executor.
Add dashboard views for missing requirements, dropped evidence, and contract violations.

API surface

POST /v2/handoffs/contracts — create a contract.
POST /v2/handoffs/contracts/{id}/ack — accept, reject, or request clarification.
GET /v2/handoffs/contracts/{id} — view schema, evidence, and lifecycle.
GET /v2/handoffs/graphs/{run_id} — render the handoff graph for a run.

Dashboard requirements

Handoff graph view linking producer and consumer runs.
Contract inspector with goal, evidence, constraints, required outputs, and validation state.
Violation alerts when downstream output ignores or conflicts with the contract.

Done when

At least one multi-agent workflow can pass a typed contract end to end.
Downstream agents can reject invalid or incomplete handoffs before work begins.
The recorder can explain exactly what crossed the boundary and why.

Key risks and mitigations

Too much rigidity for trivial tasks. Use optional templates and allow lightweight contracts for small flows.
Schema bloat. Keep the core contract compact and use extension fields for niche cases.

8. Module 5 — Policy Envelopes¶

What it is: Give each agent task a scoped permission envelope for data, memory, tools, models, and approvals.

Problem to solve: The product can stay developer-first for a while, but broad adoption eventually requires hard boundaries around who can see what, which tools may run, which memories may be reused, and when a human approval gate is required. Policy envelopes turn the compiler into an enforceable control point.

User-visible outcome

Teams can define which repos, docs, memory classes, and tools an agent may access for a task.
Denied actions are visible, explainable, and replayable.
The same compiler can serve individual developers and stricter enterprise environments.

Core types

PolicyEnvelope — the top-level policy attached to a run, agent, or task.
DataRule — what information may be seen, redacted, or blocked.
ToolRule — allowed, denied, rate-limited, or approval-gated tools.
ModelRule — approved providers, models, or reasoning modes.
ApprovalRule — when a human must approve before execution continues.
EnvelopeDecision — the recorder artifact that explains allow/deny/redact outcomes.

Repository changes

packages/policy/ for envelope evaluation, inheritance, redaction, and approvals.
apps/dashboard/src/pages/policies/ for versions, dry runs, and audit history.
Database tables: policy_envelopes, policy_versions, policy_decisions, approval_events.

Build in this order

Define an envelope schema with inheritance and versioning.
Add compile-time enforcement for data visibility, memory eligibility, and model/tool permissions.
Add tool-execution checks so the envelope is enforced both before render and before action.
Add approval checkpoints for selected risky tool or write actions.
Build a dry-run simulator that shows what the envelope would allow, deny, or redact on a sample run.
Add dashboard views for version diffs, decision receipts, and approval history.

API surface

POST /v2/policies/envelopes — create or update a policy envelope.
POST /v2/policies/envelopes/{id}/simulate — run dry-run evaluation against a recorded run.
GET /v2/policies/decisions/{run_id} — fetch all allow/deny/redact decisions for a run.
POST /v2/policies/approvals/{id} — approve or reject a gated action.

Dashboard requirements

Policy editor with inheritance preview and version diff.
Dry-run simulator that overlays decisions on a recorded run.
Approval queue for gated actions and audit history for completed approvals.

Done when

A policy envelope can prevent unauthorized memory retrieval, data exposure, and tool execution.
A developer can simulate the effect of a new policy against an existing run before enforcing it.
Every allow, deny, and redact decision is preserved as a recorder artifact.

Key risks and mitigations

Policy sprawl. Use templates and inheritance instead of hundreds of standalone rules.
Hidden complexity. Keep the decision receipts readable and first-class in the UI.

9. Cross-module architecture and release plan¶

Shared infrastructure to build once¶

A background job runner for distillation, replay batches, and regression suites.
A fixture store that can persist redacted snapshots for replays and regression cases.
A reusable evaluator interface so replay scoring and regression assertions share the same engine.
A provenance query layer that can answer 'where did this come from?' across runs, memories, handoffs, and policies.
A common scope model used by memory retrieval, handoff evidence bundles, and policy envelopes.

Recommended repository expansion¶

Start from the V1 monorepo shape:

repo/
  README.md
  docs/
  packages/
    context_core/
    sdk/
    provider_openai/
    provider_anthropic/
    recorder/
    mcp_bridge/
    otel_export/
  services/
    api/
  apps/
    dashboard/
  examples/
    openai_context_demo/
    anthropic_coding_agent/
    langgraph_wrapper/
    mcp_ingest_demo/
  tests/
    unit/
    integration/
    fixtures/

Then extend it for V2:

packages/
  memory/
  replay/
  regressions/
  handoffs/
  policy/

apps/
  dashboard/
    routes for memories, replays, regressions, handoffs, policies

tests/
  regression_fixtures/

Notes: - In V1, memory, replay, and policy may live inside context_core/ as cohesive interfaces and implementations. - In V2, split memory/, replay/, and policy/ into first-class package boundaries when their modules become large enough to justify it. - Add regressions/ and handoffs/ as new package boundaries. - Add one migration package or migration module that keeps schema changes versioned and reversible.

Recommended build sequence¶

Build Memory Distiller first so the system starts learning from traces.
Build Counterfactual Replay second so the team can debug and iterate on compiler choices intelligently.
Build Auto-Regression Generation third so replay insights become permanent safeguards.
Build Handoff Contracts fourth when multi-agent workflows become a real part of the product.
Build Policy Envelopes fifth to widen adoption without breaking the developer-first experience.

Open-source vs hosted split¶

Open-source core: memory extraction and retrieval, replay engine, regression case format and local CLI, handoff contract schema and validator.
Hosted or enterprise layer: team review queues, batch replay orchestration, long-term analytics, shared memory governance, policy approvals, and org-wide dashboards.
Do not lock local mode behind the hosted service; the local product is the adoption wedge.

10. Final acceptance criteria¶

A coding agent can receive this document and implement the modules in sequence without guessing the product boundaries.
Each module has its own schemas, package boundary, API surface, UI surface, and definition of done.
All modules preserve the main product thesis: memory + receipts for agents.
The roadmap avoids building a second orchestration framework and instead deepens the under-owned layer around context, learning, replay, handoff, and control.

11. One-line positioning to preserve¶

The learning, replay, and control layer for AI coding agents.

Keep the messaging anchored to memory, receipts, and trustworthy action. Do not drift into the language of generic orchestration.