V1 Build Plan: General-Purpose Context, Memory, and Receipts Layer for Agents¶

Self-contained product and implementation brief¶

Canonical V1 source of truth. The older context-compiler-specific build plan has been consolidated into this document.

This document is a complete V1 build brief. It does not assume any prior plan or conversation. A team or coding agent should be able to use this document alone to understand what to build and how to implement it from scratch.

1. Product definition¶

Build a general-purpose infrastructure layer for AI agents that does four jobs:

Compiles the right context for each model call.
Maintains working and durable memory across steps and sessions.
Records receipts for what the agent saw, why it saw it, what it did, and how it changed state.
Makes failures replayable and explainable.

This product is not another orchestrator, workflow graph editor, or vector database. It is the context, memory, policy, and replay substrate that sits under agent runtimes.

Short positioning¶

The memory and receipts layer for AI agents.

What the product must support¶

The core must be generic enough to support: - coding agents - research agents - support agents - internal knowledge agents - workflow / operations agents - multi-agent systems

What “coding-agent-first” means¶

The product is for agents in general.
The first launch wedge is coding agents, because they are the easiest category in which to demonstrate value.

That means: - the core architecture stays generic - the first polished demo and integration target is a repo-local coding agent workflow - the first adapters should make it easy to use with a local workspace, files, diffs, tests, and shell tools

Do not bake code-specific assumptions into the core data model.

2. The problem this solves¶

Agent systems fail for recurring reasons: - they see the wrong context - they lose track of constraints - they include stale or conflicting evidence - they forget decisions made earlier in the run - they compact long sessions badly - they call tools without clear provenance - they are hard to debug after the fact - teams cannot answer “what did the agent know when it made that decision?”

Today, developers often solve this with ad hoc prompt construction, one-off memory hacks, logging, and framework-specific state management. The result is fragile, opaque, and hard to reproduce.

This product fixes that by making context selection, memory persistence, and execution receipts explicit and inspectable.

3. V1 goals¶

V1 should deliver a usable product with local-first deployment.

Primary goals¶

Provide a context compiler that selects, compresses, and renders context for each model call.
Provide a flight recorder that stores what was included, excluded, changed, and executed.
Provide a working memory layer and a durable memory layer.
Provide a policy layer that scopes what the agent may see or do.
Provide replay for exact reruns of prior steps.
Provide provider adapters for at least two major LLM providers.
Provide a developer-facing dashboard for receipts, diffs, and replay.
Launch with a coding-agent wedge: polished examples for repo-local agents.

V1 success criteria¶

A developer should be able to: - integrate the package into an agent in under 30 minutes - inspect exactly what context was sent for each step - see why each context item was included, excluded, compressed, or redacted - replay a failed step exactly - persist and reuse memory across sessions - use the product locally without any hosted dependency

4. Non-goals¶

Do not build these in V1: - a visual workflow/DAG builder - a new orchestration framework - a hosted multi-tenant SaaS control plane - enterprise billing and account management - a vector database - autonomous memory learning that writes to long-term memory without developer control - a cross-provider eval platform - a generic IDE - a replacement for existing agent runtimes

The product should integrate with runtimes, not replace them.

5. Product shape¶

Ship V1 as three pieces:

Python package
the main adoption wedge
includes compiler, memory, policy, recorder, adapters, and SDK
Local sidecar daemon
optional but recommended
exposes HTTP and local IPC for integrations
centralizes state, replay, and dashboard data
Local dashboard
shows runs, steps, diffs, receipts, memory, and replay controls

Why this shape¶

A Python package makes it easy to adopt in agent codebases.
A sidecar keeps the product usable by non-Python entry points.
A dashboard provides the “why did the agent do that?” moment.

Distribution¶

publish the package to PyPI
ship the sidecar and dashboard via a simple local dev launcher
keep cloud deployment optional and out of scope for V1

6. Architecture overview¶

High-level runtime model¶

Agent runtime / CLI / app
        |
        v
Context layer SDK or local sidecar
  - context compiler
  - memory manager
  - policy engine
  - recorder
  - replay engine
        |
        v
Provider adapters + tool adapters
        |
        v
LLM provider, tools, files, MCP servers, retrieval systems

Two usage modes¶

Mode A: Embedded SDK mode¶

A developer imports the Python package directly into their agent code.

Best for: - custom agent backends - server-side apps - research and ops agents - custom coding agents

Mode B: Sidecar mode¶

A runtime talks to a local daemon over HTTP or IPC.

Best for: - CLI tools - local coding agent workflows - non-Python runtimes - hook-based integrations

7. Core design principles¶

Generic core, specific launch wedge
the platform must work for any agent domain
only the first demo/integration is coding-focused
Local-first
all critical capabilities work without a cloud service
Determinism
same inputs and config should produce the same compile result
Receipts over mystery
every inclusion and exclusion needs a reason
Provenance everywhere
every context item must preserve its source
Policy before convenience
scope and redaction must happen before rendering
Replayability
any important step should be rerunnable
Adapters, not lock-in
integrate with providers and runtimes through clear boundaries

8. Core capabilities¶

V1 includes six subsystems:

Context compiler
Recorder / flight recorder
Memory system
Policy engine
Provider and tool adapters
Dashboard and replay UI

Each subsystem is described below.

9. Core data model¶

Use Pydantic models for every core type. The core schema must be provider-neutral and domain-neutral.

9.1 ContextItem¶

A single possible unit of context.

class ContextItem(BaseModel):
    id: str
    kind: Literal[
        "system",
        "policy",
        "user_msg",
        "assistant_msg",
        "task",
        "constraint",
        "plan",
        "memory",
        "retrieval_doc",
        "tool_schema",
        "tool_result",
        "artifact",
        "file",
        "code_diff",
        "summary",
        "handoff",
        "other",
    ]
    content: dict | str
    source_ref: "SourceRef | None"
    created_at: datetime
    freshness_ts: datetime | None
    ttl_seconds: int | None
    trust_score: float
    importance_score: float
    sensitivity: Literal["public", "internal", "restricted", "secret"]
    scope: dict
    dependencies: list[str]
    token_estimate: int | None
    stable_hash: str
    tags: list[str]
    metadata: dict

9.2 SourceRef¶

Where a context item came from.

class SourceRef(BaseModel):
    source_type: Literal[
        "user",
        "app_state",
        "memory",
        "retrieval",
        "tool",
        "file",
        "mcp_resource",
        "policy",
        "human_approval",
        "external_api",
        "other",
    ]
    uri: str | None
    title: str | None
    version: str | None
    checksum: str | None
    retrieval_query: str | None
    positions: list[str] = []
    author: str | None
    metadata: dict = {}

9.3 MemoryRecord¶

A stored memory unit.

class MemoryRecord(BaseModel):
    id: str
    memory_type: Literal["working", "durable", "session_summary", "fact", "preference", "constraint", "artifact_index"]
    subject: str
    content: dict | str
    source_run_id: str | None
    source_step_ids: list[str]
    confidence: float
    freshness_ts: datetime | None
    expires_at: datetime | None
    tags: list[str]
    scope: dict
    metadata: dict

9.4 PolicyRule¶

class PolicyRule(BaseModel):
    id: str
    name: str
    description: str
    applies_to: dict
    effect: Literal["allow", "deny", "redact", "require_approval", "require_receipt"]
    conditions: dict
    priority: int
    metadata: dict

9.5 CompilationRequest¶

class CompilationRequest(BaseModel):
    request_id: str
    run_id: str
    step_id: str
    provider: str
    model: str
    task_summary: str
    context_items: list[ContextItem]
    tool_specs: list[dict]
    memory_records: list[MemoryRecord]
    policy_rules: list[PolicyRule]
    token_budget: int
    response_reserve: int
    compile_config: dict

9.6 CompilationDecision¶

class CompilationDecision(BaseModel):
    item_id: str
    decision: Literal["include", "exclude", "compress", "redact"]
    pass_name: str
    reason_code: str
    score_breakdown: dict
    original_tokens: int | None
    final_tokens: int | None
    notes: str | None

9.7 CompiledContext¶

class CompiledContext(BaseModel):
    id: str
    request_id: str
    provider: str
    model: str
    selected_items: list[str]
    excluded_items: list[str]
    decisions: list[CompilationDecision]
    token_allocation: dict
    stable_prefix_hash: str | None
    volatile_suffix_hash: str | None
    rendered_payload: dict

9.8 RunRecord and StepRecord¶

class RunRecord(BaseModel):
    run_id: str
    agent_name: str
    domain: str
    user_goal: str
    started_at: datetime
    ended_at: datetime | None
    status: Literal["running", "completed", "failed", "cancelled"]
    metadata: dict

class StepRecord(BaseModel):
    step_id: str
    run_id: str
    step_index: int
    step_kind: Literal["model_call", "tool_call", "memory_write", "policy_check", "human_approval", "other"]
    started_at: datetime
    ended_at: datetime | None
    status: Literal["running", "completed", "failed", "cancelled"]
    metadata: dict

9.9 ReplayRecipe¶

class ReplayRecipe(BaseModel):
    replay_id: str
    source_run_id: str
    source_step_id: str
    provider_override: str | None
    model_override: str | None
    compile_overrides: dict
    context_mutations: list[dict]
    notes: str | None

10. Context compiler¶

The compiler is the heart of the product. Its job is to turn a messy set of possible inputs into the best possible provider request.

10.1 Compiler inputs¶

Inputs may include: - latest user instruction - conversation history - task plan - retrieved documents - prior tool results - workspace files or artifacts - stored memory - policies and constraints - tool definitions - approvals and handoffs

10.2 Compiler outputs¶

The compiler must produce: - the rendered provider payload - inclusion/exclusion decisions - token allocation - stable/volatile layout metadata - receipts explaining every important decision

10.3 Compiler passes¶

Implement the compiler as explicit passes in this order:

normalize
coerce all inputs into ContextItem
validate
ensure all required fields exist
ensure items have ids, kind, and stable hash
dedupe
remove exact duplicates by stable hash
keep the highest-trust copy
dependency expansion
if an included item depends on another item, ensure the dependency is considered
conflict detection
detect contradictory items
prefer the higher trust + fresher item
keep losing items for traceability
scoring
compute a final score per item using:
- relevance to task
- freshness
- trust
- explicit pinning
- dependency pressure
- policy priority
budgeting
allocate available tokens by class
compression
summarize or compress low-priority overflow items
preserve provenance
policy
deny, redact, or require approval before render
layout
create stable and volatile sections for caching and consistency
render
produce provider-specific request payload
record
emit receipts and persist the compile result

10.4 Scoring formula¶

Use a simple configurable weighted score in V1:

final_score =
  w_relevance   * relevance_score
+ w_freshness   * freshness_score
+ w_trust       * trust_score
+ w_importance  * importance_score
+ w_dependency  * dependency_score
+ w_pin         * pin_score
- w_staleness   * staleness_penalty
- w_risk        * risk_penalty

Start with defaults in config, not hardcoded constants.

10.5 Budgeting¶

Reserve budget in this order: 1. hard constraints and latest user request 2. active task state and tool results 3. policy items 4. working memory 5. retrieval and artifacts 6. lower-priority summaries

The compiler must never drop: - the latest user request - hard constraints - unresolved blocking tool results - required policy items

10.6 Compression rules¶

Compression is allowed for: - long retrieval docs - stale conversation history - large artifact summaries - old tool results

Compression is not allowed for: - latest user instruction - active policy constraints - approvals - exact code diffs currently under discussion - raw evidence required for a decision

10.7 Redaction rules¶

Redaction happens before provider rendering.
Redacted items remain visible in the recorder with reason metadata.

Examples: - remove secret tokens or credentials - hide restricted sources from lower-trust sub-agents - strip PII before external calls

10.8 Stable vs volatile layout¶

Structure rendered context into: - stable prefix - task context - volatile suffix

Stable prefix should contain: - static system instructions - tool schemas - stable policies - persistent profile or workspace summary

Volatile suffix should contain: - latest user request - latest tool results - newly changed artifacts - ephemeral run state

This improves reproducibility and can later improve cache friendliness.

11. Memory system¶

V1 must include both working memory and durable memory.

11.1 Working memory¶

Short-lived state used during a run: - active task summary - current plan - open questions - temporary facts - recent tool outputs - current artifact map

Scope: - per run - per step chain - easy to overwrite

11.2 Durable memory¶

Longer-lived state across sessions: - user preferences - stable constraints - recurring environment facts - recurring workspace summaries - approved project conventions

Scope: - namespace by project / user / agent - explicitly writable - readable via filters

11.3 Memory read path¶

Before compile: - fetch candidate memory records - filter by scope and freshness - convert them into ContextItem - let the compiler decide whether they are included

11.4 Memory write path¶

In V1, memory writes should be controlled and explicit.

Allowed writers: - application code - approved tool outputs - human-approved summaries - explicit post-step summarizers

Do not implement unconstrained autonomous self-writing memory in V1.

11.5 Memory retention¶

Each memory record should support: - freshness timestamp - expiry time - confidence - tags - scope

Implement TTL-based cleanup and manual invalidation.

12. Policy engine¶

The policy engine determines what the agent may see or do.

12.1 Policy capabilities in V1¶

Support these effects: - allow - deny - redact - require approval - require receipt

12.2 Policy dimensions¶

Policies may depend on: - agent identity - task type - model provider - source sensitivity - tool class - environment - user or project scope

12.3 Example policies¶

A support agent may not see secrets.
A research agent may use web tools but not write to production systems.
A coding agent may run read-only shell commands automatically, but destructive commands require approval.
A low-trust summarizer sub-agent may see summaries but not raw restricted documents.

12.4 Approval flow¶

If a rule says require_approval, the step should transition into a blocked state and emit: - requested action - reason - relevant receipts - approval token when granted

V1 approval can be local/manual only.

13. Recorder / flight recorder¶

The recorder is a first-class subsystem, not a log file.

13.1 What to persist¶

Persist: - run metadata - step metadata - normalized inputs - memory reads and writes - compile requests - compile decisions - rendered provider payload - provider responses - tool calls and tool results - policy checks - approvals - errors - replay recipes

13.2 Required recorder guarantees¶

every important step has a unique id
every compile can be reconstructed
every included item has provenance
every excluded item has a reason
replay can reconstruct the original payload
sensitive payload retention is configurable

13.3 Retention modes¶

Support: - metadata_only - redacted_payloads - full_payloads

Default local mode: redacted_payloads.

13.4 Diff engine¶

The recorder must support diffs between: - step N and step N+1 - original compile and replay compile - pre- and post-compaction state

Diffs should show: - added items - removed items - compressed items - score changes - policy-caused changes - token budget changes

14. Replay engine¶

Replay is a product feature, not a test-only feature.

14.1 Exact replay¶

Use the stored compile artifact to rerun the same model call.

14.2 Mutated replay¶

Allow controlled overrides: - model change - provider change - different budget - removed memory source - changed scoring weights - changed policy

14.3 Replay outputs¶

Replay should produce: - new run/step ids - comparison to original - output diff - compile diff - notes about what changed

15. Provider adapters¶

The core must be provider-agnostic. Build adapters at the edges.

15.1 V1 provider goals¶

Support at least two major providers: - one adapter for an OpenAI-style API - one adapter for an Anthropic-style API

Do not hard-code request logic into the compiler. The compiler emits a provider-neutral intermediate result, and the adapter renders it.

15.2 Adapter interface¶

class ProviderAdapter(Protocol):
    def count_tokens(self, compiled: CompiledContext) -> int: ...
    def render_request(self, compiled: CompiledContext) -> dict: ...
    def send(self, rendered_request: dict) -> dict: ...
    def parse_response(self, response: dict) -> dict: ...

15.3 Tool handling¶

Tool definitions and results should remain provider-neutral in core types. Adapters are responsible for rendering tool schemas and parsing tool calls.

15.4 OpenAI-style adapter¶

Target the Responses API first.

Requirements: - render instructions and input cleanly from the provider-neutral compiled context - support developer-provided tools - pass through provider built-in tools when supplied - record request usage, response usage, and cache metrics when available - expose stable prefix and volatile suffix hashes in recorder metadata - support conversation continuation - keep long-run compaction helpers isolated from the core compiler

15.5 Anthropic-style adapter¶

Target the Messages API first.

Requirements: - render system, messages, and tools from the provider-neutral compiled context - preflight requests with token counting - support cache breakpoints for reusable prefix content - support streaming - validate tool-result adjacency before sending the request - keep extended-thinking support out of the default V1 path unless isolated behind a feature flag

16. Tool and integration surfaces¶

V1 should expose three integration surfaces.

16.1 Python SDK¶

Use for: - custom agent backends - server-side integrations - embedded use in applications

Core API sketch:

ctx = agentcore.start_run(agent_name="my-agent", domain="research", user_goal="...")
compiled = agentcore.compile_step(run_id=..., step_id=..., inputs=...)
response = agentcore.call_model(compiled)
agentcore.record_tool_result(...)
agentcore.end_run(...)

16.2 Sidecar daemon¶

Expose HTTP endpoints for: - start run - append step inputs - compile - call model through adapter - record tool call/result - fetch memory - write memory - replay step - query recorder

16.3 MCP bridge¶

Provide an MCP server so other runtimes can access: - memory search - context receipt lookup - replay trigger - artifact summary - policy check

Treat MCP as a thin integration bridge in V1: - resources become context candidates - tools become callable tool specs - prompts become reusable prompt assets - roots define workspace boundaries

Keep sampling, elicitation, and richer workflow behavior out of V1 unless they are needed by the golden-path demo.

17. Coding-agent-first launch wedge¶

This section defines the first launch wedge without changing the generic core.

17.1 What the wedge is¶

The first polished user experience should target repo-local agents.

Examples: - a custom coding agent built with the Python SDK - a local sidecar used by a CLI-based coding workflow - a hook-based adapter to a coding agent tool that already has a lifecycle model - MCP-based integration for an IDE or code assistant

17.2 What value should be obvious in the first demo¶

A coding-agent user should immediately get: - repo memory across sessions - better context selection across files, diffs, and test results - receipts for why files or instructions were included - replay when the agent makes a bad code change - policy checks for risky commands or write operations

17.3 Important boundary¶

Do not define the core product as “only for coding agents.”
Use coding agents to prove the product, not to define its architecture.

17.4 Coding-agent demo scenario¶

Ship one end-to-end example:

open a local repo
ask the agent to implement a feature
read files and create a plan
run tests
inspect diff
persist a workspace summary and task receipts
replay a failed step with different compile settings

That is enough to make the product legible.

17.5 Golden-path demo spec¶

Use this demo to drive the first implementation slice. It is narrower than full V1 and should be runnable before the entire platform is complete.

Demo name: repo-local failing task replay

Purpose: prove that a developer can wrap a coding agent, inspect exactly what the model saw, understand why context was included or excluded, and replay the bad step with changed compile settings.

Fixture repo: a small Python project with: - one failing test - at least three source files - one irrelevant large file or document that should be excluded - one repo convention file that should become workspace memory - one simple command for tests, such as pytest

Starting user request:

Make the failing test pass. Do not edit tests. Preserve the public API.

Required recorded inputs: - latest user request - repo summary - relevant source files - failing test output - current diff, if any - workspace convention or project note - hard constraints from the user request - tool schemas for file reads, shell commands, and patch application

Required compiler behavior: - always include the latest user request - always include the hard constraints - include the failing test output and directly relevant source files - include the workspace convention if it applies - exclude unrelated files with machine-readable reasons - preserve raw source items even when excluded or compressed - produce stable hashes for the compiled payload sections

Required recorder behavior: - store the normalized context items - store inclusion, exclusion, compression, and redaction decisions - store the rendered provider payload - store tool calls, test output, diff, token usage, and cache metadata when available - store enough information to replay the model step without reconstructing state from logs

Required dashboard path: 1. Runs list shows the demo run, status, model, token usage, and timestamp. 2. Step detail shows included and excluded context items with reasons. 3. Diff view shows what changed between the planning step and patch step. 4. Replay view reruns the failed or pre-patch step with at least one changed compile setting, such as token budget or removed context source.

Done when: - a fresh checkout can run the demo from one command - the sidecar and dashboard start locally with no cloud dependency - the dashboard can answer "what did the model know and why?" for the patch step - exact replay works for the recorded step - one mutated replay changes the compiled context and records a comparison - the demo leaves behind a durable workspace summary or convention with provenance - the implementation remains generic outside the demo adapter and example files

Treat this as the first build milestone. Later V1 phases should deepen the same flow instead of adding unrelated features first.

18. Repository layout¶

18.1 Implementation stack defaults¶

Use: - Python for the compiler, memory, policy, recorder, adapters, SDK, and examples - Pydantic for core schemas - Pytest for deterministic unit and integration tests - FastAPI for the local sidecar API - SQLite plus local JSON/JSONL files for local storage - React or Next.js for the local dashboard - Postgres and object storage only behind team-ready storage interfaces - OTLP/OpenTelemetry export as an adapter, not as the recorder's source of truth

Optimize first for clarity, determinism, and debuggability.

18.2 Monorepo shape¶

Use a monorepo.

This is the preserved baseline repository shape for the first implementation:

repo/
  README.md
  docs/
  packages/
    context_core/
    sdk/
    provider_openai/
    provider_anthropic/
    recorder/
    mcp_bridge/
    otel_export/
  services/
    api/
  apps/
    dashboard/
  examples/
    openai_context_demo/
    anthropic_coding_agent/
    langgraph_wrapper/
    mcp_ingest_demo/
  tests/
    unit/
    integration/
    fixtures/

18.3 Package responsibilities¶

context_core/: provider-neutral schemas, compiler passes, scoring, budgeting, cache layout, memory interfaces, policy interfaces, and replay recipes.
sdk/: customer-facing Python SDK and installable CLI that wrap the internal packages into the supported adoption surface.
provider_openai/: OpenAI-style adapter, rendering, usage capture, cache metadata, and example integration helpers.
provider_anthropic/: Anthropic-style adapter, rendering, token preflight, cache breakpoints, and tool-order validation.
recorder/: run and step persistence, receipts, payload snapshots, local SQLite/file storage, diffs, and replay artifacts.
mcp_bridge/: MCP resources, tools, prompts, and roots mapped into context-layer primitives.
otel_export/: optional exporter that maps internal recorder events to OTLP/OpenTelemetry without making OTLP the internal source of truth.
services/api/: FastAPI sidecar for compile, record, memory, policy, replay, and dashboard data access.
apps/dashboard/: local dashboard for runs, steps, receipts, diffs, memory, and replay.
examples/openai_context_demo/: first one-shot OpenAI context demo.
examples/anthropic_coding_agent/: Anthropic-style coding-agent demo.
examples/langgraph_wrapper/: wrapper example for an existing agent runtime.
examples/mcp_ingest_demo/: MCP ingestion demo.

18.4 Later package split¶

The baseline shape keeps the core cohesive while the product is young. If context_core/ becomes too large, split it later along these boundaries:

packages/
  core/
  compiler/
  memory/
  policy/
  replay/

Do not start with this split unless the codebase pressure is real. The initial goal is a legible, shippable compiler/recorder package, not maximum package granularity.

19. Storage design¶

Support two storage modes.

19.1 Local mode¶

Use: - SQLite for structured metadata - local JSON or JSONL blobs for payload snapshots - local filesystem for artifacts

19.2 Team-ready mode¶

Design the interfaces so they can later support: - Postgres - object storage - Redis or queue for jobs

Do not implement hosted team mode in V1, but keep interfaces clean.

19.3 Required tables / collections¶

At minimum persist: - runs - steps - context_items - memory_records - compilation_requests - compilation_results - tool_events - policy_events - replay_runs

20. Sidecar API¶

Use FastAPI.

Endpoints¶

POST /v1/runs/start
POST /v1/runs/{run_id}/steps
POST /v1/compile
POST /v1/model/call
POST /v1/tools/result
POST /v1/memory/read
POST /v1/memory/write
POST /v1/policy/check
GET /v1/runs
GET /v1/runs/{run_id}
GET /v1/steps/{step_id}
GET /v1/compilations/{id}
GET /v1/diff/steps/{left}/{right}
POST /v1/replay
GET /v1/replay/{replay_id}

Keep payloads simple JSON with explicit schema versions.

21. Dashboard requirements¶

The dashboard is local-first and must answer: - what did the agent know? - why did it know that? - what changed between steps? - what memory was used or written? - what tool action happened? - how can I replay this?

Screens¶

21.1 Runs list¶

Show: - run id - agent - domain - status - start time - end time - step count

21.2 Run detail¶

Show: - timeline of steps - tool actions - memory writes - failures - compile count

21.3 Step detail¶

Show: - included items - excluded items - scores - reasons - token allocation - rendered request preview - response preview

21.4 Diff view¶

Show: - additions/removals - compressed items - redactions - policy changes - budget changes

21.5 Memory explorer¶

Show: - working memory - durable memory - source runs - confidence and freshness - invalidation controls

21.6 Replay view¶

Allow: - exact replay - replay with modified config - compare outputs

22. Build order¶

Build in this order.

Phase 0: bootstrap¶

repo setup
packaging
formatting
typing
CI
test harness

Phase 1: core schemas and utilities¶

Pydantic models
stable hashing
ids and timestamps
serialization tests

Phase 2: compiler v1¶

normalize
dedupe
conflict detection
scoring
budgeting
compression
policy pass
provider-neutral output

Phase 3: recorder and storage¶

run and step persistence
compile event storage
payload snapshots
diff engine

Phase 4: memory system¶

working memory
durable memory
read/write APIs
TTL cleanup

Phase 5: provider adapters¶

adapter 1
adapter 2
token counting hooks
tool rendering

Phase 6: sidecar API¶

FastAPI service
local storage integration
replay endpoints

Phase 7: dashboard¶

runs view
step view
diff view
replay view
memory explorer

Phase 8: examples¶

generic agent example
coding-agent repo example
research agent example

Phase 9: MCP bridge¶

expose memory read
expose receipts
expose replay trigger

23. Test plan¶

Write tests early and keep them deterministic.

Unit tests¶

context item hashing
dedupe
conflict resolution
score calculation
budget allocation
compression rules
redaction rules
policy rule evaluation
memory read/write
replay recipe construction

Integration tests¶

compile -> render -> provider adapter request
exact replay of a stored step
diff output between steps
sidecar API roundtrip
dashboard data queries
coding-agent example workflow
research-agent example workflow

Invariant tests¶

latest user request is always included
hard constraints are never dropped
every excluded item has a reason
every included item has provenance
same input + config => same compile hash

24. Acceptance criteria¶

V1 is done when all of the following are true:

the core works for at least two agent domains in examples
the coding-agent demo is polished and easy to understand
the data model remains generic and domain-neutral
every compile produces receipts
every stored step can be replayed
working and durable memory both function
policy checks can deny, redact, or require approval
the sidecar runs locally with no cloud dependency
the dashboard answers “what did the agent know and why?”
package installation and first run are documented clearly

25. Documentation requirements¶

Ship these docs: - quickstart - architecture overview - data model reference - sidecar API reference - writing provider adapters - writing memory policies - building a custom agent with the SDK - coding-agent demo walkthrough - research-agent demo walkthrough

The docs must make the general-vs-wedge distinction explicit.

26. Risks and mitigations¶

Risk: product feels too abstract¶

Mitigation: - ship the coding-agent demo early - use concrete before/after examples

Risk: product gets mistaken for another orchestration framework¶

Mitigation: - keep runtime integration simple - position the product as substrate, not orchestrator

Risk: memory becomes unreliable¶

Mitigation: - keep writes explicit in V1 - preserve provenance and confidence - avoid hidden autonomous learning

Risk: dashboards become the whole product¶

Mitigation: - keep the SDK and sidecar as the core - treat UI as a debugging surface, not the primary value

Risk: coding-agent wedge pollutes the generic core¶

Mitigation: - enforce domain-neutral names in core packages - keep code-specific logic in examples and adapters

27. What not to postpone by accident¶

Do not leave these as “later”: - provenance on every context item - explicit compile decisions - replay - durable memory with invalidation - policy checks - step diffs

These are the product, not polish.

28. V1 one-sentence mandate¶

Build a general-purpose context, memory, and receipts layer for AI agents, shipped as a Python package plus local sidecar and dashboard, with the first polished go-to-market wedge focused on repo-local coding agents but with a generic core that works across agent domains.