Skip to content

V1 Build Plan: General-Purpose Context, Memory, and Receipts Layer for Agents

Self-contained product and implementation brief

Canonical V1 source of truth. The older context-compiler-specific build plan has been consolidated into this document.

This document is a complete V1 build brief. It does not assume any prior plan or conversation. A team or coding agent should be able to use this document alone to understand what to build and how to implement it from scratch.


1. Product definition

Build a general-purpose infrastructure layer for AI agents that does four jobs:

  1. Compiles the right context for each model call.
  2. Maintains working and durable memory across steps and sessions.
  3. Records receipts for what the agent saw, why it saw it, what it did, and how it changed state.
  4. Makes failures replayable and explainable.

This product is not another orchestrator, workflow graph editor, or vector database. It is the context, memory, policy, and replay substrate that sits under agent runtimes.

Short positioning

The memory and receipts layer for AI agents.

What the product must support

The core must be generic enough to support: - coding agents - research agents - support agents - internal knowledge agents - workflow / operations agents - multi-agent systems

What “coding-agent-first” means

The product is for agents in general.
The first launch wedge is coding agents, because they are the easiest category in which to demonstrate value.

That means: - the core architecture stays generic - the first polished demo and integration target is a repo-local coding agent workflow - the first adapters should make it easy to use with a local workspace, files, diffs, tests, and shell tools

Do not bake code-specific assumptions into the core data model.


2. The problem this solves

Agent systems fail for recurring reasons: - they see the wrong context - they lose track of constraints - they include stale or conflicting evidence - they forget decisions made earlier in the run - they compact long sessions badly - they call tools without clear provenance - they are hard to debug after the fact - teams cannot answer “what did the agent know when it made that decision?”

Today, developers often solve this with ad hoc prompt construction, one-off memory hacks, logging, and framework-specific state management. The result is fragile, opaque, and hard to reproduce.

This product fixes that by making context selection, memory persistence, and execution receipts explicit and inspectable.


3. V1 goals

V1 should deliver a usable product with local-first deployment.

Primary goals

  • Provide a context compiler that selects, compresses, and renders context for each model call.
  • Provide a flight recorder that stores what was included, excluded, changed, and executed.
  • Provide a working memory layer and a durable memory layer.
  • Provide a policy layer that scopes what the agent may see or do.
  • Provide replay for exact reruns of prior steps.
  • Provide provider adapters for at least two major LLM providers.
  • Provide a developer-facing dashboard for receipts, diffs, and replay.
  • Launch with a coding-agent wedge: polished examples for repo-local agents.

V1 success criteria

A developer should be able to: - integrate the package into an agent in under 30 minutes - inspect exactly what context was sent for each step - see why each context item was included, excluded, compressed, or redacted - replay a failed step exactly - persist and reuse memory across sessions - use the product locally without any hosted dependency


4. Non-goals

Do not build these in V1: - a visual workflow/DAG builder - a new orchestration framework - a hosted multi-tenant SaaS control plane - enterprise billing and account management - a vector database - autonomous memory learning that writes to long-term memory without developer control - a cross-provider eval platform - a generic IDE - a replacement for existing agent runtimes

The product should integrate with runtimes, not replace them.


5. Product shape

Ship V1 as three pieces:

  1. Python package
  2. the main adoption wedge
  3. includes compiler, memory, policy, recorder, adapters, and SDK

  4. Local sidecar daemon

  5. optional but recommended
  6. exposes HTTP and local IPC for integrations
  7. centralizes state, replay, and dashboard data

  8. Local dashboard

  9. shows runs, steps, diffs, receipts, memory, and replay controls

Why this shape

  • A Python package makes it easy to adopt in agent codebases.
  • A sidecar keeps the product usable by non-Python entry points.
  • A dashboard provides the “why did the agent do that?” moment.

Distribution

  • publish the package to PyPI
  • ship the sidecar and dashboard via a simple local dev launcher
  • keep cloud deployment optional and out of scope for V1

6. Architecture overview

High-level runtime model

Agent runtime / CLI / app
        |
        v
Context layer SDK or local sidecar
  - context compiler
  - memory manager
  - policy engine
  - recorder
  - replay engine
        |
        v
Provider adapters + tool adapters
        |
        v
LLM provider, tools, files, MCP servers, retrieval systems

Two usage modes

Mode A: Embedded SDK mode

A developer imports the Python package directly into their agent code.

Best for: - custom agent backends - server-side apps - research and ops agents - custom coding agents

Mode B: Sidecar mode

A runtime talks to a local daemon over HTTP or IPC.

Best for: - CLI tools - local coding agent workflows - non-Python runtimes - hook-based integrations


7. Core design principles

  1. Generic core, specific launch wedge
  2. the platform must work for any agent domain
  3. only the first demo/integration is coding-focused

  4. Local-first

  5. all critical capabilities work without a cloud service

  6. Determinism

  7. same inputs and config should produce the same compile result

  8. Receipts over mystery

  9. every inclusion and exclusion needs a reason

  10. Provenance everywhere

  11. every context item must preserve its source

  12. Policy before convenience

  13. scope and redaction must happen before rendering

  14. Replayability

  15. any important step should be rerunnable

  16. Adapters, not lock-in

  17. integrate with providers and runtimes through clear boundaries

8. Core capabilities

V1 includes six subsystems:

  1. Context compiler
  2. Recorder / flight recorder
  3. Memory system
  4. Policy engine
  5. Provider and tool adapters
  6. Dashboard and replay UI

Each subsystem is described below.


9. Core data model

Use Pydantic models for every core type. The core schema must be provider-neutral and domain-neutral.

9.1 ContextItem

A single possible unit of context.

class ContextItem(BaseModel):
    id: str
    kind: Literal[
        "system",
        "policy",
        "user_msg",
        "assistant_msg",
        "task",
        "constraint",
        "plan",
        "memory",
        "retrieval_doc",
        "tool_schema",
        "tool_result",
        "artifact",
        "file",
        "code_diff",
        "summary",
        "handoff",
        "other",
    ]
    content: dict | str
    source_ref: "SourceRef | None"
    created_at: datetime
    freshness_ts: datetime | None
    ttl_seconds: int | None
    trust_score: float
    importance_score: float
    sensitivity: Literal["public", "internal", "restricted", "secret"]
    scope: dict
    dependencies: list[str]
    token_estimate: int | None
    stable_hash: str
    tags: list[str]
    metadata: dict

9.2 SourceRef

Where a context item came from.

class SourceRef(BaseModel):
    source_type: Literal[
        "user",
        "app_state",
        "memory",
        "retrieval",
        "tool",
        "file",
        "mcp_resource",
        "policy",
        "human_approval",
        "external_api",
        "other",
    ]
    uri: str | None
    title: str | None
    version: str | None
    checksum: str | None
    retrieval_query: str | None
    positions: list[str] = []
    author: str | None
    metadata: dict = {}

9.3 MemoryRecord

A stored memory unit.

class MemoryRecord(BaseModel):
    id: str
    memory_type: Literal["working", "durable", "session_summary", "fact", "preference", "constraint", "artifact_index"]
    subject: str
    content: dict | str
    source_run_id: str | None
    source_step_ids: list[str]
    confidence: float
    freshness_ts: datetime | None
    expires_at: datetime | None
    tags: list[str]
    scope: dict
    metadata: dict

9.4 PolicyRule

class PolicyRule(BaseModel):
    id: str
    name: str
    description: str
    applies_to: dict
    effect: Literal["allow", "deny", "redact", "require_approval", "require_receipt"]
    conditions: dict
    priority: int
    metadata: dict

9.5 CompilationRequest

class CompilationRequest(BaseModel):
    request_id: str
    run_id: str
    step_id: str
    provider: str
    model: str
    task_summary: str
    context_items: list[ContextItem]
    tool_specs: list[dict]
    memory_records: list[MemoryRecord]
    policy_rules: list[PolicyRule]
    token_budget: int
    response_reserve: int
    compile_config: dict

9.6 CompilationDecision

class CompilationDecision(BaseModel):
    item_id: str
    decision: Literal["include", "exclude", "compress", "redact"]
    pass_name: str
    reason_code: str
    score_breakdown: dict
    original_tokens: int | None
    final_tokens: int | None
    notes: str | None

9.7 CompiledContext

class CompiledContext(BaseModel):
    id: str
    request_id: str
    provider: str
    model: str
    selected_items: list[str]
    excluded_items: list[str]
    decisions: list[CompilationDecision]
    token_allocation: dict
    stable_prefix_hash: str | None
    volatile_suffix_hash: str | None
    rendered_payload: dict

9.8 RunRecord and StepRecord

class RunRecord(BaseModel):
    run_id: str
    agent_name: str
    domain: str
    user_goal: str
    started_at: datetime
    ended_at: datetime | None
    status: Literal["running", "completed", "failed", "cancelled"]
    metadata: dict

class StepRecord(BaseModel):
    step_id: str
    run_id: str
    step_index: int
    step_kind: Literal["model_call", "tool_call", "memory_write", "policy_check", "human_approval", "other"]
    started_at: datetime
    ended_at: datetime | None
    status: Literal["running", "completed", "failed", "cancelled"]
    metadata: dict

9.9 ReplayRecipe

class ReplayRecipe(BaseModel):
    replay_id: str
    source_run_id: str
    source_step_id: str
    provider_override: str | None
    model_override: str | None
    compile_overrides: dict
    context_mutations: list[dict]
    notes: str | None

10. Context compiler

The compiler is the heart of the product. Its job is to turn a messy set of possible inputs into the best possible provider request.

10.1 Compiler inputs

Inputs may include: - latest user instruction - conversation history - task plan - retrieved documents - prior tool results - workspace files or artifacts - stored memory - policies and constraints - tool definitions - approvals and handoffs

10.2 Compiler outputs

The compiler must produce: - the rendered provider payload - inclusion/exclusion decisions - token allocation - stable/volatile layout metadata - receipts explaining every important decision

10.3 Compiler passes

Implement the compiler as explicit passes in this order:

  1. normalize
  2. coerce all inputs into ContextItem

  3. validate

  4. ensure all required fields exist
  5. ensure items have ids, kind, and stable hash

  6. dedupe

  7. remove exact duplicates by stable hash
  8. keep the highest-trust copy

  9. dependency expansion

  10. if an included item depends on another item, ensure the dependency is considered

  11. conflict detection

  12. detect contradictory items
  13. prefer the higher trust + fresher item
  14. keep losing items for traceability

  15. scoring

  16. compute a final score per item using:

    • relevance to task
    • freshness
    • trust
    • explicit pinning
    • dependency pressure
    • policy priority
  17. budgeting

  18. allocate available tokens by class

  19. compression

  20. summarize or compress low-priority overflow items
  21. preserve provenance

  22. policy

  23. deny, redact, or require approval before render

  24. layout

  25. create stable and volatile sections for caching and consistency

  26. render

  27. produce provider-specific request payload

  28. record

  29. emit receipts and persist the compile result

10.4 Scoring formula

Use a simple configurable weighted score in V1:

final_score =
  w_relevance   * relevance_score
+ w_freshness   * freshness_score
+ w_trust       * trust_score
+ w_importance  * importance_score
+ w_dependency  * dependency_score
+ w_pin         * pin_score
- w_staleness   * staleness_penalty
- w_risk        * risk_penalty

Start with defaults in config, not hardcoded constants.

10.5 Budgeting

Reserve budget in this order: 1. hard constraints and latest user request 2. active task state and tool results 3. policy items 4. working memory 5. retrieval and artifacts 6. lower-priority summaries

The compiler must never drop: - the latest user request - hard constraints - unresolved blocking tool results - required policy items

10.6 Compression rules

Compression is allowed for: - long retrieval docs - stale conversation history - large artifact summaries - old tool results

Compression is not allowed for: - latest user instruction - active policy constraints - approvals - exact code diffs currently under discussion - raw evidence required for a decision

10.7 Redaction rules

Redaction happens before provider rendering.
Redacted items remain visible in the recorder with reason metadata.

Examples: - remove secret tokens or credentials - hide restricted sources from lower-trust sub-agents - strip PII before external calls

10.8 Stable vs volatile layout

Structure rendered context into: - stable prefix - task context - volatile suffix

Stable prefix should contain: - static system instructions - tool schemas - stable policies - persistent profile or workspace summary

Volatile suffix should contain: - latest user request - latest tool results - newly changed artifacts - ephemeral run state

This improves reproducibility and can later improve cache friendliness.


11. Memory system

V1 must include both working memory and durable memory.

11.1 Working memory

Short-lived state used during a run: - active task summary - current plan - open questions - temporary facts - recent tool outputs - current artifact map

Scope: - per run - per step chain - easy to overwrite

11.2 Durable memory

Longer-lived state across sessions: - user preferences - stable constraints - recurring environment facts - recurring workspace summaries - approved project conventions

Scope: - namespace by project / user / agent - explicitly writable - readable via filters

11.3 Memory read path

Before compile: - fetch candidate memory records - filter by scope and freshness - convert them into ContextItem - let the compiler decide whether they are included

11.4 Memory write path

In V1, memory writes should be controlled and explicit.

Allowed writers: - application code - approved tool outputs - human-approved summaries - explicit post-step summarizers

Do not implement unconstrained autonomous self-writing memory in V1.

11.5 Memory retention

Each memory record should support: - freshness timestamp - expiry time - confidence - tags - scope

Implement TTL-based cleanup and manual invalidation.


12. Policy engine

The policy engine determines what the agent may see or do.

12.1 Policy capabilities in V1

Support these effects: - allow - deny - redact - require approval - require receipt

12.2 Policy dimensions

Policies may depend on: - agent identity - task type - model provider - source sensitivity - tool class - environment - user or project scope

12.3 Example policies

  • A support agent may not see secrets.
  • A research agent may use web tools but not write to production systems.
  • A coding agent may run read-only shell commands automatically, but destructive commands require approval.
  • A low-trust summarizer sub-agent may see summaries but not raw restricted documents.

12.4 Approval flow

If a rule says require_approval, the step should transition into a blocked state and emit: - requested action - reason - relevant receipts - approval token when granted

V1 approval can be local/manual only.


13. Recorder / flight recorder

The recorder is a first-class subsystem, not a log file.

13.1 What to persist

Persist: - run metadata - step metadata - normalized inputs - memory reads and writes - compile requests - compile decisions - rendered provider payload - provider responses - tool calls and tool results - policy checks - approvals - errors - replay recipes

13.2 Required recorder guarantees

  • every important step has a unique id
  • every compile can be reconstructed
  • every included item has provenance
  • every excluded item has a reason
  • replay can reconstruct the original payload
  • sensitive payload retention is configurable

13.3 Retention modes

Support: - metadata_only - redacted_payloads - full_payloads

Default local mode: redacted_payloads.

13.4 Diff engine

The recorder must support diffs between: - step N and step N+1 - original compile and replay compile - pre- and post-compaction state

Diffs should show: - added items - removed items - compressed items - score changes - policy-caused changes - token budget changes


14. Replay engine

Replay is a product feature, not a test-only feature.

14.1 Exact replay

Use the stored compile artifact to rerun the same model call.

14.2 Mutated replay

Allow controlled overrides: - model change - provider change - different budget - removed memory source - changed scoring weights - changed policy

14.3 Replay outputs

Replay should produce: - new run/step ids - comparison to original - output diff - compile diff - notes about what changed


15. Provider adapters

The core must be provider-agnostic. Build adapters at the edges.

15.1 V1 provider goals

Support at least two major providers: - one adapter for an OpenAI-style API - one adapter for an Anthropic-style API

Do not hard-code request logic into the compiler. The compiler emits a provider-neutral intermediate result, and the adapter renders it.

15.2 Adapter interface

class ProviderAdapter(Protocol):
    def count_tokens(self, compiled: CompiledContext) -> int: ...
    def render_request(self, compiled: CompiledContext) -> dict: ...
    def send(self, rendered_request: dict) -> dict: ...
    def parse_response(self, response: dict) -> dict: ...

15.3 Tool handling

Tool definitions and results should remain provider-neutral in core types. Adapters are responsible for rendering tool schemas and parsing tool calls.

15.4 OpenAI-style adapter

Target the Responses API first.

Requirements: - render instructions and input cleanly from the provider-neutral compiled context - support developer-provided tools - pass through provider built-in tools when supplied - record request usage, response usage, and cache metrics when available - expose stable prefix and volatile suffix hashes in recorder metadata - support conversation continuation - keep long-run compaction helpers isolated from the core compiler

15.5 Anthropic-style adapter

Target the Messages API first.

Requirements: - render system, messages, and tools from the provider-neutral compiled context - preflight requests with token counting - support cache breakpoints for reusable prefix content - support streaming - validate tool-result adjacency before sending the request - keep extended-thinking support out of the default V1 path unless isolated behind a feature flag


16. Tool and integration surfaces

V1 should expose three integration surfaces.

16.1 Python SDK

Use for: - custom agent backends - server-side integrations - embedded use in applications

Core API sketch:

ctx = agentcore.start_run(agent_name="my-agent", domain="research", user_goal="...")
compiled = agentcore.compile_step(run_id=..., step_id=..., inputs=...)
response = agentcore.call_model(compiled)
agentcore.record_tool_result(...)
agentcore.end_run(...)

16.2 Sidecar daemon

Expose HTTP endpoints for: - start run - append step inputs - compile - call model through adapter - record tool call/result - fetch memory - write memory - replay step - query recorder

16.3 MCP bridge

Provide an MCP server so other runtimes can access: - memory search - context receipt lookup - replay trigger - artifact summary - policy check

Treat MCP as a thin integration bridge in V1: - resources become context candidates - tools become callable tool specs - prompts become reusable prompt assets - roots define workspace boundaries

Keep sampling, elicitation, and richer workflow behavior out of V1 unless they are needed by the golden-path demo.


17. Coding-agent-first launch wedge

This section defines the first launch wedge without changing the generic core.

17.1 What the wedge is

The first polished user experience should target repo-local agents.

Examples: - a custom coding agent built with the Python SDK - a local sidecar used by a CLI-based coding workflow - a hook-based adapter to a coding agent tool that already has a lifecycle model - MCP-based integration for an IDE or code assistant

17.2 What value should be obvious in the first demo

A coding-agent user should immediately get: - repo memory across sessions - better context selection across files, diffs, and test results - receipts for why files or instructions were included - replay when the agent makes a bad code change - policy checks for risky commands or write operations

17.3 Important boundary

Do not define the core product as “only for coding agents.”
Use coding agents to prove the product, not to define its architecture.

17.4 Coding-agent demo scenario

Ship one end-to-end example:

  1. open a local repo
  2. ask the agent to implement a feature
  3. read files and create a plan
  4. run tests
  5. inspect diff
  6. persist a workspace summary and task receipts
  7. replay a failed step with different compile settings

That is enough to make the product legible.

17.5 Golden-path demo spec

Use this demo to drive the first implementation slice. It is narrower than full V1 and should be runnable before the entire platform is complete.

Demo name: repo-local failing task replay

Purpose: prove that a developer can wrap a coding agent, inspect exactly what the model saw, understand why context was included or excluded, and replay the bad step with changed compile settings.

Fixture repo: a small Python project with: - one failing test - at least three source files - one irrelevant large file or document that should be excluded - one repo convention file that should become workspace memory - one simple command for tests, such as pytest

Starting user request:

Make the failing test pass. Do not edit tests. Preserve the public API.

Required recorded inputs: - latest user request - repo summary - relevant source files - failing test output - current diff, if any - workspace convention or project note - hard constraints from the user request - tool schemas for file reads, shell commands, and patch application

Required compiler behavior: - always include the latest user request - always include the hard constraints - include the failing test output and directly relevant source files - include the workspace convention if it applies - exclude unrelated files with machine-readable reasons - preserve raw source items even when excluded or compressed - produce stable hashes for the compiled payload sections

Required recorder behavior: - store the normalized context items - store inclusion, exclusion, compression, and redaction decisions - store the rendered provider payload - store tool calls, test output, diff, token usage, and cache metadata when available - store enough information to replay the model step without reconstructing state from logs

Required dashboard path: 1. Runs list shows the demo run, status, model, token usage, and timestamp. 2. Step detail shows included and excluded context items with reasons. 3. Diff view shows what changed between the planning step and patch step. 4. Replay view reruns the failed or pre-patch step with at least one changed compile setting, such as token budget or removed context source.

Done when: - a fresh checkout can run the demo from one command - the sidecar and dashboard start locally with no cloud dependency - the dashboard can answer "what did the model know and why?" for the patch step - exact replay works for the recorded step - one mutated replay changes the compiled context and records a comparison - the demo leaves behind a durable workspace summary or convention with provenance - the implementation remains generic outside the demo adapter and example files

Treat this as the first build milestone. Later V1 phases should deepen the same flow instead of adding unrelated features first.


18. Repository layout

18.1 Implementation stack defaults

Use: - Python for the compiler, memory, policy, recorder, adapters, SDK, and examples - Pydantic for core schemas - Pytest for deterministic unit and integration tests - FastAPI for the local sidecar API - SQLite plus local JSON/JSONL files for local storage - React or Next.js for the local dashboard - Postgres and object storage only behind team-ready storage interfaces - OTLP/OpenTelemetry export as an adapter, not as the recorder's source of truth

Optimize first for clarity, determinism, and debuggability.

18.2 Monorepo shape

Use a monorepo.

This is the preserved baseline repository shape for the first implementation:

repo/
  README.md
  docs/
  packages/
    context_core/
    sdk/
    provider_openai/
    provider_anthropic/
    recorder/
    mcp_bridge/
    otel_export/
  services/
    api/
  apps/
    dashboard/
  examples/
    openai_context_demo/
    anthropic_coding_agent/
    langgraph_wrapper/
    mcp_ingest_demo/
  tests/
    unit/
    integration/
    fixtures/

18.3 Package responsibilities

  • context_core/: provider-neutral schemas, compiler passes, scoring, budgeting, cache layout, memory interfaces, policy interfaces, and replay recipes.
  • sdk/: customer-facing Python SDK and installable CLI that wrap the internal packages into the supported adoption surface.
  • provider_openai/: OpenAI-style adapter, rendering, usage capture, cache metadata, and example integration helpers.
  • provider_anthropic/: Anthropic-style adapter, rendering, token preflight, cache breakpoints, and tool-order validation.
  • recorder/: run and step persistence, receipts, payload snapshots, local SQLite/file storage, diffs, and replay artifacts.
  • mcp_bridge/: MCP resources, tools, prompts, and roots mapped into context-layer primitives.
  • otel_export/: optional exporter that maps internal recorder events to OTLP/OpenTelemetry without making OTLP the internal source of truth.
  • services/api/: FastAPI sidecar for compile, record, memory, policy, replay, and dashboard data access.
  • apps/dashboard/: local dashboard for runs, steps, receipts, diffs, memory, and replay.
  • examples/openai_context_demo/: first one-shot OpenAI context demo.
  • examples/anthropic_coding_agent/: Anthropic-style coding-agent demo.
  • examples/langgraph_wrapper/: wrapper example for an existing agent runtime.
  • examples/mcp_ingest_demo/: MCP ingestion demo.

18.4 Later package split

The baseline shape keeps the core cohesive while the product is young. If context_core/ becomes too large, split it later along these boundaries:

packages/
  core/
  compiler/
  memory/
  policy/
  replay/

Do not start with this split unless the codebase pressure is real. The initial goal is a legible, shippable compiler/recorder package, not maximum package granularity.


19. Storage design

Support two storage modes.

19.1 Local mode

Use: - SQLite for structured metadata - local JSON or JSONL blobs for payload snapshots - local filesystem for artifacts

19.2 Team-ready mode

Design the interfaces so they can later support: - Postgres - object storage - Redis or queue for jobs

Do not implement hosted team mode in V1, but keep interfaces clean.

19.3 Required tables / collections

At minimum persist: - runs - steps - context_items - memory_records - compilation_requests - compilation_results - tool_events - policy_events - replay_runs


20. Sidecar API

Use FastAPI.

Endpoints

  • POST /v1/runs/start
  • POST /v1/runs/{run_id}/steps
  • POST /v1/compile
  • POST /v1/model/call
  • POST /v1/tools/result
  • POST /v1/memory/read
  • POST /v1/memory/write
  • POST /v1/policy/check
  • GET /v1/runs
  • GET /v1/runs/{run_id}
  • GET /v1/steps/{step_id}
  • GET /v1/compilations/{id}
  • GET /v1/diff/steps/{left}/{right}
  • POST /v1/replay
  • GET /v1/replay/{replay_id}

Keep payloads simple JSON with explicit schema versions.


21. Dashboard requirements

The dashboard is local-first and must answer: - what did the agent know? - why did it know that? - what changed between steps? - what memory was used or written? - what tool action happened? - how can I replay this?

Screens

21.1 Runs list

Show: - run id - agent - domain - status - start time - end time - step count

21.2 Run detail

Show: - timeline of steps - tool actions - memory writes - failures - compile count

21.3 Step detail

Show: - included items - excluded items - scores - reasons - token allocation - rendered request preview - response preview

21.4 Diff view

Show: - additions/removals - compressed items - redactions - policy changes - budget changes

21.5 Memory explorer

Show: - working memory - durable memory - source runs - confidence and freshness - invalidation controls

21.6 Replay view

Allow: - exact replay - replay with modified config - compare outputs


22. Build order

Build in this order.

Phase 0: bootstrap

  • repo setup
  • packaging
  • formatting
  • typing
  • CI
  • test harness

Phase 1: core schemas and utilities

  • Pydantic models
  • stable hashing
  • ids and timestamps
  • serialization tests

Phase 2: compiler v1

  • normalize
  • dedupe
  • conflict detection
  • scoring
  • budgeting
  • compression
  • policy pass
  • provider-neutral output

Phase 3: recorder and storage

  • run and step persistence
  • compile event storage
  • payload snapshots
  • diff engine

Phase 4: memory system

  • working memory
  • durable memory
  • read/write APIs
  • TTL cleanup

Phase 5: provider adapters

  • adapter 1
  • adapter 2
  • token counting hooks
  • tool rendering

Phase 6: sidecar API

  • FastAPI service
  • local storage integration
  • replay endpoints

Phase 7: dashboard

  • runs view
  • step view
  • diff view
  • replay view
  • memory explorer

Phase 8: examples

  • generic agent example
  • coding-agent repo example
  • research agent example

Phase 9: MCP bridge

  • expose memory read
  • expose receipts
  • expose replay trigger

23. Test plan

Write tests early and keep them deterministic.

Unit tests

  • context item hashing
  • dedupe
  • conflict resolution
  • score calculation
  • budget allocation
  • compression rules
  • redaction rules
  • policy rule evaluation
  • memory read/write
  • replay recipe construction

Integration tests

  • compile -> render -> provider adapter request
  • exact replay of a stored step
  • diff output between steps
  • sidecar API roundtrip
  • dashboard data queries
  • coding-agent example workflow
  • research-agent example workflow

Invariant tests

  • latest user request is always included
  • hard constraints are never dropped
  • every excluded item has a reason
  • every included item has provenance
  • same input + config => same compile hash

24. Acceptance criteria

V1 is done when all of the following are true:

  • the core works for at least two agent domains in examples
  • the coding-agent demo is polished and easy to understand
  • the data model remains generic and domain-neutral
  • every compile produces receipts
  • every stored step can be replayed
  • working and durable memory both function
  • policy checks can deny, redact, or require approval
  • the sidecar runs locally with no cloud dependency
  • the dashboard answers “what did the agent know and why?”
  • package installation and first run are documented clearly

25. Documentation requirements

Ship these docs: - quickstart - architecture overview - data model reference - sidecar API reference - writing provider adapters - writing memory policies - building a custom agent with the SDK - coding-agent demo walkthrough - research-agent demo walkthrough

The docs must make the general-vs-wedge distinction explicit.


26. Risks and mitigations

Risk: product feels too abstract

Mitigation: - ship the coding-agent demo early - use concrete before/after examples

Risk: product gets mistaken for another orchestration framework

Mitigation: - keep runtime integration simple - position the product as substrate, not orchestrator

Risk: memory becomes unreliable

Mitigation: - keep writes explicit in V1 - preserve provenance and confidence - avoid hidden autonomous learning

Risk: dashboards become the whole product

Mitigation: - keep the SDK and sidecar as the core - treat UI as a debugging surface, not the primary value

Risk: coding-agent wedge pollutes the generic core

Mitigation: - enforce domain-neutral names in core packages - keep code-specific logic in examples and adapters


27. What not to postpone by accident

Do not leave these as “later”: - provenance on every context item - explicit compile decisions - replay - durable memory with invalidation - policy checks - step diffs

These are the product, not polish.


28. V1 one-sentence mandate

Build a general-purpose context, memory, and receipts layer for AI agents, shipped as a Python package plus local sidecar and dashboard, with the first polished go-to-market wedge focused on repo-local coding agents but with a generic core that works across agent domains.