V1 Build Plan: General-Purpose Context, Memory, and Receipts Layer for Agents¶
Self-contained product and implementation brief¶
Canonical V1 source of truth. The older context-compiler-specific build plan has been consolidated into this document.
This document is a complete V1 build brief. It does not assume any prior plan or conversation. A team or coding agent should be able to use this document alone to understand what to build and how to implement it from scratch.
1. Product definition¶
Build a general-purpose infrastructure layer for AI agents that does four jobs:
- Compiles the right context for each model call.
- Maintains working and durable memory across steps and sessions.
- Records receipts for what the agent saw, why it saw it, what it did, and how it changed state.
- Makes failures replayable and explainable.
This product is not another orchestrator, workflow graph editor, or vector database. It is the context, memory, policy, and replay substrate that sits under agent runtimes.
Short positioning¶
The memory and receipts layer for AI agents.
What the product must support¶
The core must be generic enough to support: - coding agents - research agents - support agents - internal knowledge agents - workflow / operations agents - multi-agent systems
What “coding-agent-first” means¶
The product is for agents in general.
The first launch wedge is coding agents, because they are the easiest category in which to demonstrate value.
That means: - the core architecture stays generic - the first polished demo and integration target is a repo-local coding agent workflow - the first adapters should make it easy to use with a local workspace, files, diffs, tests, and shell tools
Do not bake code-specific assumptions into the core data model.
2. The problem this solves¶
Agent systems fail for recurring reasons: - they see the wrong context - they lose track of constraints - they include stale or conflicting evidence - they forget decisions made earlier in the run - they compact long sessions badly - they call tools without clear provenance - they are hard to debug after the fact - teams cannot answer “what did the agent know when it made that decision?”
Today, developers often solve this with ad hoc prompt construction, one-off memory hacks, logging, and framework-specific state management. The result is fragile, opaque, and hard to reproduce.
This product fixes that by making context selection, memory persistence, and execution receipts explicit and inspectable.
3. V1 goals¶
V1 should deliver a usable product with local-first deployment.
Primary goals¶
- Provide a context compiler that selects, compresses, and renders context for each model call.
- Provide a flight recorder that stores what was included, excluded, changed, and executed.
- Provide a working memory layer and a durable memory layer.
- Provide a policy layer that scopes what the agent may see or do.
- Provide replay for exact reruns of prior steps.
- Provide provider adapters for at least two major LLM providers.
- Provide a developer-facing dashboard for receipts, diffs, and replay.
- Launch with a coding-agent wedge: polished examples for repo-local agents.
V1 success criteria¶
A developer should be able to: - integrate the package into an agent in under 30 minutes - inspect exactly what context was sent for each step - see why each context item was included, excluded, compressed, or redacted - replay a failed step exactly - persist and reuse memory across sessions - use the product locally without any hosted dependency
4. Non-goals¶
Do not build these in V1: - a visual workflow/DAG builder - a new orchestration framework - a hosted multi-tenant SaaS control plane - enterprise billing and account management - a vector database - autonomous memory learning that writes to long-term memory without developer control - a cross-provider eval platform - a generic IDE - a replacement for existing agent runtimes
The product should integrate with runtimes, not replace them.
5. Product shape¶
Ship V1 as three pieces:
- Python package
- the main adoption wedge
-
includes compiler, memory, policy, recorder, adapters, and SDK
-
Local sidecar daemon
- optional but recommended
- exposes HTTP and local IPC for integrations
-
centralizes state, replay, and dashboard data
-
Local dashboard
- shows runs, steps, diffs, receipts, memory, and replay controls
Why this shape¶
- A Python package makes it easy to adopt in agent codebases.
- A sidecar keeps the product usable by non-Python entry points.
- A dashboard provides the “why did the agent do that?” moment.
Distribution¶
- publish the package to PyPI
- ship the sidecar and dashboard via a simple local dev launcher
- keep cloud deployment optional and out of scope for V1
6. Architecture overview¶
High-level runtime model¶
Agent runtime / CLI / app
|
v
Context layer SDK or local sidecar
- context compiler
- memory manager
- policy engine
- recorder
- replay engine
|
v
Provider adapters + tool adapters
|
v
LLM provider, tools, files, MCP servers, retrieval systems
Two usage modes¶
Mode A: Embedded SDK mode¶
A developer imports the Python package directly into their agent code.
Best for: - custom agent backends - server-side apps - research and ops agents - custom coding agents
Mode B: Sidecar mode¶
A runtime talks to a local daemon over HTTP or IPC.
Best for: - CLI tools - local coding agent workflows - non-Python runtimes - hook-based integrations
7. Core design principles¶
- Generic core, specific launch wedge
- the platform must work for any agent domain
-
only the first demo/integration is coding-focused
-
Local-first
-
all critical capabilities work without a cloud service
-
Determinism
-
same inputs and config should produce the same compile result
-
Receipts over mystery
-
every inclusion and exclusion needs a reason
-
Provenance everywhere
-
every context item must preserve its source
-
Policy before convenience
-
scope and redaction must happen before rendering
-
Replayability
-
any important step should be rerunnable
-
Adapters, not lock-in
- integrate with providers and runtimes through clear boundaries
8. Core capabilities¶
V1 includes six subsystems:
- Context compiler
- Recorder / flight recorder
- Memory system
- Policy engine
- Provider and tool adapters
- Dashboard and replay UI
Each subsystem is described below.
9. Core data model¶
Use Pydantic models for every core type. The core schema must be provider-neutral and domain-neutral.
9.1 ContextItem¶
A single possible unit of context.
class ContextItem(BaseModel):
id: str
kind: Literal[
"system",
"policy",
"user_msg",
"assistant_msg",
"task",
"constraint",
"plan",
"memory",
"retrieval_doc",
"tool_schema",
"tool_result",
"artifact",
"file",
"code_diff",
"summary",
"handoff",
"other",
]
content: dict | str
source_ref: "SourceRef | None"
created_at: datetime
freshness_ts: datetime | None
ttl_seconds: int | None
trust_score: float
importance_score: float
sensitivity: Literal["public", "internal", "restricted", "secret"]
scope: dict
dependencies: list[str]
token_estimate: int | None
stable_hash: str
tags: list[str]
metadata: dict
9.2 SourceRef¶
Where a context item came from.
class SourceRef(BaseModel):
source_type: Literal[
"user",
"app_state",
"memory",
"retrieval",
"tool",
"file",
"mcp_resource",
"policy",
"human_approval",
"external_api",
"other",
]
uri: str | None
title: str | None
version: str | None
checksum: str | None
retrieval_query: str | None
positions: list[str] = []
author: str | None
metadata: dict = {}
9.3 MemoryRecord¶
A stored memory unit.
class MemoryRecord(BaseModel):
id: str
memory_type: Literal["working", "durable", "session_summary", "fact", "preference", "constraint", "artifact_index"]
subject: str
content: dict | str
source_run_id: str | None
source_step_ids: list[str]
confidence: float
freshness_ts: datetime | None
expires_at: datetime | None
tags: list[str]
scope: dict
metadata: dict
9.4 PolicyRule¶
class PolicyRule(BaseModel):
id: str
name: str
description: str
applies_to: dict
effect: Literal["allow", "deny", "redact", "require_approval", "require_receipt"]
conditions: dict
priority: int
metadata: dict
9.5 CompilationRequest¶
class CompilationRequest(BaseModel):
request_id: str
run_id: str
step_id: str
provider: str
model: str
task_summary: str
context_items: list[ContextItem]
tool_specs: list[dict]
memory_records: list[MemoryRecord]
policy_rules: list[PolicyRule]
token_budget: int
response_reserve: int
compile_config: dict
9.6 CompilationDecision¶
class CompilationDecision(BaseModel):
item_id: str
decision: Literal["include", "exclude", "compress", "redact"]
pass_name: str
reason_code: str
score_breakdown: dict
original_tokens: int | None
final_tokens: int | None
notes: str | None
9.7 CompiledContext¶
class CompiledContext(BaseModel):
id: str
request_id: str
provider: str
model: str
selected_items: list[str]
excluded_items: list[str]
decisions: list[CompilationDecision]
token_allocation: dict
stable_prefix_hash: str | None
volatile_suffix_hash: str | None
rendered_payload: dict
9.8 RunRecord and StepRecord¶
class RunRecord(BaseModel):
run_id: str
agent_name: str
domain: str
user_goal: str
started_at: datetime
ended_at: datetime | None
status: Literal["running", "completed", "failed", "cancelled"]
metadata: dict
class StepRecord(BaseModel):
step_id: str
run_id: str
step_index: int
step_kind: Literal["model_call", "tool_call", "memory_write", "policy_check", "human_approval", "other"]
started_at: datetime
ended_at: datetime | None
status: Literal["running", "completed", "failed", "cancelled"]
metadata: dict
9.9 ReplayRecipe¶
class ReplayRecipe(BaseModel):
replay_id: str
source_run_id: str
source_step_id: str
provider_override: str | None
model_override: str | None
compile_overrides: dict
context_mutations: list[dict]
notes: str | None
10. Context compiler¶
The compiler is the heart of the product. Its job is to turn a messy set of possible inputs into the best possible provider request.
10.1 Compiler inputs¶
Inputs may include: - latest user instruction - conversation history - task plan - retrieved documents - prior tool results - workspace files or artifacts - stored memory - policies and constraints - tool definitions - approvals and handoffs
10.2 Compiler outputs¶
The compiler must produce: - the rendered provider payload - inclusion/exclusion decisions - token allocation - stable/volatile layout metadata - receipts explaining every important decision
10.3 Compiler passes¶
Implement the compiler as explicit passes in this order:
- normalize
-
coerce all inputs into
ContextItem -
validate
- ensure all required fields exist
-
ensure items have ids, kind, and stable hash
-
dedupe
- remove exact duplicates by stable hash
-
keep the highest-trust copy
-
dependency expansion
-
if an included item depends on another item, ensure the dependency is considered
-
conflict detection
- detect contradictory items
- prefer the higher trust + fresher item
-
keep losing items for traceability
-
scoring
-
compute a final score per item using:
- relevance to task
- freshness
- trust
- explicit pinning
- dependency pressure
- policy priority
-
budgeting
-
allocate available tokens by class
-
compression
- summarize or compress low-priority overflow items
-
preserve provenance
-
policy
-
deny, redact, or require approval before render
-
layout
-
create stable and volatile sections for caching and consistency
-
render
-
produce provider-specific request payload
-
record
- emit receipts and persist the compile result
10.4 Scoring formula¶
Use a simple configurable weighted score in V1:
final_score =
w_relevance * relevance_score
+ w_freshness * freshness_score
+ w_trust * trust_score
+ w_importance * importance_score
+ w_dependency * dependency_score
+ w_pin * pin_score
- w_staleness * staleness_penalty
- w_risk * risk_penalty
Start with defaults in config, not hardcoded constants.
10.5 Budgeting¶
Reserve budget in this order: 1. hard constraints and latest user request 2. active task state and tool results 3. policy items 4. working memory 5. retrieval and artifacts 6. lower-priority summaries
The compiler must never drop: - the latest user request - hard constraints - unresolved blocking tool results - required policy items
10.6 Compression rules¶
Compression is allowed for: - long retrieval docs - stale conversation history - large artifact summaries - old tool results
Compression is not allowed for: - latest user instruction - active policy constraints - approvals - exact code diffs currently under discussion - raw evidence required for a decision
10.7 Redaction rules¶
Redaction happens before provider rendering.
Redacted items remain visible in the recorder with reason metadata.
Examples: - remove secret tokens or credentials - hide restricted sources from lower-trust sub-agents - strip PII before external calls
10.8 Stable vs volatile layout¶
Structure rendered context into: - stable prefix - task context - volatile suffix
Stable prefix should contain: - static system instructions - tool schemas - stable policies - persistent profile or workspace summary
Volatile suffix should contain: - latest user request - latest tool results - newly changed artifacts - ephemeral run state
This improves reproducibility and can later improve cache friendliness.
11. Memory system¶
V1 must include both working memory and durable memory.
11.1 Working memory¶
Short-lived state used during a run: - active task summary - current plan - open questions - temporary facts - recent tool outputs - current artifact map
Scope: - per run - per step chain - easy to overwrite
11.2 Durable memory¶
Longer-lived state across sessions: - user preferences - stable constraints - recurring environment facts - recurring workspace summaries - approved project conventions
Scope: - namespace by project / user / agent - explicitly writable - readable via filters
11.3 Memory read path¶
Before compile:
- fetch candidate memory records
- filter by scope and freshness
- convert them into ContextItem
- let the compiler decide whether they are included
11.4 Memory write path¶
In V1, memory writes should be controlled and explicit.
Allowed writers: - application code - approved tool outputs - human-approved summaries - explicit post-step summarizers
Do not implement unconstrained autonomous self-writing memory in V1.
11.5 Memory retention¶
Each memory record should support: - freshness timestamp - expiry time - confidence - tags - scope
Implement TTL-based cleanup and manual invalidation.
12. Policy engine¶
The policy engine determines what the agent may see or do.
12.1 Policy capabilities in V1¶
Support these effects: - allow - deny - redact - require approval - require receipt
12.2 Policy dimensions¶
Policies may depend on: - agent identity - task type - model provider - source sensitivity - tool class - environment - user or project scope
12.3 Example policies¶
- A support agent may not see secrets.
- A research agent may use web tools but not write to production systems.
- A coding agent may run read-only shell commands automatically, but destructive commands require approval.
- A low-trust summarizer sub-agent may see summaries but not raw restricted documents.
12.4 Approval flow¶
If a rule says require_approval, the step should transition into a blocked state and emit:
- requested action
- reason
- relevant receipts
- approval token when granted
V1 approval can be local/manual only.
13. Recorder / flight recorder¶
The recorder is a first-class subsystem, not a log file.
13.1 What to persist¶
Persist: - run metadata - step metadata - normalized inputs - memory reads and writes - compile requests - compile decisions - rendered provider payload - provider responses - tool calls and tool results - policy checks - approvals - errors - replay recipes
13.2 Required recorder guarantees¶
- every important step has a unique id
- every compile can be reconstructed
- every included item has provenance
- every excluded item has a reason
- replay can reconstruct the original payload
- sensitive payload retention is configurable
13.3 Retention modes¶
Support:
- metadata_only
- redacted_payloads
- full_payloads
Default local mode: redacted_payloads.
13.4 Diff engine¶
The recorder must support diffs between: - step N and step N+1 - original compile and replay compile - pre- and post-compaction state
Diffs should show: - added items - removed items - compressed items - score changes - policy-caused changes - token budget changes
14. Replay engine¶
Replay is a product feature, not a test-only feature.
14.1 Exact replay¶
Use the stored compile artifact to rerun the same model call.
14.2 Mutated replay¶
Allow controlled overrides: - model change - provider change - different budget - removed memory source - changed scoring weights - changed policy
14.3 Replay outputs¶
Replay should produce: - new run/step ids - comparison to original - output diff - compile diff - notes about what changed
15. Provider adapters¶
The core must be provider-agnostic. Build adapters at the edges.
15.1 V1 provider goals¶
Support at least two major providers: - one adapter for an OpenAI-style API - one adapter for an Anthropic-style API
Do not hard-code request logic into the compiler. The compiler emits a provider-neutral intermediate result, and the adapter renders it.
15.2 Adapter interface¶
class ProviderAdapter(Protocol):
def count_tokens(self, compiled: CompiledContext) -> int: ...
def render_request(self, compiled: CompiledContext) -> dict: ...
def send(self, rendered_request: dict) -> dict: ...
def parse_response(self, response: dict) -> dict: ...
15.3 Tool handling¶
Tool definitions and results should remain provider-neutral in core types. Adapters are responsible for rendering tool schemas and parsing tool calls.
15.4 OpenAI-style adapter¶
Target the Responses API first.
Requirements: - render instructions and input cleanly from the provider-neutral compiled context - support developer-provided tools - pass through provider built-in tools when supplied - record request usage, response usage, and cache metrics when available - expose stable prefix and volatile suffix hashes in recorder metadata - support conversation continuation - keep long-run compaction helpers isolated from the core compiler
15.5 Anthropic-style adapter¶
Target the Messages API first.
Requirements: - render system, messages, and tools from the provider-neutral compiled context - preflight requests with token counting - support cache breakpoints for reusable prefix content - support streaming - validate tool-result adjacency before sending the request - keep extended-thinking support out of the default V1 path unless isolated behind a feature flag
16. Tool and integration surfaces¶
V1 should expose three integration surfaces.
16.1 Python SDK¶
Use for: - custom agent backends - server-side integrations - embedded use in applications
Core API sketch:
ctx = agentcore.start_run(agent_name="my-agent", domain="research", user_goal="...")
compiled = agentcore.compile_step(run_id=..., step_id=..., inputs=...)
response = agentcore.call_model(compiled)
agentcore.record_tool_result(...)
agentcore.end_run(...)
16.2 Sidecar daemon¶
Expose HTTP endpoints for: - start run - append step inputs - compile - call model through adapter - record tool call/result - fetch memory - write memory - replay step - query recorder
16.3 MCP bridge¶
Provide an MCP server so other runtimes can access: - memory search - context receipt lookup - replay trigger - artifact summary - policy check
Treat MCP as a thin integration bridge in V1: - resources become context candidates - tools become callable tool specs - prompts become reusable prompt assets - roots define workspace boundaries
Keep sampling, elicitation, and richer workflow behavior out of V1 unless they are needed by the golden-path demo.
17. Coding-agent-first launch wedge¶
This section defines the first launch wedge without changing the generic core.
17.1 What the wedge is¶
The first polished user experience should target repo-local agents.
Examples: - a custom coding agent built with the Python SDK - a local sidecar used by a CLI-based coding workflow - a hook-based adapter to a coding agent tool that already has a lifecycle model - MCP-based integration for an IDE or code assistant
17.2 What value should be obvious in the first demo¶
A coding-agent user should immediately get: - repo memory across sessions - better context selection across files, diffs, and test results - receipts for why files or instructions were included - replay when the agent makes a bad code change - policy checks for risky commands or write operations
17.3 Important boundary¶
Do not define the core product as “only for coding agents.”
Use coding agents to prove the product, not to define its architecture.
17.4 Coding-agent demo scenario¶
Ship one end-to-end example:
- open a local repo
- ask the agent to implement a feature
- read files and create a plan
- run tests
- inspect diff
- persist a workspace summary and task receipts
- replay a failed step with different compile settings
That is enough to make the product legible.
17.5 Golden-path demo spec¶
Use this demo to drive the first implementation slice. It is narrower than full V1 and should be runnable before the entire platform is complete.
Demo name: repo-local failing task replay
Purpose: prove that a developer can wrap a coding agent, inspect exactly what the model saw, understand why context was included or excluded, and replay the bad step with changed compile settings.
Fixture repo: a small Python project with:
- one failing test
- at least three source files
- one irrelevant large file or document that should be excluded
- one repo convention file that should become workspace memory
- one simple command for tests, such as pytest
Starting user request:
Required recorded inputs: - latest user request - repo summary - relevant source files - failing test output - current diff, if any - workspace convention or project note - hard constraints from the user request - tool schemas for file reads, shell commands, and patch application
Required compiler behavior: - always include the latest user request - always include the hard constraints - include the failing test output and directly relevant source files - include the workspace convention if it applies - exclude unrelated files with machine-readable reasons - preserve raw source items even when excluded or compressed - produce stable hashes for the compiled payload sections
Required recorder behavior: - store the normalized context items - store inclusion, exclusion, compression, and redaction decisions - store the rendered provider payload - store tool calls, test output, diff, token usage, and cache metadata when available - store enough information to replay the model step without reconstructing state from logs
Required dashboard path: 1. Runs list shows the demo run, status, model, token usage, and timestamp. 2. Step detail shows included and excluded context items with reasons. 3. Diff view shows what changed between the planning step and patch step. 4. Replay view reruns the failed or pre-patch step with at least one changed compile setting, such as token budget or removed context source.
Done when: - a fresh checkout can run the demo from one command - the sidecar and dashboard start locally with no cloud dependency - the dashboard can answer "what did the model know and why?" for the patch step - exact replay works for the recorded step - one mutated replay changes the compiled context and records a comparison - the demo leaves behind a durable workspace summary or convention with provenance - the implementation remains generic outside the demo adapter and example files
Treat this as the first build milestone. Later V1 phases should deepen the same flow instead of adding unrelated features first.
18. Repository layout¶
18.1 Implementation stack defaults¶
Use: - Python for the compiler, memory, policy, recorder, adapters, SDK, and examples - Pydantic for core schemas - Pytest for deterministic unit and integration tests - FastAPI for the local sidecar API - SQLite plus local JSON/JSONL files for local storage - React or Next.js for the local dashboard - Postgres and object storage only behind team-ready storage interfaces - OTLP/OpenTelemetry export as an adapter, not as the recorder's source of truth
Optimize first for clarity, determinism, and debuggability.
18.2 Monorepo shape¶
Use a monorepo.
This is the preserved baseline repository shape for the first implementation:
repo/
README.md
docs/
packages/
context_core/
sdk/
provider_openai/
provider_anthropic/
recorder/
mcp_bridge/
otel_export/
services/
api/
apps/
dashboard/
examples/
openai_context_demo/
anthropic_coding_agent/
langgraph_wrapper/
mcp_ingest_demo/
tests/
unit/
integration/
fixtures/
18.3 Package responsibilities¶
context_core/: provider-neutral schemas, compiler passes, scoring, budgeting, cache layout, memory interfaces, policy interfaces, and replay recipes.sdk/: customer-facing Python SDK and installable CLI that wrap the internal packages into the supported adoption surface.provider_openai/: OpenAI-style adapter, rendering, usage capture, cache metadata, and example integration helpers.provider_anthropic/: Anthropic-style adapter, rendering, token preflight, cache breakpoints, and tool-order validation.recorder/: run and step persistence, receipts, payload snapshots, local SQLite/file storage, diffs, and replay artifacts.mcp_bridge/: MCP resources, tools, prompts, and roots mapped into context-layer primitives.otel_export/: optional exporter that maps internal recorder events to OTLP/OpenTelemetry without making OTLP the internal source of truth.services/api/: FastAPI sidecar for compile, record, memory, policy, replay, and dashboard data access.apps/dashboard/: local dashboard for runs, steps, receipts, diffs, memory, and replay.examples/openai_context_demo/: first one-shot OpenAI context demo.examples/anthropic_coding_agent/: Anthropic-style coding-agent demo.examples/langgraph_wrapper/: wrapper example for an existing agent runtime.examples/mcp_ingest_demo/: MCP ingestion demo.
18.4 Later package split¶
The baseline shape keeps the core cohesive while the product is young. If context_core/ becomes too large, split it later along these boundaries:
Do not start with this split unless the codebase pressure is real. The initial goal is a legible, shippable compiler/recorder package, not maximum package granularity.
19. Storage design¶
Support two storage modes.
19.1 Local mode¶
Use: - SQLite for structured metadata - local JSON or JSONL blobs for payload snapshots - local filesystem for artifacts
19.2 Team-ready mode¶
Design the interfaces so they can later support: - Postgres - object storage - Redis or queue for jobs
Do not implement hosted team mode in V1, but keep interfaces clean.
19.3 Required tables / collections¶
At minimum persist: - runs - steps - context_items - memory_records - compilation_requests - compilation_results - tool_events - policy_events - replay_runs
20. Sidecar API¶
Use FastAPI.
Endpoints¶
POST /v1/runs/startPOST /v1/runs/{run_id}/stepsPOST /v1/compilePOST /v1/model/callPOST /v1/tools/resultPOST /v1/memory/readPOST /v1/memory/writePOST /v1/policy/checkGET /v1/runsGET /v1/runs/{run_id}GET /v1/steps/{step_id}GET /v1/compilations/{id}GET /v1/diff/steps/{left}/{right}POST /v1/replayGET /v1/replay/{replay_id}
Keep payloads simple JSON with explicit schema versions.
21. Dashboard requirements¶
The dashboard is local-first and must answer: - what did the agent know? - why did it know that? - what changed between steps? - what memory was used or written? - what tool action happened? - how can I replay this?
Screens¶
21.1 Runs list¶
Show: - run id - agent - domain - status - start time - end time - step count
21.2 Run detail¶
Show: - timeline of steps - tool actions - memory writes - failures - compile count
21.3 Step detail¶
Show: - included items - excluded items - scores - reasons - token allocation - rendered request preview - response preview
21.4 Diff view¶
Show: - additions/removals - compressed items - redactions - policy changes - budget changes
21.5 Memory explorer¶
Show: - working memory - durable memory - source runs - confidence and freshness - invalidation controls
21.6 Replay view¶
Allow: - exact replay - replay with modified config - compare outputs
22. Build order¶
Build in this order.
Phase 0: bootstrap¶
- repo setup
- packaging
- formatting
- typing
- CI
- test harness
Phase 1: core schemas and utilities¶
- Pydantic models
- stable hashing
- ids and timestamps
- serialization tests
Phase 2: compiler v1¶
- normalize
- dedupe
- conflict detection
- scoring
- budgeting
- compression
- policy pass
- provider-neutral output
Phase 3: recorder and storage¶
- run and step persistence
- compile event storage
- payload snapshots
- diff engine
Phase 4: memory system¶
- working memory
- durable memory
- read/write APIs
- TTL cleanup
Phase 5: provider adapters¶
- adapter 1
- adapter 2
- token counting hooks
- tool rendering
Phase 6: sidecar API¶
- FastAPI service
- local storage integration
- replay endpoints
Phase 7: dashboard¶
- runs view
- step view
- diff view
- replay view
- memory explorer
Phase 8: examples¶
- generic agent example
- coding-agent repo example
- research agent example
Phase 9: MCP bridge¶
- expose memory read
- expose receipts
- expose replay trigger
23. Test plan¶
Write tests early and keep them deterministic.
Unit tests¶
- context item hashing
- dedupe
- conflict resolution
- score calculation
- budget allocation
- compression rules
- redaction rules
- policy rule evaluation
- memory read/write
- replay recipe construction
Integration tests¶
- compile -> render -> provider adapter request
- exact replay of a stored step
- diff output between steps
- sidecar API roundtrip
- dashboard data queries
- coding-agent example workflow
- research-agent example workflow
Invariant tests¶
- latest user request is always included
- hard constraints are never dropped
- every excluded item has a reason
- every included item has provenance
- same input + config => same compile hash
24. Acceptance criteria¶
V1 is done when all of the following are true:
- the core works for at least two agent domains in examples
- the coding-agent demo is polished and easy to understand
- the data model remains generic and domain-neutral
- every compile produces receipts
- every stored step can be replayed
- working and durable memory both function
- policy checks can deny, redact, or require approval
- the sidecar runs locally with no cloud dependency
- the dashboard answers “what did the agent know and why?”
- package installation and first run are documented clearly
25. Documentation requirements¶
Ship these docs: - quickstart - architecture overview - data model reference - sidecar API reference - writing provider adapters - writing memory policies - building a custom agent with the SDK - coding-agent demo walkthrough - research-agent demo walkthrough
The docs must make the general-vs-wedge distinction explicit.
26. Risks and mitigations¶
Risk: product feels too abstract¶
Mitigation: - ship the coding-agent demo early - use concrete before/after examples
Risk: product gets mistaken for another orchestration framework¶
Mitigation: - keep runtime integration simple - position the product as substrate, not orchestrator
Risk: memory becomes unreliable¶
Mitigation: - keep writes explicit in V1 - preserve provenance and confidence - avoid hidden autonomous learning
Risk: dashboards become the whole product¶
Mitigation: - keep the SDK and sidecar as the core - treat UI as a debugging surface, not the primary value
Risk: coding-agent wedge pollutes the generic core¶
Mitigation: - enforce domain-neutral names in core packages - keep code-specific logic in examples and adapters
27. What not to postpone by accident¶
Do not leave these as “later”: - provenance on every context item - explicit compile decisions - replay - durable memory with invalidation - policy checks - step diffs
These are the product, not polish.
28. V1 one-sentence mandate¶
Build a general-purpose context, memory, and receipts layer for AI agents, shipped as a Python package plus local sidecar and dashboard, with the first polished go-to-market wedge focused on repo-local coding agents but with a generic core that works across agent domains.