Coding-Continuation Evaluation¶

What this eval measures¶

This harness compares continuation context quality across three conditions:

transcript_replay
compact_summary
stateplane

It scores whether the next model context preserves the right operational state:

hard constraints
active failures
resolved failures handled as resolved
quarantined risky content
important decisions
useful procedures
relevant touched files
token usage relative to transcript replay

Why this eval matters¶

StatePlane claims that structured, replayable operational state is a better continuation substrate than raw transcript replay or a generic compact summary.

This harness tests that claim directly. It does not run a live model. It scores the exact continuation context each condition would send.

If you also want outcome-level validation, use External Coding Benchmarks. That layer measures task outcomes rather than continuation-context fidelity.

If you want to know which state groups actually earn their token cost inside the continuation snapshot, use State Value Lab. That layer runs deterministic group ablations on top of this same backend.

If you want to validate and train the first learned selector on top of those ablations, use State Selector v0. That layer still uses this same offline backend for evaluation and promotion gates.

Conditions¶

`transcript_replay`¶

Renders prior events as chronological transcript-style context without applying lifecycle semantics.

`compact_summary`¶

Builds a deterministic compact summary from the same fixture events, but without using StatePlane reducers or replay logic.

`stateplane`¶

Uses a temporary local StatePlaneSession, replays the fixture events through the real public APIs, and scores the actual StatePlane model-visible context produced by the continuation policy.

StatePlane now separates:

a compact model-visible context block
a richer audit receipt used for replay artifacts and debugging

The coding-continuation eval uses context_policy="continuation" with the compact render profile, so it still measures whether StatePlane carries forward the right operational state without paying for receipt metadata in the model prompt.

Fixture design¶

Fixtures are committed JSON files under:

evals/coding_continuation/fixtures/
  smoke/
  publish/

Each fixture includes:

prior structured events
the current continuation request
explicit ground truth for what should remain active
explicit ground truth for what should be excluded or quarantined

The primary suites are:

smoke: fast CI-style regression coverage
publish: larger benchmark-style report generation

Metrics¶

Primary metrics are deterministic and transparent:

constraint recall
open failure recall
resolved failure handling
quarantine correctness
decision recall
procedure recall
touched file recall
forbidden actionable phrase rate
token estimate
token savings versus transcript replay
weighted composite fidelity

The primary eval is offline and deterministic. It does not call an LLM. It does not prove end-to-end coding-agent success. It measures whether the continuation context preserves the right operational state.

Run the smoke suite¶

uv run python -m evals.coding_continuation.run \
  --suite smoke \
  --conditions transcript_replay,compact_summary,stateplane \
  --out evals/coding_continuation/reports/smoke \
  --json \
  --markdown \
  --seed 0

Run the publish suite¶

uv run python -m evals.coding_continuation.run \
  --suite publish \
  --conditions transcript_replay,compact_summary,stateplane \
  --out evals/coding_continuation/reports/publish \
  --json \
  --markdown \
  --csv \
  --seed 0 \
  --bootstrap-samples 2000

Read the report¶

Each run writes artifacts such as:

<out>/
  scorecard.json
  scorecard.csv
  report.md
  by_task.json
  metadata.json
  candidates/
    <fixture_id>/
      transcript_replay.txt
      compact_summary.txt
      stateplane.txt

report.md is the publishable summary. scorecard.json and by_task.json keep every fixture and condition visible so failures cannot disappear silently.

Interpreting results¶

Use the harness to answer questions like:

Does StatePlane keep hard constraints more reliably than transcript replay?
Does it avoid replaying resolved failures as active work?
Does it label quarantined risk without replaying raw malicious text?
How much continuation context does it save relative to transcript replay?
Does the compact model-visible block still preserve the right state after relevance filtering?

If the selected suite has fewer than 20 fixtures, the report notes that the results are directional.

Limitations¶

This is an offline continuation-context benchmark.
It does not measure end-to-end coding-agent pass rates.
The fixtures are curated and partially synthetic.
State fidelity is necessary for good continuation behavior, but it is not the only factor in real agent success.

Extending the benchmark¶

To add a new fixture:

add a JSON file under fixtures/smoke/ or fixtures/publish/
keep the event kinds within the supported structured set
add explicit ground truth patterns
rerun the smoke suite
rerun the publish suite before updating any public claims

Future live-agent evaluation¶

The current harness deliberately stops at offline continuation-context quality.

A later live evaluation can reuse the same fixture suites to drive coding-agent or live-model runs and measure end-to-end pass rates separately.

The first concrete outcome-oriented layer now lives in External Coding Benchmarks.