Skip to content

Coding-Continuation Evaluation

What this eval measures

This harness compares continuation context quality across three conditions:

  • transcript_replay
  • compact_summary
  • stateplane

It scores whether the next model context preserves the right operational state:

  • hard constraints
  • active failures
  • resolved failures handled as resolved
  • quarantined risky content
  • important decisions
  • useful procedures
  • relevant touched files
  • token usage relative to transcript replay

Why this eval matters

StatePlane claims that structured, replayable operational state is a better continuation substrate than raw transcript replay or a generic compact summary.

This harness tests that claim directly. It does not run a live model. It scores the exact continuation context each condition would send.

If you also want outcome-level validation, use External Coding Benchmarks. That layer measures task outcomes rather than continuation-context fidelity.

If you want to know which state groups actually earn their token cost inside the continuation snapshot, use State Value Lab. That layer runs deterministic group ablations on top of this same backend.

If you want to validate and train the first learned selector on top of those ablations, use State Selector v0. That layer still uses this same offline backend for evaluation and promotion gates.

Conditions

transcript_replay

Renders prior events as chronological transcript-style context without applying lifecycle semantics.

compact_summary

Builds a deterministic compact summary from the same fixture events, but without using StatePlane reducers or replay logic.

stateplane

Uses a temporary local StatePlaneSession, replays the fixture events through the real public APIs, and scores the actual StatePlane model-visible context produced by the continuation policy.

StatePlane now separates:

  • a compact model-visible context block
  • a richer audit receipt used for replay artifacts and debugging

The coding-continuation eval uses context_policy="continuation" with the compact render profile, so it still measures whether StatePlane carries forward the right operational state without paying for receipt metadata in the model prompt.

Fixture design

Fixtures are committed JSON files under:

evals/coding_continuation/fixtures/
  smoke/
  publish/

Each fixture includes:

  • prior structured events
  • the current continuation request
  • explicit ground truth for what should remain active
  • explicit ground truth for what should be excluded or quarantined

The primary suites are:

  • smoke: fast CI-style regression coverage
  • publish: larger benchmark-style report generation

Metrics

Primary metrics are deterministic and transparent:

  • constraint recall
  • open failure recall
  • resolved failure handling
  • quarantine correctness
  • decision recall
  • procedure recall
  • touched file recall
  • forbidden actionable phrase rate
  • token estimate
  • token savings versus transcript replay
  • weighted composite fidelity

The primary eval is offline and deterministic. It does not call an LLM. It does not prove end-to-end coding-agent success. It measures whether the continuation context preserves the right operational state.

Run the smoke suite

uv run python -m evals.coding_continuation.run \
  --suite smoke \
  --conditions transcript_replay,compact_summary,stateplane \
  --out evals/coding_continuation/reports/smoke \
  --json \
  --markdown \
  --seed 0

Run the publish suite

uv run python -m evals.coding_continuation.run \
  --suite publish \
  --conditions transcript_replay,compact_summary,stateplane \
  --out evals/coding_continuation/reports/publish \
  --json \
  --markdown \
  --csv \
  --seed 0 \
  --bootstrap-samples 2000

Read the report

Each run writes artifacts such as:

<out>/
  scorecard.json
  scorecard.csv
  report.md
  by_task.json
  metadata.json
  candidates/
    <fixture_id>/
      transcript_replay.txt
      compact_summary.txt
      stateplane.txt

report.md is the publishable summary. scorecard.json and by_task.json keep every fixture and condition visible so failures cannot disappear silently.

Interpreting results

Use the harness to answer questions like:

  • Does StatePlane keep hard constraints more reliably than transcript replay?
  • Does it avoid replaying resolved failures as active work?
  • Does it label quarantined risk without replaying raw malicious text?
  • How much continuation context does it save relative to transcript replay?
  • Does the compact model-visible block still preserve the right state after relevance filtering?

If the selected suite has fewer than 20 fixtures, the report notes that the results are directional.

Limitations

  • This is an offline continuation-context benchmark.
  • It does not measure end-to-end coding-agent pass rates.
  • The fixtures are curated and partially synthetic.
  • State fidelity is necessary for good continuation behavior, but it is not the only factor in real agent success.

Extending the benchmark

To add a new fixture:

  1. add a JSON file under fixtures/smoke/ or fixtures/publish/
  2. keep the event kinds within the supported structured set
  3. add explicit ground truth patterns
  4. rerun the smoke suite
  5. rerun the publish suite before updating any public claims

Future live-agent evaluation

The current harness deliberately stops at offline continuation-context quality.

A later live evaluation can reuse the same fixture suites to drive coding-agent or live-model runs and measure end-to-end pass rates separately.

The first concrete outcome-oriented layer now lives in External Coding Benchmarks.