External Coding Benchmarks¶

Why external benchmarks are needed¶

The offline coding-continuation eval measures state fidelity:

does the continuation context preserve hard constraints?
does it keep active failures while deactivating resolved ones?
does it quarantine risky content?
does it reduce token load?

That is useful, but it is still an offline proxy.

External coding benchmarks answer a different question:

does the same agent solve more benchmark tasks when its continuation or task context is packaged with StatePlane?

Both layers matter.

If you need to understand which StatePlane groups are worth carrying into the model before you run live external benchmarks, use State Value Lab. That lab uses the deterministic continuation backend to measure group usefulness and token tradeoffs without live model calls.

The latest checked-in harness-evaluated snapshot is in External Coding Benchmark Results.

Selector work is currently validated on the offline continuation backend first. See State Selector v0 for dataset validation, training, evaluation, and promotion of the first empirical selector.

Relationship to the coding-continuation eval¶

The in-repo continuation eval measures context quality. External coding benchmarks measure task outcomes.

A StatePlane context can look better offline and still fail to improve live task outcomes. Conversely, a live benchmark can improve for reasons that are not specific to state fidelity.

Use both reports together.

Supported benchmarks¶

V1 implements one adapter:

swebench

Future work is reserved for:

swe_lancer (placeholder only in this release)

SWE-bench-family adapter¶

The SWE-bench adapter supports three task-loading modes:

checked-in smoke fixtures
local --tasks-json
optional lazy dataset loading when benchmark dependencies are installed

The adapter writes one predictions JSONL per condition and can optionally shell out to the SWE-bench harness.

Dry-run pipeline mode¶

Dry run remains the default validation mode for tests and local smoke checks.

It:

loads synthetic SWE-bench-like tasks
builds prompts for all conditions
runs a deterministic fake agent
writes patch files and predictions JSONL
generates a report
does not call a model, Docker, or the network

Run it with:

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --suite smoke \
  --conditions transcript_replay,compact_summary,stateplane \
  --dry-run \
  --out /tmp/stateplane-external-smoke \
  --json \
  --markdown

Live patch-generation mode¶

The primary live path is --agent codex. A generic --agent command path is still available for other runners, but Codex is the benchmark surface this adapter is designed to compare first.

Set up a clean benchmark Codex home if your default ~/.codex/config.toml contains unrelated MCP or local tool configuration:

mkdir -p /tmp/stateplane-codex-home
cp ~/.codex/auth.json /tmp/stateplane-codex-home/auth.json

Then run live patch generation on the pinned SWE-bench Lite subset:

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --split test \
  --instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
  --conditions transcript_replay,compact_summary,stateplane \
  --agent codex \
  --codex-home /tmp/stateplane-codex-home \
  --codex-model gpt-5.4-mini \
  --out /tmp/stateplane-external-live \
  --json \
  --markdown \
  --csv

This mode generates real patches in checked-out task worktrees, but it still does not establish resolved-rate claims unless the external harness is also run.

Harness-evaluated mode¶

If optional dependencies are installed, the adapter can also run a SWE-bench-compatible harness. This is the required mode for any real benchmark claim.

Example:

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --split test \
  --instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
  --conditions transcript_replay,compact_summary,stateplane \
  --agent codex \
  --codex-home /tmp/stateplane-codex-home \
  --codex-model gpt-5.4-mini \
  --run-harness \
  --max-workers 1 \
  --out /tmp/stateplane-external-harness \
  --json \
  --markdown \
  --csv

Environment prerequisites for this mode:

a dedicated repo venv with datasets and swebench installed
Docker running locally
a valid Codex CLI auth file
a Codex model that is supported by your Codex account plan

Conditions compared¶

The external runner compares:

transcript_replay
compact_summary
stateplane

The same task set, agent command, and benchmark harness are used across conditions. Only the prompt or state packaging changes.

StatePlane now separates model context from audit receipts:

the model sees a compact, relevance-ranked StatePlane context block
artifacts still keep the richer receipt, exclusions, and replay details

For single-shot external tasks, the default StatePlane policy is single_shot. If no useful prior state exists, StatePlane injects no model-visible snapshot at all. That is intentional.

You can override the StatePlane packaging policy explicitly with:

--stateplane-policy auto|single_shot|continuation|debug|eval
--stateplane-profile none|minimal|compact|full|receipt

Run a smoke dry run¶

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --suite smoke \
  --conditions transcript_replay,compact_summary,stateplane \
  --dry-run \
  --out /tmp/stateplane-external-smoke \
  --json \
  --markdown

Run against SWE-bench Lite¶

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --split test \
  --instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
  --conditions transcript_replay,compact_summary,stateplane \
  --agent codex \
  --codex-home /tmp/stateplane-codex-home \
  --codex-model gpt-5.4-mini \
  --run-harness \
  --out /tmp/stateplane-external-harness \
  --json \
  --markdown \
  --csv

Read the report¶

Each run writes artifacts such as:

<out>/
  metadata.json
  tasks.json
  scorecard.json
  report.md
  report.csv
  predictions/
  prompts/
  patches/
  stateplane/
  harness/

report.md is the human-readable summary. scorecard.json keeps every task-condition result visible. agent_logs/ and worktrees/ are the first places to inspect when a Codex task fails. snapshot.md is the model-visible StatePlane block. receipt.md is the human-readable audit receipt when StatePlane artifacts are exported.

What claims are valid¶

dry_run_pipeline_only: validates the pipeline only, not coding performance
patch_generation_only: patches were generated, but resolved rates are unavailable
harness_evaluated: resolved-rate claims are allowed

Do not publish performance claims from dry-run reports.

Limitations¶

SWE-bench-style tasks are primarily issue-resolution tasks, not always interrupted continuation tasks.
StatePlane's strongest thesis is continuation across evolving agent state.
The external benchmark adapter is therefore a task-outcome validation layer, not a replacement for the continuation eval.
Optional dataset loading and harness execution are lazy and may require additional local setup.
Some Codex accounts support only a subset of the available Codex CLI models, so the benchmark configuration must use a model that your account can actually execute.

Future work: SWE-Lancer and real continuation traces¶

The next useful expansions are:

a real swe_lancer adapter
continuation-transformed external tasks
evaluation against real local coding-agent traces
attribution of external outcome deltas back to StatePlane state groups after the offline State Value Lab stabilizes