External Coding Benchmarks¶
Why external benchmarks are needed¶
The offline coding-continuation eval measures state fidelity:
- does the continuation context preserve hard constraints?
- does it keep active failures while deactivating resolved ones?
- does it quarantine risky content?
- does it reduce token load?
That is useful, but it is still an offline proxy.
External coding benchmarks answer a different question:
- does the same agent solve more benchmark tasks when its continuation or task context is packaged with StatePlane?
Both layers matter.
If you need to understand which StatePlane groups are worth carrying into the model before you run live external benchmarks, use State Value Lab. That lab uses the deterministic continuation backend to measure group usefulness and token tradeoffs without live model calls.
The latest checked-in harness-evaluated snapshot is in External Coding Benchmark Results.
Selector work is currently validated on the offline continuation backend first. See State Selector v0 for dataset validation, training, evaluation, and promotion of the first empirical selector.
Relationship to the coding-continuation eval¶
The in-repo continuation eval measures context quality. External coding benchmarks measure task outcomes.
A StatePlane context can look better offline and still fail to improve live task outcomes. Conversely, a live benchmark can improve for reasons that are not specific to state fidelity.
Use both reports together.
Supported benchmarks¶
V1 implements one adapter:
swebench
Future work is reserved for:
swe_lancer(placeholder only in this release)
SWE-bench-family adapter¶
The SWE-bench adapter supports three task-loading modes:
- checked-in smoke fixtures
- local
--tasks-json - optional lazy dataset loading when benchmark dependencies are installed
The adapter writes one predictions JSONL per condition and can optionally shell out to the SWE-bench harness.
Dry-run pipeline mode¶
Dry run remains the default validation mode for tests and local smoke checks.
It:
- loads synthetic SWE-bench-like tasks
- builds prompts for all conditions
- runs a deterministic fake agent
- writes patch files and predictions JSONL
- generates a report
- does not call a model, Docker, or the network
Run it with:
uv run python -m evals.external_coding.run \
--benchmark swebench \
--suite smoke \
--conditions transcript_replay,compact_summary,stateplane \
--dry-run \
--out /tmp/stateplane-external-smoke \
--json \
--markdown
Live patch-generation mode¶
The primary live path is --agent codex. A generic --agent command path is still available for
other runners, but Codex is the benchmark surface this adapter is designed to compare first.
Set up a clean benchmark Codex home if your default ~/.codex/config.toml contains unrelated MCP
or local tool configuration:
Then run live patch generation on the pinned SWE-bench Lite subset:
uv run python -m evals.external_coding.run \
--benchmark swebench \
--dataset_name princeton-nlp/SWE-bench_Lite \
--split test \
--instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
--conditions transcript_replay,compact_summary,stateplane \
--agent codex \
--codex-home /tmp/stateplane-codex-home \
--codex-model gpt-5.4-mini \
--out /tmp/stateplane-external-live \
--json \
--markdown \
--csv
This mode generates real patches in checked-out task worktrees, but it still does not establish resolved-rate claims unless the external harness is also run.
Harness-evaluated mode¶
If optional dependencies are installed, the adapter can also run a SWE-bench-compatible harness. This is the required mode for any real benchmark claim.
Example:
uv run python -m evals.external_coding.run \
--benchmark swebench \
--dataset_name princeton-nlp/SWE-bench_Lite \
--split test \
--instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
--conditions transcript_replay,compact_summary,stateplane \
--agent codex \
--codex-home /tmp/stateplane-codex-home \
--codex-model gpt-5.4-mini \
--run-harness \
--max-workers 1 \
--out /tmp/stateplane-external-harness \
--json \
--markdown \
--csv
Environment prerequisites for this mode:
- a dedicated repo venv with
datasetsandswebenchinstalled - Docker running locally
- a valid Codex CLI auth file
- a Codex model that is supported by your Codex account plan
Conditions compared¶
The external runner compares:
transcript_replaycompact_summarystateplane
The same task set, agent command, and benchmark harness are used across conditions. Only the prompt or state packaging changes.
StatePlane now separates model context from audit receipts:
- the model sees a compact, relevance-ranked StatePlane context block
- artifacts still keep the richer receipt, exclusions, and replay details
For single-shot external tasks, the default StatePlane policy is single_shot. If no useful prior
state exists, StatePlane injects no model-visible snapshot at all. That is intentional.
You can override the StatePlane packaging policy explicitly with:
--stateplane-policy auto|single_shot|continuation|debug|eval--stateplane-profile none|minimal|compact|full|receipt
Run a smoke dry run¶
uv run python -m evals.external_coding.run \
--benchmark swebench \
--suite smoke \
--conditions transcript_replay,compact_summary,stateplane \
--dry-run \
--out /tmp/stateplane-external-smoke \
--json \
--markdown
Run against SWE-bench Lite¶
uv run python -m evals.external_coding.run \
--benchmark swebench \
--dataset_name princeton-nlp/SWE-bench_Lite \
--split test \
--instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
--conditions transcript_replay,compact_summary,stateplane \
--agent codex \
--codex-home /tmp/stateplane-codex-home \
--codex-model gpt-5.4-mini \
--run-harness \
--out /tmp/stateplane-external-harness \
--json \
--markdown \
--csv
Read the report¶
Each run writes artifacts such as:
<out>/
metadata.json
tasks.json
scorecard.json
report.md
report.csv
predictions/
prompts/
patches/
stateplane/
harness/
report.md is the human-readable summary.
scorecard.json keeps every task-condition result visible.
agent_logs/ and worktrees/ are the first places to inspect when a Codex task fails.
snapshot.md is the model-visible StatePlane block. receipt.md is the human-readable audit
receipt when StatePlane artifacts are exported.
What claims are valid¶
dry_run_pipeline_only: validates the pipeline only, not coding performancepatch_generation_only: patches were generated, but resolved rates are unavailableharness_evaluated: resolved-rate claims are allowed
Do not publish performance claims from dry-run reports.
Limitations¶
- SWE-bench-style tasks are primarily issue-resolution tasks, not always interrupted continuation tasks.
- StatePlane's strongest thesis is continuation across evolving agent state.
- The external benchmark adapter is therefore a task-outcome validation layer, not a replacement for the continuation eval.
- Optional dataset loading and harness execution are lazy and may require additional local setup.
- Some Codex accounts support only a subset of the available Codex CLI models, so the benchmark configuration must use a model that your account can actually execute.
Future work: SWE-Lancer and real continuation traces¶
The next useful expansions are:
- a real
swe_lanceradapter - continuation-transformed external tasks
- evaluation against real local coding-agent traces
- attribution of external outcome deltas back to StatePlane state groups after the offline State Value Lab stabilizes