External Coding Benchmark Results¶
Reproduction command¶
mkdir -p /tmp/stateplane-codex-home
cp ~/.codex/auth.json /tmp/stateplane-codex-home/auth.json
uv run python -m evals.external_coding.run \
--benchmark swebench \
--dataset_name princeton-nlp/SWE-bench_Lite \
--split test \
--instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
--conditions transcript_replay,compact_summary,stateplane \
--agent codex \
--codex-home /tmp/stateplane-codex-home \
--codex-model gpt-5.4-mini \
--run-harness \
--max-workers 1 \
--out /tmp/stateplane-external-live-five \
--json \
--markdown \
--csv
Benchmark and dataset¶
- Benchmark:
swebench - Dataset:
princeton-nlp/SWE-bench_Lite - Split:
test - Pinned instance file:
evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json - Git SHA:
f3d22beaa57970538aeed547e1af3138c545e319 - Repo-local summary artifacts:
evals/external_coding/reports/swebench-lite-live-latest/
Agent configuration¶
- Agent surface:
codex exec - Codex model:
gpt-5.4-mini - Codex home:
/tmp/stateplane-codex-home - Codex full-auto:
true - Harness:
python -m swebench.harness.run_evaluation ... --report_dir {report_dir}
Conditions¶
transcript_replaycompact_summarystateplane
Results¶
| Condition | Resolved rate ↑ | Mean prompt tokens ↓ | Mean patch size | Mean elapsed s ↓ |
|---|---|---|---|---|
| transcript_replay | 0.40 | 887.40 | 1360.00 | 23.17 |
| compact_summary | 0.40 | 914.60 | 1387.20 | 20.34 |
| stateplane | 0.40 | 1736.20 | 1372.60 | 26.22 |
Resolved tasks for every condition:
astropy__astropy-14995astropy__astropy-6938
Unresolved tasks for every condition:
astropy__astropy-12907astropy__astropy-14182astropy__astropy-14365
StatePlane vs transcript replay¶
- Resolved-rate delta:
0.00 - Win / tie / loss:
0 / 5 / 0 - Mean prompt-token delta:
+848.80for StatePlane
StatePlane vs compact summary¶
- Resolved-rate delta:
0.00 - Win / tie / loss:
0 / 5 / 0 - Mean prompt-token delta:
+821.60for StatePlane
Follow-up: lean model-context policy¶
The live result above was the trigger for the next product change. StatePlane was carrying useful state, but it was also paying too much prompt cost for single-shot benchmark tasks.
The lean-context follow-up changed the product policy, not the benchmark:
- model-visible context is now separate from the richer audit receipt
- single-shot tasks default to
stateplane_policy=single_shot - StatePlane can intentionally inject no model-visible snapshot when no useful prior state exists
- raw quarantined text stays out of model-visible context
- model-visible snapshots are token-budgeted and relevance-ranked
The lean-context rerun used the same pinned 5-task subset and the same gpt-5.4-mini
configuration, with repo-local artifacts written to:
evals/external_coding/reports/swebench-lite-live-lean-context/
The rerun command for that follow-up was:
mkdir -p /tmp/stateplane-codex-home
cp ~/.codex/auth.json /tmp/stateplane-codex-home/auth.json
uv run python -m evals.external_coding.run \
--benchmark swebench \
--dataset_name princeton-nlp/SWE-bench_Lite \
--split test \
--instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
--conditions transcript_replay,compact_summary,stateplane \
--agent codex \
--codex-home /tmp/stateplane-codex-home \
--codex-model gpt-5.4-mini \
--run-harness \
--max-workers 1 \
--out evals/external_coding/reports/swebench-lite-live-lean-context \
--json \
--markdown \
--csv
Before vs after:
| Metric | Previous live | Lean-context live | Change |
|---|---|---|---|
| StatePlane resolved rate | 0.40 | 0.40 | 0.00 |
| StatePlane mean prompt tokens | 1736.20 | 916.00 | -820.20 |
| StatePlane mean patch size | 1372.60 | 1247.20 | -125.40 |
| StatePlane mean elapsed seconds | 26.22 | 22.81 | -3.41 |
The main product result is prompt-cost reduction:
- StatePlane reduced mean prompt tokens by
820.20, or about47.2%, on the same live subset. - StatePlane still did not improve resolved rate on this subset; it remained tied at
0.40. - Against
compact_summary, StatePlane moved from+821.60mean prompt-token overhead to+1.40. - Against
transcript_replay, StatePlane moved from+848.80mean prompt-token overhead to+28.60.
Pairwise follow-up results:
- StatePlane vs transcript replay:
- previous:
0 / 5 / 0 - lean-context:
1 / 3 / 1 - StatePlane vs compact summary:
- previous:
0 / 5 / 0 - lean-context:
0 / 5 / 0
What remained true from the lean-policy smoke guardrails:
- coding-continuation smoke remained at
1.00composite for StatePlane - external dry-run smoke prompt tokens dropped to
210.33 - external dry-run smoke prompt tokens are now below both baselines on that smoke set
That is real product progress: the token tax on single-shot benchmark tasks was substantially reduced without losing resolved-rate performance on this subset. It is not yet evidence that StatePlane improves resolved rate on SWE-bench Lite.
Failure cases¶
No task-condition execution failures were recorded in the harness-evaluated run.
The main negative result is not a runtime failure but a benchmark outcome:
- On this 5-task SWE-bench Lite subset, StatePlane did not improve resolved rate over either baseline.
- After the lean-context follow-up, StatePlane still used slightly more prompt tokens than transcript replay and essentially matched compact summary on prompt cost.
Limitations¶
- This is a very small pinned subset, so the result is directional rather than definitive.
- SWE-bench-style tasks are issue-resolution tasks, not interrupted continuation traces.
- The result should be read together with the offline continuation-context eval, not as a replacement for it.
- These numbers are specific to
codex execwithgpt-5.4-minion this subset and should not be generalized beyond that configuration.