External Coding Benchmark Results¶

Reproduction command¶

mkdir -p /tmp/stateplane-codex-home
cp ~/.codex/auth.json /tmp/stateplane-codex-home/auth.json

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --split test \
  --instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
  --conditions transcript_replay,compact_summary,stateplane \
  --agent codex \
  --codex-home /tmp/stateplane-codex-home \
  --codex-model gpt-5.4-mini \
  --run-harness \
  --max-workers 1 \
  --out /tmp/stateplane-external-live-five \
  --json \
  --markdown \
  --csv

Benchmark and dataset¶

Benchmark: swebench
Dataset: princeton-nlp/SWE-bench_Lite
Split: test
Pinned instance file: evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json
Git SHA: f3d22beaa57970538aeed547e1af3138c545e319
Repo-local summary artifacts: evals/external_coding/reports/swebench-lite-live-latest/

Agent configuration¶

Agent surface: codex exec
Codex model: gpt-5.4-mini
Codex home: /tmp/stateplane-codex-home
Codex full-auto: true
Harness: python -m swebench.harness.run_evaluation ... --report_dir {report_dir}

Conditions¶

transcript_replay
compact_summary
stateplane

Results¶

Condition	Resolved rate ↑	Mean prompt tokens ↓	Mean patch size	Mean elapsed s ↓
transcript_replay	0.40	887.40	1360.00	23.17
compact_summary	0.40	914.60	1387.20	20.34
stateplane	0.40	1736.20	1372.60	26.22

Resolved tasks for every condition:

astropy__astropy-14995
astropy__astropy-6938

Unresolved tasks for every condition:

astropy__astropy-12907
astropy__astropy-14182
astropy__astropy-14365

StatePlane vs transcript replay¶

Resolved-rate delta: 0.00
Win / tie / loss: 0 / 5 / 0
Mean prompt-token delta: +848.80 for StatePlane

StatePlane vs compact summary¶

Resolved-rate delta: 0.00
Win / tie / loss: 0 / 5 / 0
Mean prompt-token delta: +821.60 for StatePlane

Follow-up: lean model-context policy¶

The live result above was the trigger for the next product change. StatePlane was carrying useful state, but it was also paying too much prompt cost for single-shot benchmark tasks.

The lean-context follow-up changed the product policy, not the benchmark:

model-visible context is now separate from the richer audit receipt
single-shot tasks default to stateplane_policy=single_shot
StatePlane can intentionally inject no model-visible snapshot when no useful prior state exists
raw quarantined text stays out of model-visible context
model-visible snapshots are token-budgeted and relevance-ranked

The lean-context rerun used the same pinned 5-task subset and the same gpt-5.4-mini configuration, with repo-local artifacts written to:

evals/external_coding/reports/swebench-lite-live-lean-context/

The rerun command for that follow-up was:

mkdir -p /tmp/stateplane-codex-home
cp ~/.codex/auth.json /tmp/stateplane-codex-home/auth.json

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --split test \
  --instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
  --conditions transcript_replay,compact_summary,stateplane \
  --agent codex \
  --codex-home /tmp/stateplane-codex-home \
  --codex-model gpt-5.4-mini \
  --run-harness \
  --max-workers 1 \
  --out evals/external_coding/reports/swebench-lite-live-lean-context \
  --json \
  --markdown \
  --csv

Before vs after:

Metric	Previous live	Lean-context live	Change
StatePlane resolved rate	0.40	0.40	0.00
StatePlane mean prompt tokens	1736.20	916.00	-820.20
StatePlane mean patch size	1372.60	1247.20	-125.40
StatePlane mean elapsed seconds	26.22	22.81	-3.41

The main product result is prompt-cost reduction:

StatePlane reduced mean prompt tokens by 820.20, or about 47.2%, on the same live subset.
StatePlane still did not improve resolved rate on this subset; it remained tied at 0.40.
Against compact_summary, StatePlane moved from +821.60 mean prompt-token overhead to +1.40.
Against transcript_replay, StatePlane moved from +848.80 mean prompt-token overhead to +28.60.

Pairwise follow-up results:

StatePlane vs transcript replay:
previous: 0 / 5 / 0
lean-context: 1 / 3 / 1
StatePlane vs compact summary:
previous: 0 / 5 / 0
lean-context: 0 / 5 / 0

What remained true from the lean-policy smoke guardrails:

coding-continuation smoke remained at 1.00 composite for StatePlane
external dry-run smoke prompt tokens dropped to 210.33
external dry-run smoke prompt tokens are now below both baselines on that smoke set

That is real product progress: the token tax on single-shot benchmark tasks was substantially reduced without losing resolved-rate performance on this subset. It is not yet evidence that StatePlane improves resolved rate on SWE-bench Lite.

Failure cases¶

No task-condition execution failures were recorded in the harness-evaluated run.

The main negative result is not a runtime failure but a benchmark outcome:

On this 5-task SWE-bench Lite subset, StatePlane did not improve resolved rate over either baseline.
After the lean-context follow-up, StatePlane still used slightly more prompt tokens than transcript replay and essentially matched compact summary on prompt cost.

Limitations¶

This is a very small pinned subset, so the result is directional rather than definitive.
SWE-bench-style tasks are issue-resolution tasks, not interrupted continuation traces.
The result should be read together with the offline continuation-context eval, not as a replacement for it.
These numbers are specific to codex exec with gpt-5.4-mini on this subset and should not be generalized beyond that configuration.