Skip to content

External Coding Benchmark Results

Reproduction command

mkdir -p /tmp/stateplane-codex-home
cp ~/.codex/auth.json /tmp/stateplane-codex-home/auth.json

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --split test \
  --instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
  --conditions transcript_replay,compact_summary,stateplane \
  --agent codex \
  --codex-home /tmp/stateplane-codex-home \
  --codex-model gpt-5.4-mini \
  --run-harness \
  --max-workers 1 \
  --out /tmp/stateplane-external-live-five \
  --json \
  --markdown \
  --csv

Benchmark and dataset

  • Benchmark: swebench
  • Dataset: princeton-nlp/SWE-bench_Lite
  • Split: test
  • Pinned instance file: evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json
  • Git SHA: f3d22beaa57970538aeed547e1af3138c545e319
  • Repo-local summary artifacts: evals/external_coding/reports/swebench-lite-live-latest/

Agent configuration

  • Agent surface: codex exec
  • Codex model: gpt-5.4-mini
  • Codex home: /tmp/stateplane-codex-home
  • Codex full-auto: true
  • Harness: python -m swebench.harness.run_evaluation ... --report_dir {report_dir}

Conditions

  • transcript_replay
  • compact_summary
  • stateplane

Results

Condition Resolved rate ↑ Mean prompt tokens ↓ Mean patch size Mean elapsed s ↓
transcript_replay 0.40 887.40 1360.00 23.17
compact_summary 0.40 914.60 1387.20 20.34
stateplane 0.40 1736.20 1372.60 26.22

Resolved tasks for every condition:

  • astropy__astropy-14995
  • astropy__astropy-6938

Unresolved tasks for every condition:

  • astropy__astropy-12907
  • astropy__astropy-14182
  • astropy__astropy-14365

StatePlane vs transcript replay

  • Resolved-rate delta: 0.00
  • Win / tie / loss: 0 / 5 / 0
  • Mean prompt-token delta: +848.80 for StatePlane

StatePlane vs compact summary

  • Resolved-rate delta: 0.00
  • Win / tie / loss: 0 / 5 / 0
  • Mean prompt-token delta: +821.60 for StatePlane

Follow-up: lean model-context policy

The live result above was the trigger for the next product change. StatePlane was carrying useful state, but it was also paying too much prompt cost for single-shot benchmark tasks.

The lean-context follow-up changed the product policy, not the benchmark:

  • model-visible context is now separate from the richer audit receipt
  • single-shot tasks default to stateplane_policy=single_shot
  • StatePlane can intentionally inject no model-visible snapshot when no useful prior state exists
  • raw quarantined text stays out of model-visible context
  • model-visible snapshots are token-budgeted and relevance-ranked

The lean-context rerun used the same pinned 5-task subset and the same gpt-5.4-mini configuration, with repo-local artifacts written to:

  • evals/external_coding/reports/swebench-lite-live-lean-context/

The rerun command for that follow-up was:

mkdir -p /tmp/stateplane-codex-home
cp ~/.codex/auth.json /tmp/stateplane-codex-home/auth.json

uv run python -m evals.external_coding.run \
  --benchmark swebench \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --split test \
  --instance-ids evals/external_coding/fixtures/live/swebench_lite_smoke_instance_ids.json \
  --conditions transcript_replay,compact_summary,stateplane \
  --agent codex \
  --codex-home /tmp/stateplane-codex-home \
  --codex-model gpt-5.4-mini \
  --run-harness \
  --max-workers 1 \
  --out evals/external_coding/reports/swebench-lite-live-lean-context \
  --json \
  --markdown \
  --csv

Before vs after:

Metric Previous live Lean-context live Change
StatePlane resolved rate 0.40 0.40 0.00
StatePlane mean prompt tokens 1736.20 916.00 -820.20
StatePlane mean patch size 1372.60 1247.20 -125.40
StatePlane mean elapsed seconds 26.22 22.81 -3.41

The main product result is prompt-cost reduction:

  • StatePlane reduced mean prompt tokens by 820.20, or about 47.2%, on the same live subset.
  • StatePlane still did not improve resolved rate on this subset; it remained tied at 0.40.
  • Against compact_summary, StatePlane moved from +821.60 mean prompt-token overhead to +1.40.
  • Against transcript_replay, StatePlane moved from +848.80 mean prompt-token overhead to +28.60.

Pairwise follow-up results:

  • StatePlane vs transcript replay:
  • previous: 0 / 5 / 0
  • lean-context: 1 / 3 / 1
  • StatePlane vs compact summary:
  • previous: 0 / 5 / 0
  • lean-context: 0 / 5 / 0

What remained true from the lean-policy smoke guardrails:

  • coding-continuation smoke remained at 1.00 composite for StatePlane
  • external dry-run smoke prompt tokens dropped to 210.33
  • external dry-run smoke prompt tokens are now below both baselines on that smoke set

That is real product progress: the token tax on single-shot benchmark tasks was substantially reduced without losing resolved-rate performance on this subset. It is not yet evidence that StatePlane improves resolved rate on SWE-bench Lite.

Failure cases

No task-condition execution failures were recorded in the harness-evaluated run.

The main negative result is not a runtime failure but a benchmark outcome:

  • On this 5-task SWE-bench Lite subset, StatePlane did not improve resolved rate over either baseline.
  • After the lean-context follow-up, StatePlane still used slightly more prompt tokens than transcript replay and essentially matched compact summary on prompt cost.

Limitations

  • This is a very small pinned subset, so the result is directional rather than definitive.
  • SWE-bench-style tasks are issue-resolution tasks, not interrupted continuation traces.
  • The result should be read together with the offline continuation-context eval, not as a replacement for it.
  • These numbers are specific to codex exec with gpt-5.4-mini on this subset and should not be generalized beyond that configuration.