Skip to content

Selector v1 Results

Reproduction command

Main evaluation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1 \
  --compare heuristic,selector_v0,selector_v1,none,minimal,compact,constraints_open_failures \
  --json \
  --markdown \
  --csv \
  --seed 0

Promotion:

uv run python -m evals.state_selector.promote \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --eval-report evals/state_selector/reports/selector-v1/metrics.json \
  --baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
  --baseline-report evals/state_selector/reports/selector-v0-publish/metrics.json \
  --out evals/state_selector/artifacts/promoted-v1 \
  --min-composite-delta-vs-v0 0.00 \
  --min-token-reduction-vs-v0 0.05 \
  --max-forbidden-rate-delta 0.00 \
  --require-no-safety-regression \
  --require-family-split-validation \
  --require-target-v2-validation

Dataset

  • Suite: selector_v1
  • Fixtures: 120
  • Conditions evaluated: 7
  • Selector-v1 evaluation rows: 840

Artifacts:

  • Evaluation report: evals/state_selector/reports/selector-v1/report.md
  • Per-category report: evals/state_selector/reports/selector-v1-leave-category-out/report.md
  • Feature ablation: evals/state_selector/reports/selector-v1-feature-ablation/report.md
  • Target ablation: evals/state_selector/reports/selector-v1-target-ablation/report.md
  • Promotion report: evals/state_selector/artifacts/promoted-v1/promotion_report.md

Selector artifacts

  • Selector v1: evals/state_selector/artifacts/selector-v1/selector.json
  • Selector v0: evals/state_selector/artifacts/selector-v0/selector.json

Results

Headline result:

  • Selector v1 tied heuristic on offline continuation quality.
  • Selector v1 tied Selector v0 on offline continuation quality.
  • Selector v1 used slightly more tokens than Selector v0.
  • Selector v1 beat constraints_open_failures strongly.

Selector v1 vs Selector v0

  • Mean composite delta: 0.000
  • Mean token delta: +1.425
  • Token reduction vs Selector v0: -0.007
  • Win/tie/loss: 0 / 120 / 0

Selector v1 did not beat Selector v0 on the current suite. It matched quality and used slightly more tokens.

Selector v1 vs heuristic

  • Mean composite delta: 0.000
  • Mean token delta: 0.000
  • Win/tie/loss: 0 / 120 / 0

On the current selector-v1 suite, Selector v1 reproduces heuristic behavior almost exactly.

Token efficiency

  • Mean tokens, heuristic: 200.100
  • Mean tokens, selector_v0: 198.675
  • Mean tokens, selector_v1: 200.100
  • Mean tokens, constraints_open_failures: 64.000

Selector v1 is flat vs heuristic and slightly worse than Selector v0 on token use.

Safety

  • Raw quarantined model-visible count, Selector v1: 0
  • Forbidden actionable phrase rate delta vs Selector v0: 0.000
  • Quarantine correctness delta vs Selector v0: 0.000
  • Resolved-failure-active count, Selector v1: 35

The promotion gate rejected Selector v1 partly because resolved failures are still being counted as active work in the evaluation output, and that count is not zero.

Failure cases

The main failure surface is concentrated in:

  • prompt_injection_quarantine
  • resolved_vs_open_failure

Those categories drive the non-zero resolved_failure_active_count in both heuristic and selector conditions on the current suite.

Promotion decision

Decision: rejected

Failed gates:

  • token reduction vs same-run selector_v0
  • resolved-failure-active count must be zero

Passed gates:

  • no composite regression vs same-run selector_v0
  • no constraint recall regression
  • no open-failure recall regression
  • no quarantine correctness regression
  • no forbidden-rate regression
  • family-split validation passed
  • target-quality validation passed
  • no severe per-category regression

Limitations

  • These are offline continuation metrics, not live coding-agent outcomes.
  • Per-category evaluation is holdout-style with the trained artifact as-is, not per-category retraining.
  • Feature and target ablations were flat on the current dataset, so they do not yet isolate a strong learning signal.
  • External live benchmark reruns were not run in this milestone.