Selector v1 Results¶
Reproduction command¶
Main evaluation:
uv run python -m evals.state_selector.evaluate \
--selector evals/state_selector/artifacts/selector-v1/selector.json \
--baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
--suite selector_v1 \
--backend coding_continuation \
--out evals/state_selector/reports/selector-v1 \
--compare heuristic,selector_v0,selector_v1,none,minimal,compact,constraints_open_failures \
--json \
--markdown \
--csv \
--seed 0
Promotion:
uv run python -m evals.state_selector.promote \
--selector evals/state_selector/artifacts/selector-v1/selector.json \
--eval-report evals/state_selector/reports/selector-v1/metrics.json \
--baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
--baseline-report evals/state_selector/reports/selector-v0-publish/metrics.json \
--out evals/state_selector/artifacts/promoted-v1 \
--min-composite-delta-vs-v0 0.00 \
--min-token-reduction-vs-v0 0.05 \
--max-forbidden-rate-delta 0.00 \
--require-no-safety-regression \
--require-family-split-validation \
--require-target-v2-validation
Dataset¶
- Suite:
selector_v1 - Fixtures:
120 - Conditions evaluated:
7 - Selector-v1 evaluation rows:
840
Artifacts:
- Evaluation report:
evals/state_selector/reports/selector-v1/report.md - Per-category report:
evals/state_selector/reports/selector-v1-leave-category-out/report.md - Feature ablation:
evals/state_selector/reports/selector-v1-feature-ablation/report.md - Target ablation:
evals/state_selector/reports/selector-v1-target-ablation/report.md - Promotion report:
evals/state_selector/artifacts/promoted-v1/promotion_report.md
Selector artifacts¶
- Selector v1:
evals/state_selector/artifacts/selector-v1/selector.json - Selector v0:
evals/state_selector/artifacts/selector-v0/selector.json
Results¶
Headline result:
- Selector v1 tied heuristic on offline continuation quality.
- Selector v1 tied Selector v0 on offline continuation quality.
- Selector v1 used slightly more tokens than Selector v0.
- Selector v1 beat
constraints_open_failuresstrongly.
Selector v1 vs Selector v0¶
- Mean composite delta:
0.000 - Mean token delta:
+1.425 - Token reduction vs Selector v0:
-0.007 - Win/tie/loss:
0 / 120 / 0
Selector v1 did not beat Selector v0 on the current suite. It matched quality and used slightly more tokens.
Selector v1 vs heuristic¶
- Mean composite delta:
0.000 - Mean token delta:
0.000 - Win/tie/loss:
0 / 120 / 0
On the current selector-v1 suite, Selector v1 reproduces heuristic behavior almost exactly.
Token efficiency¶
- Mean tokens,
heuristic:200.100 - Mean tokens,
selector_v0:198.675 - Mean tokens,
selector_v1:200.100 - Mean tokens,
constraints_open_failures:64.000
Selector v1 is flat vs heuristic and slightly worse than Selector v0 on token use.
Safety¶
- Raw quarantined model-visible count, Selector v1:
0 - Forbidden actionable phrase rate delta vs Selector v0:
0.000 - Quarantine correctness delta vs Selector v0:
0.000 - Resolved-failure-active count, Selector v1:
35
The promotion gate rejected Selector v1 partly because resolved failures are still being counted as active work in the evaluation output, and that count is not zero.
Failure cases¶
The main failure surface is concentrated in:
prompt_injection_quarantineresolved_vs_open_failure
Those categories drive the non-zero resolved_failure_active_count in both heuristic and selector
conditions on the current suite.
Promotion decision¶
Decision: rejected
Failed gates:
- token reduction vs same-run
selector_v0 - resolved-failure-active count must be zero
Passed gates:
- no composite regression vs same-run
selector_v0 - no constraint recall regression
- no open-failure recall regression
- no quarantine correctness regression
- no forbidden-rate regression
- family-split validation passed
- target-quality validation passed
- no severe per-category regression
Limitations¶
- These are offline continuation metrics, not live coding-agent outcomes.
- Per-category evaluation is holdout-style with the trained artifact as-is, not per-category retraining.
- Feature and target ablations were flat on the current dataset, so they do not yet isolate a strong learning signal.
- External live benchmark reruns were not run in this milestone.