Selector v1¶
What Selector v1 is¶
Selector v1 is the schema-v2 empirical selector trained on the family-aware selector-v1 dataset.
It stays within the same product constraints as Selector v0:
- deterministic
- dependency-free
- inspectable
- local-first
- hard-invariant-preserving
Selector v1 is not a generative model. It is a small empirical utility selector with:
- feature schema v2
- target schema v2
- family-aware train/validation/test splits
- support-aware backoff
- visibility priors
- profile priors stored as diagnostics
What Milestone C evaluates¶
Milestone C is the proof step.
It answers whether Selector v1 improves the token-value tradeoff enough to justify promotion.
Milestone C adds:
- offline evaluation on the
selector_v1suite - same-run comparison against
selector_v0,heuristic, and fixed-profile baselines - per-category holdout-style reporting
- feature-family ablations
- target-component ablations
- promotion gating
- runtime
selector_v1,hybrid, andshadowmodes - selector telemetry with support, backoff, visibility, and fallback details
Conditions compared¶
Main offline evaluation compares:
heuristicselector_v0selector_v1noneminimalcompactconstraints_open_failures
Promotion uses the same-run selector_v0 condition on the selector_v1 suite as the gate
baseline. The historical selector-v0-publish report remains reference-only.
Runtime modes¶
Runtime accepts:
heuristicempiricalselector_v0selector_v1hybridshadow
heuristic remains the default.
Shadow mode¶
Shadow mode keeps the heuristic model-visible context unchanged.
Selector v1 still scores optional candidates, and the telemetry/receipt records:
- selector scores
- support counts
- backoff levels
- visibility actions
- heuristic vs selector vs final decisions
Shadow mode is the safe place to inspect selector-v1 behavior before rollout.
Hybrid mode¶
Hybrid mode preserves hard invariants and heuristic-required records first.
Selector v1 scores optional candidates, but low-support cases fall back to the heuristic decision.
Hybrid mode is intended for controlled experiments, not default rollout.
Promotion gates¶
Promotion requires:
composite(selector_v1) >= composite(selector_v0) - tolerance- token reduction vs same-run
selector_v0 - no regression in constraint recall
- no regression in open-failure recall
- no regression in quarantine correctness
- no regression in resolved-failure handling
- no increase in forbidden actionable phrase rate
- zero raw-quarantined model-visible leaks
- zero resolved-failure-active violations
- passing family-split validation
- passing target-quality validation
- no severe per-category regression
Promotion is reversible. Selector v1 is not made default automatically.
Safety invariants¶
All selector modes preserve these hard invariants:
- raw quarantined text is never model-visible
- hard user constraints remain eligible
- resolved failures are not shown as active work
- tool output is not promoted to instruction
- token budget is still enforced
Learned selector scores never override these rules.
Feature and target ablations¶
Milestone C ablations retrain from cached selector-v1 inputs.
Feature-family ablations:
relationaltoken_economicsredundancy_conflictprovenance_trusttrajectory_sessionlexical
support_confidence is reported but skipped as a raw-feature ablation because it is artifact-side
metadata, not a base feature family.
Target-component ablations:
full_target_v2legacy_labels_onlyno_marginal_utilityno_value_per_tokenno_visibility_targetsno_profile_targetsno_token_penaltyno_safety_penalty
How to reproduce¶
Main evaluation:
uv run python -m evals.state_selector.evaluate \
--selector evals/state_selector/artifacts/selector-v1/selector.json \
--baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
--suite selector_v1 \
--backend coding_continuation \
--out evals/state_selector/reports/selector-v1 \
--compare heuristic,selector_v0,selector_v1,none,minimal,compact,constraints_open_failures \
--json \
--markdown \
--csv \
--seed 0
Per-category holdout-style evaluation:
uv run python -m evals.state_selector.evaluate \
--selector evals/state_selector/artifacts/selector-v1/selector.json \
--baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
--suite selector_v1 \
--backend coding_continuation \
--out evals/state_selector/reports/selector-v1-leave-category-out \
--compare heuristic,selector_v0,selector_v1 \
--split-mode leave_category_out \
--json \
--markdown \
--csv \
--seed 0
Feature ablation:
uv run python -m evals.state_selector.evaluate \
--selector evals/state_selector/artifacts/selector-v1/selector.json \
--suite selector_v1 \
--backend coding_continuation \
--out evals/state_selector/reports/selector-v1-feature-ablation \
--feature-ablation all \
--json \
--markdown \
--csv \
--seed 0
Target ablation:
uv run python -m evals.state_selector.evaluate \
--selector evals/state_selector/artifacts/selector-v1/selector.json \
--suite selector_v1 \
--backend coding_continuation \
--out evals/state_selector/reports/selector-v1-target-ablation \
--target-ablation all \
--json \
--markdown \
--csv \
--seed 0
Promotion:
uv run python -m evals.state_selector.promote \
--selector evals/state_selector/artifacts/selector-v1/selector.json \
--eval-report evals/state_selector/reports/selector-v1/metrics.json \
--baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
--baseline-report evals/state_selector/reports/selector-v0-publish/metrics.json \
--out evals/state_selector/artifacts/promoted-v1 \
--min-composite-delta-vs-v0 0.00 \
--min-token-reduction-vs-v0 0.05 \
--max-forbidden-rate-delta 0.00 \
--require-no-safety-regression \
--require-family-split-validation \
--require-target-v2-validation
Limitations¶
Milestone C proves only the current offline continuation-suite behavior.
It does not prove:
- external live benchmark improvement
- production runtime improvement
- that learned profile choice should be enabled
Current selector-v1 training data is still profile-degenerate, so profile priors remain stats-only. External live benchmark reruns remain optional and out of CI.