Selector v1¶

What Selector v1 is¶

Selector v1 is the schema-v2 empirical selector trained on the family-aware selector-v1 dataset.

It stays within the same product constraints as Selector v0:

deterministic
dependency-free
inspectable
local-first
hard-invariant-preserving

Selector v1 is not a generative model. It is a small empirical utility selector with:

feature schema v2
target schema v2
family-aware train/validation/test splits
support-aware backoff
visibility priors
profile priors stored as diagnostics

What Milestone C evaluates¶

Milestone C is the proof step.

It answers whether Selector v1 improves the token-value tradeoff enough to justify promotion.

Milestone C adds:

offline evaluation on the selector_v1 suite
same-run comparison against selector_v0, heuristic, and fixed-profile baselines
per-category holdout-style reporting
feature-family ablations
target-component ablations
promotion gating
runtime selector_v1, hybrid, and shadow modes
selector telemetry with support, backoff, visibility, and fallback details

Conditions compared¶

Main offline evaluation compares:

heuristic
selector_v0
selector_v1
none
minimal
compact
constraints_open_failures

Promotion uses the same-run selector_v0 condition on the selector_v1 suite as the gate baseline. The historical selector-v0-publish report remains reference-only.

Runtime modes¶

Runtime accepts:

heuristic
empirical
selector_v0
selector_v1
hybrid
shadow

heuristic remains the default.

Shadow mode¶

Shadow mode keeps the heuristic model-visible context unchanged.

Selector v1 still scores optional candidates, and the telemetry/receipt records:

selector scores
support counts
backoff levels
visibility actions
heuristic vs selector vs final decisions

Shadow mode is the safe place to inspect selector-v1 behavior before rollout.

Hybrid mode¶

Hybrid mode preserves hard invariants and heuristic-required records first.

Selector v1 scores optional candidates, but low-support cases fall back to the heuristic decision.

Hybrid mode is intended for controlled experiments, not default rollout.

Promotion gates¶

Promotion requires:

composite(selector_v1) >= composite(selector_v0) - tolerance
token reduction vs same-run selector_v0
no regression in constraint recall
no regression in open-failure recall
no regression in quarantine correctness
no regression in resolved-failure handling
no increase in forbidden actionable phrase rate
zero raw-quarantined model-visible leaks
zero resolved-failure-active violations
passing family-split validation
passing target-quality validation
no severe per-category regression

Promotion is reversible. Selector v1 is not made default automatically.

Safety invariants¶

All selector modes preserve these hard invariants:

raw quarantined text is never model-visible
hard user constraints remain eligible
resolved failures are not shown as active work
tool output is not promoted to instruction
token budget is still enforced

Learned selector scores never override these rules.

Feature and target ablations¶

Milestone C ablations retrain from cached selector-v1 inputs.

Feature-family ablations:

relational
token_economics
redundancy_conflict
provenance_trust
trajectory_session
lexical

support_confidence is reported but skipped as a raw-feature ablation because it is artifact-side metadata, not a base feature family.

Target-component ablations:

full_target_v2
legacy_labels_only
no_marginal_utility
no_value_per_token
no_visibility_targets
no_profile_targets
no_token_penalty
no_safety_penalty

How to reproduce¶

Main evaluation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1 \
  --compare heuristic,selector_v0,selector_v1,none,minimal,compact,constraints_open_failures \
  --json \
  --markdown \
  --csv \
  --seed 0

Per-category holdout-style evaluation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1-leave-category-out \
  --compare heuristic,selector_v0,selector_v1 \
  --split-mode leave_category_out \
  --json \
  --markdown \
  --csv \
  --seed 0

Feature ablation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1-feature-ablation \
  --feature-ablation all \
  --json \
  --markdown \
  --csv \
  --seed 0

Target ablation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1-target-ablation \
  --target-ablation all \
  --json \
  --markdown \
  --csv \
  --seed 0

Promotion:

uv run python -m evals.state_selector.promote \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --eval-report evals/state_selector/reports/selector-v1/metrics.json \
  --baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
  --baseline-report evals/state_selector/reports/selector-v0-publish/metrics.json \
  --out evals/state_selector/artifacts/promoted-v1 \
  --min-composite-delta-vs-v0 0.00 \
  --min-token-reduction-vs-v0 0.05 \
  --max-forbidden-rate-delta 0.00 \
  --require-no-safety-regression \
  --require-family-split-validation \
  --require-target-v2-validation

Limitations¶

Milestone C proves only the current offline continuation-suite behavior.

It does not prove:

external live benchmark improvement
production runtime improvement
that learned profile choice should be enabled

Current selector-v1 training data is still profile-degenerate, so profile priors remain stats-only. External live benchmark reruns remain optional and out of CI.