Skip to content

Selector v1

What Selector v1 is

Selector v1 is the schema-v2 empirical selector trained on the family-aware selector-v1 dataset.

It stays within the same product constraints as Selector v0:

  • deterministic
  • dependency-free
  • inspectable
  • local-first
  • hard-invariant-preserving

Selector v1 is not a generative model. It is a small empirical utility selector with:

  • feature schema v2
  • target schema v2
  • family-aware train/validation/test splits
  • support-aware backoff
  • visibility priors
  • profile priors stored as diagnostics

What Milestone C evaluates

Milestone C is the proof step.

It answers whether Selector v1 improves the token-value tradeoff enough to justify promotion.

Milestone C adds:

  • offline evaluation on the selector_v1 suite
  • same-run comparison against selector_v0, heuristic, and fixed-profile baselines
  • per-category holdout-style reporting
  • feature-family ablations
  • target-component ablations
  • promotion gating
  • runtime selector_v1, hybrid, and shadow modes
  • selector telemetry with support, backoff, visibility, and fallback details

Conditions compared

Main offline evaluation compares:

  • heuristic
  • selector_v0
  • selector_v1
  • none
  • minimal
  • compact
  • constraints_open_failures

Promotion uses the same-run selector_v0 condition on the selector_v1 suite as the gate baseline. The historical selector-v0-publish report remains reference-only.

Runtime modes

Runtime accepts:

  • heuristic
  • empirical
  • selector_v0
  • selector_v1
  • hybrid
  • shadow

heuristic remains the default.

Shadow mode

Shadow mode keeps the heuristic model-visible context unchanged.

Selector v1 still scores optional candidates, and the telemetry/receipt records:

  • selector scores
  • support counts
  • backoff levels
  • visibility actions
  • heuristic vs selector vs final decisions

Shadow mode is the safe place to inspect selector-v1 behavior before rollout.

Hybrid mode

Hybrid mode preserves hard invariants and heuristic-required records first.

Selector v1 scores optional candidates, but low-support cases fall back to the heuristic decision.

Hybrid mode is intended for controlled experiments, not default rollout.

Promotion gates

Promotion requires:

  • composite(selector_v1) >= composite(selector_v0) - tolerance
  • token reduction vs same-run selector_v0
  • no regression in constraint recall
  • no regression in open-failure recall
  • no regression in quarantine correctness
  • no regression in resolved-failure handling
  • no increase in forbidden actionable phrase rate
  • zero raw-quarantined model-visible leaks
  • zero resolved-failure-active violations
  • passing family-split validation
  • passing target-quality validation
  • no severe per-category regression

Promotion is reversible. Selector v1 is not made default automatically.

Safety invariants

All selector modes preserve these hard invariants:

  • raw quarantined text is never model-visible
  • hard user constraints remain eligible
  • resolved failures are not shown as active work
  • tool output is not promoted to instruction
  • token budget is still enforced

Learned selector scores never override these rules.

Feature and target ablations

Milestone C ablations retrain from cached selector-v1 inputs.

Feature-family ablations:

  • relational
  • token_economics
  • redundancy_conflict
  • provenance_trust
  • trajectory_session
  • lexical

support_confidence is reported but skipped as a raw-feature ablation because it is artifact-side metadata, not a base feature family.

Target-component ablations:

  • full_target_v2
  • legacy_labels_only
  • no_marginal_utility
  • no_value_per_token
  • no_visibility_targets
  • no_profile_targets
  • no_token_penalty
  • no_safety_penalty

How to reproduce

Main evaluation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1 \
  --compare heuristic,selector_v0,selector_v1,none,minimal,compact,constraints_open_failures \
  --json \
  --markdown \
  --csv \
  --seed 0

Per-category holdout-style evaluation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1-leave-category-out \
  --compare heuristic,selector_v0,selector_v1 \
  --split-mode leave_category_out \
  --json \
  --markdown \
  --csv \
  --seed 0

Feature ablation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1-feature-ablation \
  --feature-ablation all \
  --json \
  --markdown \
  --csv \
  --seed 0

Target ablation:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --suite selector_v1 \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v1-target-ablation \
  --target-ablation all \
  --json \
  --markdown \
  --csv \
  --seed 0

Promotion:

uv run python -m evals.state_selector.promote \
  --selector evals/state_selector/artifacts/selector-v1/selector.json \
  --eval-report evals/state_selector/reports/selector-v1/metrics.json \
  --baseline-selector evals/state_selector/artifacts/selector-v0/selector.json \
  --baseline-report evals/state_selector/reports/selector-v0-publish/metrics.json \
  --out evals/state_selector/artifacts/promoted-v1 \
  --min-composite-delta-vs-v0 0.00 \
  --min-token-reduction-vs-v0 0.05 \
  --max-forbidden-rate-delta 0.00 \
  --require-no-safety-regression \
  --require-family-split-validation \
  --require-target-v2-validation

Limitations

Milestone C proves only the current offline continuation-suite behavior.

It does not prove:

  • external live benchmark improvement
  • production runtime improvement
  • that learned profile choice should be enabled

Current selector-v1 training data is still profile-degenerate, so profile priors remain stats-only. External live benchmark reruns remain optional and out of CI.