Skip to content

State Selector v0

What this is

State Selector v0 is the first learned selector for StatePlane.

It is not a generative model.

It is a small, deterministic empirical utility model trained from State Value Lab ablation outputs.

Why StatePlane needs a selector

StatePlane should not be valuable because it stores more state.

It should become valuable because it learns which state is worth carrying forward into the next model call.

Selector v0 is the first step toward that goal:

State Value Lab data
  -> dataset validation
  -> empirical utility model
  -> evaluation against heuristic selection
  -> promotion gate
  -> shadow or hybrid runtime mode

Input data

Selector v0 trains from checked-in State Value Lab artifacts such as:

evals/state_value/reports/publish/
  training_rows.jsonl
  group_value.json
  telemetry.jsonl
  metadata.json

The selector consumes:

  • candidate-level training rows
  • group-level usefulness labels
  • selection telemetry
  • suite metadata and fixture coverage

Dataset validation

Before training, run the dataset validator:

uv run python -m evals.state_selector.validate \
  --input evals/state_value/reports/publish/training_rows.jsonl \
  --group-value evals/state_value/reports/publish/group_value.json \
  --telemetry evals/state_value/reports/publish/telemetry.jsonl \
  --out evals/state_selector/reports/validation \
  --json \
  --markdown

Validation checks:

  • schema completeness
  • label distribution
  • group coverage
  • feature stability
  • leakage keys
  • duplicate rows
  • fixture-level split readiness
  • cross-file consistency across training rows, group values, telemetry, and metadata

The validator recommends one of:

  • ready_for_training
  • train_with_caution
  • not_ready

Training refuses to run on not_ready unless --force is passed.

Selector model

Selector v0 uses an empirical utility model with deterministic backoff:

exact feature bucket
  -> group + kind + status + trust + source
  -> group + kind
  -> group
  -> global prior

It learns smoothed expected utility from State Value Lab outputs and ranks optional state by expected usefulness under token pressure.

It also includes a small empirical profile chooser for:

  • none
  • minimal
  • compact

Profile choice remains mostly rule-based in v0, with empirical priors only used as tie-breakers.

Hard invariants

The learned selector never overrides the hard invariants.

These remain rule-based:

  • raw quarantined content is never model-visible
  • hard user constraints remain eligible for inclusion
  • resolved failures are never shown as active work
  • stale or contradicted records stay out of model-visible context
  • token budget is always enforced
  • human receipts can stay richer than model context

The runtime order is:

hard invariant filter
  -> heuristic required-record preservation
  -> empirical scoring of optional records
  -> token-budgeted admission
  -> model-visible rendering

Training

Train selector v0 with:

uv run python -m evals.state_selector.train \
  --training-rows evals/state_value/reports/publish/training_rows.jsonl \
  --group-value evals/state_value/reports/publish/group_value.json \
  --telemetry evals/state_value/reports/publish/telemetry.jsonl \
  --out evals/state_selector/artifacts/selector-v0 \
  --model empirical \
  --seed 0 \
  --split-mode category_stratified_by_fixture \
  --smoothing 5.0 \
  --min-bucket-count 3 \
  --token-penalty-weight 0.10 \
  --json \
  --markdown

This writes:

  • selector.json
  • split artifacts
  • validation report
  • training report
  • feature schema
  • split metrics

Evaluation

Evaluate selector v0 on the coding-continuation backend with:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v0/selector.json \
  --suite smoke \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v0-smoke \
  --compare heuristic,empirical,none,minimal,compact \
  --json \
  --markdown \
  --csv \
  --seed 0

The evaluation report compares:

  • continuation-state quality deltas vs heuristic
  • token efficiency deltas vs heuristic
  • safety metrics
  • win/tie/loss by fixture
  • category breakdowns

Promotion

Promotion is a separate gate:

uv run python -m evals.state_selector.promote \
  --selector evals/state_selector/artifacts/selector-v0/selector.json \
  --eval-report evals/state_selector/reports/selector-v0-publish/metrics.json \
  --out evals/state_selector/artifacts/promoted \
  --min-composite-delta -0.01 \
  --min-token-reduction 0.10 \
  --max-forbidden-rate-delta 0.00 \
  --require-no-safety-regression

Promotion checks:

  • composite score does not regress materially
  • token usage improves enough to matter
  • constraint recall does not regress
  • open-failure recall does not regress materially
  • quarantine correctness does not regress
  • forbidden actionable phrase rate does not increase

Selector v0 remains the same-run baseline used to judge selector-v1 promotion on the selector-v1 suite. Historical publish reports are still useful reference points, but selector-v1 promotion is gated against the same-run selector_v0 condition.

Runtime modes

Selector-aware runtime policy supports:

  • heuristic
  • empirical
  • selector_v0
  • selector_v1
  • hybrid
  • shadow

Use them through StatePlaneContextPolicy(...).

heuristic remains the default until a selector has passed promotion and is explicitly enabled.

Shadow mode

Shadow mode runs the empirical selector and logs its decisions, but it still uses the heuristic output for the actual model-visible context.

This is the safest way to inspect selector behavior before using empirical or hybrid.

Limitations

  • Selector v0 is trained from deterministic offline continuation metrics, not live coding outcomes.
  • It does not replace the external coding benchmark.
  • It does not override hard invariants.
  • It is intentionally small and inspectable, not a complex ML stack.
  • Heuristic selection remains the default runtime policy.

Future work

Future selector work can build on this foundation by:

  • expanding the publish dataset
  • improving candidate features
  • adding better optional-record ranking
  • evaluating against external live outcomes once attribution is mature
  • replacing the empirical model only if a better local-first selector proves its value

Milestone B for Selector v1 now exists separately in Selector v1. It adds feature schema v2, target-aware empirical training, and artifact schema v2, but it does not yet claim runtime improvement over Selector v0.