State Selector v0¶

What this is¶

State Selector v0 is the first learned selector for StatePlane.

It is not a generative model.

It is a small, deterministic empirical utility model trained from State Value Lab ablation outputs.

Why StatePlane needs a selector¶

StatePlane should not be valuable because it stores more state.

It should become valuable because it learns which state is worth carrying forward into the next model call.

Selector v0 is the first step toward that goal:

State Value Lab data
  -> dataset validation
  -> empirical utility model
  -> evaluation against heuristic selection
  -> promotion gate
  -> shadow or hybrid runtime mode

Input data¶

Selector v0 trains from checked-in State Value Lab artifacts such as:

evals/state_value/reports/publish/
  training_rows.jsonl
  group_value.json
  telemetry.jsonl
  metadata.json

The selector consumes:

candidate-level training rows
group-level usefulness labels
selection telemetry
suite metadata and fixture coverage

Dataset validation¶

Before training, run the dataset validator:

uv run python -m evals.state_selector.validate \
  --input evals/state_value/reports/publish/training_rows.jsonl \
  --group-value evals/state_value/reports/publish/group_value.json \
  --telemetry evals/state_value/reports/publish/telemetry.jsonl \
  --out evals/state_selector/reports/validation \
  --json \
  --markdown

Validation checks:

schema completeness
label distribution
group coverage
feature stability
leakage keys
duplicate rows
fixture-level split readiness
cross-file consistency across training rows, group values, telemetry, and metadata

The validator recommends one of:

ready_for_training
train_with_caution
not_ready

Training refuses to run on not_ready unless --force is passed.

Selector model¶

Selector v0 uses an empirical utility model with deterministic backoff:

exact feature bucket
  -> group + kind + status + trust + source
  -> group + kind
  -> group
  -> global prior

It learns smoothed expected utility from State Value Lab outputs and ranks optional state by expected usefulness under token pressure.

It also includes a small empirical profile chooser for:

none
minimal
compact

Profile choice remains mostly rule-based in v0, with empirical priors only used as tie-breakers.

Hard invariants¶

The learned selector never overrides the hard invariants.

These remain rule-based:

raw quarantined content is never model-visible
hard user constraints remain eligible for inclusion
resolved failures are never shown as active work
stale or contradicted records stay out of model-visible context
token budget is always enforced
human receipts can stay richer than model context

The runtime order is:

hard invariant filter
  -> heuristic required-record preservation
  -> empirical scoring of optional records
  -> token-budgeted admission
  -> model-visible rendering

Training¶

Train selector v0 with:

uv run python -m evals.state_selector.train \
  --training-rows evals/state_value/reports/publish/training_rows.jsonl \
  --group-value evals/state_value/reports/publish/group_value.json \
  --telemetry evals/state_value/reports/publish/telemetry.jsonl \
  --out evals/state_selector/artifacts/selector-v0 \
  --model empirical \
  --seed 0 \
  --split-mode category_stratified_by_fixture \
  --smoothing 5.0 \
  --min-bucket-count 3 \
  --token-penalty-weight 0.10 \
  --json \
  --markdown

This writes:

selector.json
split artifacts
validation report
training report
feature schema
split metrics

Evaluation¶

Evaluate selector v0 on the coding-continuation backend with:

uv run python -m evals.state_selector.evaluate \
  --selector evals/state_selector/artifacts/selector-v0/selector.json \
  --suite smoke \
  --backend coding_continuation \
  --out evals/state_selector/reports/selector-v0-smoke \
  --compare heuristic,empirical,none,minimal,compact \
  --json \
  --markdown \
  --csv \
  --seed 0

The evaluation report compares:

continuation-state quality deltas vs heuristic
token efficiency deltas vs heuristic
safety metrics
win/tie/loss by fixture
category breakdowns

Promotion¶

Promotion is a separate gate:

uv run python -m evals.state_selector.promote \
  --selector evals/state_selector/artifacts/selector-v0/selector.json \
  --eval-report evals/state_selector/reports/selector-v0-publish/metrics.json \
  --out evals/state_selector/artifacts/promoted \
  --min-composite-delta -0.01 \
  --min-token-reduction 0.10 \
  --max-forbidden-rate-delta 0.00 \
  --require-no-safety-regression

Promotion checks:

composite score does not regress materially
token usage improves enough to matter
constraint recall does not regress
open-failure recall does not regress materially
quarantine correctness does not regress
forbidden actionable phrase rate does not increase

Selector v0 remains the same-run baseline used to judge selector-v1 promotion on the selector-v1 suite. Historical publish reports are still useful reference points, but selector-v1 promotion is gated against the same-run selector_v0 condition.

Runtime modes¶

Selector-aware runtime policy supports:

heuristic
empirical
selector_v0
selector_v1
hybrid
shadow

Use them through StatePlaneContextPolicy(...).

heuristic remains the default until a selector has passed promotion and is explicitly enabled.

Shadow mode¶

Shadow mode runs the empirical selector and logs its decisions, but it still uses the heuristic output for the actual model-visible context.

This is the safest way to inspect selector behavior before using empirical or hybrid.

Limitations¶

Selector v0 is trained from deterministic offline continuation metrics, not live coding outcomes.
It does not replace the external coding benchmark.
It does not override hard invariants.
It is intentionally small and inspectable, not a complex ML stack.
Heuristic selection remains the default runtime policy.

Future work¶

Future selector work can build on this foundation by:

expanding the publish dataset
improving candidate features
adding better optional-record ranking
evaluating against external live outcomes once attribution is mature
replacing the empirical model only if a better local-first selector proves its value

Milestone B for Selector v1 now exists separately in Selector v1. It adds feature schema v2, target-aware empirical training, and artifact schema v2, but it does not yet claim runtime improvement over Selector v0.