State Selector v0¶
What this is¶
State Selector v0 is the first learned selector for StatePlane.
It is not a generative model.
It is a small, deterministic empirical utility model trained from State Value Lab ablation outputs.
Why StatePlane needs a selector¶
StatePlane should not be valuable because it stores more state.
It should become valuable because it learns which state is worth carrying forward into the next model call.
Selector v0 is the first step toward that goal:
State Value Lab data
-> dataset validation
-> empirical utility model
-> evaluation against heuristic selection
-> promotion gate
-> shadow or hybrid runtime mode
Input data¶
Selector v0 trains from checked-in State Value Lab artifacts such as:
evals/state_value/reports/publish/
training_rows.jsonl
group_value.json
telemetry.jsonl
metadata.json
The selector consumes:
- candidate-level training rows
- group-level usefulness labels
- selection telemetry
- suite metadata and fixture coverage
Dataset validation¶
Before training, run the dataset validator:
uv run python -m evals.state_selector.validate \
--input evals/state_value/reports/publish/training_rows.jsonl \
--group-value evals/state_value/reports/publish/group_value.json \
--telemetry evals/state_value/reports/publish/telemetry.jsonl \
--out evals/state_selector/reports/validation \
--json \
--markdown
Validation checks:
- schema completeness
- label distribution
- group coverage
- feature stability
- leakage keys
- duplicate rows
- fixture-level split readiness
- cross-file consistency across training rows, group values, telemetry, and metadata
The validator recommends one of:
ready_for_trainingtrain_with_cautionnot_ready
Training refuses to run on not_ready unless --force is passed.
Selector model¶
Selector v0 uses an empirical utility model with deterministic backoff:
exact feature bucket
-> group + kind + status + trust + source
-> group + kind
-> group
-> global prior
It learns smoothed expected utility from State Value Lab outputs and ranks optional state by expected usefulness under token pressure.
It also includes a small empirical profile chooser for:
noneminimalcompact
Profile choice remains mostly rule-based in v0, with empirical priors only used as tie-breakers.
Hard invariants¶
The learned selector never overrides the hard invariants.
These remain rule-based:
- raw quarantined content is never model-visible
- hard user constraints remain eligible for inclusion
- resolved failures are never shown as active work
- stale or contradicted records stay out of model-visible context
- token budget is always enforced
- human receipts can stay richer than model context
The runtime order is:
hard invariant filter
-> heuristic required-record preservation
-> empirical scoring of optional records
-> token-budgeted admission
-> model-visible rendering
Training¶
Train selector v0 with:
uv run python -m evals.state_selector.train \
--training-rows evals/state_value/reports/publish/training_rows.jsonl \
--group-value evals/state_value/reports/publish/group_value.json \
--telemetry evals/state_value/reports/publish/telemetry.jsonl \
--out evals/state_selector/artifacts/selector-v0 \
--model empirical \
--seed 0 \
--split-mode category_stratified_by_fixture \
--smoothing 5.0 \
--min-bucket-count 3 \
--token-penalty-weight 0.10 \
--json \
--markdown
This writes:
selector.json- split artifacts
- validation report
- training report
- feature schema
- split metrics
Evaluation¶
Evaluate selector v0 on the coding-continuation backend with:
uv run python -m evals.state_selector.evaluate \
--selector evals/state_selector/artifacts/selector-v0/selector.json \
--suite smoke \
--backend coding_continuation \
--out evals/state_selector/reports/selector-v0-smoke \
--compare heuristic,empirical,none,minimal,compact \
--json \
--markdown \
--csv \
--seed 0
The evaluation report compares:
- continuation-state quality deltas vs heuristic
- token efficiency deltas vs heuristic
- safety metrics
- win/tie/loss by fixture
- category breakdowns
Promotion¶
Promotion is a separate gate:
uv run python -m evals.state_selector.promote \
--selector evals/state_selector/artifacts/selector-v0/selector.json \
--eval-report evals/state_selector/reports/selector-v0-publish/metrics.json \
--out evals/state_selector/artifacts/promoted \
--min-composite-delta -0.01 \
--min-token-reduction 0.10 \
--max-forbidden-rate-delta 0.00 \
--require-no-safety-regression
Promotion checks:
- composite score does not regress materially
- token usage improves enough to matter
- constraint recall does not regress
- open-failure recall does not regress materially
- quarantine correctness does not regress
- forbidden actionable phrase rate does not increase
Selector v0 remains the same-run baseline used to judge selector-v1 promotion on the selector-v1
suite. Historical publish reports are still useful reference points, but selector-v1 promotion is
gated against the same-run selector_v0 condition.
Runtime modes¶
Selector-aware runtime policy supports:
heuristicempiricalselector_v0selector_v1hybridshadow
Use them through StatePlaneContextPolicy(...).
heuristic remains the default until a selector has passed promotion and is explicitly enabled.
Shadow mode¶
Shadow mode runs the empirical selector and logs its decisions, but it still uses the heuristic output for the actual model-visible context.
This is the safest way to inspect selector behavior before using empirical or hybrid.
Limitations¶
- Selector v0 is trained from deterministic offline continuation metrics, not live coding outcomes.
- It does not replace the external coding benchmark.
- It does not override hard invariants.
- It is intentionally small and inspectable, not a complex ML stack.
- Heuristic selection remains the default runtime policy.
Future work¶
Future selector work can build on this foundation by:
- expanding the publish dataset
- improving candidate features
- adding better optional-record ranking
- evaluating against external live outcomes once attribution is mature
- replacing the empirical model only if a better local-first selector proves its value
Milestone B for Selector v1 now exists separately in Selector v1. It adds feature schema v2, target-aware empirical training, and artifact schema v2, but it does not yet claim runtime improvement over Selector v0.