State Value Lab¶
What this measures¶
State Value Lab measures which StatePlane state groups help, hurt, or waste tokens on the deterministic coding-continuation backend.
It answers questions like:
- Which groups are required for continuation fidelity?
- Which groups are neutral overall?
- Which groups are too expensive for the value they add?
- Which groups become harmful when they carry stale or distracting state?
Why this matters¶
StatePlane should not only record state well.
It should learn which state is worth carrying forward.
State Value Lab is the measurement loop for that problem:
candidate state
-> interventions
-> deterministic outcome deltas
-> group usefulness labels
-> candidate-level training rows
Relationship to StatePlane context policy¶
State Value Lab uses the real StatePlane selection and rendering pipeline.
It does not fake a selector inside the eval.
The only extra control surface is the public group filtering and token-budget policy already exposed
through StatePlaneContextPolicy.
Relationship to the coding-continuation eval¶
The first backend for State Value Lab is the deterministic coding-continuation eval.
That means:
- no live model calls
- no Docker
- no network
- no LLM judge
State Value Lab reuses the same fixtures and scoring logic, but it runs multiple StatePlane
interventions instead of only the default stateplane condition.
What an intervention is¶
An intervention is one controlled way of changing the model-visible StatePlane snapshot.
Examples:
compactminimalnoneconstraints_open_failuresno_decisionsno_proceduresno_touched_filesmax_300_tokens
The lab compares each intervention against the compact baseline.
State groups¶
The current lab tracks these groups:
constraintsactive_goalsopen_failuresresolved_failure_notesdecisionsprocedurestouched_filestool_evidencesafety_notesrisksgeneric_factsepisodes
Labels¶
The lab emits two label layers:
- group-level usefulness labels
- candidate-level export rows for future selector training
Group labels are:
helpfulharmfulneutraltoo_expensiveunknown
These labels are derived from deterministic metric deltas and token deltas.
Run the smoke lab¶
uv run python -m evals.state_value.run \
--suite smoke \
--backend coding_continuation \
--interventions all \
--out /tmp/stateplane-state-value-smoke \
--json \
--markdown \
--csv \
--export-training-rows
Run the publish lab¶
uv run python -m evals.state_value.run \
--suite publish \
--backend coding_continuation \
--interventions all \
--out evals/state_value/reports/publish \
--json \
--markdown \
--csv \
--export-training-rows \
--seed 0
Read the report¶
Each run writes artifacts such as:
<out>/
metadata.json
report.md
state_value_summary.json
group_value.json
group_value.csv
intervention_scores.json
intervention_scores.csv
training_rows.jsonl
telemetry.jsonl
contexts/
failures.json
Training data export¶
training_rows.jsonl is the future-selector dataset.
Each row stores:
- deterministic candidate features
- selected-in-base and selected-in-best flags
- group value label
- token cost
- delta metrics
This dataset is intended for a future learned selector.
State Value Lab does not train a model.
Selector v1 now consumes this export together with Milestone A targets_v2.jsonl and telemetry to
build a joined, family-aware training table for schema-v2 empirical training.
Milestone C reuses the same cached exports for feature and target ablations; it does not fabricate
new targets or labels at evaluation time.
What this does not prove¶
The first backend uses deterministic offline continuation metrics.
It does not prove live coding-agent success.
It measures which state groups improve or harm state-fidelity metrics under controlled ablations.
External and real-trace outcome attribution should be added later.
Future: learned selector¶
The current task stops at telemetry, labels, and export rows.
The next step can train a small selector model on those exported rows. That work now lives in State Selector v0.
State Value Lab remains the data-generation and ablation layer. State Selector v0 consumes those checked-in outputs, validates them, trains a tiny empirical model, and evaluates it against the heuristic selector. Selector v1 extends that path with feature schema v2 and target schema v2. See Selector v1.