State Value Lab¶

What this measures¶

State Value Lab measures which StatePlane state groups help, hurt, or waste tokens on the deterministic coding-continuation backend.

It answers questions like:

Which groups are required for continuation fidelity?
Which groups are neutral overall?
Which groups are too expensive for the value they add?
Which groups become harmful when they carry stale or distracting state?

Why this matters¶

StatePlane should not only record state well.

It should learn which state is worth carrying forward.

State Value Lab is the measurement loop for that problem:

candidate state
  -> interventions
  -> deterministic outcome deltas
  -> group usefulness labels
  -> candidate-level training rows

Relationship to StatePlane context policy¶

State Value Lab uses the real StatePlane selection and rendering pipeline.

It does not fake a selector inside the eval.

The only extra control surface is the public group filtering and token-budget policy already exposed through StatePlaneContextPolicy.

Relationship to the coding-continuation eval¶

The first backend for State Value Lab is the deterministic coding-continuation eval.

That means:

no live model calls
no Docker
no network
no LLM judge

State Value Lab reuses the same fixtures and scoring logic, but it runs multiple StatePlane interventions instead of only the default stateplane condition.

What an intervention is¶

An intervention is one controlled way of changing the model-visible StatePlane snapshot.

Examples:

compact
minimal
none
constraints_open_failures
no_decisions
no_procedures
no_touched_files
max_300_tokens

The lab compares each intervention against the compact baseline.

State groups¶

The current lab tracks these groups:

constraints
active_goals
open_failures
resolved_failure_notes
decisions
procedures
touched_files
tool_evidence
safety_notes
risks
generic_facts
episodes

Labels¶

The lab emits two label layers:

group-level usefulness labels
candidate-level export rows for future selector training

Group labels are:

helpful
harmful
neutral
too_expensive
unknown

These labels are derived from deterministic metric deltas and token deltas.

Run the smoke lab¶

uv run python -m evals.state_value.run \
  --suite smoke \
  --backend coding_continuation \
  --interventions all \
  --out /tmp/stateplane-state-value-smoke \
  --json \
  --markdown \
  --csv \
  --export-training-rows

Run the publish lab¶

uv run python -m evals.state_value.run \
  --suite publish \
  --backend coding_continuation \
  --interventions all \
  --out evals/state_value/reports/publish \
  --json \
  --markdown \
  --csv \
  --export-training-rows \
  --seed 0

Read the report¶

Each run writes artifacts such as:

<out>/
  metadata.json
  report.md
  state_value_summary.json
  group_value.json
  group_value.csv
  intervention_scores.json
  intervention_scores.csv
  training_rows.jsonl
  telemetry.jsonl
  contexts/
  failures.json

Training data export¶

training_rows.jsonl is the future-selector dataset.

Each row stores:

deterministic candidate features
selected-in-base and selected-in-best flags
group value label
token cost
delta metrics

This dataset is intended for a future learned selector.

State Value Lab does not train a model.

Selector v1 now consumes this export together with Milestone A targets_v2.jsonl and telemetry to build a joined, family-aware training table for schema-v2 empirical training. Milestone C reuses the same cached exports for feature and target ablations; it does not fabricate new targets or labels at evaluation time.

What this does not prove¶

The first backend uses deterministic offline continuation metrics.

It does not prove live coding-agent success.

It measures which state groups improve or harm state-fidelity metrics under controlled ablations.

External and real-trace outcome attribution should be added later.

Future: learned selector¶

The current task stops at telemetry, labels, and export rows.

The next step can train a small selector model on those exported rows. That work now lives in State Selector v0.

State Value Lab remains the data-generation and ablation layer. State Selector v0 consumes those checked-in outputs, validates them, trains a tiny empirical model, and evaluates it against the heuristic selector. Selector v1 extends that path with feature schema v2 and target schema v2. See Selector v1.