Skip to content

State Value Lab

What this measures

State Value Lab measures which StatePlane state groups help, hurt, or waste tokens on the deterministic coding-continuation backend.

It answers questions like:

  • Which groups are required for continuation fidelity?
  • Which groups are neutral overall?
  • Which groups are too expensive for the value they add?
  • Which groups become harmful when they carry stale or distracting state?

Why this matters

StatePlane should not only record state well.

It should learn which state is worth carrying forward.

State Value Lab is the measurement loop for that problem:

candidate state
  -> interventions
  -> deterministic outcome deltas
  -> group usefulness labels
  -> candidate-level training rows

Relationship to StatePlane context policy

State Value Lab uses the real StatePlane selection and rendering pipeline.

It does not fake a selector inside the eval.

The only extra control surface is the public group filtering and token-budget policy already exposed through StatePlaneContextPolicy.

Relationship to the coding-continuation eval

The first backend for State Value Lab is the deterministic coding-continuation eval.

That means:

  • no live model calls
  • no Docker
  • no network
  • no LLM judge

State Value Lab reuses the same fixtures and scoring logic, but it runs multiple StatePlane interventions instead of only the default stateplane condition.

What an intervention is

An intervention is one controlled way of changing the model-visible StatePlane snapshot.

Examples:

  • compact
  • minimal
  • none
  • constraints_open_failures
  • no_decisions
  • no_procedures
  • no_touched_files
  • max_300_tokens

The lab compares each intervention against the compact baseline.

State groups

The current lab tracks these groups:

  • constraints
  • active_goals
  • open_failures
  • resolved_failure_notes
  • decisions
  • procedures
  • touched_files
  • tool_evidence
  • safety_notes
  • risks
  • generic_facts
  • episodes

Labels

The lab emits two label layers:

  1. group-level usefulness labels
  2. candidate-level export rows for future selector training

Group labels are:

  • helpful
  • harmful
  • neutral
  • too_expensive
  • unknown

These labels are derived from deterministic metric deltas and token deltas.

Run the smoke lab

uv run python -m evals.state_value.run \
  --suite smoke \
  --backend coding_continuation \
  --interventions all \
  --out /tmp/stateplane-state-value-smoke \
  --json \
  --markdown \
  --csv \
  --export-training-rows

Run the publish lab

uv run python -m evals.state_value.run \
  --suite publish \
  --backend coding_continuation \
  --interventions all \
  --out evals/state_value/reports/publish \
  --json \
  --markdown \
  --csv \
  --export-training-rows \
  --seed 0

Read the report

Each run writes artifacts such as:

<out>/
  metadata.json
  report.md
  state_value_summary.json
  group_value.json
  group_value.csv
  intervention_scores.json
  intervention_scores.csv
  training_rows.jsonl
  telemetry.jsonl
  contexts/
  failures.json

Training data export

training_rows.jsonl is the future-selector dataset.

Each row stores:

  • deterministic candidate features
  • selected-in-base and selected-in-best flags
  • group value label
  • token cost
  • delta metrics

This dataset is intended for a future learned selector.

State Value Lab does not train a model.

Selector v1 now consumes this export together with Milestone A targets_v2.jsonl and telemetry to build a joined, family-aware training table for schema-v2 empirical training. Milestone C reuses the same cached exports for feature and target ablations; it does not fabricate new targets or labels at evaluation time.

What this does not prove

The first backend uses deterministic offline continuation metrics.

It does not prove live coding-agent success.

It measures which state groups improve or harm state-fidelity metrics under controlled ablations.

External and real-trace outcome attribution should be added later.

Future: learned selector

The current task stops at telemetry, labels, and export rows.

The next step can train a small selector model on those exported rows. That work now lives in State Selector v0.

State Value Lab remains the data-generation and ablation layer. State Selector v0 consumes those checked-in outputs, validates them, trains a tiny empirical model, and evaluates it against the heuristic selector. Selector v1 extends that path with feature schema v2 and target schema v2. See Selector v1.