Selector v1 Dataset¶

Goal¶

Selector v1 Milestone A does not train a new selector yet.

It establishes a cleaner data contract for future selector work:

fixture families
  -> deterministic selector-v1 fixture scaling
  -> family-aware splits
  -> selector-v1 State Value outputs
  -> target schema v2
  -> target-quality validation

The objective is not raw row count by itself.

The objective is a larger, family-aware dataset that can be validated without leaking variants across train and test splits.

Fixture families¶

Selector v1 introduces flat string metadata on each continuation fixture:

fixture_family_id
variant_id
variant_type
source_fixture_id
fixture_source

These fields stay inside the existing metadata: dict[str, str] fixture contract.

The richer provenance for generated variants lives in:

evals/coding_continuation/fixtures/selector_v1/manifest.json

Generated variants¶

Selector v1 keeps the runnable suite flat:

evals/coding_continuation/fixtures/selector_v1/*.json

Milestone A does not widen the continuation loader to nested directories.

The initial deterministic variant set is intentionally conservative:

base
request_reworded
budget_pressure
stale_log_noise

These variants enlarge fixture coverage without pretending they create new candidate-level labels.

Dataset scaling¶

Build the selector-v1 suite with:

uv run python -m evals.state_selector.dataset_scaling \
  --source evals/coding_continuation/fixtures/publish \
  --out-suite-dir evals/coding_continuation/fixtures/selector_v1 \
  --manifest evals/coding_continuation/fixtures/selector_v1/manifest.json \
  --report-out evals/state_selector/reports/selector-v1-dataset \
  --target-fixtures 120 \
  --seed 0 \
  --json \
  --markdown

This writes:

the flat selector-v1 suite
manifest.json
evals/state_selector/reports/selector-v1-dataset/dataset_report.md

Family-aware splits¶

Selector v1 Milestone A adds:

category_stratified_by_fixture_family

This split mode groups variants by fixture_family_id so generated variants from the same base scenario cannot leak across train, validation, and test.

Older split modes continue to work:

category_stratified_by_fixture
random_by_fixture
leave_category_out

Leakage risks¶

Milestone A keeps the same hard leakage bar as selector v0:

no ground-truth keys in training features
no expected_inclusions or expected_exclusions
no forbidden phrase lists as features
no solution patches
no benchmark answer leakage

The new family-aware split closes the other major risk:

generated variants from one family cannot be split across train and test

Validation¶

Run the extended validator with selector-v1 targets:

uv run python -m evals.state_selector.validate \
  --input evals/state_value/reports/selector-v1/training_rows.jsonl \
  --group-value evals/state_value/reports/selector-v1/group_value.json \
  --telemetry evals/state_value/reports/selector-v1/telemetry.jsonl \
  --targets evals/state_selector/reports/selector-v1-targets/targets_v2.jsonl \
  --pairwise-targets evals/state_selector/reports/selector-v1-targets/pairwise_targets.jsonl \
  --target-schema-version 2 \
  --min-families 30 \
  --out evals/state_selector/reports/validation-v1 \
  --json \
  --markdown

Milestone A intentionally treats size thresholds as warnings first.

The validator should block on:

leakage
impossible hard-invariant targets
raw quarantined content marked model-visible
resolved failures marked as active model-visible work
all utilities missing
all visibility targets unknown

Limitations¶

Selector v1 Milestone A does not train selector v1 yet.
Candidate-level marginal utility is still unavailable when the lab only emits group ablations.
External trace-derived fixtures are deferred.
This milestone optimizes for data-contract honesty, not maximum synthetic scale.