Skip to content

Selector v1 Dataset

Goal

Selector v1 Milestone A does not train a new selector yet.

It establishes a cleaner data contract for future selector work:

fixture families
  -> deterministic selector-v1 fixture scaling
  -> family-aware splits
  -> selector-v1 State Value outputs
  -> target schema v2
  -> target-quality validation

The objective is not raw row count by itself.

The objective is a larger, family-aware dataset that can be validated without leaking variants across train and test splits.

Fixture families

Selector v1 introduces flat string metadata on each continuation fixture:

  • fixture_family_id
  • variant_id
  • variant_type
  • source_fixture_id
  • fixture_source

These fields stay inside the existing metadata: dict[str, str] fixture contract.

The richer provenance for generated variants lives in:

evals/coding_continuation/fixtures/selector_v1/manifest.json

Generated variants

Selector v1 keeps the runnable suite flat:

evals/coding_continuation/fixtures/selector_v1/*.json

Milestone A does not widen the continuation loader to nested directories.

The initial deterministic variant set is intentionally conservative:

  • base
  • request_reworded
  • budget_pressure
  • stale_log_noise

These variants enlarge fixture coverage without pretending they create new candidate-level labels.

Dataset scaling

Build the selector-v1 suite with:

uv run python -m evals.state_selector.dataset_scaling \
  --source evals/coding_continuation/fixtures/publish \
  --out-suite-dir evals/coding_continuation/fixtures/selector_v1 \
  --manifest evals/coding_continuation/fixtures/selector_v1/manifest.json \
  --report-out evals/state_selector/reports/selector-v1-dataset \
  --target-fixtures 120 \
  --seed 0 \
  --json \
  --markdown

This writes:

  • the flat selector-v1 suite
  • manifest.json
  • evals/state_selector/reports/selector-v1-dataset/dataset_report.md

Family-aware splits

Selector v1 Milestone A adds:

category_stratified_by_fixture_family

This split mode groups variants by fixture_family_id so generated variants from the same base scenario cannot leak across train, validation, and test.

Older split modes continue to work:

  • category_stratified_by_fixture
  • random_by_fixture
  • leave_category_out

Leakage risks

Milestone A keeps the same hard leakage bar as selector v0:

  • no ground-truth keys in training features
  • no expected_inclusions or expected_exclusions
  • no forbidden phrase lists as features
  • no solution patches
  • no benchmark answer leakage

The new family-aware split closes the other major risk:

  • generated variants from one family cannot be split across train and test

Validation

Run the extended validator with selector-v1 targets:

uv run python -m evals.state_selector.validate \
  --input evals/state_value/reports/selector-v1/training_rows.jsonl \
  --group-value evals/state_value/reports/selector-v1/group_value.json \
  --telemetry evals/state_value/reports/selector-v1/telemetry.jsonl \
  --targets evals/state_selector/reports/selector-v1-targets/targets_v2.jsonl \
  --pairwise-targets evals/state_selector/reports/selector-v1-targets/pairwise_targets.jsonl \
  --target-schema-version 2 \
  --min-families 30 \
  --out evals/state_selector/reports/validation-v1 \
  --json \
  --markdown

Milestone A intentionally treats size thresholds as warnings first.

The validator should block on:

  • leakage
  • impossible hard-invariant targets
  • raw quarantined content marked model-visible
  • resolved failures marked as active model-visible work
  • all utilities missing
  • all visibility targets unknown

Limitations

  • Selector v1 Milestone A does not train selector v1 yet.
  • Candidate-level marginal utility is still unavailable when the lab only emits group ablations.
  • External trace-derived fixtures are deferred.
  • This milestone optimizes for data-contract honesty, not maximum synthetic scale.