Selector v1 Dataset¶
Goal¶
Selector v1 Milestone A does not train a new selector yet.
It establishes a cleaner data contract for future selector work:
fixture families
-> deterministic selector-v1 fixture scaling
-> family-aware splits
-> selector-v1 State Value outputs
-> target schema v2
-> target-quality validation
The objective is not raw row count by itself.
The objective is a larger, family-aware dataset that can be validated without leaking variants across train and test splits.
Fixture families¶
Selector v1 introduces flat string metadata on each continuation fixture:
fixture_family_idvariant_idvariant_typesource_fixture_idfixture_source
These fields stay inside the existing metadata: dict[str, str] fixture contract.
The richer provenance for generated variants lives in:
Generated variants¶
Selector v1 keeps the runnable suite flat:
Milestone A does not widen the continuation loader to nested directories.
The initial deterministic variant set is intentionally conservative:
baserequest_rewordedbudget_pressurestale_log_noise
These variants enlarge fixture coverage without pretending they create new candidate-level labels.
Dataset scaling¶
Build the selector-v1 suite with:
uv run python -m evals.state_selector.dataset_scaling \
--source evals/coding_continuation/fixtures/publish \
--out-suite-dir evals/coding_continuation/fixtures/selector_v1 \
--manifest evals/coding_continuation/fixtures/selector_v1/manifest.json \
--report-out evals/state_selector/reports/selector-v1-dataset \
--target-fixtures 120 \
--seed 0 \
--json \
--markdown
This writes:
- the flat selector-v1 suite
manifest.jsonevals/state_selector/reports/selector-v1-dataset/dataset_report.md
Family-aware splits¶
Selector v1 Milestone A adds:
This split mode groups variants by fixture_family_id so generated variants from the same base
scenario cannot leak across train, validation, and test.
Older split modes continue to work:
category_stratified_by_fixturerandom_by_fixtureleave_category_out
Leakage risks¶
Milestone A keeps the same hard leakage bar as selector v0:
- no ground-truth keys in training features
- no
expected_inclusionsorexpected_exclusions - no forbidden phrase lists as features
- no solution patches
- no benchmark answer leakage
The new family-aware split closes the other major risk:
- generated variants from one family cannot be split across train and test
Validation¶
Run the extended validator with selector-v1 targets:
uv run python -m evals.state_selector.validate \
--input evals/state_value/reports/selector-v1/training_rows.jsonl \
--group-value evals/state_value/reports/selector-v1/group_value.json \
--telemetry evals/state_value/reports/selector-v1/telemetry.jsonl \
--targets evals/state_selector/reports/selector-v1-targets/targets_v2.jsonl \
--pairwise-targets evals/state_selector/reports/selector-v1-targets/pairwise_targets.jsonl \
--target-schema-version 2 \
--min-families 30 \
--out evals/state_selector/reports/validation-v1 \
--json \
--markdown
Milestone A intentionally treats size thresholds as warnings first.
The validator should block on:
- leakage
- impossible hard-invariant targets
- raw quarantined content marked model-visible
- resolved failures marked as active model-visible work
- all utilities missing
- all visibility targets unknown
Limitations¶
- Selector v1 Milestone A does not train selector v1 yet.
- Candidate-level marginal utility is still unavailable when the lab only emits group ablations.
- External trace-derived fixtures are deferred.
- This milestone optimizes for data-contract honesty, not maximum synthetic scale.