Add tiny dataset generation to pipeline by nwoodruff-co · Pull Request #289 · PolicyEngine/policyengine-uk-data

nwoodruff-co · 2026-03-12T13:24:06Z

Summary

Produces frs_2023_24_tiny.h5 and enhanced_frs_2023_24_tiny.h5 (1,000 households each) at the end of the dataset creation pipeline
Improved subsample_dataset to sample with probability proportional to weight and rescale weights to preserve population totals
Useful for faster testing and development without needing the full ~20k household datasets

Test plan

Verify tiny datasets load correctly with UKSingleYearDataset
Check that household_weight sums match the full dataset
Run a basic Microsimulation on a tiny dataset to confirm it works end-to-end

Produces frs_2023_24_tiny.h5 and enhanced_frs_2023_24_tiny.h5 by subsampling with probability proportional to weight and rescaling weights to preserve population totals.

Handle zero/NaN weights by falling back to uniform sampling, fixing the crash in impute_income which passes a zero-weight dataset.

Pandas rejects weighted sampling without replacement when weights are large. Using replace=True since the sample is used for training data.

nwoodruff-co added 3 commits March 12, 2026 13:23

Add tiny dataset generation (n=1000 households) to pipeline

265f7f3

Produces frs_2023_24_tiny.h5 and enhanced_frs_2023_24_tiny.h5 by subsampling with probability proportional to weight and rescaling weights to preserve population totals.

Fix subsample NaN crash and formatting

fa4905f

Handle zero/NaN weights by falling back to uniform sampling, fixing the crash in impute_income which passes a zero-weight dataset.

Fix weighted SPI sampling with replace=True

d02d0b3

Pandas rejects weighted sampling without replacement when weights are large. Using replace=True since the sample is used for training data.

nwoodruff-co merged commit cf2d4d7 into main Mar 12, 2026
3 checks passed

nwoodruff-co deleted the feat/tiny-datasets branch March 12, 2026 14:22

nikhilwoodruff mentioned this pull request Mar 13, 2026

Upload tiny datasets in CI pipeline #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tiny dataset generation to pipeline#289

Add tiny dataset generation to pipeline#289
nwoodruff-co merged 3 commits intomainfrom
feat/tiny-datasets

nwoodruff-co commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nwoodruff-co commented Mar 12, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant