Skip to content

Add tiny dataset generation to pipeline#289

Merged
nwoodruff-co merged 3 commits intomainfrom
feat/tiny-datasets
Mar 12, 2026
Merged

Add tiny dataset generation to pipeline#289
nwoodruff-co merged 3 commits intomainfrom
feat/tiny-datasets

Conversation

@nwoodruff-co
Copy link
Collaborator

Summary

  • Produces frs_2023_24_tiny.h5 and enhanced_frs_2023_24_tiny.h5 (1,000 households each) at the end of the dataset creation pipeline
  • Improved subsample_dataset to sample with probability proportional to weight and rescale weights to preserve population totals
  • Useful for faster testing and development without needing the full ~20k household datasets

Test plan

  • Verify tiny datasets load correctly with UKSingleYearDataset
  • Check that household_weight sums match the full dataset
  • Run a basic Microsimulation on a tiny dataset to confirm it works end-to-end

Produces frs_2023_24_tiny.h5 and enhanced_frs_2023_24_tiny.h5 by
subsampling with probability proportional to weight and rescaling
weights to preserve population totals.
Handle zero/NaN weights by falling back to uniform sampling, fixing
the crash in impute_income which passes a zero-weight dataset.
Pandas rejects weighted sampling without replacement when weights are
large. Using replace=True since the sample is used for training data.
@nwoodruff-co nwoodruff-co merged commit cf2d4d7 into main Mar 12, 2026
3 checks passed
@nwoodruff-co nwoodruff-co deleted the feat/tiny-datasets branch March 12, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant