Skip to content

QRF-impute CPS-only variables for PUF clone half#589

Open
MaxGhenis wants to merge 11 commits intomainfrom
impute-cps-only-vars-puf-clone
Open

QRF-impute CPS-only variables for PUF clone half#589
MaxGhenis wants to merge 11 commits intomainfrom
impute-cps-only-vars-puf-clone

Conversation

@MaxGhenis
Copy link
Contributor

Summary

Fixes #561

The extended CPS doubles the dataset — CPS records on the left, PUF clones on the right. Income variables are already QRF-imputed from PUF data, but ~60 other CPS-only variables (retirement distributions, transfers, hours, medical expenses, Social Security sub-components, SPM variables) were naively duplicated from the CPS donor regardless of the PUF clone's imputed income.

This PR adds a second-stage QRF imputation so the PUF clone half gets plausible values for these CPS-only variables that are consistent with its imputed income profile:

  • New CPS_ONLY_IMPUTED_VARIABLES list (~60 variables): retirement distributions/contributions, Social Security sub-components, transfer income, SPM variables, medical expenses, hours/employment, and previous-year income
  • New impute_cps_only_variables() function: trains a QRF on CPS person-level data where predictors = demographics (age, sex, filing status, dependents) + key income variables (employment, self-employment, Social Security), and outputs = the CPS-only variables. For PUF clone prediction, the income predictor columns are swapped with PUF-imputed values from the first stage.
  • Updated concatenation loop: CPS-only imputed variables now use original CPS values for the CPS half and second-stage QRF predictions for the PUF clone half (same pattern as existing IMPUTED_VARIABLES)

Uses the same batching, memory management, and entity-mapping patterns as the existing QRF code.

Test plan

  • Verify the extended CPS generates without errors (make data)
  • Spot-check that PUF clone half values for CPS-only variables differ from CPS donor values and correlate with imputed income
  • Confirm CPS half values are unchanged

MaxGhenis and others added 3 commits March 10, 2026 05:35
Instead of naively duplicating CPS donor values for ~60 income-
correlated variables (retirement distributions, transfers, hours,
medical expenses, SS sub-components, SPM variables), train a second-
stage QRF on CPS data using demographics + PUF-imputed income as
predictors and predict plausible values for PUF clones.

Fixes #561

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Verify no variable appears in multiple imputation lists
- Verify stage-2 income predictors come from stage-1 imputation
- Verify sequential QRF preserves y1-y2 correlation (synthetic data)
- Verify single sequential call beats separate calls (the old batched
  approach) on inter-variable correlation

All tests use synthetic data — no real PUF/CPS needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract _fit_and_predict_qrf() to eliminate duplication between
  impute_income_variables and impute_cps_only_variables
- Extract _to_entity() to deduplicate entity mapping in concat loop
- Replace CPS_STAGE2_DEMOGRAPHIC_PREDICTORS with shared
  DEMOGRAPHIC_PREDICTORS + STAGE1_EXTRA_PREDICTORS
- Convert variable lists to sets for O(1) lookup in concat loop
- Extract _QRF_SAMPLE_SIZE and _QRF_RANDOM_STATE constants
- Pre-compute training/test DataFrames in generate() to avoid
  redundant calculate_dataframe() calls
- Remove unused MagicMock import from tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MaxGhenis MaxGhenis force-pushed the impute-cps-only-vars-puf-clone branch from b890507 to 71e9aec Compare March 10, 2026 09:37
MaxGhenis and others added 4 commits March 10, 2026 05:39
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Some variables in CPS_ONLY_IMPUTED_VARIABLES may not exist in the
current policyengine-us package. Filter them out before calling
calculate_dataframe to avoid ValueError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use CPS Microsimulation to calculate demographic predictors (is_male,
  tax_unit_is_joint, etc.) since they aren't stored in the h5 file
- Fix entity half-length calculation in splice function: spm_unit
  variables need spm_unit_id count, not person_id count

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGhenis , The more I review this, the more I realize that imputation is just as complex as calibration (if not more) and probably should have its own dashboards, etc. But for now, my review is in two inline comments plus other comments from Claude below.

Other CC comments:

 SS sub-components (line 39): Same overwrite issue as the retirement contributions. reconcile_ss_subcomponents() in  
  puf_clone_dataset() ensures these 4 shares sum to 1 and match the imputed social_security total. Stage 2's
  unconstrained QRF replaces them with values that have no summing guarantee.                                         
                                                            
  Pension leaf variables (line 31): taxable_private_pension_income and tax_exempt_private_pension_income get silently 
  discarded. _rename_imputed_to_inputs() runs after Stage 2 and overwrites them with the PUF-imputed values. Wasted
  QRF effort.

  Formula variables (line 75): weekly_hours_worked and employment_income_last_year have formulas/adds in
  policyengine-us, so _drop_formula_variables() removes them after Stage 2. Predictions are wasted unless they're
  added to _KEEP_FORMULA_VARS.

  child_support_expense (line 73): Listed under # Medical expenses but it's not a medical expense.

  Test gap (test file line 33): Should also check overlap with CPS_RETIREMENT_VARIABLES and SS_SUBCOMPONENTS from
  puf_impute.py — those have specialized imputation that Stage 2 would overwrite.

"taxable_private_pension_income",
"tax_exempt_private_pension_income",
# Retirement contributions
"traditional_401k_contributions",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was thinking that "cloning" was referring to the stage 2 stuff, but there are "cloning" operations here as well. CC is saying that:

  • traditional_401k_contributions
  • roth_401k_contributions
  • roth_ira_contributions
  • self_employed_pension_contributions

"are already imputed by _impute_retirement_contributions() in puf_clone_dataset() with domain-specific constraints — IRS contribution limits by year, age-based catch-up provisions, and income requirements (e.g., 401k zeroed when employment_income = 0).Stage 2's unconstrained QRF overwrites these with values that violate legal limits. "

● 1. puf_clone_dataset() imputes retirement contributions for the PUF clone half with explicit IRS constraints (e.g., 
  401k capped at $23,000 + catch-up, zeroed when no employment income)
  2. _impute_cps_only_variables() trains a generic QRF that also predicts those same 4 retirement variables — with no 
  constraints                                               
  3. _splice_cps_only_predictions() overwrites the constrained values from step 1 with the unconstrained values from  
  step 2                                                    

Okay, and later it says's that these variables are not even in policyengine-us!

  1. 3 variables don't exist in policyengine-us

roth_ira_distributions, regular_ira_distributions, other_type_retirement_account_distributions are not in the TBS.
They'll always be zero-filled. The code handles this gracefully via the valid_outputs filter, but consider removing
them or adding a comment.

I know they've been "impossible targets" in calibration.

"social_security_retirement",
"social_security_disability",
"social_security_dependents",
"social_security_survivors",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC:
All 4 SS sub-components are already handled by reconcile_ss_subcomponents() (puf_impute.py:376), which predicts
shares via QRF, normalizes them to sum to 1, and scales to match the PUF-imputed total. Stage 2's generic QRF would
overwrite with unconstrained values that likely don't sum to social_security.

Overlapping variables:

  • social_security_retirement
  • social_security_disability
  • social_security_dependents
  • social_security_survivors

MaxGhenis added a commit to MaxGhenis/policyengine-us-data that referenced this pull request Mar 11, 2026
Move all CPS-only variable imputation (retirement contributions, SS
sub-components) exclusively to Stage 2 QRF in extended_cps.py, with
post-processing constraints:

- Clean CPS_ONLY_IMPUTED_VARIABLES: remove 3 non-existent vars, 2 vars
  handled by Stage 1 rename, add traditional_ira_contributions
- Add apply_retirement_constraints() with explicit variable-to-limit
  mapping (IRS 401k/IRA caps, SE pension rate cap, income-based zeroing)
- Add reconcile_ss_subcomponents() to normalize sub-component shares
  to match imputed social_security total
- Extract get_se_pension_limits() into retirement_limits.py (cached)
- Add SE pension params to imputation_parameters.yaml
- Add 12 new tests covering variable lists, retirement caps, SS
  normalization

Addresses Ben Ogorek's review on PR PolicyEngine#589.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis and others added 2 commits March 11, 2026 11:19
- Clean CPS_ONLY_IMPUTED_VARIABLES: remove 3 non-existent vars, 2 vars
  handled by Stage 1 rename, add traditional_ira_contributions
- Add apply_retirement_constraints() with explicit variable-to-limit
  mapping (IRS 401k/IRA caps, SE pension rate cap, income-based zeroing)
- Add reconcile_ss_subcomponents() to normalize sub-component shares
  to match imputed social_security total
- Extract get_se_pension_limits() into retirement_limits.py (cached)
- Add 12 new tests covering variable lists, retirement caps, SS
  normalization

Addresses Ben Ogorek's review on PR #589.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek baogorek self-requested a review March 11, 2026 20:21
Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGhenis I 'm going to give you a preemptive check mark since I know you're working through the feedback and I've got to leave the house for a few hours.

Also, I'm eager to see the full publication process happen again.

@baogorek
Copy link
Collaborator

Merged main into this branch to catch up after PR #538 (calibration pipeline improvements). No conflicts — merged cleanly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Train models for CPS-only variables on PUF clone half

2 participants