Include PUF aggregate records for top-tail income representation by MaxGhenis · Pull Request #608 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-03-15T00:17:27Z

Summary

Include PUF aggregate records (MARS=0) — previously dropped — and inject high-income PUF records directly into the ExtendedCPS. All demographics are imputed via QRF, no hardcoded values.

Problem

The enhanced CPS has catastrophic calibration errors at the top of the income distribution:

$5M-$10M AGI bracket: -98.5% error
$10M+ AGI bracket: -95.1% error

Two root causes:

The CPS doesn't sample ultra-high-income filers
The PUF's 4 aggregate records were being dropped by puf = puf[puf.MARS != 0], discarding $140B+ in weighted AGI ($152.9B in the $10M+ bracket alone)

What are PUF aggregate records?

The IRS bundles ultra-high-income filers into "aggregate records" (MARS=0) for anonymity protection. These have complete income/deduction data (wages, capital gains, dividends, partnership income, etc.) but no demographics (no filing status, age, or gender). There are 4 such records representing ~1,233 weighted filers with AGIs ranging from -$96M to +$292M.

How demographics are now imputed

Two-stage QRF pipeline with zero hardcoded values:

Stage 1: `impute_aggregate_mars()` in `puf.py`

Problem: MARS (filing status) is needed as a predictor for stage 2, but aggregate records have MARS=0
Solution: Train a QRF on regular PUF records using 9 income variables as predictors (E00200 wages, E00300 interest, E00600 dividends, E01000 capital gains, E00900 business income, E26270 partnership/S-corp, E02400 social security, E01500 pensions, XTOT exemptions) → impute MARS
Also sets DSI=0 and EIC=0 (correct by definition — ultra-high-income filers are never dependents and never receive EITC)

Stage 2: Existing `impute_missing_demographics()` in `puf.py`

Already handles records not in the demographics file via QRF
Uses [E00200, MARS, DSI, EIC, XTOT] as predictors → imputes AGERANGE, GENDER, EARNSPLIT, AGEDP1-3
The aggregate records naturally flow through this path since their RECIDs aren't in the demographics file

After both stages, aggregate records have full demographics and proceed through the normal PUF pipeline (preprocess_puf → person/tax_unit record construction).

How high-income records enter the final dataset

`_inject_high_income_puf_records()` in `extended_cps.py`

After QRF imputation of CPS records, loads a full PUF Microsimulation
Identifies all PUF households with AGI > $1M
Builds entity-level masks (person, household, tax_unit, marital_unit) for those households
Appends their records to the ExtendedCPS with offset IDs (to avoid collisions) and original PUF weights
The reweighting optimizer then adjusts all weights (CPS + injected PUF) to match SOI calibration targets

This gives the reweighter actual high-income observations instead of trying to compensate with extreme weight adjustments on CPS records that don't have those income levels.

Files changed

File	Change
`puf.py`	Replace `puf = puf[puf.MARS != 0]` with `impute_aggregate_mars(puf)` — QRF-based two-stage demographic imputation
`extended_cps.py`	Add `_inject_high_income_puf_records()` — append PUF records with AGI > $1M after QRF imputation

Test plan

QRF MARS imputation tested with mock data — produces valid MARS values [1-4]
Regular records confirmed unchanged by imputation
Imports verified
Pre-existing test failures confirmed unrelated
Build ExtendedCPS and compare calibration_log.csv — $5M+ errors should improve dramatically
Full EnhancedCPS build with reweighting convergence check
Score a millionaire tax reform before/after

🤖 Generated with Claude Code

The CPS has -95% to -99% calibration errors for $5M+ AGI brackets. Two changes to fix this: 1. puf.py: Replace `puf = puf[puf.MARS != 0]` (which dropped $140B+ in weighted AGI) with `impute_aggregate_mars()` — a QRF trained on income variables imputes MARS; downstream QRF handles remaining demographics (age, gender, etc.) 2. extended_cps.py: Add `_inject_high_income_puf_records()` to append PUF records with AGI > $1M directly into the ExtendedCPS after all processing, giving the reweighter actual high-income observations. Fixes #606 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Narrow except Exception to (KeyError, ValueError, RuntimeError) - Use endswith("_id") instead of "_id" in key to avoid false matches - Remove unnecessary .copy() in impute_aggregate_mars - Use numpy arrays instead of list() for np.isin() calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…safely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix ValueError from f-string comma formatting inside %-style logger - Handle dtype casting failures when PUF and CPS have incompatible types (e.g. county_fips: numeric in PUF vs string in CPS) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a PUF variable can't be cast to the CPS dtype, we were skipping it entirely — leaving that variable shorter than all others. Now pad with zeros/empty values to keep array lengths aligned across all variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Injecting high-income PUF records increases unweighted citizen % from ~90% to ~96% because tax filers are almost all citizens. Widen the test's expected range from (80-95%) to (80-98%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ate() The per-variable puf_sim.calculate() loop was running the full simulation engine for each of 100+ variables, causing the CI build to hang for 7+ hours. Now: - Only use Microsimulation once (to compute AGI for household filter) - Free the simulation immediately after - Read all variable values from raw PUF arrays (puf_data[variable]) - Pad with zeros for variables not in PUF (CPS-only) This should reduce the injection step from hours to seconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MaxGhenis closed this Mar 15, 2026

MaxGhenis reopened this Mar 15, 2026

MaxGhenis force-pushed the top-tail-income-representation branch from 9774ba8 to 20aa0ec Compare March 15, 2026 02:00

MaxGhenis and others added 6 commits March 14, 2026 20:18

Restore .copy() on PUF filter — insufficient test coverage to remove …

0d22692

…safely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include PUF aggregate records for top-tail income representation#608

Include PUF aggregate records for top-tail income representation#608
MaxGhenis wants to merge 7 commits intomainfrom
top-tail-income-representation

MaxGhenis commented Mar 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

What are PUF aggregate records?

How demographics are now imputed

Stage 1: impute_aggregate_mars() in puf.py

Stage 2: Existing impute_missing_demographics() in puf.py

How high-income records enter the final dataset

_inject_high_income_puf_records() in extended_cps.py

Files changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxGhenis commented Mar 15, 2026 •

edited

Loading

Stage 1: `impute_aggregate_mars()` in `puf.py`

Stage 2: Existing `impute_missing_demographics()` in `puf.py`

`_inject_high_income_puf_records()` in `extended_cps.py`