Skip to content

Include PUF aggregate records for top-tail income representation#608

Open
MaxGhenis wants to merge 7 commits intomainfrom
top-tail-income-representation
Open

Include PUF aggregate records for top-tail income representation#608
MaxGhenis wants to merge 7 commits intomainfrom
top-tail-income-representation

Conversation

@MaxGhenis
Copy link
Contributor

@MaxGhenis MaxGhenis commented Mar 15, 2026

Summary

Include PUF aggregate records (MARS=0) — previously dropped — and inject high-income PUF records directly into the ExtendedCPS. All demographics are imputed via QRF, no hardcoded values.

Problem

The enhanced CPS has catastrophic calibration errors at the top of the income distribution:

  • $5M-$10M AGI bracket: -98.5% error
  • $10M+ AGI bracket: -95.1% error

Two root causes:

  1. The CPS doesn't sample ultra-high-income filers
  2. The PUF's 4 aggregate records were being dropped by puf = puf[puf.MARS != 0], discarding $140B+ in weighted AGI ($152.9B in the $10M+ bracket alone)

What are PUF aggregate records?

The IRS bundles ultra-high-income filers into "aggregate records" (MARS=0) for anonymity protection. These have complete income/deduction data (wages, capital gains, dividends, partnership income, etc.) but no demographics (no filing status, age, or gender). There are 4 such records representing ~1,233 weighted filers with AGIs ranging from -$96M to +$292M.

How demographics are now imputed

Two-stage QRF pipeline with zero hardcoded values:

Stage 1: impute_aggregate_mars() in puf.py

  • Problem: MARS (filing status) is needed as a predictor for stage 2, but aggregate records have MARS=0
  • Solution: Train a QRF on regular PUF records using 9 income variables as predictors (E00200 wages, E00300 interest, E00600 dividends, E01000 capital gains, E00900 business income, E26270 partnership/S-corp, E02400 social security, E01500 pensions, XTOT exemptions) → impute MARS
  • Also sets DSI=0 and EIC=0 (correct by definition — ultra-high-income filers are never dependents and never receive EITC)

Stage 2: Existing impute_missing_demographics() in puf.py

  • Already handles records not in the demographics file via QRF
  • Uses [E00200, MARS, DSI, EIC, XTOT] as predictors → imputes AGERANGE, GENDER, EARNSPLIT, AGEDP1-3
  • The aggregate records naturally flow through this path since their RECIDs aren't in the demographics file

After both stages, aggregate records have full demographics and proceed through the normal PUF pipeline (preprocess_puf → person/tax_unit record construction).

How high-income records enter the final dataset

_inject_high_income_puf_records() in extended_cps.py

  • After QRF imputation of CPS records, loads a full PUF Microsimulation
  • Identifies all PUF households with AGI > $1M
  • Builds entity-level masks (person, household, tax_unit, marital_unit) for those households
  • Appends their records to the ExtendedCPS with offset IDs (to avoid collisions) and original PUF weights
  • The reweighting optimizer then adjusts all weights (CPS + injected PUF) to match SOI calibration targets

This gives the reweighter actual high-income observations instead of trying to compensate with extreme weight adjustments on CPS records that don't have those income levels.

Files changed

File Change
puf.py Replace puf = puf[puf.MARS != 0] with impute_aggregate_mars(puf) — QRF-based two-stage demographic imputation
extended_cps.py Add _inject_high_income_puf_records() — append PUF records with AGI > $1M after QRF imputation

Test plan

  • QRF MARS imputation tested with mock data — produces valid MARS values [1-4]
  • Regular records confirmed unchanged by imputation
  • Imports verified
  • Pre-existing test failures confirmed unrelated
  • Build ExtendedCPS and compare calibration_log.csv — $5M+ errors should improve dramatically
  • Full EnhancedCPS build with reweighting convergence check
  • Score a millionaire tax reform before/after

🤖 Generated with Claude Code

@MaxGhenis MaxGhenis closed this Mar 15, 2026
@MaxGhenis MaxGhenis reopened this Mar 15, 2026
The CPS has -95% to -99% calibration errors for $5M+ AGI brackets.
Two changes to fix this:

1. puf.py: Replace `puf = puf[puf.MARS != 0]` (which dropped $140B+
   in weighted AGI) with `impute_aggregate_mars()` — a QRF trained on
   income variables imputes MARS; downstream QRF handles remaining
   demographics (age, gender, etc.)

2. extended_cps.py: Add `_inject_high_income_puf_records()` to append
   PUF records with AGI > $1M directly into the ExtendedCPS after all
   processing, giving the reweighter actual high-income observations.

Fixes #606

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis force-pushed the top-tail-income-representation branch from 9774ba8 to 20aa0ec Compare March 15, 2026 02:00
MaxGhenis and others added 6 commits March 14, 2026 20:18
- Narrow except Exception to (KeyError, ValueError, RuntimeError)
- Use endswith("_id") instead of "_id" in key to avoid false matches
- Remove unnecessary .copy() in impute_aggregate_mars
- Use numpy arrays instead of list() for np.isin() calls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…safely

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix ValueError from f-string comma formatting inside %-style logger
- Handle dtype casting failures when PUF and CPS have incompatible
  types (e.g. county_fips: numeric in PUF vs string in CPS)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a PUF variable can't be cast to the CPS dtype, we were skipping
it entirely — leaving that variable shorter than all others. Now pad
with zeros/empty values to keep array lengths aligned across all
variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Injecting high-income PUF records increases unweighted citizen % from
~90% to ~96% because tax filers are almost all citizens. Widen the
test's expected range from (80-95%) to (80-98%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ate()

The per-variable puf_sim.calculate() loop was running the full
simulation engine for each of 100+ variables, causing the CI build
to hang for 7+ hours. Now:

- Only use Microsimulation once (to compute AGI for household filter)
- Free the simulation immediately after
- Read all variable values from raw PUF arrays (puf_data[variable])
- Pad with zeros for variables not in PUF (CPS-only)

This should reduce the injection step from hours to seconds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant