Include PUF aggregate records for top-tail income representation#608
Open
Include PUF aggregate records for top-tail income representation#608
Conversation
The CPS has -95% to -99% calibration errors for $5M+ AGI brackets. Two changes to fix this: 1. puf.py: Replace `puf = puf[puf.MARS != 0]` (which dropped $140B+ in weighted AGI) with `impute_aggregate_mars()` — a QRF trained on income variables imputes MARS; downstream QRF handles remaining demographics (age, gender, etc.) 2. extended_cps.py: Add `_inject_high_income_puf_records()` to append PUF records with AGI > $1M directly into the ExtendedCPS after all processing, giving the reweighter actual high-income observations. Fixes #606 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9774ba8 to
20aa0ec
Compare
- Narrow except Exception to (KeyError, ValueError, RuntimeError)
- Use endswith("_id") instead of "_id" in key to avoid false matches
- Remove unnecessary .copy() in impute_aggregate_mars
- Use numpy arrays instead of list() for np.isin() calls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…safely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix ValueError from f-string comma formatting inside %-style logger - Handle dtype casting failures when PUF and CPS have incompatible types (e.g. county_fips: numeric in PUF vs string in CPS) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a PUF variable can't be cast to the CPS dtype, we were skipping it entirely — leaving that variable shorter than all others. Now pad with zeros/empty values to keep array lengths aligned across all variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Injecting high-income PUF records increases unweighted citizen % from ~90% to ~96% because tax filers are almost all citizens. Widen the test's expected range from (80-95%) to (80-98%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ate() The per-variable puf_sim.calculate() loop was running the full simulation engine for each of 100+ variables, causing the CI build to hang for 7+ hours. Now: - Only use Microsimulation once (to compute AGI for household filter) - Free the simulation immediately after - Read all variable values from raw PUF arrays (puf_data[variable]) - Pad with zeros for variables not in PUF (CPS-only) This should reduce the injection step from hours to seconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Include PUF aggregate records (MARS=0) — previously dropped — and inject high-income PUF records directly into the ExtendedCPS. All demographics are imputed via QRF, no hardcoded values.
Problem
The enhanced CPS has catastrophic calibration errors at the top of the income distribution:
Two root causes:
puf = puf[puf.MARS != 0], discarding $140B+ in weighted AGI ($152.9B in the $10M+ bracket alone)What are PUF aggregate records?
The IRS bundles ultra-high-income filers into "aggregate records" (MARS=0) for anonymity protection. These have complete income/deduction data (wages, capital gains, dividends, partnership income, etc.) but no demographics (no filing status, age, or gender). There are 4 such records representing ~1,233 weighted filers with AGIs ranging from -$96M to +$292M.
How demographics are now imputed
Two-stage QRF pipeline with zero hardcoded values:
Stage 1:
impute_aggregate_mars()inpuf.pyStage 2: Existing
impute_missing_demographics()inpuf.pyAfter both stages, aggregate records have full demographics and proceed through the normal PUF pipeline (preprocess_puf → person/tax_unit record construction).
How high-income records enter the final dataset
_inject_high_income_puf_records()inextended_cps.pyThis gives the reweighter actual high-income observations instead of trying to compensate with extreme weight adjustments on CPS records that don't have those income levels.
Files changed
puf.pypuf = puf[puf.MARS != 0]withimpute_aggregate_mars(puf)— QRF-based two-stage demographic imputationextended_cps.py_inject_high_income_puf_records()— append PUF records with AGI > $1M after QRF imputationTest plan
🤖 Generated with Claude Code