-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Summary
The matrix builder's precomputation strategy uses a cartesian product approach: for each geography level, it creates a simulation with all ~12K base households set to that geography and computes all relevant variables. This works for 51 states but is already expensive for 3,143 counties (~45 minutes for a single variable). It will not scale to finer geography levels like zip codes (~33,000), state legislative districts (~7,400), or arbitrary compositions of census blocks.
How precomputation works today
-
Geography assignment: Each of the ~12K base households is cloned ~430 times (5.2M total). Each clone is assigned to a census block, which implies a state, county, and congressional district.
-
State precomputation (cartesian product): For each of the 51 states, a sim is created with all ~12K households.
state_fipsis overridden to that state for ALL households, and all target variables are computed. This produces a lookup table: "if household H lived in state S, variable V would be X." Cost: 51 sim runs. -
County precomputation (cartesian product): Same approach for ~3,143 counties, but only for variables in the hardcoded
COUNTY_DEPENDENT_VARSset (currently justaca_ptc). For each county,countyis set for all households and county-dependent variables are recomputed. Cost: 3,143 sim runs per variable. -
Assembly: For each clone, the census block assignment determines which precomputed state/county values to slot into the sparse calibration matrix.
The sim always operates on all ~12K households — it overrides a geography input for everyone and recomputes. There is no subsetting. This means the cost is proportional to the number of distinct geographies in the cartesian product, regardless of how many clones actually land in each geography.
Why it doesn't scale
Calibration targets will eventually exist at the level of arbitrary compositions of census blocks — zip codes, state legislative districts, school districts, etc. The cartesian product approach would require:
- 51 states × 3,143 counties × ~33,000 zip codes × ~7,400 state legislative districts × ...
Each level multiplies the number of sim runs. Even a single additional level at the zip code scale would require ~33,000 sim runs per variable, taking the county precomputation time (~45 minutes for one variable) and multiplying it by 10x.
Proposed alternative: assignment-driven precomputation
Instead of precomputing all possible geographies, start from the actual clone assignments:
- Each of the 5.2M clone-households is assigned to a census block. That block implies a specific state, county, zip, legislative district, etc.
- Group clones by their unique geography tuple (state, county, zip, district, ...).
- Run one sim per unique combination that actually appears in the assignments.
The number of unique combinations is bounded by the number of distinct census blocks sampled into, which is much smaller than the full cartesian product. This approach:
- Scales naturally with any number of geography levels
- Only computes values for geographies that actually appear in the data
- Eliminates the need for
COUNTY_DEPENDENT_VARSor any hardcoded variable-to-geography mapping - Trades the decoupling between precomputation and geography assignment for computational feasibility
Current workaround
The COUNTY_DEPENDENT_VARS hardcoded set gates which variables go through per-county iteration. Currently only aca_ptc is listed, and it's disabled in the target config YAML — meaning the county precomputation runs (~45 min) and produces values that are never used in the final matrix.
Related
- Stacked builder assigns wrong block to multi-clone households sharing the same block #597 — Stacked builder block collision issue (separate but related to geography assignment)