Skip to content

Matrix builder precomputation doesn't scale beyond state/county geography levels #598

@baogorek

Description

@baogorek

Summary

The matrix builder's precomputation strategy uses a cartesian product approach: for each geography level, it creates a simulation with all ~12K base households set to that geography and computes all relevant variables. This works for 51 states but is already expensive for 3,143 counties (~45 minutes for a single variable). It will not scale to finer geography levels like zip codes (~33,000), state legislative districts (~7,400), or arbitrary compositions of census blocks.

How precomputation works today

  1. Geography assignment: Each of the ~12K base households is cloned ~430 times (5.2M total). Each clone is assigned to a census block, which implies a state, county, and congressional district.

  2. State precomputation (cartesian product): For each of the 51 states, a sim is created with all ~12K households. state_fips is overridden to that state for ALL households, and all target variables are computed. This produces a lookup table: "if household H lived in state S, variable V would be X." Cost: 51 sim runs.

  3. County precomputation (cartesian product): Same approach for ~3,143 counties, but only for variables in the hardcoded COUNTY_DEPENDENT_VARS set (currently just aca_ptc). For each county, county is set for all households and county-dependent variables are recomputed. Cost: 3,143 sim runs per variable.

  4. Assembly: For each clone, the census block assignment determines which precomputed state/county values to slot into the sparse calibration matrix.

The sim always operates on all ~12K households — it overrides a geography input for everyone and recomputes. There is no subsetting. This means the cost is proportional to the number of distinct geographies in the cartesian product, regardless of how many clones actually land in each geography.

Why it doesn't scale

Calibration targets will eventually exist at the level of arbitrary compositions of census blocks — zip codes, state legislative districts, school districts, etc. The cartesian product approach would require:

  • 51 states × 3,143 counties × ~33,000 zip codes × ~7,400 state legislative districts × ...

Each level multiplies the number of sim runs. Even a single additional level at the zip code scale would require ~33,000 sim runs per variable, taking the county precomputation time (~45 minutes for one variable) and multiplying it by 10x.

Proposed alternative: assignment-driven precomputation

Instead of precomputing all possible geographies, start from the actual clone assignments:

  1. Each of the 5.2M clone-households is assigned to a census block. That block implies a specific state, county, zip, legislative district, etc.
  2. Group clones by their unique geography tuple (state, county, zip, district, ...).
  3. Run one sim per unique combination that actually appears in the assignments.

The number of unique combinations is bounded by the number of distinct census blocks sampled into, which is much smaller than the full cartesian product. This approach:

  • Scales naturally with any number of geography levels
  • Only computes values for geographies that actually appear in the data
  • Eliminates the need for COUNTY_DEPENDENT_VARS or any hardcoded variable-to-geography mapping
  • Trades the decoupling between precomputation and geography assignment for computational feasibility

Current workaround

The COUNTY_DEPENDENT_VARS hardcoded set gates which variables go through per-county iteration. Currently only aca_ptc is listed, and it's disabled in the target config YAML — meaning the county precomputation runs (~45 min) and produces values that are never used in the final matrix.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions