Matrix builder precomputation doesn't scale beyond state/county geography levels

## Summary

The matrix builder's precomputation strategy uses a cartesian product approach: for each geography level, it creates a simulation with all ~12K base households set to that geography and computes all relevant variables. This works for 51 states but is already expensive for 3,143 counties (~45 minutes for a single variable). It will not scale to finer geography levels like zip codes (~33,000), state legislative districts (~7,400), or arbitrary compositions of census blocks.

## How precomputation works today

1. **Geography assignment**: Each of the ~12K base households is cloned ~430 times (5.2M total). Each clone is assigned to a census block, which implies a state, county, and congressional district.

2. **State precomputation (cartesian product)**: For each of the 51 states, a sim is created with all ~12K households. `state_fips` is overridden to that state for ALL households, and all target variables are computed. This produces a lookup table: "if household H lived in state S, variable V would be X." Cost: 51 sim runs.

3. **County precomputation (cartesian product)**: Same approach for ~3,143 counties, but only for variables in the hardcoded `COUNTY_DEPENDENT_VARS` set (currently just `aca_ptc`). For each county, `county` is set for all households and county-dependent variables are recomputed. Cost: 3,143 sim runs per variable.

4. **Assembly**: For each clone, the census block assignment determines which precomputed state/county values to slot into the sparse calibration matrix.

The sim always operates on all ~12K households — it overrides a geography input for everyone and recomputes. There is no subsetting. This means the cost is proportional to the number of distinct geographies in the cartesian product, regardless of how many clones actually land in each geography.

## Why it doesn't scale

Calibration targets will eventually exist at the level of arbitrary compositions of census blocks — zip codes, state legislative districts, school districts, etc. The cartesian product approach would require:

- 51 states × 3,143 counties × ~33,000 zip codes × ~7,400 state legislative districts × ...

Each level multiplies the number of sim runs. Even a single additional level at the zip code scale would require ~33,000 sim runs per variable, taking the county precomputation time (~45 minutes for one variable) and multiplying it by 10x.

## Proposed alternative: assignment-driven precomputation

Instead of precomputing all possible geographies, start from the actual clone assignments:

1. Each of the 5.2M clone-households is assigned to a census block. That block implies a specific state, county, zip, legislative district, etc.
2. Group clones by their unique geography tuple (state, county, zip, district, ...).
3. Run one sim per unique combination that actually appears in the assignments.

The number of unique combinations is bounded by the number of distinct census blocks sampled into, which is much smaller than the full cartesian product. This approach:

- Scales naturally with any number of geography levels
- Only computes values for geographies that actually appear in the data
- Eliminates the need for `COUNTY_DEPENDENT_VARS` or any hardcoded variable-to-geography mapping
- Trades the decoupling between precomputation and geography assignment for computational feasibility

## Current workaround

The `COUNTY_DEPENDENT_VARS` hardcoded set gates which variables go through per-county iteration. Currently only `aca_ptc` is listed, and it's disabled in the target config YAML — meaning the county precomputation runs (~45 min) and produces values that are never used in the final matrix.

## Related

- #597 — Stacked builder block collision issue (separate but related to geography assignment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix builder precomputation doesn't scale beyond state/county geography levels #598

Summary

How precomputation works today

Why it doesn't scale

Proposed alternative: assignment-driven precomputation

Current workaround

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Matrix builder precomputation doesn't scale beyond state/county geography levels #598

Description

Summary

How precomputation works today

Why it doesn't scale

Proposed alternative: assignment-driven precomputation

Current workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions