Uncapped PUF incomes + calibration weights produce ~19x inflated state-level aggregates

## Summary

State-level datasets built from the Feb 20 calibration inputs (`stratified_extended_cps.h5` + `w_district_calibration.npy`) produce ~19x inflated income aggregates. These datasets are **live on GCS production** (`gs://policyengine-us-data/states/*.h5`) and are being served by the production API.

**Example:** Louisiana baseline `household_net_income.sum()` = **$3,147B** (production/GCS) vs **$166B** (HuggingFace v1.62.0). National weighted employment income = **$59T** (should be ~$11T).

## Root Cause

PR #537 correctly removed the ~$6.26M AGI ceiling from PUF imputation (fixing #530), but the calibration weights were not re-tuned to account for the much wider income range. The result is that a handful of ultra-high-income PUF records get calibration weights that massively inflate national totals.

### Evidence from the new calibration inputs

**Base dataset (`stratified_extended_cps.h5`):**
- `employment_income_before_lsr` max: $2.8M (old) → **$132.6M** (new)
- `long_term_capital_gains` max: $2.1M → **$164.3M**
- 30+ financial variables inflated 5x–4,270x in the upper tail (full list below)

**Calibration weights (`w_district_calibration.npy`):**
- Overall weight sums are similar (ratio 1.03x) — the problem isn't the total weight mass
- But extreme-income records get substantial weights across many CDs:

| Income | Total calibrated weight | CDs with nonzero weight | Weighted contribution |
|--------|------------------------|------------------------|----------------------|
| $132.6M | 546 | 114 | $72.4B |
| $106.5M | **3,601** | **194** | **$383.4B** |
| $87.2M | 2,292 | 176 | $199.9B |

**Top 20 earners alone contribute $901B** of national weighted employment income. Total national weighted employment income = $59T vs the correct ~$11T.

For comparison, in the old (capped) data, the highest earner had income of $2.78M with weight 15,725, contributing $43.8B — still large but 10x smaller than the new extremes.

### Variables with >5x inflation in new base dataset

| Variable | Old abs sum | New abs sum | Ratio |
|----------|------------|------------|-------|
| `estate_income` | $2.6M | $7,514M | 2,862x |
| `general_business_credit` | $0.1M | $373M | 4,270x |
| `foreign_tax_credit` | $1.9M | $2,941M | 1,580x |
| `unadjusted_basis_qualified_property` | $54M | $23,871M | 440x |
| `unrecaptured_section_1250_gain` | $5.4M | $2,381M | 444x |
| `long_term_capital_gains` | $477M | $128,251M | 269x |
| `amt_foreign_tax_credit` | $1.2M | $302M | 247x |
| `miscellaneous_income` | $2.4M | $560M | 231x |
| `salt_refund_income` | $8.1M | $1,349M | 166x |
| `charitable_non_cash_donations` | $12M | $2,221M | 185x |
| `charitable_cash_donations` | $51M | $5,680M | 113x |
| `partnership_s_corp_income` | $135M | $14,260M | 106x |
| `qualified_dividend_income` | $50M | $5,150M | 104x |
| `domestic_production_ald` | $10M | $571M | 55x |
| `non_qualified_dividend_income` | $68M | $2,506M | 37x |
| `rental_income` | $32M | $1,044M | 32x |
| `employment_income_before_lsr` | $1,408M | $19,375M | 14x |

CPS-native variables (age, household_weight, disability, rent, etc.) are all unchanged (ratio ~1.0).

## Production Impact

The `upload_to_staging()` function in `modal_app/local_area.py` uploads files **directly to GCS production paths** before staging on HuggingFace. The v1.69.3 state files went to GCS on ~Feb 20 but the Promote workflow was never run, so:

- **GCS (production):** v1.69.3 state files with inflated incomes ← **live, broken**
- **HuggingFace production:** v1.62.0 state files (correct)
- **HuggingFace staging/:** v1.69.3 files (7.43 GB, unpromoted)

The production API (`policyengine-api`) calls `get_default_dataset()` which returns `gs://policyengine-us-data/states/{STATE}.h5` with `data_version=None`, so it always gets the latest GCS blob — the broken v1.69.3 data.

## Suggested Fixes

1. **Immediate:** Roll back GCS state files to the v1.62.0 data to restore correct production behavior
2. **Calibration:** Add constraints to the L0 optimizer to prevent extreme-income records from getting weights that inflate national totals beyond known aggregates (e.g., cap per-record weighted income contribution, or add an income-total constraint)
3. **Pipeline:** `upload_to_staging()` should not write to GCS production paths directly — this defeats the staging/promote safety pattern
4. **Versioning:** Add dataset version pinning in the API so state datasets can't be silently updated

## Reproduction

```python
import numpy as np

# Load new calibration inputs from HuggingFace
from huggingface_hub import hf_hub_download
w = np.load(hf_hub_download("policyengine/policyengine-us-data",
    "calibration/w_district_calibration.npy", repo_type="model"))
# Load old for comparison
w_old = np.load(hf_hub_download("policyengine/policyengine-us-data",
    "calibration/w_district_calibration.npy", repo_type="model",
    revision="1c91d3b"))

import h5py
ds_new = h5py.File(hf_hub_download("policyengine/policyengine-us-data",
    "calibration/stratified_extended_cps.h5", repo_type="model"), "r")
ds_old = h5py.File(hf_hub_download("policyengine/policyengine-us-data",
    "calibration/stratified_extended_cps.h5", repo_type="model",
    revision="1c91d3b"), "r")

emp_new = ds_new["employment_income_before_lsr"]["2024"][:].astype(float)
emp_old = ds_old["employment_income_before_lsr"]["2024"][:].astype(float)
print(f"Old max income: ${emp_old.max():,.0f}")  # $2,783,732
print(f"New max income: ${emp_new.max():,.0f}")  # $132,596,760
```

## Related

- #530 — Original issue about CPS top-coding (correctly identified)
- #537 — PR that removed the AGI ceiling (correct intent, but calibration wasn't adjusted)
- #489 — SparseMatrixBuilder overhaul (merged Feb 12, preceded the new calibration)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uncapped PUF incomes + calibration weights produce ~19x inflated state-level aggregates #555

Summary

Root Cause

Evidence from the new calibration inputs

Variables with >5x inflation in new base dataset

Production Impact

Suggested Fixes

Reproduction

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Income	Total calibrated weight	CDs with nonzero weight	Weighted contribution
$132.6M	546	114	$72.4B
$106.5M	3,601	194	$383.4B
$87.2M	2,292	176	$199.9B

Variable	Old abs sum	New abs sum	Ratio
`estate_income`	$2.6M	$7,514M	2,862x
`general_business_credit`	$0.1M	$373M	4,270x
`foreign_tax_credit`	$1.9M	$2,941M	1,580x
`unadjusted_basis_qualified_property`	$54M	$23,871M	440x
`unrecaptured_section_1250_gain`	$5.4M	$2,381M	444x
`long_term_capital_gains`	$477M	$128,251M	269x
`amt_foreign_tax_credit`	$1.2M	$302M	247x
`miscellaneous_income`	$2.4M	$560M	231x
`salt_refund_income`	$8.1M	$1,349M	166x
`charitable_non_cash_donations`	$12M	$2,221M	185x
`charitable_cash_donations`	$51M	$5,680M	113x
`partnership_s_corp_income`	$135M	$14,260M	106x
`qualified_dividend_income`	$50M	$5,150M	104x
`domestic_production_ald`	$10M	$571M	55x
`non_qualified_dividend_income`	$68M	$2,506M	37x
`rental_income`	$32M	$1,044M	32x
`employment_income_before_lsr`	$1,408M	$19,375M	14x

Uncapped PUF incomes + calibration weights produce ~19x inflated state-level aggregates #555

Description

Summary

Root Cause

Evidence from the new calibration inputs

Variables with >5x inflation in new base dataset

Production Impact

Suggested Fixes

Reproduction

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions