-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Summary
State-level datasets built from the Feb 20 calibration inputs (stratified_extended_cps.h5 + w_district_calibration.npy) produce ~19x inflated income aggregates. These datasets are live on GCS production (gs://policyengine-us-data/states/*.h5) and are being served by the production API.
Example: Louisiana baseline household_net_income.sum() = $3,147B (production/GCS) vs $166B (HuggingFace v1.62.0). National weighted employment income = $59T (should be ~$11T).
Root Cause
PR #537 correctly removed the ~$6.26M AGI ceiling from PUF imputation (fixing #530), but the calibration weights were not re-tuned to account for the much wider income range. The result is that a handful of ultra-high-income PUF records get calibration weights that massively inflate national totals.
Evidence from the new calibration inputs
Base dataset (stratified_extended_cps.h5):
employment_income_before_lsrmax: $2.8M (old) → $132.6M (new)long_term_capital_gainsmax: $2.1M → $164.3M- 30+ financial variables inflated 5x–4,270x in the upper tail (full list below)
Calibration weights (w_district_calibration.npy):
- Overall weight sums are similar (ratio 1.03x) — the problem isn't the total weight mass
- But extreme-income records get substantial weights across many CDs:
| Income | Total calibrated weight | CDs with nonzero weight | Weighted contribution |
|---|---|---|---|
| $132.6M | 546 | 114 | $72.4B |
| $106.5M | 3,601 | 194 | $383.4B |
| $87.2M | 2,292 | 176 | $199.9B |
Top 20 earners alone contribute $901B of national weighted employment income. Total national weighted employment income = $59T vs the correct ~$11T.
For comparison, in the old (capped) data, the highest earner had income of $2.78M with weight 15,725, contributing $43.8B — still large but 10x smaller than the new extremes.
Variables with >5x inflation in new base dataset
| Variable | Old abs sum | New abs sum | Ratio |
|---|---|---|---|
estate_income |
$2.6M | $7,514M | 2,862x |
general_business_credit |
$0.1M | $373M | 4,270x |
foreign_tax_credit |
$1.9M | $2,941M | 1,580x |
unadjusted_basis_qualified_property |
$54M | $23,871M | 440x |
unrecaptured_section_1250_gain |
$5.4M | $2,381M | 444x |
long_term_capital_gains |
$477M | $128,251M | 269x |
amt_foreign_tax_credit |
$1.2M | $302M | 247x |
miscellaneous_income |
$2.4M | $560M | 231x |
salt_refund_income |
$8.1M | $1,349M | 166x |
charitable_non_cash_donations |
$12M | $2,221M | 185x |
charitable_cash_donations |
$51M | $5,680M | 113x |
partnership_s_corp_income |
$135M | $14,260M | 106x |
qualified_dividend_income |
$50M | $5,150M | 104x |
domestic_production_ald |
$10M | $571M | 55x |
non_qualified_dividend_income |
$68M | $2,506M | 37x |
rental_income |
$32M | $1,044M | 32x |
employment_income_before_lsr |
$1,408M | $19,375M | 14x |
CPS-native variables (age, household_weight, disability, rent, etc.) are all unchanged (ratio ~1.0).
Production Impact
The upload_to_staging() function in modal_app/local_area.py uploads files directly to GCS production paths before staging on HuggingFace. The v1.69.3 state files went to GCS on ~Feb 20 but the Promote workflow was never run, so:
- GCS (production): v1.69.3 state files with inflated incomes ← live, broken
- HuggingFace production: v1.62.0 state files (correct)
- HuggingFace staging/: v1.69.3 files (7.43 GB, unpromoted)
The production API (policyengine-api) calls get_default_dataset() which returns gs://policyengine-us-data/states/{STATE}.h5 with data_version=None, so it always gets the latest GCS blob — the broken v1.69.3 data.
Suggested Fixes
- Immediate: Roll back GCS state files to the v1.62.0 data to restore correct production behavior
- Calibration: Add constraints to the L0 optimizer to prevent extreme-income records from getting weights that inflate national totals beyond known aggregates (e.g., cap per-record weighted income contribution, or add an income-total constraint)
- Pipeline:
upload_to_staging()should not write to GCS production paths directly — this defeats the staging/promote safety pattern - Versioning: Add dataset version pinning in the API so state datasets can't be silently updated
Reproduction
import numpy as np
# Load new calibration inputs from HuggingFace
from huggingface_hub import hf_hub_download
w = np.load(hf_hub_download("policyengine/policyengine-us-data",
"calibration/w_district_calibration.npy", repo_type="model"))
# Load old for comparison
w_old = np.load(hf_hub_download("policyengine/policyengine-us-data",
"calibration/w_district_calibration.npy", repo_type="model",
revision="1c91d3b"))
import h5py
ds_new = h5py.File(hf_hub_download("policyengine/policyengine-us-data",
"calibration/stratified_extended_cps.h5", repo_type="model"), "r")
ds_old = h5py.File(hf_hub_download("policyengine/policyengine-us-data",
"calibration/stratified_extended_cps.h5", repo_type="model",
revision="1c91d3b"), "r")
emp_new = ds_new["employment_income_before_lsr"]["2024"][:].astype(float)
emp_old = ds_old["employment_income_before_lsr"]["2024"][:].astype(float)
print(f"Old max income: ${emp_old.max():,.0f}") # $2,783,732
print(f"New max income: ${emp_new.max():,.0f}") # $132,596,760Related
- CPS top-coding caps AGI at $6.26M — zero observations above $10M in any state #530 — Original issue about CPS top-coding (correctly identified)
- Add PUF + source impute modules, fix AGI ceiling (issue #530) #537 — PR that removed the AGI ceiling (correct intent, but calibration wasn't adjusted)
- Supporting all calibration targets with SparseMatrixBuilder #489 — SparseMatrixBuilder overhaul (merged Feb 12, preceded the new calibration)