Skip to content

Uncapped PUF incomes + calibration weights produce ~19x inflated state-level aggregates #555

@PavelMakarchuk

Description

@PavelMakarchuk

Summary

State-level datasets built from the Feb 20 calibration inputs (stratified_extended_cps.h5 + w_district_calibration.npy) produce ~19x inflated income aggregates. These datasets are live on GCS production (gs://policyengine-us-data/states/*.h5) and are being served by the production API.

Example: Louisiana baseline household_net_income.sum() = $3,147B (production/GCS) vs $166B (HuggingFace v1.62.0). National weighted employment income = $59T (should be ~$11T).

Root Cause

PR #537 correctly removed the ~$6.26M AGI ceiling from PUF imputation (fixing #530), but the calibration weights were not re-tuned to account for the much wider income range. The result is that a handful of ultra-high-income PUF records get calibration weights that massively inflate national totals.

Evidence from the new calibration inputs

Base dataset (stratified_extended_cps.h5):

  • employment_income_before_lsr max: $2.8M (old) → $132.6M (new)
  • long_term_capital_gains max: $2.1M → $164.3M
  • 30+ financial variables inflated 5x–4,270x in the upper tail (full list below)

Calibration weights (w_district_calibration.npy):

  • Overall weight sums are similar (ratio 1.03x) — the problem isn't the total weight mass
  • But extreme-income records get substantial weights across many CDs:
Income Total calibrated weight CDs with nonzero weight Weighted contribution
$132.6M 546 114 $72.4B
$106.5M 3,601 194 $383.4B
$87.2M 2,292 176 $199.9B

Top 20 earners alone contribute $901B of national weighted employment income. Total national weighted employment income = $59T vs the correct ~$11T.

For comparison, in the old (capped) data, the highest earner had income of $2.78M with weight 15,725, contributing $43.8B — still large but 10x smaller than the new extremes.

Variables with >5x inflation in new base dataset

Variable Old abs sum New abs sum Ratio
estate_income $2.6M $7,514M 2,862x
general_business_credit $0.1M $373M 4,270x
foreign_tax_credit $1.9M $2,941M 1,580x
unadjusted_basis_qualified_property $54M $23,871M 440x
unrecaptured_section_1250_gain $5.4M $2,381M 444x
long_term_capital_gains $477M $128,251M 269x
amt_foreign_tax_credit $1.2M $302M 247x
miscellaneous_income $2.4M $560M 231x
salt_refund_income $8.1M $1,349M 166x
charitable_non_cash_donations $12M $2,221M 185x
charitable_cash_donations $51M $5,680M 113x
partnership_s_corp_income $135M $14,260M 106x
qualified_dividend_income $50M $5,150M 104x
domestic_production_ald $10M $571M 55x
non_qualified_dividend_income $68M $2,506M 37x
rental_income $32M $1,044M 32x
employment_income_before_lsr $1,408M $19,375M 14x

CPS-native variables (age, household_weight, disability, rent, etc.) are all unchanged (ratio ~1.0).

Production Impact

The upload_to_staging() function in modal_app/local_area.py uploads files directly to GCS production paths before staging on HuggingFace. The v1.69.3 state files went to GCS on ~Feb 20 but the Promote workflow was never run, so:

  • GCS (production): v1.69.3 state files with inflated incomes ← live, broken
  • HuggingFace production: v1.62.0 state files (correct)
  • HuggingFace staging/: v1.69.3 files (7.43 GB, unpromoted)

The production API (policyengine-api) calls get_default_dataset() which returns gs://policyengine-us-data/states/{STATE}.h5 with data_version=None, so it always gets the latest GCS blob — the broken v1.69.3 data.

Suggested Fixes

  1. Immediate: Roll back GCS state files to the v1.62.0 data to restore correct production behavior
  2. Calibration: Add constraints to the L0 optimizer to prevent extreme-income records from getting weights that inflate national totals beyond known aggregates (e.g., cap per-record weighted income contribution, or add an income-total constraint)
  3. Pipeline: upload_to_staging() should not write to GCS production paths directly — this defeats the staging/promote safety pattern
  4. Versioning: Add dataset version pinning in the API so state datasets can't be silently updated

Reproduction

import numpy as np

# Load new calibration inputs from HuggingFace
from huggingface_hub import hf_hub_download
w = np.load(hf_hub_download("policyengine/policyengine-us-data",
    "calibration/w_district_calibration.npy", repo_type="model"))
# Load old for comparison
w_old = np.load(hf_hub_download("policyengine/policyengine-us-data",
    "calibration/w_district_calibration.npy", repo_type="model",
    revision="1c91d3b"))

import h5py
ds_new = h5py.File(hf_hub_download("policyengine/policyengine-us-data",
    "calibration/stratified_extended_cps.h5", repo_type="model"), "r")
ds_old = h5py.File(hf_hub_download("policyengine/policyengine-us-data",
    "calibration/stratified_extended_cps.h5", repo_type="model",
    revision="1c91d3b"), "r")

emp_new = ds_new["employment_income_before_lsr"]["2024"][:].astype(float)
emp_old = ds_old["employment_income_before_lsr"]["2024"][:].astype(float)
print(f"Old max income: ${emp_old.max():,.0f}")  # $2,783,732
print(f"New max income: ${emp_new.max():,.0f}")  # $132,596,760

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions