Skip to content

Dev to Main — March 2026#603

Open
maxachis wants to merge 46 commits intomainfrom
dev
Open

Dev to Main — March 2026#603
maxachis wants to merge 46 commits intomainfrom
dev

Conversation

@maxachis
Copy link
Collaborator

@maxachis maxachis commented Mar 9, 2026

Summary

Merges dev into main, collecting 8 PRs worth of features, fixes, and infrastructure improvements.

PRs Included

Issues Resolved

Overall Changes

Performance

  • Materialized url_annotation_count_view and url_annotation_flags views with unique indexes, yielding up to 92% improvement on GET /annotate/all at scale
  • Added pytest-benchmark infrastructure with 10k-URL scale seeder, pyinstrument profiling, and CI job summary reporting

New feature

  • Internet Archive collector that crawls the Wayback Machine CDX API to discover archived police data URLs, with filtering, preprocessing, and full test coverage

Security & auth

  • Auth guards added to sensitive mutating endpoints (delete/put operations)
  • Internet Archive save task now authorizes status polling requests and validates async save status before marking success

Developer experience

  • Replaced ReviewDog with a prek-based pre-commit lint hook
  • Benchmark CI workflow triggers on PRs to main
  • Removed hardcoded dev DB password from docs in favor of docker-compose reference

Data model cleanup

  • Removed agency_described_not_in_database from data source metadata
  • Federal agencies are now enforced to link only to the US national location

🤖 Generated with Claude Code

labradorite-dev and others added 30 commits February 18, 2026 07:03
Implement a new collector that uses the Internet Archive CDX API to
discover archived URLs on domains PDAP already knows about. Users provide
seed URLs, domains are extracted, and the Wayback Machine is searched for
all archived pages with filtering for mime types, URL patterns, and dedup.
…e crawler

Add mocked integration tests (happy path, empty domain, API error) and a
manual lifecycle test hitting the live CDX API. Also fix missing
'internet_archive' value in batch_strategy DB enum and SQLAlchemy model.
The mime_type_allowlist already filters out non-HTML content, making
the static asset file extension patterns unnecessary.
Add missing module, class, and method docstrings (D100-D107) and
type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet
Archive collector files to satisfy flake8 linting requirements.
Update alembic migration down_revision to chain off latest dev head and
fix renamed get_access_info -> get_admin_access_info in IA route.
…ternet-archive-crawler

feat: Internet Archive Collector
- Add pytest-benchmark~=4.0 dev dependency
- Add ContextVar-based timing collector (timing.py) for zero-cost
  production instrumentation via _phase() context managers
- Instrument extract_and_format_get_annotation_result() with per-phase
  timing: format_s, agency_suggestions_s, location_suggestions_s,
  name_suggestions_s, batch_info_s
- Instrument GetNextURLForAllAnnotationQueryBuilder.run() with
  main_query_s timing
- Add benchmark test suite under tests/automated/integration/benchmark/
  with HTTP round-trip and per-phase breakdown tests
- Add GHA workflow (.github/workflows/benchmark.yml) that runs on
  workflow_dispatch or PR to dev, uploads JSON artifact per commit SHA
- Add README with Mermaid sequence diagram of the measured call chain
…566)

Add scale_seed.py with create_scale_data() that seeds 10k URLs plus
geographic hierarchy, agencies, name/agency/location suggestions to
exercise the query planner under realistic load. Add scale_seeder
fixture (module-scoped) and two new benchmark tests that mirror the
existing small-data pair, printing per-phase averages for direct
comparison.
)

The sequential agency_auto_suggestions / add_location_suggestion loops
(4 DB calls × 5k iterations each) blew through the 300s pytest timeout
during fixture setup. Replace with 3 round-trips per suggestion type:
initiate_task → bulk_insert subtasks (return_ids) → bulk_insert suggestions.
SuggestionModel.robo_confidence is typed int; Pydantic rejects float
values with fractional parts (e.g. 0.9). Use 1.0 which coerces cleanly.
…ags (#566)

Convert both annotation views from regular views to MATERIALIZED VIEWs
with unique indexes on url_id. Add CONCURRENTLY refresh calls to
refresh_materialized_views(). Drops main_query_s from ~177ms to ~0.64ms
at 10k-URL scale (~276x improvement).
…test (#566)

With materialized views, URLs added between refreshes have no row in the
view. Inner joins excluded them entirely, making new URLs invisible to
annotators. Switch to LEFT OUTER JOINs so URLs without a view row still
appear with NULL/0 counts.

The sort test explicitly verifies annotation-count-driven ordering, so it
needs a manual refresh_materialized_views() call after data setup to
reflect counts before querying.
- Migration: Union[str, None] -> Optional[str], add docstrings to upgrade/downgrade
- async_.py: add missing newline at end of file
… heads (#566)

Upstream dev merged 1fb2286a016c (add_internet_archive_to_batch_strategy)
which also chains from 759ce7d0772b, creating two Alembic heads.
Update our migration's down_revision to sit after upstream's.
labradorite-dev and others added 16 commits February 28, 2026 14:29
…o docker-compose

GitGuardian flagged the literal default password added to docs/development.md.
Replaced with a pointer to local_database/docker-compose.yml where it is already defined.
- docs(benchmark): define 'phase' clearly in README
- ci(benchmark): restrict benchmark workflow to workflow_dispatch only
- fix(migration): restore internet_archive migration to match dev (style-only drift from lint passes)
- fix(annotate): remove isouter from mat view joins; inner join is correct since missing rows are unusable
…_save_status_polling_fix

fix(ia-save): validate save job status before recording success
…4-remove-agency-described-flag

Remove agency_described_not_in_database from data source metadata
…2-remove-reviewdog

Remove ReviewDog and add lint directives to CLAUDE.md
…5-federal-agencies-national-location

Ensure federal agencies are linked only to US national location
Keep reviewdog approach since .flake8 config centralizes ignore settings,
replacing the inline --ignore flags from dev.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…7-add-prek-lint-hook

Add prek-based pre-commit lint hook
…4-sensitive-endpoint-auth

Add auth guards to sensitive mutating endpoints
…otation-load-time

Resolve uv.lock conflict by regenerating lockfile.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three migrations branched from 1fb2286a016c after merging dev.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…566-optimize-annotation-load-time

feat(db): improve `GET /annotate/all` performance
@gitguardian
Copy link

gitguardian bot commented Mar 9, 2026

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
15086721 Triggered Generic Password baf7550 .github/workflows/benchmark.yml View secret
15086721 Triggered Generic Password ee09ea1 .github/workflows/benchmark.yml View secret
15086721 Triggered Generic Password ee09ea1 .github/workflows/benchmark.yml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@maxachis maxachis changed the title Dev Release: Internet Archive Collector Mar 9, 2026
@maxachis maxachis changed the title Release: Internet Archive Collector Dev to Main — March 2026 Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants