Conversation
Implement a new collector that uses the Internet Archive CDX API to discover archived URLs on domains PDAP already knows about. Users provide seed URLs, domains are extracted, and the Wayback Machine is searched for all archived pages with filtering for mime types, URL patterns, and dedup.
…e crawler Add mocked integration tests (happy path, empty domain, API error) and a manual lifecycle test hitting the live CDX API. Also fix missing 'internet_archive' value in batch_strategy DB enum and SQLAlchemy model.
The mime_type_allowlist already filters out non-HTML content, making the static asset file extension patterns unnecessary.
Add missing module, class, and method docstrings (D100-D107) and type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet Archive collector files to satisfy flake8 linting requirements.
Update alembic migration down_revision to chain off latest dev head and fix renamed get_access_info -> get_admin_access_info in IA route.
…ternet-archive-crawler feat: Internet Archive Collector
- Add pytest-benchmark~=4.0 dev dependency - Add ContextVar-based timing collector (timing.py) for zero-cost production instrumentation via _phase() context managers - Instrument extract_and_format_get_annotation_result() with per-phase timing: format_s, agency_suggestions_s, location_suggestions_s, name_suggestions_s, batch_info_s - Instrument GetNextURLForAllAnnotationQueryBuilder.run() with main_query_s timing - Add benchmark test suite under tests/automated/integration/benchmark/ with HTTP round-trip and per-phase breakdown tests - Add GHA workflow (.github/workflows/benchmark.yml) that runs on workflow_dispatch or PR to dev, uploads JSON artifact per commit SHA - Add README with Mermaid sequence diagram of the measured call chain
…566) Add scale_seed.py with create_scale_data() that seeds 10k URLs plus geographic hierarchy, agencies, name/agency/location suggestions to exercise the query planner under realistic load. Add scale_seeder fixture (module-scoped) and two new benchmark tests that mirror the existing small-data pair, printing per-phase averages for direct comparison.
SuggestionModel.robo_confidence is typed int; Pydantic rejects float values with fractional parts (e.g. 0.9). Use 1.0 which coerces cleanly.
…ags (#566) Convert both annotation views from regular views to MATERIALIZED VIEWs with unique indexes on url_id. Add CONCURRENTLY refresh calls to refresh_materialized_views(). Drops main_query_s from ~177ms to ~0.64ms at 10k-URL scale (~276x improvement).
…test (#566) With materialized views, URLs added between refreshes have no row in the view. Inner joins excluded them entirely, making new URLs invisible to annotators. Switch to LEFT OUTER JOINs so URLs without a view row still appear with NULL/0 counts. The sort test explicitly verifies annotation-count-driven ordering, so it needs a manual refresh_materialized_views() call after data setup to reflect counts before querying.
- Migration: Union[str, None] -> Optional[str], add docstrings to upgrade/downgrade - async_.py: add missing newline at end of file
… heads (#566) Upstream dev merged 1fb2286a016c (add_internet_archive_to_batch_strategy) which also chains from 759ce7d0772b, creating two Alembic heads. Update our migration's down_revision to sit after upstream's.
…o docker-compose GitGuardian flagged the literal default password added to docs/development.md. Replaced with a pointer to local_database/docker-compose.yml where it is already defined.
- docs(benchmark): define 'phase' clearly in README - ci(benchmark): restrict benchmark workflow to workflow_dispatch only - fix(migration): restore internet_archive migration to match dev (style-only drift from lint passes) - fix(annotate): remove isouter from mat view joins; inner join is correct since missing rows are unusable
…_save_status_polling_fix fix(ia-save): validate save job status before recording success
…4-remove-agency-described-flag Remove agency_described_not_in_database from data source metadata
…2-remove-reviewdog Remove ReviewDog and add lint directives to CLAUDE.md
…5-federal-agencies-national-location Ensure federal agencies are linked only to US national location
Keep reviewdog approach since .flake8 config centralizes ignore settings, replacing the inline --ignore flags from dev. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…7-add-prek-lint-hook Add prek-based pre-commit lint hook
…4-sensitive-endpoint-auth Add auth guards to sensitive mutating endpoints
…otation-load-time Resolve uv.lock conflict by regenerating lockfile. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three migrations branched from 1fb2286a016c after merging dev. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…566-optimize-annotation-load-time feat(db): improve `GET /annotate/all` performance
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 15086721 | Triggered | Generic Password | baf7550 | .github/workflows/benchmark.yml | View secret |
| 15086721 | Triggered | Generic Password | ee09ea1 | .github/workflows/benchmark.yml | View secret |
| 15086721 | Triggered | Generic Password | ee09ea1 | .github/workflows/benchmark.yml | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merges
devintomain, collecting 8 PRs worth of features, fixes, and infrastructure improvements.PRs Included
GET /annotate/allperformance #601 — feat(db): improveGET /annotate/allperformanceagency_described_not_in_databasefrom data source metadataIssues Resolved
url_optional_data_source_metadata.agency_described_not_in_databaseproperty from data sources #514 — Removeagency_described_not_in_databaseproperty from data sourcesOverall Changes
Performance
url_annotation_count_viewandurl_annotation_flagsviews with unique indexes, yielding up to 92% improvement onGET /annotate/allat scalepytest-benchmarkinfrastructure with 10k-URL scale seeder, pyinstrument profiling, and CI job summary reportingNew feature
Security & auth
Developer experience
Data model cleanup
agency_described_not_in_databasefrom data source metadata🤖 Generated with Claude Code