Skip to content

fix: PSv2 follow-up fixes from integration tests#1135

Draft
mihow wants to merge 5 commits intomainfrom
fix/nats-connection-safety
Draft

fix: PSv2 follow-up fixes from integration tests#1135
mihow wants to merge 5 commits intomainfrom
fix/nats-connection-safety

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Feb 17, 2026

Summary

The PSv2 seems to work sometimes, but some events cause no tasks to be returned, NATS to timeout or Django to become unresponsive.

Here are some findings from my session 2 out of 3. More to come!


Fixes discovered during PSv2 integration testing where stale ADC workers overwhelmed Django by flooding /tasks with concurrent NATS reserve requests.

  • NATS get_connection() now uses connect_timeout=5, allow_reconnect=False to prevent leaked reconnection loops from blocking Django's async event loop
  • /tasks endpoint returns empty for terminal-status jobs instead of attempting NATS reserve
  • IncompleteJobFilter excludes jobs by top-level status (not just progress JSON stages), so workers don't pick up manually-failed jobs
  • Integration test script kills stale ADC workers before starting and uses AMI_NUM_WORKERS=0

Context

During integration testing, a stale ADC worker from a previous test run consumed all 20 NATS messages before the new worker could fetch them. The stale worker spawned multiple DataLoader subprocesses, generating 147 concurrent /tasks requests against Django's single uvicorn worker thread. Combined with NATS connections that defaulted to allow_reconnect=True, this blocked the entire event loop and made Django unresponsive.

TODO & TO-TRY

Test plan

  • Run scripts/psv2_integration_test.sh 20 with clean state (no stale workers)
  • Verify Django stays responsive during worker task fetching
  • Verify terminal-status jobs return empty tasks without hitting NATS
  • Verify incomplete_only=1 excludes FAILURE/SUCCESS/REVOKED jobs

Full findings: docs/claude/planning/nats-flooding-prevention.md

mihow and others added 2 commits February 16, 2026 22:43
- Add connect_timeout=5, allow_reconnect=False to NATS connections to
  prevent leaked reconnection loops from blocking Django's event loop
- Guard /tasks endpoint against terminal-status jobs (return empty tasks
  instead of attempting NATS reserve)
- IncompleteJobFilter now excludes jobs by top-level status in addition
  to progress JSON stages
- Add stale worker cleanup to integration test script

Found during PSv2 integration testing where stale ADC workers with
default DataLoader parallelism overwhelmed the single uvicorn worker
thread by flooding /tasks with concurrent NATS reserve requests.

Co-Authored-By: Claude <noreply@anthropic.com>
Session notes from 2026-02-16 integration test including root cause
analysis of stale worker task competition and NATS connection issues.
Findings doc tracks applied fixes and remaining TODOs with priorities.

Co-Authored-By: Claude <noreply@anthropic.com>
@netlify
Copy link

netlify bot commented Feb 17, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 17d4c6b
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/699416cc04a9c9000829012d

@netlify
Copy link

netlify bot commented Feb 17, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 17d4c6b
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/699416cc6107f90008b6fcef

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/nats-connection-safety

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

mihow and others added 3 commits February 16, 2026 23:04
PSv2 integration test passed end-to-end (job 1380, 20/20 images).
Identified ack_wait=300s as cause of ~5min idle time when GPU
processes race for NATS tasks.

Co-Authored-By: Claude <noreply@anthropic.com>
Replace N×1 reserve_task() calls with single reserve_tasks() batch
fetch. The previous implementation created a new pull subscription per
message (320 NATS round trips for batch=64), causing the /tasks endpoint
to exceed HTTP client timeouts. The new approach uses one psub.fetch()
call for the entire batch.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow changed the title fix: prevent NATS connection flooding and stale job task fetching fix: PSv2 follow-up fixes from integration tests Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments