Skip to content

Conversation

@sirtimid
Copy link
Contributor

@sirtimid sirtimid commented Dec 18, 2025

Closes #660

  • Add connection limit (default 100 concurrent connections)
  • Add message size limit (default 1MB per message)
  • Add stale peer cleanup (removes data for peers disconnected >1 hour)
  • Make all limits configurable via RemoteCommsOptions
  • Add ResourceLimitError for limit violations
  • Add comprehensive tests for all resource limits

This prevents memory exhaustion and manages system resources by:

  • Rejecting new connections when limit is reached
  • Rejecting messages exceeding size limit
  • Periodically cleaning up stale peer data

Note

Implements resource management for remote communications and a dedicated error type for limit violations.

  • Adds ResourceLimitError (code RESOURCE_LIMIT_ERROR) with marshal/unmarshal support; exports wired into errorClasses and package index
  • Enforces limits in remotes/network.ts with new options: maxConcurrentConnections, maxMessageSizeBytes, cleanupIntervalMs, stalePeerTimeoutMs; validates message size, blocks/tears down excess connections (outbound and inbound), and periodically cleans stale peer data
  • Refactors connection handling: registerChannel, reuseOrReturnChannel, improved reconnection loop (race handling, intentional closes, queue flushing, timestamp updates)
  • Introduces ConnectionFactory.closeChannel to explicitly close/abort underlying streams; used when replacing or rejecting channels
  • Extends RemoteCommsOptions to include new limits; updates tests extensively for limits, reconnection reuse, cleanup behavior, and race conditions; adjusts coverage thresholds

Written by Cursor Bugbot for commit 982f9b8. This will update automatically on new commits. Configure here.

@sirtimid sirtimid requested a review from a team as a code owner December 18, 2025 14:38
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from a40d3eb to 96f26e2 Compare December 18, 2025 15:05
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from b6c23e0 to 25e81ac Compare December 19, 2025 19:55
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from 8e04986 to ce1e774 Compare January 5, 2026 15:35
@rekmarks
Copy link
Member

rekmarks commented Jan 5, 2026

@cursor review

@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from ce1e774 to 8323837 Compare January 5, 2026 20:51
@sirtimid sirtimid requested a review from FUDCo January 5, 2026 20:51
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine for what it is.

One lingering question I have is that as we get more careful about detecting error conditions (or proclaiming them, in the case of configurable resource limits), are we potentially setting ourselves up for situations where an error bubbles up to user code at some location other than the location that is actually responsible for causing it? In other words, will a message transmission error always find its way back to the actual send operation that triggered it?

More generally, could user code find itself in an unrecoverable error state (by that I don't mean a state where there's an error that you can't get rid of -- things can always break unfixably, e.g., a remote host dies forever -- but rather a state where you don't actually know you're stuck). It's entirely plausible to me that everything is fine, but I can't tell from reading the tests whether our tests give us reason to believe we are ok on this score.

@sirtimid
Copy link
Contributor Author

sirtimid commented Jan 6, 2026

One lingering question I have is that as we get more careful about detecting error conditions (or proclaiming them, in the case of configurable resource limits), are we potentially setting ourselves up for situations where an error bubbles up to user code at some location other than the location that is actually responsible for causing it? In other words, will a message transmission error always find its way back to the actual send operation that triggered it?

More generally, could user code find itself in an unrecoverable error state (by that I don't mean a state where there's an error that you can't get rid of -- things can always break unfixably, e.g., a remote host dies forever -- but rather a state where you don't actually know you're stuck). It's entirely plausible to me that everything is fine, but I can't tell from reading the tests whether our tests give us reason to believe we are ok on this score.

@FUDCo Two parts to your question:

Error attribution: Yeah transmission errors are always caught within the specific sendRemoteMessage() call that triggered them. Each call is independent with its own try/catch, so errors won't bubble up to the wrong location. The only errors that surface to callers are synchronous validation errors (ResourceLimitError, intentional close), which also come from the correct call site.

Stuck without knowing: Currently sendRemoteMessage is fire-and-forget—messages queue during reconnection, but if retries are exhausted, they're silently lost. The caller never knows. This is exactly what the Message Acknowledgment work addresses: sendRemoteMessage will resolve on ACK or reject after retries, so user code will know definitively.

@sirtimid sirtimid enabled auto-merge (squash) January 6, 2026 19:48
@sirtimid sirtimid disabled auto-merge January 7, 2026 08:12
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from 71fe98b to b9f908b Compare January 7, 2026 15:10
@claude
Copy link

claude bot commented Jan 7, 2026

PR Review: Resource Limits for Remote Communications

I've completed a comprehensive review of this PR. Overall, this is high-quality work with excellent attention to detail on resource management and edge cases. The implementation addresses critical DoS vulnerabilities and includes thorough test coverage. Below are my findings:


✅ Strengths

1. Excellent Error Design

  • ResourceLimitError follows the existing error pattern perfectly with marshal/unmarshal support
  • Includes structured data field with limitType, current, and limit for detailed diagnostics
  • Properly uses harden() for object capability security
  • Comprehensive test coverage (233 lines) with parameterized tests

2. Robust Resource Limit Enforcement

The implementation is thorough and handles numerous edge cases:

  • Connection limit: Checked before dial, after dial (for race conditions), on inbound connections
  • Message size limit: Validated before queuing, before dialing, with proper byte-accurate measurement
  • Stale peer cleanup: Periodic removal of disconnected peers (1 hour timeout, 15-minute intervals)
  • Race condition protection: Multiple checks to handle concurrent dials and inbound connections

3. Channel Lifecycle Management

  • New closeChannel() method properly attempts close() then falls back to abort()
  • Explicit channel closing prevents resource leaks throughout the codebase
  • Smart channel reuse logic in reuseOrReturnChannel()

4. Comprehensive Test Coverage

  • 779 new lines of tests for network.ts alone
  • Tests cover limits, cleanup, race conditions, channel closing, and reconnection scenarios
  • Proper use of it.each() for parameterized tests
  • Good use of vi.waitFor() for async race condition testing

5. Security Considerations

  • Prevents memory exhaustion attacks via connection/message limits
  • Rejects connections from intentionally closed peers
  • Proper validation before queuing to prevent queue overflow
  • Stale peer cleanup prevents indefinite memory growth

🔍 Issues & Recommendations

Critical

None - No critical issues found.

High Priority

1. Error Propagation Concern (addresses @FUDCo's comment)

The current implementation has a potential issue where resource limit errors may not always propagate back to the caller:

Location: network.ts:762-773

try {
  checkConnectionLimit();
} catch {
  // Connection limit reached - close the dialed channel and queue the message
  logger.log(
    `${targetPeerId}:: connection limit reached after dial, queueing message`,
  );
  await connectionFactory.closeChannel(channel, targetPeerId);
  currentQueue.enqueue(message);
  handleConnectionLoss(targetPeerId, channel);
  return; // ⚠️ Returns silently, caller doesn't know message was queued
}

Problem: When the connection limit is reached after dial, the message is silently queued and reconnection is triggered. The caller receives a successful Promise resolution but the message wasn't actually sent.

Similar issue at: network.ts:377-392 (reconnection loop)

Recommendation: Consider one of these approaches:

  1. Throw ResourceLimitError back to the caller so they know the send failed
  2. Return a status object: { sent: boolean, queued: boolean, reason?: string }
  3. Add a callback/event for "message queued due to limit" scenarios

This ensures user code knows when messages are delayed vs. successfully sent, avoiding the "unrecoverable error state" concern.


2. TextEncoder Creation Efficiency

Location: network.ts:490

function validateMessageSize(message: string): void {
  const messageSizeBytes = new TextEncoder().encode(message).length;
  // ...
}

Issue: Creates a new TextEncoder instance for every message validation. This is called on the hot path for every message send.

Recommendation:

const messageEncoder = new TextEncoder(); // Create once at module level

function validateMessageSize(message: string): void {
  const messageSizeBytes = messageEncoder.encode(message).length;
  // ...
}

Medium Priority

3. Race Condition in Cleanup

Location: network.ts:626-667

The cleanupStalePeers() function could have a subtle race condition:

for (const peerId of stalePeers) {
  lastConnectionTime.delete(peerId);
  messageQueues.delete(peerId);
  // ...
}

Issue: If a connection is established between when stalePeers array is built and when cleanup executes, we might delete a newly-active peer's data.

Recommendation: Add a final check before deletion:

for (const peerId of stalePeers) {
  // Re-check that peer is still stale (not reconnected since we checked)
  if (channels.has(peerId) || reconnectionManager.isReconnecting(peerId)) {
    continue; // Skip cleanup, peer is active again
  }
  lastConnectionTime.delete(peerId);
  // ...
}

4. Missing TypeDoc Documentation

Location: types.ts:30-67

The new RemoteCommsOptions fields lack TypeDoc comments explaining units and behavior:

maxConcurrentConnections?: number | undefined; // Missing: what happens when reached?
maxMessageSizeBytes?: number | undefined;      // Missing: exact behavior on exceed
stalePeerTimeoutMs?: number | undefined;       // Missing: what data is cleaned up?

Recommendation: Add detailed TypeDoc comments per CLAUDE.md conventions:

/**
 * Maximum message size in bytes (default: 1MB).
 * Messages exceeding this limit will be immediately rejected with ResourceLimitError.
 */
maxMessageSizeBytes?: number | undefined;

5. Potential Integer Overflow in Test

Location: network.test.ts (message size tests, from diff)

When testing message size limits with values like 1024 * 1024 + 1, ensure tests also cover boundary cases around 32-bit integer limits if maxMessageSizeBytes could be configured that high.

Recommendation: Add a test comment documenting the max safe value or add explicit validation in the options.


Low Priority

6. Code Style: Error Handling Pattern

Location: Multiple locations use empty catch {}

Per CLAUDE.md, prefer explicit error handling. Several places catch and ignore errors:

} catch {
  logger.log(`${targetPeerId}:: connection limit reached...`);
  // ...
}

Recommendation: Make error variable explicit:

} catch (limitError) {
  logger.log(`${targetPeerId}:: connection limit reached...`);
  // Could log limitError.message for better diagnostics
}

7. Magic Number for SCTP Code

Location: network.ts:183

const SCTP_USER_INITIATED_ABORT = 12; // RFC 4960

Note: Good use of a constant! Consider extracting to a shared constants file if used elsewhere, but current location is fine for now.


📊 Performance & Scale

Good:

  • Connection limit prevents resource exhaustion
  • Message size limit prevents large message attacks
  • Stale peer cleanup prevents memory leaks
  • Efficient Map-based lookups for channel/queue management

Consider:

  • At 100 connections with frequent messages, the validateMessageSize() overhead could be noticeable
  • Cleanup interval of 15 minutes is reasonable, but consider making it adaptive based on connection count
  • The double-check pattern for connection limits adds complexity but is necessary for correctness

🔒 Security Assessment

✅ Well Protected:

  • DoS via connection flood: Limited to 100 concurrent connections
  • DoS via large messages: Limited to 1MB per message
  • Memory exhaustion: Stale peer cleanup after 1 hour
  • Intentional close bypass: Properly tracked and enforced

⚠️ Considerations:

  1. Queue exhaustion per peer: Already protected (200 messages/peer by default)
  2. Rapid connect/disconnect: Could trigger cleanup/reconnection storms (consider rate limiting in future)
  3. Hint flooding: registerLocationHints() merges hints unboundedly - consider max hints per peer

📝 Code Quality

Adherence to CLAUDE.md: ✅ Excellent

  • ✅ Uses TypeScript with explicit return types
  • ✅ Uses harden() for ocap security
  • ✅ Uses @metamask/superstruct for validation
  • ✅ Proper TypeDoc documentation (minor gaps noted)
  • ✅ camelCase for functions, PascalCase for classes
  • ✅ Test titles use verb forms without "should"
  • ✅ Proper use of vi.fn() and mocking

Minor Style Issues:

  • Some functions exceed 50 lines (e.g., attemptReconnection, sendRemoteMessage) but complexity is justified
  • Consider extracting some nested logic into helper functions for readability

🧪 Test Coverage

Excellent Coverage:

  • Connection limits (at boundary, exceeded, custom limits)
  • Message size limits (exact boundary, exceeded, during reconnection)
  • Stale peer cleanup (timeout, active peers excluded)
  • Race conditions (concurrent dials, inbound during reconnection)
  • Channel lifecycle (close, abort, fallback)
  • Integration scenarios

Potential Gaps:

  1. Cleanup during reconnection: Test that cleanup doesn't interfere with active reconnection
  2. Multiple peers hitting limit simultaneously: Stress test with N+1 peers where N=limit
  3. Hint flooding: Test registerLocationHints() with many hints

🎯 Recommendations Summary

Must Address:

  1. Error propagation - Ensure callers know when messages are queued vs. sent (High Priority fix: Install Corepack as prereq in broken workflows #1)

Should Address:

  1. TextEncoder efficiency - Move to module level (High Priority chore: Convert to monorepo #2)
  2. Cleanup race condition - Add active-check before deletion (Medium Priority Publish packages to npm #3)

Nice to Have:

  1. TypeDoc for new options
  2. Explicit error handling in catch blocks
  3. Consider max hints per peer

✅ Final Verdict

This PR is approved with minor recommendations. The resource limit implementation is solid, well-tested, and addresses real security concerns. The code quality is high and follows project conventions.

Recommended Actions:

  1. Address High Priority fix: Install Corepack as prereq in broken workflows #1 (error propagation) to resolve @FUDCo's concern
  2. Consider High Priority chore: Convert to monorepo #2 (TextEncoder) for performance
  3. Others can be addressed in follow-up PRs if desired

Great work! 🚀

- Add connection limit (default 100 concurrent connections)
- Add message size limit (default 1MB per message)
- Add stale peer cleanup (removes data for peers disconnected >1 hour)
- Make all limits configurable via RemoteCommsOptions
- Add ResourceLimitError for limit violations
- Add comprehensive tests for all resource limits

This prevents memory exhaustion and manages system resources by:
- Rejecting new connections when limit is reached
- Rejecting messages exceeding size limit
- Periodically cleaning up stale peer data
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from b9c906e to 982f9b8 Compare January 8, 2026 16:06
@rekmarks rekmarks requested a review from FUDCo January 8, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote comms: Implement Resource Limits

4 participants