feat(ocap-kernel): add resource limits for remote communications #714

sirtimid · 2025-12-18T14:38:56Z

Closes #660

Add connection limit (default 100 concurrent connections)
Add message size limit (default 1MB per message)
Add stale peer cleanup (removes data for peers disconnected >1 hour)
Make all limits configurable via RemoteCommsOptions
Add ResourceLimitError for limit violations
Add comprehensive tests for all resource limits

This prevents memory exhaustion and manages system resources by:

Rejecting new connections when limit is reached
Rejecting messages exceeding size limit
Periodically cleaning up stale peer data

Note

Implements resource management for remote communications and a dedicated error type for limit violations.

Adds ResourceLimitError (code RESOURCE_LIMIT_ERROR) with marshal/unmarshal support; exports wired into errorClasses and package index
Enforces limits in remotes/network.ts with new options: maxConcurrentConnections, maxMessageSizeBytes, cleanupIntervalMs, stalePeerTimeoutMs; validates message size, blocks/tears down excess connections (outbound and inbound), and periodically cleans stale peer data
Refactors connection handling: registerChannel, reuseOrReturnChannel, improved reconnection loop (race handling, intentional closes, queue flushing, timestamp updates)
Introduces ConnectionFactory.closeChannel to explicitly close/abort underlying streams; used when replacing or rejecting channels
Extends RemoteCommsOptions to include new limits; updates tests extensively for limits, reconnection reuse, cleanup behavior, and race conditions; adjusts coverage thresholds

^{Written by Cursor Bugbot for commit 982f9b8. This will update automatically on new commits. Configure here.}

packages/ocap-kernel/src/remotes/network.ts

packages/kernel-errors/src/errors/ResourceLimitError.ts

packages/ocap-kernel/src/remotes/network.ts

rekmarks · 2026-01-05T17:56:51Z

@cursor review

packages/ocap-kernel/src/remotes/network.ts

FUDCo

This looks fine for what it is.

One lingering question I have is that as we get more careful about detecting error conditions (or proclaiming them, in the case of configurable resource limits), are we potentially setting ourselves up for situations where an error bubbles up to user code at some location other than the location that is actually responsible for causing it? In other words, will a message transmission error always find its way back to the actual send operation that triggered it?

More generally, could user code find itself in an unrecoverable error state (by that I don't mean a state where there's an error that you can't get rid of -- things can always break unfixably, e.g., a remote host dies forever -- but rather a state where you don't actually know you're stuck). It's entirely plausible to me that everything is fine, but I can't tell from reading the tests whether our tests give us reason to believe we are ok on this score.

sirtimid · 2026-01-06T15:55:34Z

One lingering question I have is that as we get more careful about detecting error conditions (or proclaiming them, in the case of configurable resource limits), are we potentially setting ourselves up for situations where an error bubbles up to user code at some location other than the location that is actually responsible for causing it? In other words, will a message transmission error always find its way back to the actual send operation that triggered it?

More generally, could user code find itself in an unrecoverable error state (by that I don't mean a state where there's an error that you can't get rid of -- things can always break unfixably, e.g., a remote host dies forever -- but rather a state where you don't actually know you're stuck). It's entirely plausible to me that everything is fine, but I can't tell from reading the tests whether our tests give us reason to believe we are ok on this score.

@FUDCo Two parts to your question:

Error attribution: Yeah transmission errors are always caught within the specific sendRemoteMessage() call that triggered them. Each call is independent with its own try/catch, so errors won't bubble up to the wrong location. The only errors that surface to callers are synchronous validation errors (ResourceLimitError, intentional close), which also come from the correct call site.

Stuck without knowing: Currently sendRemoteMessage is fire-and-forget—messages queue during reconnection, but if retries are exhausted, they're silently lost. The caller never knows. This is exactly what the Message Acknowledgment work addresses: sendRemoteMessage will resolve on ACK or reject after retries, so user code will know definitively.

packages/ocap-kernel/src/remotes/network.ts

claude · 2026-01-07T20:41:18Z

PR Review: Resource Limits for Remote Communications

I've completed a comprehensive review of this PR. Overall, this is high-quality work with excellent attention to detail on resource management and edge cases. The implementation addresses critical DoS vulnerabilities and includes thorough test coverage. Below are my findings:

✅ Strengths

1. Excellent Error Design

ResourceLimitError follows the existing error pattern perfectly with marshal/unmarshal support
Includes structured data field with limitType, current, and limit for detailed diagnostics
Properly uses harden() for object capability security
Comprehensive test coverage (233 lines) with parameterized tests

2. Robust Resource Limit Enforcement

The implementation is thorough and handles numerous edge cases:

Connection limit: Checked before dial, after dial (for race conditions), on inbound connections
Message size limit: Validated before queuing, before dialing, with proper byte-accurate measurement
Stale peer cleanup: Periodic removal of disconnected peers (1 hour timeout, 15-minute intervals)
Race condition protection: Multiple checks to handle concurrent dials and inbound connections

3. Channel Lifecycle Management

New closeChannel() method properly attempts close() then falls back to abort()
Explicit channel closing prevents resource leaks throughout the codebase
Smart channel reuse logic in reuseOrReturnChannel()

4. Comprehensive Test Coverage

779 new lines of tests for network.ts alone
Tests cover limits, cleanup, race conditions, channel closing, and reconnection scenarios
Proper use of it.each() for parameterized tests
Good use of vi.waitFor() for async race condition testing

5. Security Considerations

Prevents memory exhaustion attacks via connection/message limits
Rejects connections from intentionally closed peers
Proper validation before queuing to prevent queue overflow
Stale peer cleanup prevents indefinite memory growth

🔍 Issues & Recommendations

Critical

None - No critical issues found.

High Priority

1. Error Propagation Concern (addresses @FUDCo's comment)

The current implementation has a potential issue where resource limit errors may not always propagate back to the caller:

Location: network.ts:762-773

try {
  checkConnectionLimit();
} catch {
  // Connection limit reached - close the dialed channel and queue the message
  logger.log(
    `${targetPeerId}:: connection limit reached after dial, queueing message`,
  );
  await connectionFactory.closeChannel(channel, targetPeerId);
  currentQueue.enqueue(message);
  handleConnectionLoss(targetPeerId, channel);
  return; // ⚠️ Returns silently, caller doesn't know message was queued
}

Problem: When the connection limit is reached after dial, the message is silently queued and reconnection is triggered. The caller receives a successful Promise resolution but the message wasn't actually sent.

Similar issue at: network.ts:377-392 (reconnection loop)

Recommendation: Consider one of these approaches:

Throw ResourceLimitError back to the caller so they know the send failed
Return a status object: { sent: boolean, queued: boolean, reason?: string }
Add a callback/event for "message queued due to limit" scenarios

This ensures user code knows when messages are delayed vs. successfully sent, avoiding the "unrecoverable error state" concern.

2. TextEncoder Creation Efficiency

Location: network.ts:490

function validateMessageSize(message: string): void {
  const messageSizeBytes = new TextEncoder().encode(message).length;
  // ...
}

Issue: Creates a new TextEncoder instance for every message validation. This is called on the hot path for every message send.

Recommendation:

const messageEncoder = new TextEncoder(); // Create once at module level

function validateMessageSize(message: string): void {
  const messageSizeBytes = messageEncoder.encode(message).length;
  // ...
}

Medium Priority

3. Race Condition in Cleanup

Location: network.ts:626-667

The cleanupStalePeers() function could have a subtle race condition:

for (const peerId of stalePeers) {
  lastConnectionTime.delete(peerId);
  messageQueues.delete(peerId);
  // ...
}

Issue: If a connection is established between when stalePeers array is built and when cleanup executes, we might delete a newly-active peer's data.

Recommendation: Add a final check before deletion:

for (const peerId of stalePeers) {
  // Re-check that peer is still stale (not reconnected since we checked)
  if (channels.has(peerId) || reconnectionManager.isReconnecting(peerId)) {
    continue; // Skip cleanup, peer is active again
  }
  lastConnectionTime.delete(peerId);
  // ...
}

4. Missing TypeDoc Documentation

Location: types.ts:30-67

The new RemoteCommsOptions fields lack TypeDoc comments explaining units and behavior:

maxConcurrentConnections?: number | undefined; // Missing: what happens when reached?
maxMessageSizeBytes?: number | undefined;      // Missing: exact behavior on exceed
stalePeerTimeoutMs?: number | undefined;       // Missing: what data is cleaned up?

Recommendation: Add detailed TypeDoc comments per CLAUDE.md conventions:

/**
 * Maximum message size in bytes (default: 1MB).
 * Messages exceeding this limit will be immediately rejected with ResourceLimitError.
 */
maxMessageSizeBytes?: number | undefined;

5. Potential Integer Overflow in Test

Location: network.test.ts (message size tests, from diff)

When testing message size limits with values like 1024 * 1024 + 1, ensure tests also cover boundary cases around 32-bit integer limits if maxMessageSizeBytes could be configured that high.

Recommendation: Add a test comment documenting the max safe value or add explicit validation in the options.

Low Priority

6. Code Style: Error Handling Pattern

Location: Multiple locations use empty catch {}

Per CLAUDE.md, prefer explicit error handling. Several places catch and ignore errors:

} catch {
  logger.log(`${targetPeerId}:: connection limit reached...`);
  // ...
}

Recommendation: Make error variable explicit:

} catch (limitError) {
  logger.log(`${targetPeerId}:: connection limit reached...`);
  // Could log limitError.message for better diagnostics
}

7. Magic Number for SCTP Code

Location: network.ts:183

const SCTP_USER_INITIATED_ABORT = 12; // RFC 4960

Note: Good use of a constant! Consider extracting to a shared constants file if used elsewhere, but current location is fine for now.

📊 Performance & Scale

Good:

Connection limit prevents resource exhaustion
Message size limit prevents large message attacks
Stale peer cleanup prevents memory leaks
Efficient Map-based lookups for channel/queue management

Consider:

At 100 connections with frequent messages, the validateMessageSize() overhead could be noticeable
Cleanup interval of 15 minutes is reasonable, but consider making it adaptive based on connection count
The double-check pattern for connection limits adds complexity but is necessary for correctness

🔒 Security Assessment

✅ Well Protected:

DoS via connection flood: Limited to 100 concurrent connections
DoS via large messages: Limited to 1MB per message
Memory exhaustion: Stale peer cleanup after 1 hour
Intentional close bypass: Properly tracked and enforced

⚠️ Considerations:

Queue exhaustion per peer: Already protected (200 messages/peer by default)
Rapid connect/disconnect: Could trigger cleanup/reconnection storms (consider rate limiting in future)
Hint flooding: registerLocationHints() merges hints unboundedly - consider max hints per peer

📝 Code Quality

Adherence to CLAUDE.md: ✅ Excellent

✅ Uses TypeScript with explicit return types
✅ Uses harden() for ocap security
✅ Uses @metamask/superstruct for validation
✅ Proper TypeDoc documentation (minor gaps noted)
✅ camelCase for functions, PascalCase for classes
✅ Test titles use verb forms without "should"
✅ Proper use of vi.fn() and mocking

Minor Style Issues:

Some functions exceed 50 lines (e.g., attemptReconnection, sendRemoteMessage) but complexity is justified
Consider extracting some nested logic into helper functions for readability

🧪 Test Coverage

Excellent Coverage:

Connection limits (at boundary, exceeded, custom limits)
Message size limits (exact boundary, exceeded, during reconnection)
Stale peer cleanup (timeout, active peers excluded)
Race conditions (concurrent dials, inbound during reconnection)
Channel lifecycle (close, abort, fallback)
Integration scenarios

Potential Gaps:

Cleanup during reconnection: Test that cleanup doesn't interfere with active reconnection
Multiple peers hitting limit simultaneously: Stress test with N+1 peers where N=limit
Hint flooding: Test registerLocationHints() with many hints

🎯 Recommendations Summary

Must Address:

Error propagation - Ensure callers know when messages are queued vs. sent (High Priority fix: Install Corepack as prereq in broken workflows #1)

Should Address:

TextEncoder efficiency - Move to module level (High Priority chore: Convert to monorepo #2)
Cleanup race condition - Add active-check before deletion (Medium Priority Publish packages to npm #3)

Nice to Have:

TypeDoc for new options
Explicit error handling in catch blocks
Consider max hints per peer

✅ Final Verdict

This PR is approved with minor recommendations. The resource limit implementation is solid, well-tested, and addresses real security concerns. The code quality is high and follows project conventions.

Recommended Actions:

Address High Priority fix: Install Corepack as prereq in broken workflows #1 (error propagation) to resolve @FUDCo's concern
Consider High Priority chore: Convert to monorepo #2 (TextEncoder) for performance
Others can be addressed in follow-up PRs if desired

Great work! 🚀

packages/ocap-kernel/src/remotes/network.ts

- Add connection limit (default 100 concurrent connections) - Add message size limit (default 1MB per message) - Add stale peer cleanup (removes data for peers disconnected >1 hour) - Make all limits configurable via RemoteCommsOptions - Add ResourceLimitError for limit violations - Add comprehensive tests for all resource limits This prevents memory exhaustion and manages system resources by: - Rejecting new connections when limit is reached - Rejecting messages exceeding size limit - Periodically cleaning up stale peer data

sirtimid requested a review from a team as a code owner December 18, 2025 14:38

cursor bot reviewed Dec 18, 2025

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Outdated Show resolved Hide resolved

packages/kernel-errors/src/errors/ResourceLimitError.ts Show resolved Hide resolved

packages/ocap-kernel/src/remotes/network.ts Outdated Show resolved Hide resolved

sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from a40d3eb to 96f26e2 Compare December 18, 2025 15:05

cursor bot reviewed Dec 18, 2025

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

cursor bot reviewed Dec 19, 2025

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

cursor bot reviewed Dec 19, 2025

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from b6c23e0 to 25e81ac Compare December 19, 2025 19:55

cursor bot reviewed Dec 19, 2025

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from 8e04986 to ce1e774 Compare January 5, 2026 15:35

cursor bot reviewed Jan 5, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

packages/ocap-kernel/src/remotes/network.ts Outdated Show resolved Hide resolved

cursor bot reviewed Jan 5, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from ce1e774 to 8323837 Compare January 5, 2026 20:51

sirtimid requested a review from FUDCo January 5, 2026 20:51

cursor bot reviewed Jan 5, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Outdated Show resolved Hide resolved

FUDCo reviewed Jan 5, 2026

View reviewed changes

cursor bot reviewed Jan 6, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

packages/ocap-kernel/src/remotes/network.ts Outdated Show resolved Hide resolved

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

cursor bot reviewed Jan 6, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Outdated Show resolved Hide resolved

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

cursor bot reviewed Jan 6, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

cursor bot reviewed Jan 6, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

cursor bot reviewed Jan 6, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

sirtimid enabled auto-merge (squash) January 6, 2026 19:48

sirtimid disabled auto-merge January 7, 2026 08:12

sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from 71fe98b to b9f908b Compare January 7, 2026 15:10

cursor bot reviewed Jan 7, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Outdated Show resolved Hide resolved

cursor bot reviewed Jan 7, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

packages/ocap-kernel/src/remotes/network.ts Show resolved Hide resolved

sirtimid added 2 commits January 8, 2026 17:06

fix bugs

b9c3446

sirtimid added 19 commits January 8, 2026 17:06

close channel to release network resources

b0b50bb

small refactor

a1d50dc

update last timestamp on inbound message receipt

2374bec

fix yet another bug

4e709c1

Check connection limit for inbound connections

633a4be

remove redundant check

02578f8

close channel after dials

3368bde

fix bug

95a3ac0

fix more bugs oof

7920bde

cleanup

4416486

moooore cleanup

a545631

fixing moooore bugs

3e60049

fixing all bugs

7e5cbdd

fixing race condition

c3387d3

fix Messages stuck in queue when write fails on replaced channel

e331e14

thresholds

9ffc7b2

fix lint issues

f71bcb0

no unnecessary reconnection

ecb6199

apply suggestions

982f9b8

sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from b9c906e to 982f9b8 Compare January 8, 2026 16:06

rekmarks requested a review from FUDCo January 8, 2026 18:46

feat(ocap-kernel): add resource limits for remote communications #714

Are you sure you want to change the base?

feat(ocap-kernel): add resource limits for remote communications #714

Conversation

sirtimid commented Dec 18, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rekmarks commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

FUDCo left a comment

Choose a reason for hiding this comment

Uh oh!

sirtimid commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Jan 7, 2026

PR Review: Resource Limits for Remote Communications

✅ Strengths

1. Excellent Error Design

2. Robust Resource Limit Enforcement

3. Channel Lifecycle Management

4. Comprehensive Test Coverage

5. Security Considerations

🔍 Issues & Recommendations

Critical

High Priority

Medium Priority

Low Priority

📊 Performance & Scale

Good:

Consider:

🔒 Security Assessment

✅ Well Protected:

⚠️ Considerations:

📝 Code Quality

Adherence to CLAUDE.md: ✅ Excellent

Minor Style Issues:

🧪 Test Coverage

Excellent Coverage:

Potential Gaps:

🎯 Recommendations Summary

Must Address:

Should Address:

Nice to Have:

✅ Final Verdict

Recommended Actions:

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sirtimid commented Dec 18, 2025 •

edited by cursor bot

Loading

sirtimid commented Jan 6, 2026 •

edited

Loading