-
Notifications
You must be signed in to change notification settings - Fork 5
feat(ocap-kernel): add resource limits for remote communications #714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a40d3eb to
96f26e2
Compare
b6c23e0 to
25e81ac
Compare
8e04986 to
ce1e774
Compare
|
@cursor review |
ce1e774 to
8323837
Compare
FUDCo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine for what it is.
One lingering question I have is that as we get more careful about detecting error conditions (or proclaiming them, in the case of configurable resource limits), are we potentially setting ourselves up for situations where an error bubbles up to user code at some location other than the location that is actually responsible for causing it? In other words, will a message transmission error always find its way back to the actual send operation that triggered it?
More generally, could user code find itself in an unrecoverable error state (by that I don't mean a state where there's an error that you can't get rid of -- things can always break unfixably, e.g., a remote host dies forever -- but rather a state where you don't actually know you're stuck). It's entirely plausible to me that everything is fine, but I can't tell from reading the tests whether our tests give us reason to believe we are ok on this score.
@FUDCo Two parts to your question: Error attribution: Yeah transmission errors are always caught within the specific Stuck without knowing: Currently |
71fe98b to
b9f908b
Compare
PR Review: Resource Limits for Remote CommunicationsI've completed a comprehensive review of this PR. Overall, this is high-quality work with excellent attention to detail on resource management and edge cases. The implementation addresses critical DoS vulnerabilities and includes thorough test coverage. Below are my findings: ✅ Strengths1. Excellent Error Design
2. Robust Resource Limit EnforcementThe implementation is thorough and handles numerous edge cases:
3. Channel Lifecycle Management
4. Comprehensive Test Coverage
5. Security Considerations
🔍 Issues & RecommendationsCriticalNone - No critical issues found. High Priority1. Error Propagation Concern (addresses @FUDCo's comment) The current implementation has a potential issue where resource limit errors may not always propagate back to the caller: Location: try {
checkConnectionLimit();
} catch {
// Connection limit reached - close the dialed channel and queue the message
logger.log(
`${targetPeerId}:: connection limit reached after dial, queueing message`,
);
await connectionFactory.closeChannel(channel, targetPeerId);
currentQueue.enqueue(message);
handleConnectionLoss(targetPeerId, channel);
return; // ⚠️ Returns silently, caller doesn't know message was queued
}Problem: When the connection limit is reached after dial, the message is silently queued and reconnection is triggered. The caller receives a successful Promise resolution but the message wasn't actually sent. Similar issue at: Recommendation: Consider one of these approaches:
This ensures user code knows when messages are delayed vs. successfully sent, avoiding the "unrecoverable error state" concern. 2. TextEncoder Creation Efficiency Location: function validateMessageSize(message: string): void {
const messageSizeBytes = new TextEncoder().encode(message).length;
// ...
}Issue: Creates a new Recommendation: const messageEncoder = new TextEncoder(); // Create once at module level
function validateMessageSize(message: string): void {
const messageSizeBytes = messageEncoder.encode(message).length;
// ...
}Medium Priority3. Race Condition in Cleanup Location: The for (const peerId of stalePeers) {
lastConnectionTime.delete(peerId);
messageQueues.delete(peerId);
// ...
}Issue: If a connection is established between when Recommendation: Add a final check before deletion: for (const peerId of stalePeers) {
// Re-check that peer is still stale (not reconnected since we checked)
if (channels.has(peerId) || reconnectionManager.isReconnecting(peerId)) {
continue; // Skip cleanup, peer is active again
}
lastConnectionTime.delete(peerId);
// ...
}4. Missing TypeDoc Documentation Location: The new maxConcurrentConnections?: number | undefined; // Missing: what happens when reached?
maxMessageSizeBytes?: number | undefined; // Missing: exact behavior on exceed
stalePeerTimeoutMs?: number | undefined; // Missing: what data is cleaned up?Recommendation: Add detailed TypeDoc comments per CLAUDE.md conventions: /**
* Maximum message size in bytes (default: 1MB).
* Messages exceeding this limit will be immediately rejected with ResourceLimitError.
*/
maxMessageSizeBytes?: number | undefined;5. Potential Integer Overflow in Test Location: When testing message size limits with values like Recommendation: Add a test comment documenting the max safe value or add explicit validation in the options. Low Priority6. Code Style: Error Handling Pattern Location: Multiple locations use empty Per CLAUDE.md, prefer explicit error handling. Several places catch and ignore errors: } catch {
logger.log(`${targetPeerId}:: connection limit reached...`);
// ...
}Recommendation: Make error variable explicit: } catch (limitError) {
logger.log(`${targetPeerId}:: connection limit reached...`);
// Could log limitError.message for better diagnostics
}7. Magic Number for SCTP Code Location: const SCTP_USER_INITIATED_ABORT = 12; // RFC 4960Note: Good use of a constant! Consider extracting to a shared constants file if used elsewhere, but current location is fine for now. 📊 Performance & ScaleGood:
Consider:
🔒 Security Assessment✅ Well Protected:
|
- Add connection limit (default 100 concurrent connections) - Add message size limit (default 1MB per message) - Add stale peer cleanup (removes data for peers disconnected >1 hour) - Make all limits configurable via RemoteCommsOptions - Add ResourceLimitError for limit violations - Add comprehensive tests for all resource limits This prevents memory exhaustion and manages system resources by: - Rejecting new connections when limit is reached - Rejecting messages exceeding size limit - Periodically cleaning up stale peer data
b9c906e to
982f9b8
Compare
Closes #660
This prevents memory exhaustion and manages system resources by:
Note
Implements resource management for remote communications and a dedicated error type for limit violations.
ResourceLimitError(codeRESOURCE_LIMIT_ERROR) with marshal/unmarshal support; exports wired intoerrorClassesand package indexremotes/network.tswith new options:maxConcurrentConnections,maxMessageSizeBytes,cleanupIntervalMs,stalePeerTimeoutMs; validates message size, blocks/tears down excess connections (outbound and inbound), and periodically cleans stale peer dataregisterChannel,reuseOrReturnChannel, improved reconnection loop (race handling, intentional closes, queue flushing, timestamp updates)ConnectionFactory.closeChannelto explicitly close/abort underlying streams; used when replacing or rejecting channelsRemoteCommsOptionsto include new limits; updates tests extensively for limits, reconnection reuse, cleanup behavior, and race conditions; adjusts coverage thresholdsWritten by Cursor Bugbot for commit 982f9b8. This will update automatically on new commits. Configure here.