feat: add observability for security agent by jeanduplessis · Pull Request #58 · Kilo-Org/cloud

jeanduplessis · 2026-02-07T16:42:03Z

Addresses shortcoming of operational observability from the security agent.

Addresses Finding #15 (HIGH: No Operational Observability) from the security agent production readiness review. Lays out a 5-phase plan covering correlation IDs, structured logging, LLM call timing/token tracking, cron heartbeats, sync metrics, pipeline instrumentation, and degradation detection — all using existing codebase infrastructure (emitApiMetrics, sentryLogger, Sentry spans, BetterStack heartbeats). https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

…kflows Implements all 5 phases of the observability plan (Finding #15): Phase 1 - Correlation ID & Structured Logging: - Generate correlationId (UUID) at analysis start, thread through all tiers - Store correlationId in SecurityFindingAnalysis JSONB for queryability - Replace ~76 console.log/error calls with sentryLogger (dual console+Sentry) - Wrap startSecurityAnalysis in Sentry withScope for tag propagation Phase 2 - LLM Call Timing & Token Tracking: - Wrap triage and extraction LLM calls in Sentry startSpan (op: ai.inference) - Extract token usage from sendProxiedChatCompletion responses - Emit metrics via emitApiMetrics with mode security-agent-triage/extraction - Track input/output tokens as span attributes Phase 3 - Cron Heartbeats & Sync Metrics: - Add BetterStack heartbeat support to both cron jobs (env-configurable URLs) - Send /fail heartbeat on sync errors - Add per-repository sync timing in syncDependabotAlertsForRepo - Track GitHub API rate limits via x-ratelimit-remaining headers Phase 4 - Pipeline Timing & R2 Retry Instrumentation: - Wrap processAnalysisStream in Sentry span (op: ai.pipeline) - Track stream duration, R2 retry attempts, and retry wait time - Log tier transition timing (Tier 1 duration) - Record stream outcome status on span attributes Phase 5 - Outcome Distribution & Degradation Detection: - Add Sentry breadcrumbs for triage/extraction outcomes with isFallback flag - Track auto-dismiss decisions with correlationId and source - Add stale analysis anomaly detection (warn when count > threshold) - Log bulk auto-dismiss summaries https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

…ementation - Fix withScope propagation: move withScope inside processAnalysisStream where background work actually runs instead of startSecurityAnalysis - Fix span exception handling: move try/catch inside startSpan callback so span attributes are available on error paths - Refactor triage, extraction, and auto-dismiss to use options objects instead of growing positional argument lists - Guard emitApiMetrics calls with O11Y_KILO_GATEWAY_CLIENT_SECRET check to prevent sending metrics with empty client secret - Derive toolsUsed from actual LLM response tool_calls instead of hardcoding before validation - Remove unused warn variable in triage-service - Add try/catch and failure heartbeat to cleanup-stale-analyses cron - Use consistent performance.now() in sync-service runFullSync - Use 'cron' source tag for auth warnings in cron routes for consistent Sentry alert routing https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

kiloconnect · 2026-02-07T19:45:55Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (10 files)

src/app/api/cron/cleanup-stale-analyses/route.ts
src/app/api/cron/sync-security-alerts/route.ts
src/lib/config.server.ts
src/lib/security-agent/core/types.ts
src/lib/security-agent/github/dependabot-api.ts
src/lib/security-agent/services/analysis-service.ts
src/lib/security-agent/services/auto-dismiss-service.ts
src/lib/security-agent/services/extraction-service.ts
src/lib/security-agent/services/sync-service.ts
src/lib/security-agent/services/triage-service.ts

The observability refactor removed console.error statements from parseTriageResult and parseExtractionResult without replacing them, losing visibility into which field validation failed and what the invalid value was. Restore logging using sentryLogger (logError) so failures surface in both console and Sentry. https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

src/lib/security-agent/services/sync-service.ts

…raction The observability refactor removed console.log/console.error calls for response validation failures (no choice, no tool call, unexpected tool) and success logging (triage/extraction complete) without replacing them. Restore using sentryLogger so these events surface in both console and Sentry. https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

The observability refactor replaced the truncated reasoning excerpts with a redundant source field. Restore the reasoning.slice(0, 100) so dismiss logs show *why* the finding was dismissed without needing to look up the full analysis. https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

Restore console statements that were removed without replacement: analysis-service.ts: - R2 message fetch debug info (messageCount, lastFewTypes) - Which message type was selected (completion_result, text, fallback) with messageIndex and contentLength sync-service.ts: - Alert count after GitHub fetch - Finding count after parsing These are useful for diagnosing pipeline issues (e.g. why an analysis returned no result, or how many alerts a repo actually has). https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

Three logError calls passed raw `error` as a positional arg instead of a structured object. sentryLogger puts args into `extra.args[]`, so raw errors end up as `args[0]` with no key — losing context in Sentry. Consistently use `{ error }` (and other relevant fields) so Sentry extra data has named keys. https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

src/app/api/cron/cleanup-stale-analyses/route.ts

The heartbeat fetch calls are awaited — if BetterStack is slow or unreachable, the cron handler stalls until the platform kills it. Add AbortSignal.timeout(5000) so heartbeats are truly best-effort and never block the response. https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

@param

The options-object refactor dropped the @param documentation from triageSecurityFinding, extractSandboxAnalysis, and maybeAutoDismissAnalysis. Restore them using options.field notation. https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

…ar access Centralizes SECURITY_SYNC_BETTERSTACK_HEARTBEAT_URL and SECURITY_CLEANUP_BETTERSTACK_HEARTBEAT_URL in @/lib/config.server instead of reading process.env directly in route files. https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

claude added 4 commits February 7, 2026 16:02

chore: remove security-agent-observability-plan.md

602ff0f

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7

kiloconnect bot reviewed Feb 7, 2026

View reviewed changes

src/lib/security-agent/services/sync-service.ts Outdated Show resolved Hide resolved

claude added 4 commits February 7, 2026 19:52

jeanduplessis changed the title ~~docs: add observability implementation plan for security agent~~ feat: add observability for security agent Feb 7, 2026

kiloconnect bot reviewed Feb 7, 2026

View reviewed changes

src/app/api/cron/cleanup-stale-analyses/route.ts Outdated Show resolved Hide resolved

claude and others added 5 commits February 7, 2026 20:09

Merge branch 'main' into claude/plan-security-agent-observability-07PgC

10f6db9

Merge branch 'main' into claude/plan-security-agent-observability-07PgC

6775f80

eshurakov approved these changes Feb 9, 2026

View reviewed changes

jeanduplessis merged commit f191c0c into main Feb 9, 2026
11 checks passed

jeanduplessis deleted the claude/plan-security-agent-observability-07PgC branch February 9, 2026 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add observability for security agent#58

feat: add observability for security agent#58
jeanduplessis merged 14 commits intomainfrom
claude/plan-security-agent-observability-07PgC

jeanduplessis commented Feb 7, 2026 •

edited

Loading

Uh oh!

kiloconnect bot commented Feb 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeanduplessis commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiloconnect bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeanduplessis commented Feb 7, 2026 •

edited

Loading

kiloconnect bot commented Feb 7, 2026 •

edited

Loading