Skip to content

feat: add observability for security agent#58

Merged
jeanduplessis merged 14 commits intomainfrom
claude/plan-security-agent-observability-07PgC
Feb 9, 2026
Merged

feat: add observability for security agent#58
jeanduplessis merged 14 commits intomainfrom
claude/plan-security-agent-observability-07PgC

Conversation

@jeanduplessis
Copy link
Contributor

@jeanduplessis jeanduplessis commented Feb 7, 2026

Addresses shortcoming of operational observability from the security agent.

Addresses Finding #15 (HIGH: No Operational Observability) from the
security agent production readiness review. Lays out a 5-phase plan
covering correlation IDs, structured logging, LLM call timing/token
tracking, cron heartbeats, sync metrics, pipeline instrumentation,
and degradation detection — all using existing codebase infrastructure
(emitApiMetrics, sentryLogger, Sentry spans, BetterStack heartbeats).

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
…kflows

Implements all 5 phases of the observability plan (Finding #15):

Phase 1 - Correlation ID & Structured Logging:
- Generate correlationId (UUID) at analysis start, thread through all tiers
- Store correlationId in SecurityFindingAnalysis JSONB for queryability
- Replace ~76 console.log/error calls with sentryLogger (dual console+Sentry)
- Wrap startSecurityAnalysis in Sentry withScope for tag propagation

Phase 2 - LLM Call Timing & Token Tracking:
- Wrap triage and extraction LLM calls in Sentry startSpan (op: ai.inference)
- Extract token usage from sendProxiedChatCompletion responses
- Emit metrics via emitApiMetrics with mode security-agent-triage/extraction
- Track input/output tokens as span attributes

Phase 3 - Cron Heartbeats & Sync Metrics:
- Add BetterStack heartbeat support to both cron jobs (env-configurable URLs)
- Send /fail heartbeat on sync errors
- Add per-repository sync timing in syncDependabotAlertsForRepo
- Track GitHub API rate limits via x-ratelimit-remaining headers

Phase 4 - Pipeline Timing & R2 Retry Instrumentation:
- Wrap processAnalysisStream in Sentry span (op: ai.pipeline)
- Track stream duration, R2 retry attempts, and retry wait time
- Log tier transition timing (Tier 1 duration)
- Record stream outcome status on span attributes

Phase 5 - Outcome Distribution & Degradation Detection:
- Add Sentry breadcrumbs for triage/extraction outcomes with isFallback flag
- Track auto-dismiss decisions with correlationId and source
- Add stale analysis anomaly detection (warn when count > threshold)
- Log bulk auto-dismiss summaries

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
…ementation

- Fix withScope propagation: move withScope inside processAnalysisStream
  where background work actually runs instead of startSecurityAnalysis
- Fix span exception handling: move try/catch inside startSpan callback
  so span attributes are available on error paths
- Refactor triage, extraction, and auto-dismiss to use options objects
  instead of growing positional argument lists
- Guard emitApiMetrics calls with O11Y_KILO_GATEWAY_CLIENT_SECRET check
  to prevent sending metrics with empty client secret
- Derive toolsUsed from actual LLM response tool_calls instead of
  hardcoding before validation
- Remove unused warn variable in triage-service
- Add try/catch and failure heartbeat to cleanup-stale-analyses cron
- Use consistent performance.now() in sync-service runFullSync
- Use 'cron' source tag for auth warnings in cron routes for
  consistent Sentry alert routing

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
@kiloconnect
Copy link
Contributor

kiloconnect bot commented Feb 7, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (10 files)
  • src/app/api/cron/cleanup-stale-analyses/route.ts
  • src/app/api/cron/sync-security-alerts/route.ts
  • src/lib/config.server.ts
  • src/lib/security-agent/core/types.ts
  • src/lib/security-agent/github/dependabot-api.ts
  • src/lib/security-agent/services/analysis-service.ts
  • src/lib/security-agent/services/auto-dismiss-service.ts
  • src/lib/security-agent/services/extraction-service.ts
  • src/lib/security-agent/services/sync-service.ts
  • src/lib/security-agent/services/triage-service.ts

The observability refactor removed console.error statements from
parseTriageResult and parseExtractionResult without replacing them,
losing visibility into which field validation failed and what the
invalid value was. Restore logging using sentryLogger (logError)
so failures surface in both console and Sentry.

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
…raction

The observability refactor removed console.log/console.error calls for
response validation failures (no choice, no tool call, unexpected tool)
and success logging (triage/extraction complete) without replacing them.
Restore using sentryLogger so these events surface in both console and
Sentry.

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
The observability refactor replaced the truncated reasoning excerpts
with a redundant source field. Restore the reasoning.slice(0, 100)
so dismiss logs show *why* the finding was dismissed without needing
to look up the full analysis.

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
Restore console statements that were removed without replacement:

analysis-service.ts:
- R2 message fetch debug info (messageCount, lastFewTypes)
- Which message type was selected (completion_result, text, fallback)
  with messageIndex and contentLength

sync-service.ts:
- Alert count after GitHub fetch
- Finding count after parsing

These are useful for diagnosing pipeline issues (e.g. why an analysis
returned no result, or how many alerts a repo actually has).

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
Three logError calls passed raw `error` as a positional arg instead of
a structured object. sentryLogger puts args into `extra.args[]`, so raw
errors end up as `args[0]` with no key — losing context in Sentry.
Consistently use `{ error }` (and other relevant fields) so Sentry
extra data has named keys.

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
@jeanduplessis jeanduplessis changed the title docs: add observability implementation plan for security agent feat: add observability for security agent Feb 7, 2026
claude and others added 5 commits February 7, 2026 20:09
The heartbeat fetch calls are awaited — if BetterStack is slow or
unreachable, the cron handler stalls until the platform kills it.
Add AbortSignal.timeout(5000) so heartbeats are truly best-effort
and never block the response.

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
The options-object refactor dropped the @param documentation from
triageSecurityFinding, extractSandboxAnalysis, and
maybeAutoDismissAnalysis. Restore them using options.field notation.

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
…ar access

Centralizes SECURITY_SYNC_BETTERSTACK_HEARTBEAT_URL and
SECURITY_CLEANUP_BETTERSTACK_HEARTBEAT_URL in @/lib/config.server
instead of reading process.env directly in route files.

https://claude.ai/code/session_01H6HahwjayzdFFZXbpE9Hg7
@jeanduplessis jeanduplessis merged commit f191c0c into main Feb 9, 2026
11 checks passed
@jeanduplessis jeanduplessis deleted the claude/plan-security-agent-observability-07PgC branch February 9, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants