Skip to content

Conversation

@stefanhipfel
Copy link
Contributor

@stefanhipfel stefanhipfel commented Oct 13, 2025

Proposed Changes

Subscription to redfish events/ metric reports and add those to the operators prometheus metrics.

Summary by CodeRabbit

  • New Features

    • Event subscription management added to create/delete subscriptions for BMCs.
    • BMC status now exposes links to event and metrics-report subscriptions.
    • Lightweight event server added to receive, aggregate, and expose alerts and metric data.
  • Configuration

    • New runtime flags to configure the event server endpoint (URL, port, protocol) and automatic startup behavior.
  • Tests & Docs

    • Tests updated to assert new status fields; API docs updated to document the new status links.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 29, 2025
@stefanhipfel stefanhipfel marked this pull request as ready for review November 13, 2025 16:08
@stefanhipfel stefanhipfel requested a review from a team as a code owner December 29, 2025 09:43
@github-actions github-actions bot added size/XXL and removed size/XL labels Dec 29, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 29, 2025

Walkthrough

Adds event subscription lifecycle: BMC interface methods and Redfish client implementations to create/delete subscriptions, controller logic to create/cleanup subscriptions and patch BMC status, a lightweight HTTP event receiver (internal/serverevents) with subscription helpers, mock server POST/DELETE support, and test tooling.

Changes

Cohort / File(s) Summary
API & CRD
api/v1alpha1/bmc_types.go, config/crd/bases/metal.ironcore.dev_bmcs.yaml, docs/api-reference/api.md
Added optional BMC status fields MetricsReportSubscriptionLink and EventsSubscriptionLink (JSON: metricsReportSubscriptionLink, eventsSubscriptionLink) and updated CRD schema and docs.
BMC Interface
bmc/bmc.go
Extended BMC interface with CreateEventSubscription(ctx, destination, eventType, protocol) (string, error) and DeleteEventSubscription(ctx, uri) error.
Redfish Implementation
bmc/redfish.go
Implemented CreateEventSubscription and DeleteEventSubscription on RedfishBMC: validate EventService, build/post payload, normalize Location header, and delete subscriptions by URI.
Mock Server Data & HTTP Handler
bmc/mock/server/data/Managers/BMC/index.json, bmc/mock/server/server.go
Added Manufacturer to mock manager JSON. Mock server: added Collection/Member types, POST handling for collection resource creation (overrides, generated @odata.id), and DELETE handling to remove collection members.
Controller: BMC
internal/controller/bmc_controller.go, internal/controller/bmc_controller_test.go
Added EventURL field to reconcilers; added handleEventSubscriptions and deleteEventSubscription; reconcile now subscribes and patches status links; deletion attempts subscription cleanup; tests assert new status fields.
Controller: Server & Tests
internal/controller/server_controller.go, internal/controller/suite_test.go
Initialized EventURL in test manager setup; minor formatting.
Serverevents Package
internal/serverevents/server.go, internal/serverevents/subscription.go
New HTTP event receiver exposing /serverevents/alerts and /serverevents/metricsreport with Prometheus metrics, and helpers SubscribeMetricsReport / SubscribeEvents that call BMC client to create subscriptions.
Manager & Runtime Wiring
cmd/manager/main.go
Added CLI flags event-url, event-port, event-protocol; infer EventURL from EVENT_ADDRESS; wire EventURL into reconciler and start event server runnable when configured.
Test Agent Binary
test/serverevents/main.go
New test program to run an event server agent for local testing.
Misc (go.mod)
go.mod
Module/manifest updated (dependency adjustments implied).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Reconciler as BMC Reconciler
    participant BMCClient as RedfishBMC
    participant RedfishAPI as Redfish Service
    participant EventSrv as Event Server

    Note over Reconciler,BMCClient: Reconciler requests subscription creation
    Reconciler->>BMCClient: CreateEventSubscription(ctx, callbackURL, format, retry)
    activate BMCClient
    BMCClient->>RedfishAPI: GET /EventService
    RedfishAPI-->>BMCClient: EventService (Enabled)
    BMCClient->>RedfishAPI: POST /EventService/Subscriptions (payload)
    RedfishAPI-->>BMCClient: 201 Created + Location
    BMCClient-->>Reconciler: subscription link (normalized)
    deactivate BMCClient

    Reconciler->>Reconciler: Patch BMC status with subscription link

    Note over RedfishAPI,EventSrv: Later, Redfish sends events to event server
    RedfishAPI->>EventSrv: POST /serverevents/alerts or /metricsreport
    activate EventSrv
    EventSrv->>EventSrv: Parse payload → update Prometheus metrics
    deactivate EventSrv

    Note over Reconciler,BMCClient: On deletion, reconciler requests deletion
    Reconciler->>BMCClient: DeleteEventSubscription(ctx, subscriptionURI)
    activate BMCClient
    BMCClient->>RedfishAPI: GET /EventService/Subscriptions/{id}
    RedfishAPI-->>BMCClient: Subscription (found) / 404 (not found)
    alt found
        BMCClient->>RedfishAPI: DELETE /EventService/Subscriptions/{id}
        RedfishAPI-->>BMCClient: 204 No Content
    end
    BMCClient-->>Reconciler: done
    deactivate BMCClient
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The PR description is too vague and incomplete. It lacks proper structure and specific details about the changes being made. Expand the description to cover key changes: Redfish interface methods, BMC controller updates, server events package implementation, and how metrics are integrated into Prometheus.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add event subscription management for server events and metrics' accurately summarizes the main changes: adding Redfish event subscription management with integration into Prometheus metrics.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch redfish_events

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f41134a and 654d26f.

📒 Files selected for processing (2)
  • cmd/manager/main.go
  • internal/controller/server_controller.go
💤 Files with no reviewable changes (1)
  • internal/controller/server_controller.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/manager/main.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run test chart
  • GitHub Check: Run tests
  • GitHub Check: Run test e2e

Comment @coderabbitai help to get the list of available commands and usage tips.

… documentation for SubscribeMetricsReport and SubscribeEvents functions.
@github-actions github-actions bot added size/XL and removed size/XXL labels Dec 29, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (2)
test/serverevents/main.go (1)

50-55: Hardcoded server address :8888.

The server address is hardcoded. Consider making this configurable via a flag for flexibility in testing environments.

internal/controller/bmc_controller.go (1)

108-114: Silent failure when BMC client unavailable during deletion.

If GetBMCClientFromBMC fails, the code silently skips subscription deletion. While this may be intentional (BMC unreachable = can't delete subscriptions), the error is discarded. Consider logging a warning so operators know subscriptions may be orphaned on the BMC.

🔎 Suggested improvement
 	bmcClient, err := bmcutils.GetBMCClientFromBMC(ctx, r.Client, bmcObj, r.Insecure, r.BMCOptions)
-	if err == nil {
+	if err != nil {
+		log.V(1).Info("Unable to get BMC client for subscription cleanup, subscriptions may be orphaned", "error", err)
+	} else {
 		defer bmcClient.Logout()
 		if err := r.deleteEventSubscription(ctx, log, bmcClient, bmcObj); err != nil {
 			return ctrl.Result{}, fmt.Errorf("failed to delete event subscriptions: %w", err)
 		}
 	}
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37bd373 and dbe2b7f.

⛔ Files ignored due to path filters (1)
  • dist/chart/templates/crd/metal.ironcore.dev_bmcs.yaml is excluded by !**/dist/**
📒 Files selected for processing (15)
  • api/v1alpha1/bmc_types.go
  • bmc/bmc.go
  • bmc/mock/server/data/Managers/BMC/index.json
  • bmc/mock/server/server.go
  • bmc/redfish.go
  • cmd/manager/main.go
  • config/crd/bases/metal.ironcore.dev_bmcs.yaml
  • docs/api-reference/api.md
  • internal/controller/bmc_controller.go
  • internal/controller/bmc_controller_test.go
  • internal/controller/server_controller.go
  • internal/controller/suite_test.go
  • internal/serverevents/server.go
  • internal/serverevents/subscription.go
  • test/serverevents/main.go
🧰 Additional context used
🧬 Code graph analysis (4)
bmc/redfish.go (1)
api/v1alpha1/bmc_types.go (1)
  • Protocol (118-129)
internal/serverevents/subscription.go (1)
bmc/bmc.go (1)
  • BMC (36-129)
cmd/manager/main.go (2)
internal/serverevents/server.go (1)
  • NewServer (66-75)
internal/registry/server.go (1)
  • NewServer (34-45)
test/serverevents/main.go (1)
internal/serverevents/server.go (1)
  • NewServer (66-75)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run test e2e
  • GitHub Check: Run test chart
  • GitHub Check: Run tests
  • GitHub Check: Run linter
🔇 Additional comments (7)
bmc/mock/server/data/Managers/BMC/index.json (1)

7-7: LGTM!

The Manufacturer field addition is consistent with the existing "Contoso BMC" description and aligns with the BMC manager metadata requirements for the subscription feature.

config/crd/bases/metal.ironcore.dev_bmcs.yaml (1)

258-261: LGTM!

The new status fields eventsSubscriptionLink and metricsReportSubscriptionLink are correctly defined as optional strings with appropriate descriptions. They align with the corresponding Go struct fields in BMCStatus.

Also applies to: 285-288

api/v1alpha1/bmc_types.go (1)

213-219: LGTM!

The new status fields are correctly defined with appropriate documentation, kubebuilder markers, and JSON tags. The optional semantics with omitempty ensure backward compatibility.

internal/controller/bmc_controller_test.go (1)

64-65: LGTM!

The new assertions correctly validate that subscription links are populated in the BMC status. The hardcoded subscription IDs (/5 and /6) are consistent with the mock server's collection member ID generation logic.

Also applies to: 148-149

internal/controller/suite_test.go (2)

17-17: Minor formatting change.

The blank line addition in imports has no functional impact.


175-175: Align EventURL port with production configuration or clarify test intent.

The EventURL is hardcoded to port 8008 in the test, but production uses port 10001 as the default for the event server (set via the event-port flag in cmd/manager/main.go). Either update the test to use 10001 to match production, or document why 8008 is used as a test-specific value. Additionally, verify that an event server is actually started on this port during test execution if subscription logic is being tested.

internal/controller/server_controller.go (1)

86-86: Remove unused EventURL field from ServerReconciler.

The EventURL field is added to ServerReconciler but is never referenced in any of its methods or reconciliation logic. Event subscription handling is implemented only in BMCReconciler (via handleEventSubscriptions() and deleteEventSubscription()), not in the Server controller. This field should be removed to avoid dead code.

Likely an incorrect or invalid review comment.

Comment on lines +123 to +128

// CreateEventSubscription creates an event subscription for the manager.q
CreateEventSubscription(ctx context.Context, destination string, eventType redfish.EventFormatType, protocol redfish.DeliveryRetryPolicy) (string, error)

// DeleteEventSubscription deletes an event subscription for the manager.
DeleteEventSubscription(ctx context.Context, uri string) error
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo in comment.

Line 124 has a trailing 'q' in the comment: "manager.q" should be "manager."

🔎 Proposed fix
-	// CreateEventSubscription creates an event subscription for the manager.q
+	// CreateEventSubscription creates an event subscription for the manager.
 	CreateEventSubscription(ctx context.Context, destination string, eventType redfish.EventFormatType, protocol redfish.DeliveryRetryPolicy) (string, error)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// CreateEventSubscription creates an event subscription for the manager.q
CreateEventSubscription(ctx context.Context, destination string, eventType redfish.EventFormatType, protocol redfish.DeliveryRetryPolicy) (string, error)
// DeleteEventSubscription deletes an event subscription for the manager.
DeleteEventSubscription(ctx context.Context, uri string) error
// CreateEventSubscription creates an event subscription for the manager.
CreateEventSubscription(ctx context.Context, destination string, eventType redfish.EventFormatType, protocol redfish.DeliveryRetryPolicy) (string, error)
// DeleteEventSubscription deletes an event subscription for the manager.
DeleteEventSubscription(ctx context.Context, uri string) error
🤖 Prompt for AI Agents
In bmc/bmc.go around lines 123 to 128, the comment for CreateEventSubscription
contains a stray trailing 'q' ("manager.q"); remove the 'q' so the comment reads
"manager." Update the comment text only, preserving punctuation and spacing.

Comment on lines +197 to +198
s.mu.RLock()
defer s.mu.Unlock()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Lock mismatch causes deadlock.

s.mu.RLock() is paired with s.mu.Unlock() instead of s.mu.RUnlock(). This will cause a deadlock because RLock must be released with RUnlock, not Unlock.

Additionally, this function modifies s.overrides (lines 202, 230), which requires a write lock, not a read lock.

🔎 Proposed fix
 func (s *MockServer) handleRedfishDelete(w http.ResponseWriter, r *http.Request) {
 	s.log.Info("Received request", "method", r.Method, "path", r.URL.Path)

 	urlPath := resolvePath(r.URL.Path)
-	s.mu.RLock()
-	defer s.mu.Unlock()
+	s.mu.Lock()
+	defer s.mu.Unlock()
 	_, hasOverride := s.overrides[urlPath]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
s.mu.RLock()
defer s.mu.Unlock()
s.mu.Lock()
defer s.mu.Unlock()
🤖 Prompt for AI Agents
In bmc/mock/server/server.go around lines 197-198, the code currently calls
s.mu.RLock() but defers s.mu.Unlock(), which is a lock/unlock mismatch and also
insufficient because this function mutates s.overrides (lines ~202 and ~230).
Replace the read lock with a full write lock: call s.mu.Lock() and keep defer
s.mu.Unlock() so modifications are protected; if any read-only paths exist,
refactor them separately to use RLock/RUnlock, but for this function use
Lock/Unlock to fix the deadlock and ensure safe writes to s.overrides.

Comment on lines +206 to +220
cached, hasOverride := s.overrides[collectionPath]
var collection Collection
if hasOverride {
collection = cached.(Collection)
} else {
data, err := dataFS.ReadFile(collectionPath)
if err != nil {
http.NotFound(w, r)
return
}
if err := json.Unmarshal(data, &collection); err != nil {
http.Error(w, "Corrupt embedded JSON", http.StatusInternalServerError)
return
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unsafe type assertion may cause panic.

Line 209 performs a type assertion cached.(Collection) without an ok-check. If the cached value is not a Collection (e.g., it's a map[string]any), this will panic at runtime.

🔎 Proposed fix
 	cached, hasOverride := s.overrides[collectionPath]
 	var collection Collection
 	if hasOverride {
-		collection = cached.(Collection)
+		var ok bool
+		collection, ok = cached.(Collection)
+		if !ok {
+			http.Error(w, "Invalid collection type", http.StatusInternalServerError)
+			return
+		}
 	} else {

Comment on lines +369 to +384
if hasOverride {
s.log.Info("Using overridden data for POST", "path", urlPath)
base = cached.(Collection)
} else {
s.log.Info("Using embedded data for POST", "path", urlPath)
data, err := dataFS.ReadFile(urlPath)
if err != nil {
s.log.Error(err, "Failed to read embedded data for POST", "path", urlPath)
http.NotFound(w, r)
return
}
if err := json.Unmarshal(data, &base); err != nil {
http.Error(w, "Corrupt embedded JSON", http.StatusInternalServerError)
return
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unsafe type assertion may cause panic.

Line 371 performs cached.(Collection) without an ok-check. This can panic if the cached value is a different type (e.g., map[string]any).

🔎 Proposed fix
 	if hasOverride {
 		s.log.Info("Using overridden data for POST", "path", urlPath)
-		base = cached.(Collection)
+		var ok bool
+		base, ok = cached.(Collection)
+		if !ok {
+			http.Error(w, "Resource is not a collection", http.StatusBadRequest)
+			return
+		}
 	} else {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if hasOverride {
s.log.Info("Using overridden data for POST", "path", urlPath)
base = cached.(Collection)
} else {
s.log.Info("Using embedded data for POST", "path", urlPath)
data, err := dataFS.ReadFile(urlPath)
if err != nil {
s.log.Error(err, "Failed to read embedded data for POST", "path", urlPath)
http.NotFound(w, r)
return
}
if err := json.Unmarshal(data, &base); err != nil {
http.Error(w, "Corrupt embedded JSON", http.StatusInternalServerError)
return
}
}
if hasOverride {
s.log.Info("Using overridden data for POST", "path", urlPath)
var ok bool
base, ok = cached.(Collection)
if !ok {
http.Error(w, "Resource is not a collection", http.StatusBadRequest)
return
}
} else {
s.log.Info("Using embedded data for POST", "path", urlPath)
data, err := dataFS.ReadFile(urlPath)
if err != nil {
s.log.Error(err, "Failed to read embedded data for POST", "path", urlPath)
http.NotFound(w, r)
return
}
if err := json.Unmarshal(data, &base); err != nil {
http.Error(w, "Corrupt embedded JSON", http.StatusInternalServerError)
return
}
}
🤖 Prompt for AI Agents
In bmc/mock/server/server.go around lines 369 to 384, the code uses a type
assertion cached.(Collection) without an ok-check which can panic if cached
holds a different type; modify this to perform the comma-ok form (v, ok :=
cached.(Collection)), and if ok is true use v as base, otherwise log an error
(including type information), avoid panicking and either fall back to reading
the embedded data or return an HTTP error response (e.g., 500 or 404) depending
on desired behavior; ensure any error path returns after writing the HTTP
response so execution doesn't continue with an invalid base.

Comment on lines +399 to +411
} else {
base.Members = make([]Member, 0)
location := r.URL.JoinPath("1").String()
base.Members = []Member{
{
OdataID: r.URL.JoinPath("1").String(),
},
}
s.overrides[urlPath] = base
if strings.HasSuffix(r.URL.Path, "/Subscriptions") {
w.Header().Set("Location", location)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resource data not stored for empty collections.

When the collection is initially empty (line 399 else branch), the code stores the collection with the new member (line 407) but does not store the update data for the new resource itself. Compare with lines 397-398 in the non-empty case where both are stored.

🔎 Proposed fix
 	} else {
-		base.Members = make([]Member, 0)
 		location := r.URL.JoinPath("1").String()
+		newMemberPath := resolvePath(location)
 		base.Members = []Member{
 			{
 				OdataID: r.URL.JoinPath("1").String(),
 			},
 		}
 		s.overrides[urlPath] = base
+		s.overrides[newMemberPath] = update
 		if strings.HasSuffix(r.URL.Path, "/Subscriptions") {
 			w.Header().Set("Location", location)
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else {
base.Members = make([]Member, 0)
location := r.URL.JoinPath("1").String()
base.Members = []Member{
{
OdataID: r.URL.JoinPath("1").String(),
},
}
s.overrides[urlPath] = base
if strings.HasSuffix(r.URL.Path, "/Subscriptions") {
w.Header().Set("Location", location)
}
}
} else {
location := r.URL.JoinPath("1").String()
newMemberPath := resolvePath(location)
base.Members = []Member{
{
OdataID: r.URL.JoinPath("1").String(),
},
}
s.overrides[urlPath] = base
s.overrides[newMemberPath] = update
if strings.HasSuffix(r.URL.Path, "/Subscriptions") {
w.Header().Set("Location", location)
}
}
🤖 Prompt for AI Agents
In bmc/mock/server/server.go around lines 399 to 411, when the collection is
empty the code creates and stores the collection with the new Member but fails
to store the resource's update data for the new member; add the same storage
step used in the non-empty branch by writing the update payload for the new
member URL (r.URL.JoinPath("1").String()) into s.overrides (or the appropriate
map) so the newly created resource has its data persisted, and keep setting the
Location header for Subscriptions as already done.

Comment on lines +121 to +123
if err := json.NewDecoder(r.Body).Decode(&metricsReport); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing return after error response.

After calling http.Error, the handler continues execution instead of returning. This will cause the handler to continue processing with invalid data.

🔎 Suggested fix
 	if err := json.NewDecoder(r.Body).Decode(&metricsReport); err != nil {
 		http.Error(w, err.Error(), http.StatusBadRequest)
+		return
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err := json.NewDecoder(r.Body).Decode(&metricsReport); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
}
if err := json.NewDecoder(r.Body).Decode(&metricsReport); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
🤖 Prompt for AI Agents
In internal/serverevents/server.go around lines 121 to 123, the handler calls
http.Error on JSON decode failure but does not return, causing the function to
continue with invalid data; fix by immediately returning after writing the error
response (i.e., call http.Error(...) then return) so the handler stops further
processing when decoding fails.

Comment on lines +124 to +135
for _, mv := range metricsReport.MetricsValues {
if _, ok := metricsReportCollectors[mv.MetricId]; !ok {
gauge := prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: mv.MetricProperty,
Help: "Metric with ID " + mv.MetricId,
},
[]string{"hostname", "vendor"},
)
metricsReportCollectors[mv.MetricId] = gauge
metrics.Registry.MustRegister(gauge)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Race condition: concurrent map access without synchronization.

metricsReportCollectors is accessed concurrently from multiple HTTP handler goroutines without synchronization. This will cause data races and potential panics.

Additionally, metrics.Registry.MustRegister (line 134) will panic if the same metric is registered twice, which can happen with concurrent requests for the same MetricId.

🔎 Suggested fix: use sync.Mutex or sync.Map
+var metricsReportMu sync.Mutex

 func (s *Server) metricsreportHandler(w http.ResponseWriter, r *http.Request) {
 	// ... existing validation code ...
 	
 	for _, mv := range metricsReport.MetricsValues {
+		metricsReportMu.Lock()
 		if _, ok := metricsReportCollectors[mv.MetricId]; !ok {
 			gauge := prometheus.NewGaugeVec(
 				prometheus.GaugeOpts{
 					Name: mv.MetricProperty,
 					Help: "Metric with ID " + mv.MetricId,
 				},
 				[]string{"hostname", "vendor"},
 			)
 			metricsReportCollectors[mv.MetricId] = gauge
-			metrics.Registry.MustRegister(gauge)
+			if err := metrics.Registry.Register(gauge); err != nil {
+				s.log.Error(err, "failed to register metric", "metricId", mv.MetricId)
+			}
 		}
+		collector := metricsReportCollectors[mv.MetricId]
+		metricsReportMu.Unlock()
 		// ... rest of processing ...
+		collector.WithLabelValues(hostname, vendor).Set(floatVal)
 	}
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
internal/serverevents/server.go around lines 124 to 135: metricsReportCollectors
is read and written by concurrent HTTP handlers causing data races and potential
panics when registering the same metric; protect access with synchronization
(use a package-level sync.Mutex or a sync.RWMutex, or convert
metricsReportCollectors to a sync.Map), ensure you check for the metric and
create/store it while holding the lock, then perform prometheus registration in
a way that avoids panics (use prometheus.Register and handle
prometheus.AlreadyRegisteredError or register while still holding the lock and
only once), and release the lock promptly to avoid contention.

Comment on lines 14 to 25
func SubscribeMetricsReport(ctx context.Context, url, hostname string, bmcClient bmc.BMC) (string, error) {
link, err := bmcClient.CreateEventSubscription(
ctx,
fmt.Sprintf("%s/%s/metrics", url, hostname),
redfish.MetricReportEventFormatType,
redfish.TerminateAfterRetriesDeliveryRetryPolicy,
)
if err != nil {
return link, fmt.Errorf("failed to create event subscription: %w", err)
}
return link, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

URL path mismatch: subscription destination won't match server routes.

The subscription creates a destination URL with path <url>/<hostname>/metrics, but the server (in internal/serverevents/server.go lines 77-80) registers routes at /serverevents/alerts and /serverevents/metricsreport.

The generated URL pattern <url>/<hostname>/metrics won't match /serverevents/metricsreport, causing 404 errors when the BMC attempts to post metrics.

🔎 Suggested fix: align URL paths with server routes
 func SubscribeMetricsReport(ctx context.Context, url, hostname string, bmcClient bmc.BMC) (string, error) {
 	link, err := bmcClient.CreateEventSubscription(
 		ctx,
-		fmt.Sprintf("%s/%s/metrics", url, hostname),
+		fmt.Sprintf("%s/serverevents/metricsreport/%s", url, hostname),
 		redfish.MetricReportEventFormatType,
 		redfish.TerminateAfterRetriesDeliveryRetryPolicy,
 	)

Note: The server handlers also need to be updated to extract hostname from the correct path position, or the URL structure needs to be reconsidered.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In internal/serverevents/subscription.go around lines 14 to 25, the subscription
destination path is generated as "<url>/<hostname>/metrics" which doesn't match
the server's registered route "/serverevents/metricsreport", causing 404s;
update the generated destination to match the server route (for example
"<url>/serverevents/metricsreport" and include hostname in a consistent position
such as "<url>/serverevents/{hostname}/metricsreport" or as a query param) and
then update the server handlers to extract the hostname from that same path
position, ensuring both subscription creation and server routing use the
identical URL format.

Comment on lines +27 to +38
func SubscribeEvents(ctx context.Context, url, hostname string, bmcClient bmc.BMC) (string, error) {
link, err := bmcClient.CreateEventSubscription(
ctx,
fmt.Sprintf("%s/%s/alerts", url, hostname),
redfish.EventEventFormatType,
redfish.TerminateAfterRetriesDeliveryRetryPolicy,
)
if err != nil {
return "", fmt.Errorf("failed to create alert subscription: %w", err)
}
return link, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistent error return value and URL path issue.

  1. Inconsistent return on error: SubscribeMetricsReport returns link on error (line 22), while SubscribeEvents returns "" (line 35). This should be consistent.

  2. URL path mismatch: Similar to the metrics subscription, the destination <url>/<hostname>/alerts won't match the server route /serverevents/alerts.

🔎 Suggested fix
 func SubscribeEvents(ctx context.Context, url, hostname string, bmcClient bmc.BMC) (string, error) {
 	link, err := bmcClient.CreateEventSubscription(
 		ctx,
-		fmt.Sprintf("%s/%s/alerts", url, hostname),
+		fmt.Sprintf("%s/serverevents/alerts/%s", url, hostname),
 		redfish.EventEventFormatType,
 		redfish.TerminateAfterRetriesDeliveryRetryPolicy,
 	)
 	if err != nil {
-		return "", fmt.Errorf("failed to create alert subscription: %w", err)
+		return link, fmt.Errorf("failed to create alert subscription: %w", err)
 	}
 	return link, nil
 }
🤖 Prompt for AI Agents
In internal/serverevents/subscription.go around lines 27 to 38, make two fixes:
1) make the error return consistent with SubscribeMetricsReport by returning the
created link along with the error instead of an empty string (i.e., return link,
fmt.Errorf(...)); 2) correct the destination path to match the server route by
changing the formatted path from "%s/%s/alerts" to "%s/%s/serverevents/alerts"
so the subscription posts to /serverevents/alerts.

Comment on lines +21 to +27
var registryURL string
var serverUUID string
var duration time.Duration

flag.StringVar(&registryURL, "registry-url", "", "Registry URL where the probe will register itself.")
flag.StringVar(&serverUUID, "server-uuid", "", "Agent UUID to register with the registry.")
flag.DurationVar(&duration, "duration", 5*time.Second, "Duration of time to wait between checks.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused variables: registryURL and duration are defined but never used.

The registryURL and duration flags are parsed and validated but never actually used in the program logic. This appears to be dead code or incomplete implementation.

🔎 Suggested fix: remove unused variables or implement their usage
 func main() {
-	var registryURL string
-	var serverUUID string
-	var duration time.Duration
-
-	flag.StringVar(&registryURL, "registry-url", "", "Registry URL where the probe will register itself.")
-	flag.StringVar(&serverUUID, "server-uuid", "", "Agent UUID to register with the registry.")
-	flag.DurationVar(&duration, "duration", 5*time.Second, "Duration of time to wait between checks.")
+	var serverUUID string
+
+	flag.StringVar(&serverUUID, "server-uuid", "", "Agent UUID to register with the registry.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var registryURL string
var serverUUID string
var duration time.Duration
flag.StringVar(&registryURL, "registry-url", "", "Registry URL where the probe will register itself.")
flag.StringVar(&serverUUID, "server-uuid", "", "Agent UUID to register with the registry.")
flag.DurationVar(&duration, "duration", 5*time.Second, "Duration of time to wait between checks.")
var serverUUID string
flag.StringVar(&serverUUID, "server-uuid", "", "Agent UUID to register with the registry.")
🤖 Prompt for AI Agents
In test/serverevents/main.go around lines 21 to 27, the flags registryURL and
duration are declared and parsed but never used; either remove these unused
variables and their flag definitions, or wire them into the program: pass
registryURL into the registration logic where the agent registers with the
registry, and use duration as the interval (instead of a hardcoded value) for
the time between checks/loop sleeps; update any related variable names/usages
and tests/comments accordingly so there are no unused declarations left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change area/metal-automation documentation Improvements or additions to documentation size/XL

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants