Skip to content

Conversation

@obelix74
Copy link
Contributor

@obelix74 obelix74 commented Jan 24, 2026

Add new schema version 4 with tables for storing scan and commit metrics reports as first-class entities.

New tables:

  • scan_metrics_report: Stores scan metrics with trace correlation
  • scan_metrics_report_roles: Junction table for principal roles
  • commit_metrics_report: Stores commit metrics with trace correlation
  • commit_metrics_report_roles: Junction table for principal roles

Key design decisions:

  • PRIMARY KEY (realm_id, report_id) for multi-tenancy
  • Junction tables with CASCADE DELETE for roles
  • Timestamp index for retention cleanup
  • JSONB metadata column for extensibility (Postgres), TEXT for H2

Checklist

  • 🛡️ Don't disclose security issues! (contact [email protected])
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

Add new schema version 4 with tables for storing scan and commit metrics
reports as first-class entities.

New tables:
- scan_metrics_report: Stores scan metrics with trace correlation
- scan_metrics_report_roles: Junction table for principal roles
- commit_metrics_report: Stores commit metrics with trace correlation
- commit_metrics_report_roles: Junction table for principal roles

Key design decisions:
- PRIMARY KEY (realm_id, report_id) for multi-tenancy
- Junction tables with CASCADE DELETE for roles
- Timestamp index for retention cleanup
- JSONB metadata column for extensibility (Postgres), TEXT for H2

Schema is self-contained (includes all tables from v1-v3) to support
fresh installs.
Comment on lines +342 to +348
CREATE TABLE IF NOT EXISTS commit_metrics_report_roles (
realm_id TEXT NOT NULL,
report_id TEXT NOT NULL,
role_name TEXT NOT NULL,
PRIMARY KEY (realm_id, report_id, role_name),
FOREIGN KEY (realm_id, report_id) REFERENCES commit_metrics_report(realm_id, report_id) ON DELETE CASCADE
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

porting my comment from previous pr here, do we need a new table just for this ? @dimas-b thoughts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

   CREATE TABLE scan_metrics_report (
       ...
       roles JSONB DEFAULT '[]'::JSONB,  -- or TEXT for H2
       ...
   );

   Pros:
     • No extra tables
     • Single INSERT per report
     • Works with both Postgres (JSONB) and H2 (TEXT)
     • Can query with JSON operators: WHERE roles @> '["analyst"]'

   Cons:
     • Harder to index efficiently
     • JSON parsing overhead
     • Less type-safe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe the debate then is adding it in schema directly of the scan_metric report or normalizing the roles info into seperate i believe its fine to have role_name if we have principal name in the schema already, if we wanna be cautious we can just move the prinicipal_name and role name in the additional properties.

H2 is just for test, for PG we can make indexes on the JSONB columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @singhpk234. @dimas-b please let us know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I lost context on the RDBMS schema for metrics 😅

Do we have a description (somewhere) of anticipated query patterns in Polaris and whether we expect external (intependent) queries against this schema?

@singhpk234 singhpk234 requested a review from dimas-b January 24, 2026 00:23
PRIMARY KEY (event_id)
);

-- Idempotency records (from v3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not longer in v3 per #3386

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have merged my PR with the latest from main. My PR now only contains the metrics tables I added, and a potential fix for the correct version.

@dimas-b
Copy link
Contributor

dimas-b commented Jan 26, 2026

@obelix74 @singhpk234 : WDYT about starting an RFC doc + dev thread on this? I believe a structured overview of this feature would be good to set the stage for PRs :) (apologies if I missed it)

@singhpk234
Copy link
Contributor

@dimas-b there is a dev thread already please ref : https://lists.apache.org/thread/c83jnkvlwc2k3swm65cmvl4t0mt7p799
thanks @obelix74 for the the writing this up !

@obelix74
Copy link
Contributor Author

@obelix74 @singhpk234 : WDYT about starting an RFC doc + dev thread on this? I believe a structured overview of this feature would be good to set the stage for PRs :) (apologies if I missed it)

I am trying to solve two sets of asks from my product folks with this.

  1. Metrics - what tables were accessed by a client principal
  2. Auditing - which user accessed what data and why

From the metrics perspective, today, with 1.3.0, I want to be able to report on table metrics based on:

Track table scan operations:

  • by table
  • by snapshot
  • by time range
  • by realm
  • by user principal
  • by engine

For commit report queries:

  • by operation type
  • data growth
  • file churn
  • storage analysis

Also many operational dashboards, and filtering by user, realm, engine name, version etc.

I have not thought about roles in this flow at all, perhaps it will be useful. @singhpk234 recommended adding roles and I added them. I normalized the roles tables from a RDBMS perspective, but I didn't realize there are other similar fields stored as JSON already.

MERGE INTO version (version_key, version_value)
KEY (version_key)
VALUES ('version', 3);
VALUES ('version', 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @obelix74 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants