fix(preprocessing): get merged group_ids after intersections #105385

thetruecpaul · 2025-12-22T17:14:14Z

We support users merging / unmerging / reprocessing their groups. Each of these operations requires the group's occurrences to point to the new group ID. Clickhouse-based data stores are immutable-ish, so we do that through the power of (1) tracking Group Redirects on one of these move-events, and (2) preprocessing queries to expand group_id filters to include all the relevant groups.

Currently, the preprocessing apparatus aims for correctness — it gets all merged group_ids for each filter before intersecting them. This is technically correct, but (1) unlikely to be important (since users usually just use the newest group_id) and (2) breaks end-user expectations (since users would expect if g_id = 1 AND g_id = 2 to generally not return any results, even if one was merged into the other).

This PR changes the behavior so that we run all the intersections before expanding through redirects. This has two main benefits:

Intersections will necessarily have equal-or-fewer group_ids than each filter. Passing fewer IDs into the expansion will be faster and make us less likely to hit the threshold (& miss results).
Only running the expansion twice instead of once for each filter means that we'll cut down on the number of expensive DB queries, which should speed up queries.

We support users merging / unmerging / reprocessing their groups. Each of these operations requires the group's occurrences to point to the new group ID. Clickhouse-based data stores are immutable-ish, so we do that through the power of (1) tracking Group Redirects on one of these move-events, and (2) preprocessing queries to expand group_id filters to include all the relevant groups. Currently, the preprocessing apparatus aims for _correctness_ — it gets all merged group_ids for each filter before intersecting them. This is technically correct, but (1) unlikely to be important (since users usually just use the newest group_id) and (2) breaks end-user expectations (since users would expect `if g_id = 1 AND g_id = 2` to generally not return any results, even if one was merged into the other). This PR changes the behavior so that we run all the intersections before expanding through redirects. This has two main benefits: * Intersections will necessarily have equal-or-fewer group_ids than each filter. Passing fewer IDs into the expansion will be faster and make us less likely to hit the threshold (& miss results). * Only running the expansion twice instead of once for each filter means that we'll cut down on the number of expensive DB queries, which should speed up queries.

cursor · 2025-12-22T17:19:49Z

src/sentry/utils/snuba.py

        if in_groups is not None:
            in_groups.difference_update(out_groups)
-            triple = ["group_id", "IN", in_groups]
+            triple = ["group_id", "IN", get_all_merged_group_ids(in_groups)]


Expanded exclusions can be reintroduced after late expansion

The difference_update on line 976 removes expanded out_groups from raw in_groups, but then line 977 expands in_groups which can re-introduce group IDs that were in out_groups. For example, if group 1 redirects to 10 and group 100 also redirects to 10, a query group_id = 1 AND group_id != 100 would have out_groups = {100, 10} but in_groups = {1}. After difference_update, in_groups stays {1}, but after expansion becomes {1, 10}, incorrectly including 10 which the user tried to exclude.

codecov · 2025-12-22T17:27:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #105385      +/-   ##
===========================================
- Coverage   80.57%    80.57%   -0.01%     
===========================================
  Files        9440      9440              
  Lines      404731    404731              
  Branches    25681     25681              
===========================================
- Hits       326124    326106      -18     
- Misses      78138     78156      +18     
  Partials      469       469

This was originally set to 2500 semi-arbitrarily to help avoid Clickhouse ingestion size limits. We're seeing folks hit the threshold; it seems like increasing to a threshold of 4000 would eliminate most of (though not all of) the problems. See: [Widget](https://sentry.sentry.io/organizations/sentry/dashboard/167228/widget/5/?statsPeriod=14d) Notably, we already gave ourselves a lot of breathing room below the Clickhouse limit and we now have the ability to up that limit if needed. Let's bump the threshold to avoid problems. The ideal number might change after #105385; we should keep an eye on that.

thetruecpaul requested a review from a team December 22, 2025 17:14

thetruecpaul requested review from a team as code owners December 22, 2025 17:14

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Dec 22, 2025

vercel bot deployed to Preview December 22, 2025 17:16 View deployment

cursor bot reviewed Dec 22, 2025

View reviewed changes

shashjar approved these changes Dec 22, 2025

View reviewed changes

thetruecpaul mentioned this pull request Dec 22, 2025

fix(preprocessing): bump redirect-preprocess threshold #105391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(preprocessing): get merged group_ids after intersections #105385

fix(preprocessing): get merged group_ids after intersections #105385

thetruecpaul commented Dec 22, 2025

Uh oh!

cursor bot Dec 22, 2025

Uh oh!

codecov bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix(preprocessing): get merged group_ids after intersections #105385

Are you sure you want to change the base?

fix(preprocessing): get merged group_ids after intersections #105385

Conversation

thetruecpaul commented Dec 22, 2025

Uh oh!

cursor bot Dec 22, 2025

Choose a reason for hiding this comment

Expanded exclusions can be reintroduced after late expansion

Uh oh!

codecov bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Dec 22, 2025 •

edited

Loading