Skip to content

Conversation

@thetruecpaul
Copy link
Contributor

We support users merging / unmerging / reprocessing their groups. Each of these operations requires the group's occurrences to point to the new group ID. Clickhouse-based data stores are immutable-ish, so we do that through the power of (1) tracking Group Redirects on one of these move-events, and (2) preprocessing queries to expand group_id filters to include all the relevant groups.

Currently, the preprocessing apparatus aims for correctness — it gets all merged group_ids for each filter before intersecting them. This is technically correct, but (1) unlikely to be important (since users usually just use the newest group_id) and (2) breaks end-user expectations (since users would expect if g_id = 1 AND g_id = 2 to generally not return any results, even if one was merged into the other).

This PR changes the behavior so that we run all the intersections before expanding through redirects. This has two main benefits:

  • Intersections will necessarily have equal-or-fewer group_ids than each filter. Passing fewer IDs into the expansion will be faster and make us less likely to hit the threshold (& miss results).
  • Only running the expansion twice instead of once for each filter means that we'll cut down on the number of expensive DB queries, which should speed up queries.

We support users merging / unmerging / reprocessing their groups. Each of these operations requires the group's occurrences to point to the new group ID. Clickhouse-based data stores are immutable-ish, so we do that through the power of (1) tracking Group Redirects on one of these move-events, and (2) preprocessing queries to expand group_id filters to include all the relevant groups.

Currently, the preprocessing apparatus aims for _correctness_ — it gets all merged group_ids for each filter before intersecting them. This is technically correct, but (1) unlikely to be important (since users usually just use the newest group_id) and (2) breaks end-user expectations (since users would expect `if g_id = 1 AND g_id = 2` to generally not return any results, even if one was merged into the other).

This PR changes the behavior so that we run all the intersections before expanding through redirects. This has two main benefits:
* Intersections will necessarily have equal-or-fewer group_ids than each filter. Passing fewer IDs into the expansion will be faster and make us less likely to hit the threshold (& miss results).
* Only running the expansion twice instead of once for each filter means that we'll cut down on the number of expensive DB queries, which should speed up queries.
@thetruecpaul thetruecpaul requested a review from a team December 22, 2025 17:14
@thetruecpaul thetruecpaul requested review from a team as code owners December 22, 2025 17:14
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Dec 22, 2025
if in_groups is not None:
in_groups.difference_update(out_groups)
triple = ["group_id", "IN", in_groups]
triple = ["group_id", "IN", get_all_merged_group_ids(in_groups)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded exclusions can be reintroduced after late expansion

The difference_update on line 976 removes expanded out_groups from raw in_groups, but then line 977 expands in_groups which can re-introduce group IDs that were in out_groups. For example, if group 1 redirects to 10 and group 100 also redirects to 10, a query group_id = 1 AND group_id != 100 would have out_groups = {100, 10} but in_groups = {1}. After difference_update, in_groups stays {1}, but after expansion becomes {1, 10}, incorrectly including 10 which the user tried to exclude.

Fix in Cursor Fix in Web

@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #105385      +/-   ##
===========================================
- Coverage   80.57%    80.57%   -0.01%     
===========================================
  Files        9440      9440              
  Lines      404731    404731              
  Branches    25681     25681              
===========================================
- Hits       326124    326106      -18     
- Misses      78138     78156      +18     
  Partials      469       469              

thetruecpaul added a commit that referenced this pull request Dec 22, 2025
This was originally set to 2500 semi-arbitrarily to help avoid Clickhouse ingestion size limits. We're seeing folks hit the threshold; it seems like increasing to a threshold of 4000 would eliminate most of (though not all of) the problems.

See: [Widget](https://sentry.sentry.io/organizations/sentry/dashboard/167228/widget/5/?statsPeriod=14d)

Notably, we already gave ourselves a lot of breathing room below the Clickhouse limit and we now have the ability to up that limit if needed. Let's bump the threshold to avoid problems.

The ideal number might change after #105385; we should keep an eye on that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants