-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix(preprocessing): get merged group_ids after intersections #105385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
We support users merging / unmerging / reprocessing their groups. Each of these operations requires the group's occurrences to point to the new group ID. Clickhouse-based data stores are immutable-ish, so we do that through the power of (1) tracking Group Redirects on one of these move-events, and (2) preprocessing queries to expand group_id filters to include all the relevant groups. Currently, the preprocessing apparatus aims for _correctness_ — it gets all merged group_ids for each filter before intersecting them. This is technically correct, but (1) unlikely to be important (since users usually just use the newest group_id) and (2) breaks end-user expectations (since users would expect `if g_id = 1 AND g_id = 2` to generally not return any results, even if one was merged into the other). This PR changes the behavior so that we run all the intersections before expanding through redirects. This has two main benefits: * Intersections will necessarily have equal-or-fewer group_ids than each filter. Passing fewer IDs into the expansion will be faster and make us less likely to hit the threshold (& miss results). * Only running the expansion twice instead of once for each filter means that we'll cut down on the number of expensive DB queries, which should speed up queries.
| if in_groups is not None: | ||
| in_groups.difference_update(out_groups) | ||
| triple = ["group_id", "IN", in_groups] | ||
| triple = ["group_id", "IN", get_all_merged_group_ids(in_groups)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded exclusions can be reintroduced after late expansion
The difference_update on line 976 removes expanded out_groups from raw in_groups, but then line 977 expands in_groups which can re-introduce group IDs that were in out_groups. For example, if group 1 redirects to 10 and group 100 also redirects to 10, a query group_id = 1 AND group_id != 100 would have out_groups = {100, 10} but in_groups = {1}. After difference_update, in_groups stays {1}, but after expansion becomes {1, 10}, incorrectly including 10 which the user tried to exclude.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #105385 +/- ##
===========================================
- Coverage 80.57% 80.57% -0.01%
===========================================
Files 9440 9440
Lines 404731 404731
Branches 25681 25681
===========================================
- Hits 326124 326106 -18
- Misses 78138 78156 +18
Partials 469 469 |
This was originally set to 2500 semi-arbitrarily to help avoid Clickhouse ingestion size limits. We're seeing folks hit the threshold; it seems like increasing to a threshold of 4000 would eliminate most of (though not all of) the problems. See: [Widget](https://sentry.sentry.io/organizations/sentry/dashboard/167228/widget/5/?statsPeriod=14d) Notably, we already gave ourselves a lot of breathing room below the Clickhouse limit and we now have the ability to up that limit if needed. Let's bump the threshold to avoid problems. The ideal number might change after #105385; we should keep an eye on that.
We support users merging / unmerging / reprocessing their groups. Each of these operations requires the group's occurrences to point to the new group ID. Clickhouse-based data stores are immutable-ish, so we do that through the power of (1) tracking Group Redirects on one of these move-events, and (2) preprocessing queries to expand group_id filters to include all the relevant groups.
Currently, the preprocessing apparatus aims for correctness — it gets all merged group_ids for each filter before intersecting them. This is technically correct, but (1) unlikely to be important (since users usually just use the newest group_id) and (2) breaks end-user expectations (since users would expect
if g_id = 1 AND g_id = 2to generally not return any results, even if one was merged into the other).This PR changes the behavior so that we run all the intersections before expanding through redirects. This has two main benefits: