Skip to content

Conversation

@Indhumathi27
Copy link
Contributor

@Indhumathi27 Indhumathi27 commented Nov 18, 2025

What changes were proposed in this pull request?

This PR updates TopNKeyProcessor to skip creating TopNKeyOperator when ReduceSinkDesc.topN is already set by LIMIT pushdown for ORDER BY LIMIT case. This prevents TopNKey from overriding pushdown and ensures the Reduce sink TopNkey filtering is used.

Why are the changes needed?

Currently, when a query includes ORDER BY + LIMIT, LIMIT pushdown is generated during planning but is effectively overridden by the subsequent TopNKey rewrite. As a result, TopNKey operator receives full input rather than a reduced data set, leading to worse performance (e.g., 16M rows forwarded to reducer instead of a few). In cases where global ordering uses a single reducer, LIMIT pushdown is sufficient and far more efficient. This fix prevents unnecessary TopNKey creation so that pushdown can reduce shuffle and significantly improve execution time.

Test reports:

For query: select * from table order by h limit 100;
Total num of rows: 67764224

with fix:
Screenshot 2025-12-01 at 11 45 37 PM

without fix:
Screenshot 2025-12-01 at 11 49 39 PM

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual testing + existing testcases

@Indhumathi27 Indhumathi27 changed the title Draft: HIVE-29322: Avoid TopNKeyOperator When Map-Side LIMIT Pushdown Provides Better Pruning HIVE-29322: Avoid TopNKeyOperator When Map-Side LIMIT Pushdown Provides Better Pruning Nov 27, 2025
@Indhumathi27
Copy link
Contributor Author

@kasakrisz @deniskuzZ Can you help to review this PR. Thanks

@Indhumathi27 Indhumathi27 force-pushed the disable_topn branch 4 times, most recently from db493f2 to a003d03 Compare December 1, 2025 18:13
@Indhumathi27 Indhumathi27 changed the title HIVE-29322: Avoid TopNKeyOperator When Map-Side LIMIT Pushdown Provides Better Pruning HIVE-29322: Avoid TopNKeyOperator When Map-Side LIMIT Pushdown Provides Better Pruning for ORDER BY LIMIT Queries Dec 1, 2025
@Indhumathi27
Copy link
Contributor Author

@ayushtkn @zabetak @kasakrisz @deniskuzZ Can you help to review the PR. thanks

@zabetak
Copy link
Member

zabetak commented Dec 8, 2025

I left some questions under the JIRA ticket to better understand the issue that we are trying to solve here.

@Indhumathi27
Copy link
Contributor Author

@zabetak @okumin Added the analysis in the jira. Please check

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over the plan changes and they look reasonable based on the goal of this PR. In terms of changes in the optimizer I left a few comments that could make the code a bit easier to follow.

sort order: ++
Statistics: Num rows: 1 Data size: 178 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: string)
TopN Hash Memory Usage: 0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the plan contains a Limit operator all subsequent TopN optimizations in the same mapper/reducer seem redundant. It's out of scope of the current PR but it may be worth logging a separate JIRA ticket as a potential improvement.

// Skip the current optimization when a simple global ORDER BY...LIMIT is present
// (topN > -1 and hasOnlyOrderByLimit()).
// This plan structure is handled more efficiently by the specialized 'TopN In Reducer' optimization.
if (reduceSinkDesc.getTopN() > -1 && reduceSinkDesc.hasOnlyOrderByLimit()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we don't want to introduce a TopNKeyOperator if the path between TS and RS does not contain a JOIN or GBY operator. We can add such checks at some point here or maybe better modify the respective RuleRegExp to only trigger when there is a GBY or JOIN in the path.

Copy link
Contributor Author

@Indhumathi27 Indhumathi27 Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zabetak I have handled in TopNKeyProcessor to disable TopNkey for cases without JOIN/GROUP BY. can you look at the changes. Thanks

@Indhumathi27 Indhumathi27 changed the title HIVE-29322: Avoid TopNKeyOperator When Map-Side LIMIT Pushdown Provides Better Pruning for ORDER BY LIMIT Queries HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries Jan 3, 2026
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 4, 2026

* This is used to disable TopNKey for pure ORDER BY LIMIT queries where
* LIMIT pushdown must take precedence.
*/
public static boolean isOrderByLimitPath(ReduceSinkOperator rs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that the traversal logic here could be skipped by properly setting up the RuleRegExp in org.apache.hadoop.hive.ql.parse.TezCompiler#runTopNKeyOptimization.

Something like:

new RuleRegExp("Top n key optimization", "(GBY%|JOIN%).*RS%")

The expression above is not tested but I have the impression that with some tuning we can get the desired matching scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants