For broadcast alias, insert bcast domain at the position corresponding to bcast op by liqiangxl · Pull Request #5944 · NVIDIA/Fuser

liqiangxl · 2026-02-09T15:17:44Z

This PR disables broadcast aliasing for the reduction + broadcast pattern to prevent unnecessary allocation-domain changes on broadcast tensors.
When segmenting normalization kernels, this chagne leads to a reduction + pointwise decomposition. Before this PR, reduction + transpose is used due to the unnecessary allocation domain set on the broadcast tensor.
This is an optimization following PR #5884 and issue #5883

liqiangxl · 2026-02-09T15:17:57Z

!test --diff

github-actions · 2026-02-09T15:18:40Z

Review updated until commit 3fbef38

Description

Insert broadcast dimensions at logical positions in allocation domain
Keep allocation domain close to logical domain for better performance
Add test case for broadcast with reordered layout subset
Enable reduction+pointwise segmentation instead of reduction+transpose

Changes walkthrough

Relevant files

Enhancement

alias_analysis.cpp `Position broadcast dimensions at logical indices` csrc/alias_analysis.cpp Change broadcast dimension insertion from end to logical positions Insert dimensions at corresponding indices in allocation domain Maintain contiguity information for new broadcast dimensions Keep allocation domain aligned with logical domain structure	+4/-3

Tests

test_alias_analysis.cpp `Add test for broadcast with reordered layout` tests/cpp/test_alias_analysis.cpp Add test for broadcast with reordered layout subset Verify allocation domain maintains logical domain permutation Test preferred layout behavior with non-contiguous dimensions Ensure broadcast dims inserted between reordered logical dims	+31/-0
test_persistent_buffer.cpp `Validate scheduler types in persistent buffer test` tests/cpp/test_persistent_buffer.cpp Add scheduler type validation for segmented fusion Expect both PointWise and Reduction scheduler types Verify segmentation produces expected scheduler combination Confirm optimization enables proper kernel decomposition	+7/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review
Logic Correctness The broadcast dimension insertion logic correctly uses `insert(out_allocation.begin() + i, out_logical[i])` to place broadcast dimensions at their corresponding logical positions rather than appending them. This ensures allocation domain stays close to logical domain as intended. out_allocation.insert(out_allocation.begin() + i, out_logical[i]); out_contiguity.insert(out_contiguity.begin() + i, std::nullopt);

liqiangxl · 2026-02-10T14:28:24Z

!test --diff

…o llu/bcast_alias

liqiangxl · 2026-02-10T20:56:15Z

!test --diff

liqiangxl · 2026-02-11T14:42:22Z

code diff is from the revised test and is expected.

liqiangxl · 2026-02-11T14:42:52Z

distributed test failure in LowerGatherTest.InMesh_1_2_OutMesh is not related to this PR.

greptile-apps · 2026-02-11T14:52:08Z

Greptile Overview

Greptile Summary

This PR optimizes broadcast aliasing by inserting broadcast dimensions at their logical positions in the allocation domain rather than appending them to the end. This keeps the allocation domain closer to the logical domain structure.

Changes AliasFinder::handle(const BroadcastOp*) to use insert() at position i instead of push_back() for broadcast dimensions
For reduction+broadcast patterns, this prevents unnecessary allocation-domain changes that previously forced segmentation into reduction+transpose
Now enables reduction+pointwise decomposition for normalization kernels, improving performance
New test Broadcast_OutLayoutReorderOfSubsetOfOutLogical validates the behavior with reordered input layouts
Updated persistent buffer test confirms segmentation now uses reduction+pointwise instead of reduction+transpose

The implementation correctly handles multiple broadcast dimensions by inserting them sequentially at their logical positions, which naturally accounts for array shifts from previous insertions.

Confidence Score: 4/5

This PR is safe to merge with minor considerations
The logic change is sound and well-tested. The algorithm correctly inserts broadcast dimensions at logical positions by leveraging natural array shifts. However, there's one area that warrants verification: the previous review thread raised a contiguity concern that should be validated through additional testing or code inspection to ensure mapInLayoutToOutRoot properly preserves input contiguity for non-broadcast dimensions
Pay close attention to csrc/alias_analysis.cpp to ensure the contiguity preservation mentioned in previous review comments is handled correctly by mapInLayoutToOutRoot

Important Files Changed

Filename	Overview
csrc/alias_analysis.cpp	Changes broadcast dimension insertion from appending to the end to inserting at logical positions, preserving allocation domain structure
tests/cpp/test_alias_analysis.cpp	Adds test verifying broadcast dims are inserted at logical positions when input has reordered layout
tests/cpp/test_persistent_buffer.cpp	Adds assertion that reduction+broadcast pattern now segments into reduction+pointwise instead of reduction+transpose

Flowchart

flowchart TD
    A[Input Tensor<br/>Logical: i0, i1<br/>Allocation: i1, i0] --> B[mapInLayoutToOutRoot]
    B --> C[Mapped Layout<br/>Allocation: i1_out, i0_out]
    C --> D{Iterate through<br/>output logical dims}
    D -->|i=0: not broadcast| E[Skip]
    D -->|i=1: IS broadcast| F[Insert b at position 1<br/>Allocation: i1_out, b, i0_out]
    D -->|i=2: not broadcast| G[Skip]
    F --> H[Final Output<br/>Logical: i0, b, i1<br/>Allocation: i1, b, i0]
    E --> D
    G --> H
    
    style F fill:#90EE90
    style H fill:#87CEEB

_{Last reviewed commit: 3fbef38}

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

wujingyue

Thanks for trying to provide the context! I still don't follow why reduction+broadcast is bad. Is that something alias analysis could fix instead of dodging?

liqiangxl · 2026-02-11T16:58:32Z

Thanks for trying to provide the context! I still don't follow why reduction+broadcast is bad. Is that something alias analysis could fix instead of dodging?

Let me add more details from a test case:
Dumped from NVFUSER_DUMP=pre_segmenter_logging ./test_nvfuser --gtest_filter=*SmemPersistentNotSupportedIn3DReduction

Fusion is a simple normalization:

Inputs:
  T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
Outputs:
  T3_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}]

%kernel_math {
T1_l_float[rS4{i0}, iS5{i2}, rS6{i3}, rS7{i4}]
   = reduction( T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}], op = add, initial value = float(0), allreduce = false )
T2_l_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}]
   = broadcast( T1_l_float[rS4{i0}, iS5{i2}, rS6{i3}, rS7{i4}], flags = {true, false, true, true} )
T3_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}]
   = T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
   / T2_l_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}];
} // %kernel_math

T2's allocation domain is set to : [iS9{i2}, bS8{1}, bS10{1}, bS11{1}]


========================================
Alias analysis result:
  Potential aliases:
    T2 is an alias of T1 if its layout is <allocation=[iS9{i2}, bS8{1}, bS10{1}, bS11{1}], contiguity=[t n n n]>
  Finalized aliases:
    T2 of allocation domain [] and logical domain [bS8{1}, iS9{i2}, bS10{1}, bS11{1}] is a transitive alias of T1

Set the layout of T2 to <allocation=[iS9{i2}, bS8{1}, bS10{1}, bS11{1}], contiguity=[t n n n]>

Fusion is segmented due to large reduction size, then the following segment is picked up by transpose scheduler due to T2's allocation domain can't be mapped with T0's.

T3_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}]
   = T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
   / T2_g_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}];

We want to use pointwise scheduler for this segmented fusion. So I disabled the broadcast aliasing for the reduction + broadcast pattern. Transpose scheduler will reject this segment if the allocation domains are mapped, see logic at here.

wujingyue · 2026-02-11T22:47:29Z

the following segment is picked up by transpose scheduler due to T2's allocation domain can't be mapped with T0's.

In other words, the following segment can't be picked up by pointwise because T2's allocation domain can't be mapped with T0's.

What stops T2 and T0 from being mapped?

T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
T2_l_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}] with allocation=[iS9{i2}, bS8{1}, bS10{1}, bS11{1}], contiguity=[t n n n]>

IIUC, two better approaches are:

Fix

Fuser/csrc/alias_analysis.cpp

Lines 308 to 313 in 035cf88

    
           for (const auto i : arange(out_logical.size())) { 
        
             if (bcast->isBroadcastDim(i)) { 
        
               out_allocation.push_back(out_logical[i]); 
        
               out_contiguity.push_back(std::nullopt); 
        
             } 
        
           }

so it tries to follow out_logical as much as possible.

Fix pointwise so T0 and T2 can be mapped. The allocation order of broadcast dimensions shouldn't matter at all.

liqiangxl · 2026-02-12T14:03:17Z

the following segment is picked up by transpose scheduler due to T2's allocation domain can't be mapped with T0's.

In other words, the following segment can't be picked up by pointwise because T2's allocation domain can't be mapped with T0's.

What stops T2 and T0 from being mapped?
T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
T2_l_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}] with allocation=[iS9{i2}, bS8{1}, bS10{1}, bS11{1}], contiguity=[t n n n]>
IIUC, two better approaches are:

Fix

Fuser/csrc/alias_analysis.cpp

Lines 308 to 313 in 035cf88

for (const auto i : arange(out_logical.size())) {

if (bcast->isBroadcastDim(i)) {

out_allocation.push_back(out_logical[i]);

out_contiguity.push_back(std::nullopt);

}

}

so it tries to follow out_logical as much as possible.

Fix pointwise so T0 and T2 can be mapped. The allocation order of broadcast dimensions shouldn't matter at all.

Approach-1 sounds good to me. I revised to set allocation domain to logical.

liqiangxl · 2026-02-12T14:03:33Z

!test

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-12T14:06:46Z

csrc/alias_analysis.cpp

+  std::vector<IterDomain*> out_allocation(out_logical);
+  std::vector<std::optional<bool>> out_contiguity;
  for (const auto i : arange(out_logical.size())) {
    if (bcast->isBroadcastDim(i)) {
-      out_allocation.push_back(out_logical[i]);
      out_contiguity.push_back(std::nullopt);
+    } else {
+      out_contiguity.push_back(true);


hardcoding contiguity to true for all non-broadcast dimensions ignores the input tensor's actual contiguity. if the input has a non-contiguous layout (e.g., from a previous transpose), this will create a mismatch.

consider whether this should be out_contiguity.push_back(out_layout->contiguity(i)) for non-broadcast dims to preserve input contiguity, or verify that aliasIfCompliant will reject incompatible cases

liqiangxl · 2026-02-12T16:08:48Z

!test

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

liqiangxl · 2026-02-12T17:33:14Z

!test

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

This reverts commit 0377924.

liqiangxl · 2026-02-12T19:33:15Z

!test

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

wujingyue · 2026-02-12T22:34:50Z

Approach-1 sounds good to me.

Thanks for trying this!

I revised to set allocation domain to logical.

That's probably incorrect. You still want to start with out_layout (which could be a reorder of a subset of out_logical) but somehow add broadcast dimensions cleverly so it's as close to logical as possible.

liqiangxl · 2026-02-13T14:08:57Z

Approach-1 sounds good to me.

Thanks for trying this!

I revised to set allocation domain to logical.

That's probably incorrect. You still want to start with out_layout (which could be a reorder of a subset of out_logical) but somehow add broadcast dimensions cleverly so it's as close to logical as possible.

OK, I added a new test where out_layout/allocation is a reordered subtset of out_logical. The new algorithm is simply insert bcast domain at the corresponding bcast axis.

liqiangxl · 2026-02-13T14:09:25Z

!test

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

liqiangxl · 2026-02-13T22:52:37Z

!test

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

wujingyue · 2026-02-15T02:23:56Z

csrc/alias_analysis.cpp

    if (bcast->isBroadcastDim(i)) {
-      out_allocation.push_back(out_logical[i]);
-      out_contiguity.push_back(std::nullopt);
+      out_allocation.insert(out_allocation.begin() + i, out_logical[i]);


Cool! Almost there...

Can you avoid insertion to the middle of a vector? Below is a small trick that doesn't complicate the code much at all.

auto j = std::ssize(out_allocation); out_allocation.resize(out_logical.size()); out_contiguity... for (auto i : arange(std::ssize(out_allocation)) | std::views::reverse) { if is broadcast { out_allocation[i] = ...; out_contiguity... } else { j--; out_allocation[i] = out_allocation[j]; out_contiguity... } }

limit bcast aliasing to io tensors

95d225f

liqiangxl and others added 2 commits February 10, 2026 06:27

revise

2663513

Merge branch 'main' into llu/bcast_alias

d608387

liqiangxl and others added 3 commits February 10, 2026 12:54

only skip reduction+bcast

61c70c1

Merge branch 'llu/bcast_alias' of https://github.com/nvidia/fuser int…

482f391

…o llu/bcast_alias

Merge branch 'main' into llu/bcast_alias

7804d64

liqiangxl marked this pull request as ready for review February 11, 2026 14:43

liqiangxl requested a review from wujingyue February 11, 2026 14:48

liqiangxl changed the title ~~limit bcast aliasing to io tensors~~ disable broadcast aliasing for the reduction + broadcast pattern Feb 11, 2026

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

wujingyue reviewed Feb 11, 2026

View reviewed changes

liqiangxl added 2 commits February 12, 2026 05:31

Merge branch 'main' into llu/bcast_alias

127939b

bcast alias, alloc domain follow logical domain

1766073

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

fix

0180570

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

clean

0377924

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

liqiangxl added 2 commits February 12, 2026 11:29

Revert "clean"

02b844f

This reverts commit 0377924.

remove preferredLayout

3596088

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

liqiangxl added 2 commits February 13, 2026 06:04

insert bcast domain at the position corresponding to bcast op

50c3a63

merge

147a3d1

liqiangxl changed the title ~~disable broadcast aliasing for the reduction + broadcast pattern~~ For broadcast alias, insert bcast domain at the position corresponding to bcast op Feb 13, 2026

Merge branch 'main' into llu/bcast_alias

8d07d2f

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into llu/bcast_alias

3fbef38

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

wujingyue approved these changes Feb 15, 2026

View reviewed changes

Conversation

liqiangxl commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liqiangxl commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

liqiangxl commented Feb 10, 2026

Uh oh!

liqiangxl commented Feb 10, 2026

Uh oh!

liqiangxl commented Feb 11, 2026

Uh oh!

liqiangxl commented Feb 11, 2026

Uh oh!

greptile-apps bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Feb 11, 2026

Uh oh!

wujingyue commented Feb 11, 2026

Uh oh!

liqiangxl commented Feb 12, 2026

Uh oh!

liqiangxl commented Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Feb 12, 2026

Uh oh!

liqiangxl commented Feb 13, 2026

Uh oh!

liqiangxl commented Feb 13, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Feb 13, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Feb 9, 2026 •

edited

Loading

github-actions bot commented Feb 9, 2026 •

edited

Loading

greptile-apps bot commented Feb 11, 2026 •

edited

Loading