Skip to content

For broadcast alias, insert bcast domain at the position corresponding to bcast op#5944

Open
liqiangxl wants to merge 16 commits intomainfrom
llu/bcast_alias
Open

For broadcast alias, insert bcast domain at the position corresponding to bcast op#5944
liqiangxl wants to merge 16 commits intomainfrom
llu/bcast_alias

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Feb 9, 2026

This PR disables broadcast aliasing for the reduction + broadcast pattern to prevent unnecessary allocation-domain changes on broadcast tensors.
When segmenting normalization kernels, this chagne leads to a reduction + pointwise decomposition. Before this PR, reduction + transpose is used due to the unnecessary allocation domain set on the broadcast tensor.
This is an optimization following PR #5884 and issue #5883

@liqiangxl
Copy link
Collaborator Author

!test --diff

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Review updated until commit 3fbef38

Description

  • Insert broadcast dimensions at logical positions in allocation domain

  • Keep allocation domain close to logical domain for better performance

  • Add test case for broadcast with reordered layout subset

  • Enable reduction+pointwise segmentation instead of reduction+transpose

Changes walkthrough

Relevant files
Enhancement
alias_analysis.cpp
Position broadcast dimensions at logical indices                 

csrc/alias_analysis.cpp

  • Change broadcast dimension insertion from end to logical positions
  • Insert dimensions at corresponding indices in allocation domain
  • Maintain contiguity information for new broadcast dimensions
  • Keep allocation domain aligned with logical domain structure
  • +4/-3     
    Tests
    test_alias_analysis.cpp
    Add test for broadcast with reordered layout                         

    tests/cpp/test_alias_analysis.cpp

  • Add test for broadcast with reordered layout subset
  • Verify allocation domain maintains logical domain permutation
  • Test preferred layout behavior with non-contiguous dimensions
  • Ensure broadcast dims inserted between reordered logical dims
  • +31/-0   
    test_persistent_buffer.cpp
    Validate scheduler types in persistent buffer test             

    tests/cpp/test_persistent_buffer.cpp

  • Add scheduler type validation for segmented fusion
  • Expect both PointWise and Reduction scheduler types
  • Verify segmentation produces expected scheduler combination
  • Confirm optimization enables proper kernel decomposition
  • +7/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Logic Correctness

    The broadcast dimension insertion logic correctly uses insert(out_allocation.begin() + i, out_logical[i]) to place broadcast dimensions at their corresponding logical positions rather than appending them. This ensures allocation domain stays close to logical domain as intended.

    out_allocation.insert(out_allocation.begin() + i, out_logical[i]);
    out_contiguity.insert(out_contiguity.begin() + i, std::nullopt);

    @liqiangxl
    Copy link
    Collaborator Author

    !test --diff

    @liqiangxl
    Copy link
    Collaborator Author

    !test --diff

    @liqiangxl
    Copy link
    Collaborator Author

    code diff is from the revised test and is expected.

    @liqiangxl
    Copy link
    Collaborator Author

    distributed test failure in LowerGatherTest.InMesh_1_2_OutMesh is not related to this PR.

    @liqiangxl liqiangxl marked this pull request as ready for review February 11, 2026 14:43
    @liqiangxl liqiangxl requested a review from wujingyue February 11, 2026 14:48
    @liqiangxl liqiangxl changed the title limit bcast aliasing to io tensors disable broadcast aliasing for the reduction + broadcast pattern Feb 11, 2026
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 11, 2026

    Greptile Overview

    Greptile Summary

    This PR optimizes broadcast aliasing by inserting broadcast dimensions at their logical positions in the allocation domain rather than appending them to the end. This keeps the allocation domain closer to the logical domain structure.

    • Changes AliasFinder::handle(const BroadcastOp*) to use insert() at position i instead of push_back() for broadcast dimensions
    • For reduction+broadcast patterns, this prevents unnecessary allocation-domain changes that previously forced segmentation into reduction+transpose
    • Now enables reduction+pointwise decomposition for normalization kernels, improving performance
    • New test Broadcast_OutLayoutReorderOfSubsetOfOutLogical validates the behavior with reordered input layouts
    • Updated persistent buffer test confirms segmentation now uses reduction+pointwise instead of reduction+transpose

    The implementation correctly handles multiple broadcast dimensions by inserting them sequentially at their logical positions, which naturally accounts for array shifts from previous insertions.

    Confidence Score: 4/5

    • This PR is safe to merge with minor considerations
    • The logic change is sound and well-tested. The algorithm correctly inserts broadcast dimensions at logical positions by leveraging natural array shifts. However, there's one area that warrants verification: the previous review thread raised a contiguity concern that should be validated through additional testing or code inspection to ensure mapInLayoutToOutRoot properly preserves input contiguity for non-broadcast dimensions
    • Pay close attention to csrc/alias_analysis.cpp to ensure the contiguity preservation mentioned in previous review comments is handled correctly by mapInLayoutToOutRoot

    Important Files Changed

    Filename Overview
    csrc/alias_analysis.cpp Changes broadcast dimension insertion from appending to the end to inserting at logical positions, preserving allocation domain structure
    tests/cpp/test_alias_analysis.cpp Adds test verifying broadcast dims are inserted at logical positions when input has reordered layout
    tests/cpp/test_persistent_buffer.cpp Adds assertion that reduction+broadcast pattern now segments into reduction+pointwise instead of reduction+transpose

    Flowchart

    flowchart TD
        A[Input Tensor<br/>Logical: i0, i1<br/>Allocation: i1, i0] --> B[mapInLayoutToOutRoot]
        B --> C[Mapped Layout<br/>Allocation: i1_out, i0_out]
        C --> D{Iterate through<br/>output logical dims}
        D -->|i=0: not broadcast| E[Skip]
        D -->|i=1: IS broadcast| F[Insert b at position 1<br/>Allocation: i1_out, b, i0_out]
        D -->|i=2: not broadcast| G[Skip]
        F --> H[Final Output<br/>Logical: i0, b, i1<br/>Allocation: i1, b, i0]
        E --> D
        G --> H
        
        style F fill:#90EE90
        style H fill:#87CEEB
    
    Loading

    Last reviewed commit: 3fbef38

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Thanks for trying to provide the context! I still don't follow why reduction+broadcast is bad. Is that something alias analysis could fix instead of dodging?

    @liqiangxl
    Copy link
    Collaborator Author

    Thanks for trying to provide the context! I still don't follow why reduction+broadcast is bad. Is that something alias analysis could fix instead of dodging?

    Let me add more details from a test case:
    Dumped from NVFUSER_DUMP=pre_segmenter_logging ./test_nvfuser --gtest_filter=*SmemPersistentNotSupportedIn3DReduction

    Fusion is a simple normalization:

    Inputs:
      T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
    Outputs:
      T3_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}]
    
    %kernel_math {
    T1_l_float[rS4{i0}, iS5{i2}, rS6{i3}, rS7{i4}]
       = reduction( T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}], op = add, initial value = float(0), allreduce = false )
    T2_l_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}]
       = broadcast( T1_l_float[rS4{i0}, iS5{i2}, rS6{i3}, rS7{i4}], flags = {true, false, true, true} )
    T3_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}]
       = T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
       / T2_l_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}];
    } // %kernel_math 
    

    T2's allocation domain is set to : [iS9{i2}, bS8{1}, bS10{1}, bS11{1}]

    
    ========================================
    Alias analysis result:
      Potential aliases:
        T2 is an alias of T1 if its layout is <allocation=[iS9{i2}, bS8{1}, bS10{1}, bS11{1}], contiguity=[t n n n]>
      Finalized aliases:
        T2 of allocation domain [] and logical domain [bS8{1}, iS9{i2}, bS10{1}, bS11{1}] is a transitive alias of T1
    
    Set the layout of T2 to <allocation=[iS9{i2}, bS8{1}, bS10{1}, bS11{1}], contiguity=[t n n n]>
    
    

    Fusion is segmented due to large reduction size, then the following segment is picked up by transpose scheduler due to T2's allocation domain can't be mapped with T0's.

    T3_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}]
       = T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
       / T2_g_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}];
    

    We want to use pointwise scheduler for this segmented fusion. So I disabled the broadcast aliasing for the reduction + broadcast pattern. Transpose scheduler will reject this segment if the allocation domains are mapped, see logic at here.

    @wujingyue
    Copy link
    Collaborator

    the following segment is picked up by transpose scheduler due to T2's allocation domain can't be mapped with T0's.

    In other words, the following segment can't be picked up by pointwise because T2's allocation domain can't be mapped with T0's.

    What stops T2 and T0 from being mapped?

    T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
    T2_l_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}] with allocation=[iS9{i2}, bS8{1}, bS10{1}, bS11{1}], contiguity=[t n n n]>
    

    IIUC, two better approaches are:

    1. Fix
      for (const auto i : arange(out_logical.size())) {
      if (bcast->isBroadcastDim(i)) {
      out_allocation.push_back(out_logical[i]);
      out_contiguity.push_back(std::nullopt);
      }
      }
      so it tries to follow out_logical as much as possible.
    2. Fix pointwise so T0 and T2 can be mapped. The allocation order of broadcast dimensions shouldn't matter at all.

    @liqiangxl
    Copy link
    Collaborator Author

    the following segment is picked up by transpose scheduler due to T2's allocation domain can't be mapped with T0's.

    In other words, the following segment can't be picked up by pointwise because T2's allocation domain can't be mapped with T0's.

    What stops T2 and T0 from being mapped?

    T0_g_float[iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}]
    T2_l_float[bS8{1}, iS9{i2}, bS10{1}, bS11{1}] with allocation=[iS9{i2}, bS8{1}, bS10{1}, bS11{1}], contiguity=[t n n n]>
    

    IIUC, two better approaches are:

    1. Fix
      for (const auto i : arange(out_logical.size())) {
      if (bcast->isBroadcastDim(i)) {
      out_allocation.push_back(out_logical[i]);
      out_contiguity.push_back(std::nullopt);
      }
      }

      so it tries to follow out_logical as much as possible.
    2. Fix pointwise so T0 and T2 can be mapped. The allocation order of broadcast dimensions shouldn't matter at all.

    Approach-1 sounds good to me. I revised to set allocation domain to logical.

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    Comment on lines 308 to 314
    std::vector<IterDomain*> out_allocation(out_logical);
    std::vector<std::optional<bool>> out_contiguity;
    for (const auto i : arange(out_logical.size())) {
    if (bcast->isBroadcastDim(i)) {
    out_allocation.push_back(out_logical[i]);
    out_contiguity.push_back(std::nullopt);
    } else {
    out_contiguity.push_back(true);
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    hardcoding contiguity to true for all non-broadcast dimensions ignores the input tensor's actual contiguity. if the input has a non-contiguous layout (e.g., from a previous transpose), this will create a mismatch.

    consider whether this should be out_contiguity.push_back(out_layout->contiguity(i)) for non-broadcast dims to preserve input contiguity, or verify that aliasIfCompliant will reject incompatible cases

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @wujingyue
    Copy link
    Collaborator

    Approach-1 sounds good to me.

    Thanks for trying this!

    I revised to set allocation domain to logical.

    That's probably incorrect. You still want to start with out_layout (which could be a reorder of a subset of out_logical) but somehow add broadcast dimensions cleverly so it's as close to logical as possible.

    @liqiangxl liqiangxl changed the title disable broadcast aliasing for the reduction + broadcast pattern For broadcast alias, insert bcast domain at the position corresponding to bcast op Feb 13, 2026
    @liqiangxl
    Copy link
    Collaborator Author

    Approach-1 sounds good to me.

    Thanks for trying this!

    I revised to set allocation domain to logical.

    That's probably incorrect. You still want to start with out_layout (which could be a reorder of a subset of out_logical) but somehow add broadcast dimensions cleverly so it's as close to logical as possible.

    OK, I added a new test where out_layout/allocation is a reordered subtset of out_logical. The new algorithm is simply insert bcast domain at the corresponding bcast axis.

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    if (bcast->isBroadcastDim(i)) {
    out_allocation.push_back(out_logical[i]);
    out_contiguity.push_back(std::nullopt);
    out_allocation.insert(out_allocation.begin() + i, out_logical[i]);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Cool! Almost there...

    Can you avoid insertion to the middle of a vector? Below is a small trick that doesn't complicate the code much at all.

    auto j = std::ssize(out_allocation);
    out_allocation.resize(out_logical.size());
    out_contiguity...
    for (auto i : arange(std::ssize(out_allocation)) | std::views::reverse) {
      if is broadcast {
        out_allocation[i] = ...;
        out_contiguity...
      } else {
        j--;
        out_allocation[i] = out_allocation[j];
        out_contiguity...
      }
    }
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants