Skip to content

[LV] Don't simplify wide binops to constants if non-uniform #121898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Jan 7, 2025

After 6d6eea9 we started simplifying more operands of widened binops when they were known constant via SCEV.

However in the example in #119173, we were simplifying a reduction phi that was constant in the original IR:

<x1> vector loop: {
  vector.body:
    WIDEN-REDUCTION-PHI ir<%add45> = phi ir<5>, ir<%add> ; <5, 0, 0, 0>
    WIDEN ir<%add> = add ir<0>, ir<%add45> ; <5, 0, 0, 0>
  No successors
}

-->

<x1> vector loop: {
  vector.body:
    WIDEN-REDUCTION-PHI ir<%add45> = phi ir<5>, ir<%add> ; <5, 0, 0, 0>
    WIDEN ir<%add> = add ir<0>, ir<5> ; <5, 5, 5, 5>
  No successors
}

Whilst the underlying value is constant, the widened reduction PHI isn't uniform so we can't simplify it.

This fixes #119173 by checking if the operand is known to be uniform, but also requires doing the same fix in the legacy cost model as well in order to avoid the cost-model mismatch assertion from #107015 again.

After 6d6eea9 we started simplifying more operands of binops when they were known constant via SCEV.

However in the example in llvm#119173, we were simplifying a reduction phi that was constant in the original IR:

    <x1> vector loop: {
      vector.body:
        WIDEN-REDUCTION-PHI ir<%add45> = phi ir<5>, ir<%add> ; <5, 0, 0, 0>
        WIDEN ir<%add> = add ir<0>, ir<%add45> ; <5, 0, 0, 0>
      No successors
    }

    -->

    <x1> vector loop: {
      vector.body:
        WIDEN-REDUCTION-PHI ir<%add45> = phi ir<5>, ir<%add> ; <5, 0, 0, 0>
        WIDEN ir<%add> = add ir<0>, ir<5> ; <5, 5, 5, 5>
      No successors
    }

Whilst the underlying value is constant, the widened reduction PHI isn't uniform so we can't simplify it.

This fixes llvm#119173 by checking if the operand is known to be uniform, but also requires doing the same fix in the legacy cost model as well in order to avoid the cost-model mismatch assertion from llvm#107015 again.
@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Luke Lau (lukel97)

Changes

After 6d6eea9 we started simplifying more operands of binops when they were known constant via SCEV.

However in the example in #119173, we were simplifying a reduction phi that was constant in the original IR:

&lt;x1&gt; vector loop: {
  vector.body:
    WIDEN-REDUCTION-PHI ir&lt;%add45&gt; = phi ir&lt;5&gt;, ir&lt;%add&gt; ; &lt;5, 0, 0, 0&gt;
    WIDEN ir&lt;%add&gt; = add ir&lt;0&gt;, ir&lt;%add45&gt; ; &lt;5, 0, 0, 0&gt;
  No successors
}

--&gt;

&lt;x1&gt; vector loop: {
  vector.body:
    WIDEN-REDUCTION-PHI ir&lt;%add45&gt; = phi ir&lt;5&gt;, ir&lt;%add&gt; ; &lt;5, 0, 0, 0&gt;
    WIDEN ir&lt;%add&gt; = add ir&lt;0&gt;, ir&lt;5&gt; ; &lt;5, 5, 5, 5&gt;
  No successors
}

Whilst the underlying value is constant, the widened reduction PHI isn't uniform so we can't simplify it.

This fixes #119173 by checking if the operand is known to be uniform, but also requires doing the same fix in the legacy cost model as well in order to avoid the cost-model mismatch assertion from #107015 again.


Full diff: https://github.com/llvm/llvm-project/pull/121898.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+11-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll (+55-18)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 0797100b182cb1..dadad7a4d81287 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6716,9 +6716,16 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
     // If we're speculating on the stride being 1, the multiplication may
     // fold away.  We can generalize this for all operations using the notion
     // of neutral elements.  (TODO)
+    auto IsAlwaysOne = [this, VF](Value *V) {
+      // Reduction phi SCEVs may be constant when scalar, but non-uniform when
+      // vectorized and unfoldable.
+      if (auto *I = dyn_cast<Instruction>(V);
+          I && !isUniformAfterVectorization(I, VF))
+        return false;
+      return PSE.getSCEV(V)->isOne();
+    };
     if (I->getOpcode() == Instruction::Mul &&
-        (PSE.getSCEV(I->getOperand(0))->isOne() ||
-         PSE.getSCEV(I->getOperand(1))->isOne()))
+        (IsAlwaysOne(I->getOperand(0)) || IsAlwaysOne(I->getOperand(1))))
       return 0;
 
     // Detect reduction patterns
@@ -8632,6 +8639,8 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
       // to replace operands with constants.
       ScalarEvolution &SE = *PSE.getSE();
       auto GetConstantViaSCEV = [this, &SE](VPValue *Op) {
+        if (!vputils::isUniformAfterVectorization(Op))
+          return Op;
         Value *V = Op->getUnderlyingValue();
         if (isa<Constant>(V) || !SE.isSCEVable(V->getType()))
           return Op;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll b/llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll
index 0ff98d2abe776c..a8d44421a3c37b 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll
@@ -7,24 +7,10 @@ target triple = "arm64-apple-macosx"
 define i64 @mul_select_operand_known_1_via_scev() {
 ; CHECK-LABEL: define i64 @mul_select_operand_known_1_via_scev() {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
-; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
-; CHECK:       [[VECTOR_PH]]:
-; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
-; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <2 x i64> [ <i64 12, i64 1>, %[[VECTOR_PH]] ], [ [[VEC_PHI]], %[[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
-; CHECK-NEXT:    br i1 true, label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
-; CHECK:       [[MIDDLE_BLOCK]]:
-; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vector.reduce.mul.v2i64(<2 x i64> [[VEC_PHI]])
-; CHECK-NEXT:    br i1 true, label %[[EXIT:.*]], label %[[SCALAR_PH]]
-; CHECK:       [[SCALAR_PH]]:
-; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP0]], %[[MIDDLE_BLOCK]] ], [ 12, %[[ENTRY]] ]
-; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ 2, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[RED:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[RED_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RED:%.*]] = phi i64 [ 12, %[[ENTRY]] ], [ [[RED_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = and i32 [[IV]], 1
 ; CHECK-NEXT:    [[CMP1_I:%.*]] = icmp eq i32 [[TMP1]], 0
 ; CHECK-NEXT:    [[NARROW_I:%.*]] = select i1 [[CMP1_I]], i32 1, i32 [[IV]]
@@ -32,9 +18,9 @@ define i64 @mul_select_operand_known_1_via_scev() {
 ; CHECK-NEXT:    [[RED_NEXT]] = mul nsw i64 [[RED]], [[MUL]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
 ; CHECK-NEXT:    [[EC:%.*]] = icmp eq i32 [[IV]], 1
-; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[RED_NEXT]], %[[LOOP]] ], [ [[TMP0]], %[[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[RED_NEXT]], %[[LOOP]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -56,6 +42,57 @@ exit:
   %res = phi i64 [ %red.next, %loop ]
   ret i64 %res
 }
+
+define i32 @add_reduction_select_operand_constant_but_non_uniform() {
+; CHECK-LABEL: define i32 @add_reduction_select_operand_constant_but_non_uniform() {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i32> [ <i32 42, i32 0, i32 0, i32 0>, %[[VECTOR_PH]] ], [ [[TMP2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP1:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP2]] = add <4 x i32> zeroinitializer, [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP1]] = add <4 x i32> zeroinitializer, [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
+; CHECK-NEXT:    [[TMP0:%.*]] = icmp eq i32 [[INDEX_NEXT]], 64
+; CHECK-NEXT:    br i1 [[TMP0]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[BIN_RDX:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX]])
+; CHECK-NEXT:    br i1 true, label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ 64, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP3]], %[[MIDDLE_BLOCK]] ], [ 42, %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[ADD2_REASS:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RDX:%.*]] = phi i32 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[ADD2_REASS]] = add i32 [[IV]], 1
+; CHECK-NEXT:    [[RDX_NEXT]] = add i32 0, [[RDX]]
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[ADD2_REASS]], 64
+; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[ADD_LCSSA:%.*]] = phi i32 [ [[RDX_NEXT]], %[[LOOP]] ], [ [[TMP3]], %[[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    ret i32 [[ADD_LCSSA]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop ]
+  %rdx = phi i32 [ 42, %entry ], [ %rdx.next, %loop ]
+
+  %iv.next = add i32 %iv, 1
+  %rdx.next = add i32 0, %rdx
+
+  %cmp = icmp ult i32 %iv.next, 64
+  br i1 %cmp, label %loop, label %exit
+
+exit:
+  ret i32 %rdx.next
+}
 ;.
 ; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
 ; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}

@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-vectorizers

Author: Luke Lau (lukel97)

Changes

After 6d6eea9 we started simplifying more operands of binops when they were known constant via SCEV.

However in the example in #119173, we were simplifying a reduction phi that was constant in the original IR:

&lt;x1&gt; vector loop: {
  vector.body:
    WIDEN-REDUCTION-PHI ir&lt;%add45&gt; = phi ir&lt;5&gt;, ir&lt;%add&gt; ; &lt;5, 0, 0, 0&gt;
    WIDEN ir&lt;%add&gt; = add ir&lt;0&gt;, ir&lt;%add45&gt; ; &lt;5, 0, 0, 0&gt;
  No successors
}

--&gt;

&lt;x1&gt; vector loop: {
  vector.body:
    WIDEN-REDUCTION-PHI ir&lt;%add45&gt; = phi ir&lt;5&gt;, ir&lt;%add&gt; ; &lt;5, 0, 0, 0&gt;
    WIDEN ir&lt;%add&gt; = add ir&lt;0&gt;, ir&lt;5&gt; ; &lt;5, 5, 5, 5&gt;
  No successors
}

Whilst the underlying value is constant, the widened reduction PHI isn't uniform so we can't simplify it.

This fixes #119173 by checking if the operand is known to be uniform, but also requires doing the same fix in the legacy cost model as well in order to avoid the cost-model mismatch assertion from #107015 again.


Full diff: https://github.com/llvm/llvm-project/pull/121898.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+11-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll (+55-18)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 0797100b182cb1..dadad7a4d81287 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6716,9 +6716,16 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
     // If we're speculating on the stride being 1, the multiplication may
     // fold away.  We can generalize this for all operations using the notion
     // of neutral elements.  (TODO)
+    auto IsAlwaysOne = [this, VF](Value *V) {
+      // Reduction phi SCEVs may be constant when scalar, but non-uniform when
+      // vectorized and unfoldable.
+      if (auto *I = dyn_cast<Instruction>(V);
+          I && !isUniformAfterVectorization(I, VF))
+        return false;
+      return PSE.getSCEV(V)->isOne();
+    };
     if (I->getOpcode() == Instruction::Mul &&
-        (PSE.getSCEV(I->getOperand(0))->isOne() ||
-         PSE.getSCEV(I->getOperand(1))->isOne()))
+        (IsAlwaysOne(I->getOperand(0)) || IsAlwaysOne(I->getOperand(1))))
       return 0;
 
     // Detect reduction patterns
@@ -8632,6 +8639,8 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
       // to replace operands with constants.
       ScalarEvolution &SE = *PSE.getSE();
       auto GetConstantViaSCEV = [this, &SE](VPValue *Op) {
+        if (!vputils::isUniformAfterVectorization(Op))
+          return Op;
         Value *V = Op->getUnderlyingValue();
         if (isa<Constant>(V) || !SE.isSCEVable(V->getType()))
           return Op;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll b/llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll
index 0ff98d2abe776c..a8d44421a3c37b 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll
@@ -7,24 +7,10 @@ target triple = "arm64-apple-macosx"
 define i64 @mul_select_operand_known_1_via_scev() {
 ; CHECK-LABEL: define i64 @mul_select_operand_known_1_via_scev() {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
-; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
-; CHECK:       [[VECTOR_PH]]:
-; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
-; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <2 x i64> [ <i64 12, i64 1>, %[[VECTOR_PH]] ], [ [[VEC_PHI]], %[[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
-; CHECK-NEXT:    br i1 true, label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
-; CHECK:       [[MIDDLE_BLOCK]]:
-; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vector.reduce.mul.v2i64(<2 x i64> [[VEC_PHI]])
-; CHECK-NEXT:    br i1 true, label %[[EXIT:.*]], label %[[SCALAR_PH]]
-; CHECK:       [[SCALAR_PH]]:
-; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP0]], %[[MIDDLE_BLOCK]] ], [ 12, %[[ENTRY]] ]
-; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ 2, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[RED:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[RED_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RED:%.*]] = phi i64 [ 12, %[[ENTRY]] ], [ [[RED_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = and i32 [[IV]], 1
 ; CHECK-NEXT:    [[CMP1_I:%.*]] = icmp eq i32 [[TMP1]], 0
 ; CHECK-NEXT:    [[NARROW_I:%.*]] = select i1 [[CMP1_I]], i32 1, i32 [[IV]]
@@ -32,9 +18,9 @@ define i64 @mul_select_operand_known_1_via_scev() {
 ; CHECK-NEXT:    [[RED_NEXT]] = mul nsw i64 [[RED]], [[MUL]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
 ; CHECK-NEXT:    [[EC:%.*]] = icmp eq i32 [[IV]], 1
-; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[RED_NEXT]], %[[LOOP]] ], [ [[TMP0]], %[[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[RED_NEXT]], %[[LOOP]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -56,6 +42,57 @@ exit:
   %res = phi i64 [ %red.next, %loop ]
   ret i64 %res
 }
+
+define i32 @add_reduction_select_operand_constant_but_non_uniform() {
+; CHECK-LABEL: define i32 @add_reduction_select_operand_constant_but_non_uniform() {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i32> [ <i32 42, i32 0, i32 0, i32 0>, %[[VECTOR_PH]] ], [ [[TMP2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP1:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP2]] = add <4 x i32> zeroinitializer, [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP1]] = add <4 x i32> zeroinitializer, [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
+; CHECK-NEXT:    [[TMP0:%.*]] = icmp eq i32 [[INDEX_NEXT]], 64
+; CHECK-NEXT:    br i1 [[TMP0]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[BIN_RDX:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX]])
+; CHECK-NEXT:    br i1 true, label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ 64, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP3]], %[[MIDDLE_BLOCK]] ], [ 42, %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[ADD2_REASS:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RDX:%.*]] = phi i32 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[ADD2_REASS]] = add i32 [[IV]], 1
+; CHECK-NEXT:    [[RDX_NEXT]] = add i32 0, [[RDX]]
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[ADD2_REASS]], 64
+; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[ADD_LCSSA:%.*]] = phi i32 [ [[RDX_NEXT]], %[[LOOP]] ], [ [[TMP3]], %[[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    ret i32 [[ADD_LCSSA]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop ]
+  %rdx = phi i32 [ 42, %entry ], [ %rdx.next, %loop ]
+
+  %iv.next = add i32 %iv, 1
+  %rdx.next = add i32 0, %rdx
+
+  %cmp = icmp ult i32 %iv.next, 64
+  br i1 %cmp, label %loop, label %exit
+
+exit:
+  ret i32 %rdx.next
+}
 ;.
 ; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
 ; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}

Copy link
Contributor

@antoniofrighetto antoniofrighetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing it, this makes sense to me. Let's wait for Florian's feedback as well.

@arcbbb
Copy link
Contributor

arcbbb commented Jan 10, 2025

I was pondering this: replace the reduction-phi with a constant, then the vector operation wouldn't be necessary.
like this https://github.com/arcbbb/llvm-project/commit/fix-reduc-opt/

@fhahn
Copy link
Contributor

fhahn commented Jan 16, 2025

Would it be possible to add the test case first? I've been thinking about some potential alternatives, but it seems like none of the options are really great :(

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 17, 2025

Would it be possible to add the test case first? I've been thinking about some potential alternatives, but it seems like none of the options are really great :(

I committed the test case in e83e0c3. This isn't an area of code that I'm super familiar with so I'd be happy to close this PR and let you take over if needed!

@ayalz
Copy link
Collaborator

ayalz commented Jan 20, 2025

... I've been thinking about some potential alternatives, but it seems like none of the options are really great :(

An add Reduction with a constant addend is an Induction, by definition, but a zero addend leads SCEV (rightfully) to consider the Phi an invariant rather than an AddRec?

Such a redundant invariant Phi chain best be RAUW'd with its start value? Better apply such a scalar preliminary cleanup before reaching LV, rather than within it?

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this dropped off my radar.

I think it would probably be better to only use SCEV for live-ins for now, to avoid introducing another divergence in case LoopVectorizationCostMode::isUniform... and the vputils version disagree. Tried something like this in #125436 and it looks like it also only impacts the tests as here.

We should be able to do @ayal's suggestion building on top of #124432 eventually, i.e. a very rough version like fhahn@b8e0f63

@lukel97
Copy link
Contributor Author

lukel97 commented Feb 3, 2025

I think it would probably be better to only use SCEV for live-ins for now, to avoid introducing another divergence in case LoopVectorizationCostMode::isUniform... and the vputils version disagree. Tried something like this in #125436 and it looks like it also only impacts the tests as here.

Agreed that makes sense, closing this PR

@lukel97 lukel97 closed this Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[clang] Miscompilation at -O3
6 participants