[InstCombine] Combine interleaved PHI reduction chains. #143878

rj-jesus · 2025-06-12T11:29:54Z

Combine sequences such as:

  %pn1 = phi [init1, %BB1], [%op1, %BB2]
  %pn2 = phi [init2, %BB1], [%op2, %BB2]
  %op1 = binop %pn1, constant1
  %op2 = binop %pn2, constant2
  %rdx = binop %op1, %op2

Into:

  %phi_combined = phi [init_combined, %BB1], [%op_combined, %BB2]
  %rdx_combined = binop %phi_combined, constant_combined

This allows us to simplify interleaved reductions, for example as introduced by the loop vectorizer.

The anecdotal example for this is the loop below (https://godbolt.org/z/hYhMW6x35):

float foo() {
  float q = 1.f;
  for (int i = 0; i < 1000; ++i)
    q *= .99f;
  return q;
}

Which currently gets lowered as an explicit loop such as (on AArch64, interleaved by four):

.LBB0_1:
  fmul    v0.4s, v0.4s, v1.4s
  fmul    v2.4s, v2.4s, v1.4s
  fmul    v3.4s, v3.4s, v1.4s
  fmul    v4.4s, v4.4s, v1.4s
  subs    w8, w8, #32
  b.ne    .LBB0_1

But with this patch lowers trivially:

foo:
  mov     w8, #5028
  movk    w8, #14389, lsl #16
  fmov    s0, w8
  ret

Currently, we require init1, init2, constant1 and constant2 to be foldable constants, but this can be relaxed in the future.

Please let me know if this is better implemented elsewhere, or if there's a better way of doing this!

llvmbot · 2025-06-12T11:30:23Z

@llvm/pr-subscribers-llvm-transforms

Author: Ricardo Jesus (rj-jesus)

Changes

Combine sequences such as:

  %pn1 = phi [init1, %BB1], [%op1, %BB2]
  %pn2 = phi [init2, %BB1], [%op2, %BB2]
  %op1 = binop %pn1, constant1
  %op2 = binop %pn2, constant2
  %rdx = binop %op1, %op2

Into:

  %phi_combined = phi [init_combined, %BB1], [%op_combined, %BB2]
  %rdx_combined = binop %phi_combined, constant_combined

This allows us to simplify interleaved reductions, for example as introduced by the loop vectorizer.

The anecdotal example for this is the loop below (https://godbolt.org/z/hYhMW6x35):

float foo() {
  float q = 1.f;
  for (int i = 0; i &lt; 1000; ++i)
    q *= .99f;
  return q;
}

Which currently gets lowered as an explicit loop such as (on AArch64, interleaved by four):

.LBB0_1:
  fmul    v0.4s, v0.4s, v1.4s
  fmul    v2.4s, v2.4s, v1.4s
  fmul    v3.4s, v3.4s, v1.4s
  fmul    v4.4s, v4.4s, v1.4s
  subs    w8, w8, #<!-- -->32
  b.ne    .LBB0_1

But with this patch lowers trivially:

foo:
  mov     w8, #<!-- -->5028
  movk    w8, #<!-- -->14389, lsl #<!-- -->16
  fmov    s0, w8
  ret

Currently, we require init1, init2, constant1 and constant2 to be foldable constants, but this can be relaxed in the future.

Please let me know if this is better implemented elsewhere, or if there's a better way of doing this!

Patch is 56.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/143878.diff

3 Files Affected:

(modified) llvm/lib/Transforms/InstCombine/InstCombineInternal.h (+3)
(modified) llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp (+164)
(added) llvm/test/Transforms/InstCombine/phi-reduction-chain.ll (+1408)

diff --git a/llvm/lib/Transforms/InstCombine/InstCombineInternal.h b/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
index 334462d715f95..248092e21f109 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
+++ b/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
@@ -652,6 +652,9 @@ class LLVM_LIBRARY_VISIBILITY InstCombinerImpl final
   Instruction *foldPHIArgZextsIntoPHI(PHINode &PN);
   Instruction *foldPHIArgIntToPtrToPHI(PHINode &PN);
 
+  /// Try to fold interleaved PHI reductions to a single PHI.
+  Instruction *foldPHIReduction(PHINode &PN);
+
   /// If the phi is within a phi web, which is formed by the def-use chain
   /// of phis and all the phis in the web are only used in the other phis.
   /// In this case, these phis are dead and we will remove all of them.
diff --git a/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp b/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
index a842a5edcb8a3..7fecb213cb0f6 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
@@ -36,6 +36,7 @@ STATISTIC(NumPHIsOfInsertValues,
 STATISTIC(NumPHIsOfExtractValues,
           "Number of phi-of-extractvalue turned into extractvalue-of-phi");
 STATISTIC(NumPHICSEs, "Number of PHI's that got CSE'd");
+STATISTIC(NumPHIsInterleaved, "Number of interleaved PHI's combined");
 
 /// The PHI arguments will be folded into a single operation with a PHI node
 /// as input. The debug location of the single operation will be the merged
@@ -989,6 +990,165 @@ Instruction *InstCombinerImpl::foldPHIArgOpIntoPHI(PHINode &PN) {
   return NewCI;
 }
 
+/// Try to fold reduction ops interleaved through two PHIs to a single PHI.
+///
+/// For example, combine:
+///   %phi1 = phi [init1, %BB1], [%op1, %BB2]
+///   %phi2 = phi [init2, %BB1], [%op2, %BB2]
+///   %op1 = binop %phi1, constant1
+///   %op2 = binop %phi2, constant2
+///   %rdx = binop %op1, %op2
+/// =>
+///   %phi_combined = phi [init_combined, %BB1], [%op_combined, %BB2]
+///   %rdx_combined = binop %phi_combined, constant_combined
+///
+/// For now, we require init1, init2, constant1 and constant2 to be constants.
+Instruction *InstCombinerImpl::foldPHIReduction(PHINode &PN) {
+  // For now, only handle PHIs with one use and exactly two incoming values.
+  if (!PN.hasOneUse() || PN.getNumIncomingValues() != 2)
+    return nullptr;
+
+  // Find the binop that uses PN and ensure it can be reassociated.
+  auto *BO1 = dyn_cast<BinaryOperator>(PN.user_back());
+  if (!BO1 || !BO1->hasNUses(2) || !BO1->isAssociative())
+    return nullptr;
+
+  // Ensure PN has an incoming value for BO1.
+  if (PN.getIncomingValue(0) != BO1 && PN.getIncomingValue(1) != BO1)
+    return nullptr;
+
+  // Find the initial value of PN.
+  auto *Init1 =
+      dyn_cast<Constant>(PN.getIncomingValue(PN.getIncomingValue(0) == BO1));
+  if (!Init1)
+    return nullptr;
+
+  // Find the constant operand of BO1.
+  assert((BO1->getOperand(0) == &PN || BO1->getOperand(1) == &PN) &&
+         "Unexpected operand!");
+  auto *C1 = dyn_cast<Constant>(BO1->getOperand(BO1->getOperand(0) == &PN));
+  if (!C1)
+    return nullptr;
+
+  // Find the reduction operation.
+  auto Opc = BO1->getOpcode();
+  BinaryOperator *Rdx = nullptr;
+  for (User *U : BO1->users())
+    if (U != &PN) {
+      Rdx = dyn_cast<BinaryOperator>(U);
+      break;
+    }
+  if (!Rdx || Rdx->getOpcode() != Opc || !Rdx->isAssociative())
+    return nullptr;
+
+  // Find the interleaved binop.
+  assert((Rdx->getOperand(0) == BO1 || Rdx->getOperand(1) == BO1) &&
+         "Unexpected operand!");
+  auto *BO2 =
+      dyn_cast<BinaryOperator>(Rdx->getOperand(Rdx->getOperand(0) == BO1));
+  if (!BO2 || !BO2->hasNUses(2) || !BO2->isAssociative() ||
+      BO2->getOpcode() != Opc || BO2->getParent() != BO1->getParent())
+    return nullptr;
+
+  // Find the interleaved PHI and recurrence constant.
+  auto *PN2 = dyn_cast<PHINode>(BO2->getOperand(0));
+  auto *C2 = dyn_cast<Constant>(BO2->getOperand(1));
+  if (!PN2 && !C2) {
+    PN2 = dyn_cast<PHINode>(BO2->getOperand(1));
+    C2 = dyn_cast<Constant>(BO2->getOperand(0));
+  }
+  if (!PN2 || !C2 || !PN2->hasOneUse() || PN2->getParent() != PN.getParent())
+    return nullptr;
+  assert(PN2->getNumIncomingValues() == PN.getNumIncomingValues() &&
+         "Expected PHIs with the same number of incoming values!");
+
+  // Ensure PN2 has an incoming value for BO2.
+  if (PN2->getIncomingValue(0) != BO2 && PN2->getIncomingValue(1) != BO2)
+    return nullptr;
+
+  // Find the initial value of PN2.
+  auto *Init2 = dyn_cast<Constant>(
+      PN2->getIncomingValue(PN2->getIncomingValue(0) == BO2));
+  if (!Init2)
+    return nullptr;
+
+  assert(BO1->isCommutative() && BO2->isCommutative() && Rdx->isCommutative() &&
+         "Expected commutative instructions!");
+
+  // If we've got this far, we can transform:
+  //   pn = phi [init1; op1]
+  //   pn2 = phi [init2; op2]
+  //   op1 = binop (pn, c1)
+  //   op2 = binop (pn2, c2)
+  //   rdx = binop (op1, op2)
+  // Into:
+  //   pn = phi [binop (init1, init2); rdx]
+  //   rdx = binop (pn, binop (c1, c2))
+
+  // Attempt to fold the constants.
+  auto *Init = llvm::ConstantFoldBinaryInstruction(Opc, Init1, Init2);
+  auto *C = llvm::ConstantFoldBinaryInstruction(Opc, C1, C2);
+  if (!Init || !C)
+    return nullptr;
+
+  LLVM_DEBUG(dbgs() << "  Combining " << PN << "\n            " << *BO1
+                    << "\n       with " << *PN2 << "\n            " << *BO2
+                    << '\n');
+  ++NumPHIsInterleaved;
+
+  // Create the new PHI.
+  auto *NewPN = PHINode::Create(PN.getType(), PN.getNumIncomingValues());
+
+  // Create the new binary op.
+  auto *NewOp = BinaryOperator::Create(Opc, NewPN, C);
+  if (Opc == Instruction::FAdd || Opc == Instruction::FMul) {
+    // Intersect FMF flags for FADD and FMUL.
+    FastMathFlags Intersect = BO1->getFastMathFlags() &
+                              BO2->getFastMathFlags() & Rdx->getFastMathFlags();
+    NewOp->setFastMathFlags(Intersect);
+  } else {
+    OverflowTracking Flags;
+    Flags.AllKnownNonNegative = false;
+    Flags.AllKnownNonZero = false;
+    Flags.mergeFlags(*BO1);
+    Flags.mergeFlags(*BO2);
+    Flags.mergeFlags(*Rdx);
+    Flags.applyFlags(*NewOp);
+  }
+  InsertNewInstWith(NewOp, BO1->getIterator());
+  replaceInstUsesWith(*Rdx, NewOp);
+
+  for (unsigned I = 0, E = PN.getNumIncomingValues(); I != E; ++I) {
+    auto *V = PN.getIncomingValue(I);
+    auto *BB = PN.getIncomingBlock(I);
+    if (V == Init1) {
+      assert(((PN2->getIncomingValue(0) == Init2 &&
+               PN2->getIncomingBlock(0) == BB) ||
+              (PN2->getIncomingValue(1) == Init2 &&
+               PN2->getIncomingBlock(1) == BB)) &&
+             "Invalid incoming block!");
+      NewPN->addIncoming(Init, BB);
+    } else if (V == BO1) {
+      assert(((PN2->getIncomingValue(0) == BO2 &&
+               PN2->getIncomingBlock(0) == BB) ||
+              (PN2->getIncomingValue(1) == BO2 &&
+               PN2->getIncomingBlock(1) == BB)) &&
+             "Invalid incoming block!");
+      NewPN->addIncoming(NewOp, BB);
+    } else
+      llvm_unreachable("Unexpected incoming value!");
+  }
+
+  // Remove dead instructions. BO1/2 are replaced with poison to clean up their
+  // uses.
+  eraseInstFromFunction(*Rdx);
+  eraseInstFromFunction(*replaceInstUsesWith(*BO1, BO1));
+  eraseInstFromFunction(*replaceInstUsesWith(*BO2, BO2));
+  eraseInstFromFunction(*PN2);
+
+  return NewPN;
+}
+
 /// Return true if this phi node is always equal to NonPhiInVal.
 /// This happens with mutually cyclic phi nodes like:
 ///   z = some value; x = phi (y, z); y = phi (x, z)
@@ -1448,6 +1608,10 @@ Instruction *InstCombinerImpl::visitPHINode(PHINode &PN) {
     if (Instruction *Result = foldPHIArgOpIntoPHI(PN))
       return Result;
 
+  // Try to fold interleaved PHI reductions to a single PHI.
+  if (Instruction *Result = foldPHIReduction(PN))
+    return Result;
+
   // If the incoming values are pointer casts of the same original value,
   // replace the phi with a single cast iff we can insert a non-PHI instruction.
   if (PN.getType()->isPointerTy() &&
diff --git a/llvm/test/Transforms/InstCombine/phi-reduction-chain.ll b/llvm/test/Transforms/InstCombine/phi-reduction-chain.ll
new file mode 100644
index 0000000000000..26d3cd56f158f
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/phi-reduction-chain.ll
@@ -0,0 +1,1408 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes=instcombine < %s | FileCheck %s
+
+; Reassociate add.
+define i8 @add_reassoc(i32 %n) {
+; CHECK-LABEL: define i8 @add_reassoc(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i8 [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+  %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+  %op1 = add i8 %pn, 2
+  %op2 = add i8 %pn2, 3
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add i8 %op2, %op1
+  ret i8 %rdx
+}
+
+; Reassociate add, and maintain nuw if all ops have it.
+define i8 @add_nuw(i32 %n) {
+; CHECK-LABEL: define i8 @add_nuw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add nuw i8 [[PN]], 5
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i8 [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+  %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+  %op1 = add nuw i8 %pn, 2
+  %op2 = add nuw i8 %pn2, 3
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add nuw i8 %op2, %op1
+  ret i8 %rdx
+}
+
+; Reassociate add, drop nuw if op1 doesn't have it.
+define i8 @add_op1_no_nuw(i32 %n) {
+; CHECK-LABEL: define i8 @add_op1_no_nuw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i8 [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+  %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+  %op1 = add i8 %pn, 2
+  %op2 = add nuw i8 %pn2, 3
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add nuw i8 %op2, %op1
+  ret i8 %rdx
+}
+
+; Reassociate add, drop nuw if op2 doesn't have it.
+define i8 @add_op2_no_nuw(i32 %n) {
+; CHECK-LABEL: define i8 @add_op2_no_nuw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i8 [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+  %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+  %op1 = add nuw i8 %pn, 2
+  %op2 = add i8 %pn2, 3
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add nuw i8 %op2, %op1
+  ret i8 %rdx
+}
+
+; Reassociate add, drop nuw if rdx doesn't have it.
+define i8 @add_rdx_no_nuw(i32 %n) {
+; CHECK-LABEL: define i8 @add_rdx_no_nuw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i8 [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+  %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+  %op1 = add nuw i8 %pn, 2
+  %op2 = add nuw i8 %pn2, 3
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add i8 %op2, %op1
+  ret i8 %rdx
+}
+
+; Reassociate add, drop nsw even if all ops have it.
+define i8 @add_no_nsw(i32 %n) {
+; CHECK-LABEL: define i8 @add_no_nsw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i8 [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+  %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+  %op1 = add nsw i8 %pn, 2
+  %op2 = add nsw i8 %pn2, 3
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add nsw i8 %op2, %op1
+  ret i8 %rdx
+}
+
+; Reassociate add, keep nuw/nsw if all ops have them.
+define i8 @add_nuw_nsw(i32 %n) {
+; CHECK-LABEL: define i8 @add_nuw_nsw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add nuw nsw i8 [[PN]], 5
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i8 [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+  %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+  %op1 = add nuw nsw i8 %pn, 2
+  %op2 = add nuw nsw i8 %pn2, 3
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add nuw nsw i8 %op2, %op1
+  ret i8 %rdx
+}
+
+; Reassociate fixed-length vector operands.
+define <16 x i8> @add_v16i8(i32 %n) {
+; CHECK-LABEL: define <16 x i8> @add_v16i8(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi <16 x i8> [ splat (i8 1), %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add <16 x i8> [[PN]], splat (i8 5)
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret <16 x i8> [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi <16 x i8> [ splat (i8 0), %entry ], [ %op1, %body ]
+  %pn2 = phi <16 x i8> [ splat (i8 1), %entry ], [ %op2, %body ]
+  %op1 = add <16 x i8> %pn, splat (i8 2)
+  %op2 = add <16 x i8> %pn2, splat (i8 3)
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add <16 x i8> %op2, %op1
+  ret <16 x i8> %rdx
+}
+
+; Reassociate scalable vector operands.
+define <vscale x 16 x i8> @add_nxv16i8(i32 %n) {
+; CHECK-LABEL: define <vscale x 16 x i8> @add_nxv16i8(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi <vscale x 16 x i8> [ splat (i8 1), %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = add <vscale x 16 x i8> [[PN]], splat (i8 5)
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret <vscale x 16 x i8> [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi <vscale x 16 x i8> [ splat (i8 0), %entry ], [ %op1, %body ]
+  %pn2 = phi <vscale x 16 x i8> [ splat (i8 1), %entry ], [ %op2, %body ]
+  %op1 = add <vscale x 16 x i8> %pn, splat (i8 2)
+  %op2 = add <vscale x 16 x i8> %pn2, splat (i8 3)
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = add <vscale x 16 x i8> %op2, %op1
+  ret <vscale x 16 x i8> %rdx
+}
+
+; Check other opcodes.
+
+; Reassociate mul.
+define i8 @mul_reassoc(i32 %n) {
+; CHECK-LABEL: define i8 @mul_reassoc(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[PN:%.*]] = phi i8 [ 2, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT:    [[RDX]] = mul i8 [[PN]], 12
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i8 [[RDX]]
+;
+entry:
+  br label %body
+
+body:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  %pn = phi i8 [ 1, %entry ], [ %op1, %body ]
+  %pn2 = phi i8 [ 2, %entry ], [ %op2, %body ]
+  %op1 = mul i8 %pn, 3
+  %op2 = mul i8 %pn2, 4
+  %i.next = add nuw nsw i32 %i, 1
+  %cmp = icmp eq i32 %i.next, %n
+  br i1 %cmp, label %exit, label %body
+
+exit:
+  %rdx = mul i8 %op2, %op1
+  ret i8 %rdx
+}
+
+; Reassociate mul, don't expect any flags to be propagated.
+define i8 @mul_reassoc_no_nuw_no_nsw(i32 %n) {
+; CHECK-LABEL: define i8 @mul_reassoc_no_nuw_no_nsw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[BO...
[truncated]

llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp

Note: I'm not sure how this can be tested, but llvm#143878 is affected by this change.

Combine sequences such as: ```llvm %pn1 = phi [init1, %BB1], [%op1, %BB2] %pn2 = phi [init2, %BB1], [%op2, %BB2] %op1 = binop %pn1, constant1 %op2 = binop %pn2, constant2 %rdx = binop %op1, %op2 ``` Into: ```llvm %phi_combined = phi [init_combined, %BB1], [%op_combined, %BB2] %rdx_combined = binop %phi_combined, constant_combined ``` This allows us to simplify interleaved reductions, for example as generated by the loop vectorizer. The anecdotal example for this is the loop below: ```c float foo() { float q = 1.f; for (int i = 0; i < 1000; ++i) q *= .99f; return q; } ``` Which currently gets lowered as an explicit loop such as (on AArch64): ```gas .LBB0_1: fmul v0.4s, v0.4s, v1.4s fmul v2.4s, v2.4s, v1.4s fmul v3.4s, v3.4s, v1.4s fmul v4.4s, v4.4s, v1.4s subs w8, w8, llvm#32 b.ne .LBB0_1 ``` But with this patch lowers trivially: ```gas foo: mov w8, llvm#5028 movk w8, llvm#14389, lsl llvm#16 fmov s0, w8 ret ``` Currently, we require init1, init2, constant1 and constant2 to be constants that we can fold, but this may be relaxed in the future.

dtcxzyw · 2025-06-16T09:08:44Z

llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp

+  eraseInstFromFunction(*replaceInstUsesWith(*BO1, BO1));
+  eraseInstFromFunction(*replaceInstUsesWith(*BO2, BO2));


Suggested change

eraseInstFromFunction(*replaceInstUsesWith(*BO1, BO1));

eraseInstFromFunction(*replaceInstUsesWith(*BO2, BO2));

eraseInstFromFunction(*replaceInstUsesWith(*BO1, PoisonValue::get(BO1->getType())));

eraseInstFromFunction(*replaceInstUsesWith(*BO2, PoisonValue::get(BO2->getType())));

Thanks, done.

dtcxzyw · 2025-06-16T09:58:59Z

as introduced by the loop vectorizer.

I guess it is intended to fulfill the pipeline? Imagine the CPU has multiple ports/pipelines executing the same kind of instructions (load/fadd/fmul).

rj-jesus · 2025-06-16T10:18:31Z

as introduced by the loop vectorizer.

I guess it is intended to fulfill the pipeline? Imagine the CPU has multiple ports/pipelines executing the same kind of instructions (load/fadd/fmul).

Generally I agree, but in the cases this patch tries to capture, we don't have loads/stores or other such ops, and the binary ops we have can be collapsed to one. In these cases, I can't see how executing the multiple ops interleaved can be beneficial. For example, consider:

  %pn1 = phi [1.0, %BB1], [%op1, %BB2]
  %pn2 = phi [1.0, %BB1], [%op2, %BB2]
  %op1 = fmul %pn1, 0.9
  %op2 = fmul %pn2, 0.9
  %res = fmul %op1, %op2

Which can be folded to:

  %pn = phi [1.0, %BB1], [%res, %BB2]
  %res = fmul %pn, 0.81

Assuming the constants can be materialised similarly, the second version requires strictly fewer instructions to effect the same computations.

Do you see what I mean?

llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp

dtcxzyw · 2025-06-16T12:56:29Z

as introduced by the loop vectorizer.

I guess it is intended to fulfill the pipeline? Imagine the CPU has multiple ports/pipelines executing the same kind of instructions (load/fadd/fmul).

Generally I agree, but in the cases this patch tries to capture, we don't have loads/stores or other such ops, and the binary ops we have can be collapsed to one. In these cases, I can't see how executing the multiple ops interleaved can be beneficial. For example, consider:
  %pn1 = phi [1.0, %BB1], [%op1, %BB2]
  %pn2 = phi [1.0, %BB1], [%op2, %BB2]
  %op1 = fmul %pn1, 0.9
  %op2 = fmul %pn2, 0.9
  %res = fmul %op1, %op2
Which can be folded to:
  %pn = phi [1.0, %BB1], [%res, %BB2]
  %res = fmul %pn, 0.81
Assuming the constants can be materialised similarly, the second version requires strictly fewer instructions to effect the same computations.

Do you see what I mean?

I know. Obviously the combined version is faster. I just wonder if we can avoid introducing this pattern in LoopVectorizer by making some adjustments to cost modeling...

rj-jesus · 2025-06-16T17:54:48Z

Thanks for the review, @antoniofrighetto - I believe I should have addressed everything. Please let me know if you have any other comments. :)

I know. Obviously the combined version is faster. I just wonder if we can avoid introducing this pattern in LoopVectorizer by making some adjustments to cost modeling...

@dtcxzyw I see, thanks, I think that should also work. I think if we go down that route, we just need to be careful with loops such as this where interleaving is beneficial since these loops are currently not optimised by indvars or other such passes (as far as I'm aware). So, in these cases, interleaving and folding away the reduction chains separately seemed to be the most straightforward approach.

dtcxzyw

LGTM. Please wait for additional approval from other reviewers.

rj-jesus · 2025-06-18T16:47:38Z

LGTM. Please wait for additional approval from other reviewers.

Thanks very much, I will do. :)

nikic · 2025-06-19T10:13:08Z

llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp

+///   %phi2 = phi [init2, %BB1], [%op2, %BB2]
+///   %op1 = binop %phi1, constant1
+///   %op2 = binop %phi2, constant2
+///   %rdx = binop %op1, %op2


Why does the pattern start matching at one of the phi nodes rather than at %rdx? This is pretty unusual for InstCombine, and it's not immediately obvious to me why it is necessary.

We have a similar transform in InstCombinerImpl::foldBinopWithPhiOperands. But it looks hard to refactor the code :(

Hi @nikic, @dtcxzyw, I apologise for the delayed response. I was out of the office for most of last week.

Why does the pattern start matching at one of the phi nodes rather than at %rdx?

Given that I was specifically looking for sequences of interleaved recurrences, it seemed that starting at one of the phi nodes and working down from there would potentially allow bailing out sooner than if the match had started from %rdx. But I'm not precious about it, I'm happy to start the match from %rdx and/or move the pattern elsewhere if that's more appropriate.

Would it be preferable to move the pattern to a method similar to foldBinopWithPhiOperands and start the match from %rdx? Or do you have something else in mind? :)

rj-jesus requested review from nikic, antoniofrighetto and dtcxzyw June 12, 2025 11:29

llvmbot added llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Jun 12, 2025

dtcxzyw reviewed Jun 12, 2025

View reviewed changes

llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp Show resolved Hide resolved

rj-jesus added a commit to rj-jesus/llvm-project that referenced this pull request Jun 13, 2025

[ValueTracking] Handle FADD and XOR in matchSimpleRecurrence.

ffa3807

Note: I'm not sure how this can be tested, but llvm#143878 is affected by this change.

rj-jesus mentioned this pull request Jun 13, 2025

[ValueTracking] Remove opcode whitelist from matchSimpleRecurrence. #144031

Merged

rj-jesus added 2 commits June 16, 2025 01:23

Precommit tests.

11e41c6

rj-jesus force-pushed the rjj/combine-phi-reductions branch from 61f06ce to 61ef013 Compare June 16, 2025 08:51

dtcxzyw reviewed Jun 16, 2025

View reviewed changes

This was referenced Jun 16, 2025

Task submission dtcxzyw/llvm-opt-benchmark#1312

Open

pre-commit: PR143878 dtcxzyw/llvm-opt-benchmark#2448

Closed

Fuzz PR143878 dtcxzyw/llvm-fuzz-service#83

Closed

Explicitly replace BO1/2 with poison.

f851c36

dtcxzyw requested a review from fhahn June 16, 2025 10:02

antoniofrighetto reviewed Jun 16, 2025

View reviewed changes

Address feedback.

708a4ac

dtcxzyw approved these changes Jun 18, 2025

View reviewed changes

nikic reviewed Jun 19, 2025

View reviewed changes

		eraseInstFromFunction(replaceInstUsesWith(BO1, BO1));
		eraseInstFromFunction(replaceInstUsesWith(BO2, BO2));

[InstCombine] Combine interleaved PHI reduction chains. #143878

Are you sure you want to change the base?

[InstCombine] Combine interleaved PHI reduction chains. #143878

Uh oh!

Conversation

rj-jesus commented Jun 12, 2025

Uh oh!

llvmbot commented Jun 12, 2025

Uh oh!

Uh oh!

dtcxzyw Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

rj-jesus Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

dtcxzyw commented Jun 16, 2025

Uh oh!

rj-jesus commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dtcxzyw commented Jun 16, 2025

Uh oh!

rj-jesus commented Jun 16, 2025

Uh oh!

dtcxzyw left a comment

Choose a reason for hiding this comment

Uh oh!

rj-jesus commented Jun 18, 2025

Uh oh!

nikic Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

dtcxzyw Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

rj-jesus Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rj-jesus commented Jun 16, 2025 •

edited

Loading