-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[InstCombine] Combine interleaved PHI reduction chains. #143878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Ricardo Jesus (rj-jesus) ChangesCombine sequences such as: %pn1 = phi [init1, %BB1], [%op1, %BB2]
%pn2 = phi [init2, %BB1], [%op2, %BB2]
%op1 = binop %pn1, constant1
%op2 = binop %pn2, constant2
%rdx = binop %op1, %op2 Into: %phi_combined = phi [init_combined, %BB1], [%op_combined, %BB2]
%rdx_combined = binop %phi_combined, constant_combined This allows us to simplify interleaved reductions, for example as introduced by the loop vectorizer. The anecdotal example for this is the loop below (https://godbolt.org/z/hYhMW6x35): float foo() {
float q = 1.f;
for (int i = 0; i < 1000; ++i)
q *= .99f;
return q;
} Which currently gets lowered as an explicit loop such as (on AArch64, interleaved by four): .LBB0_1:
fmul v0.4s, v0.4s, v1.4s
fmul v2.4s, v2.4s, v1.4s
fmul v3.4s, v3.4s, v1.4s
fmul v4.4s, v4.4s, v1.4s
subs w8, w8, #<!-- -->32
b.ne .LBB0_1 But with this patch lowers trivially: foo:
mov w8, #<!-- -->5028
movk w8, #<!-- -->14389, lsl #<!-- -->16
fmov s0, w8
ret Currently, we require Please let me know if this is better implemented elsewhere, or if there's a better way of doing this! Patch is 56.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/143878.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineInternal.h b/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
index 334462d715f95..248092e21f109 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
+++ b/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
@@ -652,6 +652,9 @@ class LLVM_LIBRARY_VISIBILITY InstCombinerImpl final
Instruction *foldPHIArgZextsIntoPHI(PHINode &PN);
Instruction *foldPHIArgIntToPtrToPHI(PHINode &PN);
+ /// Try to fold interleaved PHI reductions to a single PHI.
+ Instruction *foldPHIReduction(PHINode &PN);
+
/// If the phi is within a phi web, which is formed by the def-use chain
/// of phis and all the phis in the web are only used in the other phis.
/// In this case, these phis are dead and we will remove all of them.
diff --git a/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp b/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
index a842a5edcb8a3..7fecb213cb0f6 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
@@ -36,6 +36,7 @@ STATISTIC(NumPHIsOfInsertValues,
STATISTIC(NumPHIsOfExtractValues,
"Number of phi-of-extractvalue turned into extractvalue-of-phi");
STATISTIC(NumPHICSEs, "Number of PHI's that got CSE'd");
+STATISTIC(NumPHIsInterleaved, "Number of interleaved PHI's combined");
/// The PHI arguments will be folded into a single operation with a PHI node
/// as input. The debug location of the single operation will be the merged
@@ -989,6 +990,165 @@ Instruction *InstCombinerImpl::foldPHIArgOpIntoPHI(PHINode &PN) {
return NewCI;
}
+/// Try to fold reduction ops interleaved through two PHIs to a single PHI.
+///
+/// For example, combine:
+/// %phi1 = phi [init1, %BB1], [%op1, %BB2]
+/// %phi2 = phi [init2, %BB1], [%op2, %BB2]
+/// %op1 = binop %phi1, constant1
+/// %op2 = binop %phi2, constant2
+/// %rdx = binop %op1, %op2
+/// =>
+/// %phi_combined = phi [init_combined, %BB1], [%op_combined, %BB2]
+/// %rdx_combined = binop %phi_combined, constant_combined
+///
+/// For now, we require init1, init2, constant1 and constant2 to be constants.
+Instruction *InstCombinerImpl::foldPHIReduction(PHINode &PN) {
+ // For now, only handle PHIs with one use and exactly two incoming values.
+ if (!PN.hasOneUse() || PN.getNumIncomingValues() != 2)
+ return nullptr;
+
+ // Find the binop that uses PN and ensure it can be reassociated.
+ auto *BO1 = dyn_cast<BinaryOperator>(PN.user_back());
+ if (!BO1 || !BO1->hasNUses(2) || !BO1->isAssociative())
+ return nullptr;
+
+ // Ensure PN has an incoming value for BO1.
+ if (PN.getIncomingValue(0) != BO1 && PN.getIncomingValue(1) != BO1)
+ return nullptr;
+
+ // Find the initial value of PN.
+ auto *Init1 =
+ dyn_cast<Constant>(PN.getIncomingValue(PN.getIncomingValue(0) == BO1));
+ if (!Init1)
+ return nullptr;
+
+ // Find the constant operand of BO1.
+ assert((BO1->getOperand(0) == &PN || BO1->getOperand(1) == &PN) &&
+ "Unexpected operand!");
+ auto *C1 = dyn_cast<Constant>(BO1->getOperand(BO1->getOperand(0) == &PN));
+ if (!C1)
+ return nullptr;
+
+ // Find the reduction operation.
+ auto Opc = BO1->getOpcode();
+ BinaryOperator *Rdx = nullptr;
+ for (User *U : BO1->users())
+ if (U != &PN) {
+ Rdx = dyn_cast<BinaryOperator>(U);
+ break;
+ }
+ if (!Rdx || Rdx->getOpcode() != Opc || !Rdx->isAssociative())
+ return nullptr;
+
+ // Find the interleaved binop.
+ assert((Rdx->getOperand(0) == BO1 || Rdx->getOperand(1) == BO1) &&
+ "Unexpected operand!");
+ auto *BO2 =
+ dyn_cast<BinaryOperator>(Rdx->getOperand(Rdx->getOperand(0) == BO1));
+ if (!BO2 || !BO2->hasNUses(2) || !BO2->isAssociative() ||
+ BO2->getOpcode() != Opc || BO2->getParent() != BO1->getParent())
+ return nullptr;
+
+ // Find the interleaved PHI and recurrence constant.
+ auto *PN2 = dyn_cast<PHINode>(BO2->getOperand(0));
+ auto *C2 = dyn_cast<Constant>(BO2->getOperand(1));
+ if (!PN2 && !C2) {
+ PN2 = dyn_cast<PHINode>(BO2->getOperand(1));
+ C2 = dyn_cast<Constant>(BO2->getOperand(0));
+ }
+ if (!PN2 || !C2 || !PN2->hasOneUse() || PN2->getParent() != PN.getParent())
+ return nullptr;
+ assert(PN2->getNumIncomingValues() == PN.getNumIncomingValues() &&
+ "Expected PHIs with the same number of incoming values!");
+
+ // Ensure PN2 has an incoming value for BO2.
+ if (PN2->getIncomingValue(0) != BO2 && PN2->getIncomingValue(1) != BO2)
+ return nullptr;
+
+ // Find the initial value of PN2.
+ auto *Init2 = dyn_cast<Constant>(
+ PN2->getIncomingValue(PN2->getIncomingValue(0) == BO2));
+ if (!Init2)
+ return nullptr;
+
+ assert(BO1->isCommutative() && BO2->isCommutative() && Rdx->isCommutative() &&
+ "Expected commutative instructions!");
+
+ // If we've got this far, we can transform:
+ // pn = phi [init1; op1]
+ // pn2 = phi [init2; op2]
+ // op1 = binop (pn, c1)
+ // op2 = binop (pn2, c2)
+ // rdx = binop (op1, op2)
+ // Into:
+ // pn = phi [binop (init1, init2); rdx]
+ // rdx = binop (pn, binop (c1, c2))
+
+ // Attempt to fold the constants.
+ auto *Init = llvm::ConstantFoldBinaryInstruction(Opc, Init1, Init2);
+ auto *C = llvm::ConstantFoldBinaryInstruction(Opc, C1, C2);
+ if (!Init || !C)
+ return nullptr;
+
+ LLVM_DEBUG(dbgs() << " Combining " << PN << "\n " << *BO1
+ << "\n with " << *PN2 << "\n " << *BO2
+ << '\n');
+ ++NumPHIsInterleaved;
+
+ // Create the new PHI.
+ auto *NewPN = PHINode::Create(PN.getType(), PN.getNumIncomingValues());
+
+ // Create the new binary op.
+ auto *NewOp = BinaryOperator::Create(Opc, NewPN, C);
+ if (Opc == Instruction::FAdd || Opc == Instruction::FMul) {
+ // Intersect FMF flags for FADD and FMUL.
+ FastMathFlags Intersect = BO1->getFastMathFlags() &
+ BO2->getFastMathFlags() & Rdx->getFastMathFlags();
+ NewOp->setFastMathFlags(Intersect);
+ } else {
+ OverflowTracking Flags;
+ Flags.AllKnownNonNegative = false;
+ Flags.AllKnownNonZero = false;
+ Flags.mergeFlags(*BO1);
+ Flags.mergeFlags(*BO2);
+ Flags.mergeFlags(*Rdx);
+ Flags.applyFlags(*NewOp);
+ }
+ InsertNewInstWith(NewOp, BO1->getIterator());
+ replaceInstUsesWith(*Rdx, NewOp);
+
+ for (unsigned I = 0, E = PN.getNumIncomingValues(); I != E; ++I) {
+ auto *V = PN.getIncomingValue(I);
+ auto *BB = PN.getIncomingBlock(I);
+ if (V == Init1) {
+ assert(((PN2->getIncomingValue(0) == Init2 &&
+ PN2->getIncomingBlock(0) == BB) ||
+ (PN2->getIncomingValue(1) == Init2 &&
+ PN2->getIncomingBlock(1) == BB)) &&
+ "Invalid incoming block!");
+ NewPN->addIncoming(Init, BB);
+ } else if (V == BO1) {
+ assert(((PN2->getIncomingValue(0) == BO2 &&
+ PN2->getIncomingBlock(0) == BB) ||
+ (PN2->getIncomingValue(1) == BO2 &&
+ PN2->getIncomingBlock(1) == BB)) &&
+ "Invalid incoming block!");
+ NewPN->addIncoming(NewOp, BB);
+ } else
+ llvm_unreachable("Unexpected incoming value!");
+ }
+
+ // Remove dead instructions. BO1/2 are replaced with poison to clean up their
+ // uses.
+ eraseInstFromFunction(*Rdx);
+ eraseInstFromFunction(*replaceInstUsesWith(*BO1, BO1));
+ eraseInstFromFunction(*replaceInstUsesWith(*BO2, BO2));
+ eraseInstFromFunction(*PN2);
+
+ return NewPN;
+}
+
/// Return true if this phi node is always equal to NonPhiInVal.
/// This happens with mutually cyclic phi nodes like:
/// z = some value; x = phi (y, z); y = phi (x, z)
@@ -1448,6 +1608,10 @@ Instruction *InstCombinerImpl::visitPHINode(PHINode &PN) {
if (Instruction *Result = foldPHIArgOpIntoPHI(PN))
return Result;
+ // Try to fold interleaved PHI reductions to a single PHI.
+ if (Instruction *Result = foldPHIReduction(PN))
+ return Result;
+
// If the incoming values are pointer casts of the same original value,
// replace the phi with a single cast iff we can insert a non-PHI instruction.
if (PN.getType()->isPointerTy() &&
diff --git a/llvm/test/Transforms/InstCombine/phi-reduction-chain.ll b/llvm/test/Transforms/InstCombine/phi-reduction-chain.ll
new file mode 100644
index 0000000000000..26d3cd56f158f
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/phi-reduction-chain.ll
@@ -0,0 +1,1408 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes=instcombine < %s | FileCheck %s
+
+; Reassociate add.
+define i8 @add_reassoc(i32 %n) {
+; CHECK-LABEL: define i8 @add_reassoc(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret i8 [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+ %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+ %op1 = add i8 %pn, 2
+ %op2 = add i8 %pn2, 3
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add i8 %op2, %op1
+ ret i8 %rdx
+}
+
+; Reassociate add, and maintain nuw if all ops have it.
+define i8 @add_nuw(i32 %n) {
+; CHECK-LABEL: define i8 @add_nuw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add nuw i8 [[PN]], 5
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret i8 [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+ %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+ %op1 = add nuw i8 %pn, 2
+ %op2 = add nuw i8 %pn2, 3
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add nuw i8 %op2, %op1
+ ret i8 %rdx
+}
+
+; Reassociate add, drop nuw if op1 doesn't have it.
+define i8 @add_op1_no_nuw(i32 %n) {
+; CHECK-LABEL: define i8 @add_op1_no_nuw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret i8 [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+ %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+ %op1 = add i8 %pn, 2
+ %op2 = add nuw i8 %pn2, 3
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add nuw i8 %op2, %op1
+ ret i8 %rdx
+}
+
+; Reassociate add, drop nuw if op2 doesn't have it.
+define i8 @add_op2_no_nuw(i32 %n) {
+; CHECK-LABEL: define i8 @add_op2_no_nuw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret i8 [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+ %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+ %op1 = add nuw i8 %pn, 2
+ %op2 = add i8 %pn2, 3
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add nuw i8 %op2, %op1
+ ret i8 %rdx
+}
+
+; Reassociate add, drop nuw if rdx doesn't have it.
+define i8 @add_rdx_no_nuw(i32 %n) {
+; CHECK-LABEL: define i8 @add_rdx_no_nuw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret i8 [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+ %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+ %op1 = add nuw i8 %pn, 2
+ %op2 = add nuw i8 %pn2, 3
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add i8 %op2, %op1
+ ret i8 %rdx
+}
+
+; Reassociate add, drop nsw even if all ops have it.
+define i8 @add_no_nsw(i32 %n) {
+; CHECK-LABEL: define i8 @add_no_nsw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add i8 [[PN]], 5
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret i8 [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+ %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+ %op1 = add nsw i8 %pn, 2
+ %op2 = add nsw i8 %pn2, 3
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add nsw i8 %op2, %op1
+ ret i8 %rdx
+}
+
+; Reassociate add, keep nuw/nsw if all ops have them.
+define i8 @add_nuw_nsw(i32 %n) {
+; CHECK-LABEL: define i8 @add_nuw_nsw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi i8 [ 1, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add nuw nsw i8 [[PN]], 5
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret i8 [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi i8 [ 0, %entry ], [ %op1, %body ]
+ %pn2 = phi i8 [ 1, %entry ], [ %op2, %body ]
+ %op1 = add nuw nsw i8 %pn, 2
+ %op2 = add nuw nsw i8 %pn2, 3
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add nuw nsw i8 %op2, %op1
+ ret i8 %rdx
+}
+
+; Reassociate fixed-length vector operands.
+define <16 x i8> @add_v16i8(i32 %n) {
+; CHECK-LABEL: define <16 x i8> @add_v16i8(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi <16 x i8> [ splat (i8 1), %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add <16 x i8> [[PN]], splat (i8 5)
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret <16 x i8> [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi <16 x i8> [ splat (i8 0), %entry ], [ %op1, %body ]
+ %pn2 = phi <16 x i8> [ splat (i8 1), %entry ], [ %op2, %body ]
+ %op1 = add <16 x i8> %pn, splat (i8 2)
+ %op2 = add <16 x i8> %pn2, splat (i8 3)
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add <16 x i8> %op2, %op1
+ ret <16 x i8> %rdx
+}
+
+; Reassociate scalable vector operands.
+define <vscale x 16 x i8> @add_nxv16i8(i32 %n) {
+; CHECK-LABEL: define <vscale x 16 x i8> @add_nxv16i8(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi <vscale x 16 x i8> [ splat (i8 1), %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = add <vscale x 16 x i8> [[PN]], splat (i8 5)
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret <vscale x 16 x i8> [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi <vscale x 16 x i8> [ splat (i8 0), %entry ], [ %op1, %body ]
+ %pn2 = phi <vscale x 16 x i8> [ splat (i8 1), %entry ], [ %op2, %body ]
+ %op1 = add <vscale x 16 x i8> %pn, splat (i8 2)
+ %op2 = add <vscale x 16 x i8> %pn2, splat (i8 3)
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = add <vscale x 16 x i8> %op2, %op1
+ ret <vscale x 16 x i8> %rdx
+}
+
+; Check other opcodes.
+
+; Reassociate mul.
+define i8 @mul_reassoc(i32 %n) {
+; CHECK-LABEL: define i8 @mul_reassoc(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BODY]]:
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[I_NEXT:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[PN:%.*]] = phi i8 [ 2, %[[ENTRY]] ], [ [[RDX:%.*]], %[[BODY]] ]
+; CHECK-NEXT: [[RDX]] = mul i8 [[PN]], 12
+; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[BODY]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret i8 [[RDX]]
+;
+entry:
+ br label %body
+
+body:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+ %pn = phi i8 [ 1, %entry ], [ %op1, %body ]
+ %pn2 = phi i8 [ 2, %entry ], [ %op2, %body ]
+ %op1 = mul i8 %pn, 3
+ %op2 = mul i8 %pn2, 4
+ %i.next = add nuw nsw i32 %i, 1
+ %cmp = icmp eq i32 %i.next, %n
+ br i1 %cmp, label %exit, label %body
+
+exit:
+ %rdx = mul i8 %op2, %op1
+ ret i8 %rdx
+}
+
+; Reassociate mul, don't expect any flags to be propagated.
+define i8 @mul_reassoc_no_nuw_no_nsw(i32 %n) {
+; CHECK-LABEL: define i8 @mul_reassoc_no_nuw_no_nsw(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[BODY:.*]]
+; CHECK: [[BO...
[truncated]
|
Note: I'm not sure how this can be tested, but llvm#143878 is affected by this change.
Combine sequences such as: ```llvm %pn1 = phi [init1, %BB1], [%op1, %BB2] %pn2 = phi [init2, %BB1], [%op2, %BB2] %op1 = binop %pn1, constant1 %op2 = binop %pn2, constant2 %rdx = binop %op1, %op2 ``` Into: ```llvm %phi_combined = phi [init_combined, %BB1], [%op_combined, %BB2] %rdx_combined = binop %phi_combined, constant_combined ``` This allows us to simplify interleaved reductions, for example as generated by the loop vectorizer. The anecdotal example for this is the loop below: ```c float foo() { float q = 1.f; for (int i = 0; i < 1000; ++i) q *= .99f; return q; } ``` Which currently gets lowered as an explicit loop such as (on AArch64): ```gas .LBB0_1: fmul v0.4s, v0.4s, v1.4s fmul v2.4s, v2.4s, v1.4s fmul v3.4s, v3.4s, v1.4s fmul v4.4s, v4.4s, v1.4s subs w8, w8, llvm#32 b.ne .LBB0_1 ``` But with this patch lowers trivially: ```gas foo: mov w8, llvm#5028 movk w8, llvm#14389, lsl llvm#16 fmov s0, w8 ret ``` Currently, we require init1, init2, constant1 and constant2 to be constants that we can fold, but this may be relaxed in the future.
61f06ce
to
61ef013
Compare
eraseInstFromFunction(*replaceInstUsesWith(*BO1, BO1)); | ||
eraseInstFromFunction(*replaceInstUsesWith(*BO2, BO2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eraseInstFromFunction(*replaceInstUsesWith(*BO1, BO1)); | |
eraseInstFromFunction(*replaceInstUsesWith(*BO2, BO2)); | |
eraseInstFromFunction(*replaceInstUsesWith(*BO1, PoisonValue::get(BO1->getType()))); | |
eraseInstFromFunction(*replaceInstUsesWith(*BO2, PoisonValue::get(BO2->getType()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
I guess it is intended to fulfill the pipeline? Imagine the CPU has multiple ports/pipelines executing the same kind of instructions (load/fadd/fmul). |
Generally I agree, but in the cases this patch tries to capture, we don't have loads/stores or other such ops, and the binary ops we have can be collapsed to one. In these cases, I can't see how executing the multiple ops interleaved can be beneficial. For example, consider: %pn1 = phi [1.0, %BB1], [%op1, %BB2]
%pn2 = phi [1.0, %BB1], [%op2, %BB2]
%op1 = fmul %pn1, 0.9
%op2 = fmul %pn2, 0.9
%res = fmul %op1, %op2 Which can be folded to: %pn = phi [1.0, %BB1], [%res, %BB2]
%res = fmul %pn, 0.81 Assuming the constants can be materialised similarly, the second version requires strictly fewer instructions to effect the same computations. Do you see what I mean? |
I know. Obviously the combined version is faster. I just wonder if we can avoid introducing this pattern in LoopVectorizer by making some adjustments to cost modeling... |
Thanks for the review, @antoniofrighetto - I believe I should have addressed everything. Please let me know if you have any other comments. :)
@dtcxzyw I see, thanks, I think that should also work. I think if we go down that route, we just need to be careful with loops such as this where interleaving is beneficial since these loops are currently not optimised by indvars or other such passes (as far as I'm aware). So, in these cases, interleaving and folding away the reduction chains separately seemed to be the most straightforward approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please wait for additional approval from other reviewers.
Thanks very much, I will do. :) |
/// %phi2 = phi [init2, %BB1], [%op2, %BB2] | ||
/// %op1 = binop %phi1, constant1 | ||
/// %op2 = binop %phi2, constant2 | ||
/// %rdx = binop %op1, %op2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the pattern start matching at one of the phi nodes rather than at %rdx? This is pretty unusual for InstCombine, and it's not immediately obvious to me why it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a similar transform in InstCombinerImpl::foldBinopWithPhiOperands
. But it looks hard to refactor the code :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nikic, @dtcxzyw, I apologise for the delayed response. I was out of the office for most of last week.
Why does the pattern start matching at one of the phi nodes rather than at %rdx?
Given that I was specifically looking for sequences of interleaved recurrences, it seemed that starting at one of the phi nodes and working down from there would potentially allow bailing out sooner than if the match had started from %rdx
. But I'm not precious about it, I'm happy to start the match from %rdx
and/or move the pattern elsewhere if that's more appropriate.
Would it be preferable to move the pattern to a method similar to foldBinopWithPhiOperands
and start the match from %rdx
? Or do you have something else in mind? :)
Combine sequences such as:
Into:
This allows us to simplify interleaved reductions, for example as introduced by the loop vectorizer.
The anecdotal example for this is the loop below (https://godbolt.org/z/hYhMW6x35):
Which currently gets lowered as an explicit loop such as (on AArch64, interleaved by four):
But with this patch lowers trivially:
Currently, we require
init1
,init2
,constant1
andconstant2
to be foldable constants, but this can be relaxed in the future.Please let me know if this is better implemented elsewhere, or if there's a better way of doing this!