[X86,SimplifyCFG] Support hoisting load/store with conditional faulting (Part I) #96878

KanRobert · 2024-06-27T09:32:53Z

This is simplifycfg part of #95515

In this PR, we support hoisting load/store with conditional faulting in
SimplifyCFGOpt::speculativelyExecuteBB to eliminate conditional branches.
This is for cases like

void test (int a, int *b) {
  if (a)
   *b = a;
}

In the following patches, we will support the hoist in
SimplifyCFGOpt::hoistCommonCodeFromSuccessors.
That is for cases like

void test (int a, int *c, int *d) {
  if (a)
   *c = a;
  else 
   *d = a;
}

llvmbot · 2024-06-27T09:33:27Z

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-loongarch
@llvm/pr-subscribers-backend-arm
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Author: Shengchen Kan (KanRobert)

Changes

Patch is 26.86 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96878.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Utils/SimplifyCFG.cpp (+216-5)
(added) llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll (+460)

diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
index c52c4dc0b8a51..558fafd5a2652 100644
--- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
@@ -131,6 +131,12 @@ static cl::opt<bool> HoistCondStores(
     "simplifycfg-hoist-cond-stores", cl::Hidden, cl::init(true),
     cl::desc("Hoist conditional stores if an unconditional store precedes"));
 
+static cl::opt<bool> HoistLoadsStoresWithCondFaulting(
+    "simplifycfg-hoist-loads-stores-with-cond-faulting", cl::Hidden,
+    cl::init(true),
+    cl::desc("Hoist loads/stores if the target supports "
+             "conditional faulting"));
+
 static cl::opt<bool> MergeCondStores(
     "simplifycfg-merge-cond-stores", cl::Hidden, cl::init(true),
     cl::desc("Hoist conditional stores even if an unconditional store does not "
@@ -275,6 +281,7 @@ class SimplifyCFGOpt {
   bool hoistSuccIdenticalTerminatorToSwitchOrIf(
       Instruction *TI, Instruction *I1,
       SmallVectorImpl<Instruction *> &OtherSuccTIs);
+  bool hoistLoadStoreWithCondFaultingFromSuccessors(BasicBlock *BB);
   bool SpeculativelyExecuteBB(BranchInst *BI, BasicBlock *ThenBB);
   bool SimplifyTerminatorOnSelect(Instruction *OldTerm, Value *Cond,
                                   BasicBlock *TrueBB, BasicBlock *FalseBB,
@@ -2960,6 +2967,199 @@ static bool validateAndCostRequiredSelects(BasicBlock *BB, BasicBlock *ThenBB,
   return HaveRewritablePHIs;
 }
 
+/// Hoist load/store instructions from the conditional successor blocks up into
+/// the block.
+///
+/// We are looking for code like the following:
+/// \code
+///   BB:
+///     ...
+///     %cond = icmp ult %x, %y
+///     br i1 %cond, label %TrueBB, label %FalseBB
+///   FalseBB:
+///     store i32 1, ptr %q, align 4
+///     ...
+///   TrueBB:
+///     %0 = load i32, ptr %b, align 4
+///     store i32 %0, ptr %p, align 4
+///     ...
+/// \endcode
+//
+/// We are going to transform this into:
+///
+/// \code
+///   BB:
+///     ...
+///     %cond = icmp ult %x, %y
+///     %0 = cload i32, ptr %b, %cond
+///     cstore i32 %0, ptr %p, %cond
+///     cstore i32 1, ptr %q, ~%cond
+///     br i1 %cond, label %TrueBB, label %FalseBB
+///   FalseBB:
+///     ...
+///   TrueBB:
+///     ...
+/// \endcode
+///
+/// where cload/cstore is represented by intrinsic like llvm.masked.load/store,
+/// e.g.
+///
+/// \code
+///   %vcond = bitcast i1 %cond to <1 x i1>
+///   %v0 = call <1 x i32> @llvm.masked.load.v1i32.p0
+///                         (ptr %b, i32 4, <1 x i1> %vcond, <1 x i32> poison)
+///   %0 = bitcast <1 x i32> %v0 to i32
+///   call void @llvm.masked.store.v1i32.p0
+//                          (<1 x i32> %v0, ptr %p, i32 4, <1 x i1> %vcond)
+///   %cond.not = xor i1 %cond, true
+///   %vcond.not = bitcast i1 %cond.not to <1 x i>
+///   call void @llvm.masked.store.v1i32.p0
+///              (<1 x i32> <i32 1>, ptr %q, i32 4, <1x i1> %vcond.not)
+/// \endcode
+///
+/// \returns true if any load/store is hosited.
+///
+/// Note that this tranform should be run
+/// * before SpeculativelyExecuteBB so that the latter can have more chance.
+/// * after hoistCommonCodeFromSuccessors to ensure unconditional loads/stores
+///   are handled first.
+bool SimplifyCFGOpt::hoistLoadStoreWithCondFaultingFromSuccessors(
+    BasicBlock *BB) {
+  if (!HoistLoadsStoresWithCondFaulting ||
+      !TTI.hasConditionalLoadStoreForType())
+    return false;
+
+  auto *BI = dyn_cast<BranchInst>(BB->getTerminator());
+  if (!BI || !BI->isConditional())
+    return false;
+
+  BasicBlock *IfTrueBB = BI->getSuccessor(0);
+  BasicBlock *IfFalseBB = BI->getSuccessor(1);
+
+  // If either of the blocks has it's address taken, then we can't do this fold,
+  // because the code we'd hoist would no longer run when we jump into the block
+  // by it's address.
+  for (auto *Succ : {IfTrueBB, IfFalseBB})
+    if (Succ->hasAddressTaken())
+      return false;
+
+  // Not use isa<AllocaInst>(getUnderlyingObject(I.getOperand(0)) to avoid
+  // checking all intermediate operands dominate the branch.
+  auto IsLoadFromAlloca = [](const Instruction &I) {
+    return isa<LoadInst>(I) && isa<AllocaInst>((I.getOperand(0)));
+  };
+
+  // Collect hoisted loads/stores.
+  SmallSetVector<Instruction *, 4> HoistedInsts;
+  // Not hoist load/store if
+  // 1. target does not have corresponding conditional faulting load/store.
+  // 2. it's volatile or atomic.
+  // 3. there is a load/store that can not be hoisted in the same bb.
+  // 4. there is a non-load/store that's not safe to speculatively execute
+  //    in the same bb.
+  // 5. any operand of it does not dominate the branch.
+  // 6. it's a store and a memory read is skipped.
+  auto HoistInstsInBB = [&](BasicBlock *BB) {
+    bool SkipMemoryRead = false;
+    // A more efficient way to check domination. An operand dominates the
+    // BranchInst if
+    // 1. it's not defined in the same bb as the instruction.
+    // 2. it's to be hoisted.
+    //
+    // b/c BB is only predecessor and BranchInst does not define any value.
+    auto OpsDominatesBranch = [&](Instruction &I) {
+      return llvm::all_of(I.operands(), [&](Value *Op) {
+        if (auto *J = dyn_cast<Instruction>(Op)) {
+          if (HoistedInsts.contains(J))
+            return true;
+          if (J->getParent() == I.getParent())
+            return false;
+        }
+        return true;
+      });
+    };
+    for (auto &I : *BB) {
+      auto *LI = dyn_cast<LoadInst>(&I);
+      auto *SI = dyn_cast<StoreInst>(&I);
+      if (LI || SI) {
+        bool IsSimple = (LI && LI->isSimple()) || (SI && SI->isSimple());
+        if (!IsSimple || !OpsDominatesBranch(I))
+          return false;
+        auto *Type = LI ? I.getType() : I.getOperand(0)->getType();
+        // a load from alloca is always safe.
+        if (!IsLoadFromAlloca(I) && !TTI.hasConditionalLoadStoreForType(Type))
+          return false;
+        // Conservative aliasing check.
+        if (SI && SkipMemoryRead)
+          return false;
+        HoistedInsts.insert(&I);
+      } else if (!I.isTerminator() && !isSafeToSpeculativelyExecute(&I))
+        return false;
+      else if (I.mayReadFromMemory())
+        SkipMemoryRead = true;
+    }
+    return true;
+  };
+
+  if (!HoistInstsInBB(IfTrueBB) || !HoistInstsInBB(IfFalseBB) ||
+      HoistedInsts.empty())
+    return false;
+
+  // Put newly added instructions before the BranchInst.
+  IRBuilder<> Builder(BI);
+  auto &Context = BB->getContext();
+  auto *VCondTy = FixedVectorType::get(Type::getInt1Ty(Context), 1);
+  auto *Cond = BI->getOperand(0);
+  auto *VCond = Builder.CreateBitCast(Cond, VCondTy);
+  Value *VCondNot = nullptr;
+  for (auto *I : HoistedInsts) {
+    // Only need to move the position for load from alloca.
+    if (IsLoadFromAlloca(*I)) {
+      I->moveBefore(BI);
+      continue;
+    }
+
+    bool InvertCond = I->getParent() == IfFalseBB;
+    // Construct the inverted condition if need.
+    if (InvertCond && !VCondNot)
+      VCondNot = Builder.CreateBitCast(
+          Builder.CreateXor(Cond, ConstantInt::getTrue(Context)), VCondTy);
+
+    auto *Mask = InvertCond ? VCondNot : VCond;
+    auto *Op0 = I->getOperand(0);
+    if (auto *LI = dyn_cast<LoadInst>(I)) {
+      // Load
+      auto *Ty = I->getType();
+      // NOTE: Now we assume conditional faulting load/store is supported for
+      // scalar only when creating new instructions, but it's easy to extend it
+      // for vector types in the future.
+      assert(!Ty->isVectorTy() && "not implemented");
+      auto *V0 = Builder.CreateMaskedLoad(FixedVectorType::get(Ty, 1), Op0,
+                                          LI->getAlign(), Mask);
+      auto *S0 = Builder.CreateBitCast(V0, Ty);
+      V0->copyMetadata(*I);
+      I->replaceAllUsesWith(S0);
+    } else {
+      // Store
+      assert(!Op0->getType()->isVectorTy() && "not implemented");
+      auto *StoredVal =
+          Builder.CreateBitCast(Op0, FixedVectorType::get(Op0->getType(), 1));
+      auto *VStore = Builder.CreateMaskedStore(
+          StoredVal, I->getOperand(1), cast<StoreInst>(I)->getAlign(), Mask);
+      VStore->copyMetadata(*I);
+    }
+  }
+
+  // Erase the hoisted instrutions in reverse order to avoid use-w/o-define
+  // error.
+  std::for_each(HoistedInsts.rbegin(), HoistedInsts.rend(), [&](auto I) {
+    if (!IsLoadFromAlloca(*I))
+      I->eraseFromParent();
+  });
+
+  return true;
+}
+
 /// Speculate a conditional basic block flattening the CFG.
 ///
 /// Note that this is a very risky transform currently. Speculating
@@ -7420,31 +7620,42 @@ bool SimplifyCFGOpt::simplifyCondBranch(BranchInst *BI, IRBuilder<> &Builder) {
     return requestResimplify();
 
   // We have a conditional branch to two blocks that are only reachable
-  // from BI.  We know that the condbr dominates the two blocks, so see if
-  // there is any identical code in the "then" and "else" blocks.  If so, we
-  // can hoist it up to the branching block.
+  // from BI.  We know that the condbr dominates the two blocks, so see
+  //
+  // * if there is any identical code in the "then" and "else" blocks.
+  // * if there is any different load/store in the "then" and "else" blocks.
+  //
+  // If so, we can hoist it up to the branching block.
   if (BI->getSuccessor(0)->getSinglePredecessor()) {
     if (BI->getSuccessor(1)->getSinglePredecessor()) {
       if (HoistCommon && hoistCommonCodeFromSuccessors(
                              BI->getParent(), !Options.HoistCommonInsts))
         return requestResimplify();
+      if (hoistLoadStoreWithCondFaultingFromSuccessors(BI->getParent()))
+        return requestResimplify();
     } else {
       // If Successor #1 has multiple preds, we may be able to conditionally
       // execute Successor #0 if it branches to Successor #1.
       Instruction *Succ0TI = BI->getSuccessor(0)->getTerminator();
       if (Succ0TI->getNumSuccessors() == 1 &&
-          Succ0TI->getSuccessor(0) == BI->getSuccessor(1))
+          Succ0TI->getSuccessor(0) == BI->getSuccessor(1)) {
+        if (hoistLoadStoreWithCondFaultingFromSuccessors(BI->getParent()))
+          return requestResimplify();
         if (SpeculativelyExecuteBB(BI, BI->getSuccessor(0)))
           return requestResimplify();
+      }
     }
   } else if (BI->getSuccessor(1)->getSinglePredecessor()) {
     // If Successor #0 has multiple preds, we may be able to conditionally
     // execute Successor #1 if it branches to Successor #0.
     Instruction *Succ1TI = BI->getSuccessor(1)->getTerminator();
     if (Succ1TI->getNumSuccessors() == 1 &&
-        Succ1TI->getSuccessor(0) == BI->getSuccessor(0))
+        Succ1TI->getSuccessor(0) == BI->getSuccessor(0)) {
+      if (hoistLoadStoreWithCondFaultingFromSuccessors(BI->getParent()))
+        return requestResimplify();
       if (SpeculativelyExecuteBB(BI, BI->getSuccessor(1)))
         return requestResimplify();
+    }
   }
 
   // If this is a branch on something for which we know the constant value in
diff --git a/llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll b/llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll
new file mode 100644
index 0000000000000..2fd0055cf05f9
--- /dev/null
+++ b/llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll
@@ -0,0 +1,460 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt < %s -mtriple=x86_64 -mattr=+cf -passes=simplifycfg -simplifycfg-require-and-preserve-domtree=1 -S -simplifycfg-hoist-loads-stores-with-cond-faulting=true | FileCheck %s
+
+;; The redundant bitcast/insertelement will be opimized out in instcombine pass.
+define void @basic(i32 %a, ptr %b, ptr %p, ptr %q) {
+; CHECK-LABEL: @basic(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[COND:%.*]] = icmp eq i32 [[A:%.*]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i1 [[COND]] to <1 x i1>
+; CHECK-NEXT:    [[TMP1:%.*]] = call <1 x i32> @llvm.masked.load.v1i32.p0(ptr [[B:%.*]], i32 4, <1 x i1> [[TMP0]], <1 x i32> poison), !dbg [[DBG8:![0-9]+]]
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast <1 x i32> [[TMP1]] to i32
+; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i32 [[TMP2]] to <1 x i32>
+; CHECK-NEXT:    call void @llvm.masked.store.v1i32.p0(<1 x i32> [[TMP3]], ptr [[P:%.*]], i32 4, <1 x i1> [[TMP0]])
+; CHECK-NEXT:    [[TMP4:%.*]] = xor i1 [[COND]], true
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i1 [[TMP4]] to <1 x i1>
+; CHECK-NEXT:    call void @llvm.masked.store.v1i64.p0(<1 x i64> <i64 1>, ptr [[P]], i32 8, <1 x i1> [[TMP5]]), !dbg [[DBG12:![0-9]+]]
+; CHECK-NEXT:    call void @llvm.masked.store.v1i16.p0(<1 x i16> <i16 2>, ptr [[Q:%.*]], i32 8, <1 x i1> [[TMP5]]), !dbg [[DBG12]]
+; CHECK-NEXT:    ret void
+;
+entry:
+  %cond = icmp eq i32 %a, 0
+  br i1 %cond, label %if.true, label %if.false
+
+if.false:
+  store i64 1, ptr %p, align 8, !dbg !8
+  store i16 2, ptr %q, align 8, !dbg !8
+  br label %if.end
+
+if.true:
+  %0 = load i32, ptr %b, align 4,  !dbg !9
+  store i32 %0, ptr %p, align 4
+  br label %if.end
+
+if.end:
+  ret void
+}
+
+;; simplifycfg is run before sroa. alloca here is not optimized away yet.
+define void @alloca(ptr %p, ptr %q, i32 %a) {
+; CHECK-LABEL: @alloca(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+; CHECK-NEXT:    [[Q_ADDR:%.*]] = alloca ptr, align 8
+; CHECK-NEXT:    [[A_ADDR:%.*]] = alloca i32, align 4
+; CHECK-NEXT:    store ptr [[P:%.*]], ptr [[P_ADDR]], align 8
+; CHECK-NEXT:    store ptr [[Q:%.*]], ptr [[Q_ADDR]], align 8
+; CHECK-NEXT:    store i32 [[A:%.*]], ptr [[A_ADDR]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[A_ADDR]], align 4
+; CHECK-NEXT:    [[TOBOOL:%.*]] = icmp ne i32 [[TMP0]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i1 [[TOBOOL]] to <1 x i1>
+; CHECK-NEXT:    [[TMP2:%.*]] = load ptr, ptr [[Q_ADDR]], align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = call <1 x i32> @llvm.masked.load.v1i32.p0(ptr [[TMP2]], i32 4, <1 x i1> [[TMP1]], <1 x i32> poison)
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <1 x i32> [[TMP3]] to i32
+; CHECK-NEXT:    [[TMP5:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i32 [[TMP4]] to <1 x i32>
+; CHECK-NEXT:    call void @llvm.masked.store.v1i32.p0(<1 x i32> [[TMP6]], ptr [[TMP5]], i32 4, <1 x i1> [[TMP1]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %p.addr = alloca ptr
+  %q.addr = alloca ptr
+  %a.addr = alloca i32
+  store ptr %p, ptr %p.addr
+  store ptr %q, ptr %q.addr
+  store i32 %a, ptr %a.addr
+  %0 = load i32, ptr %a.addr
+  %tobool = icmp ne i32 %0, 0
+  br i1 %tobool, label %if.then, label %if.end
+
+if.then:
+  %1 = load ptr, ptr %q.addr
+  %2 = load i32, ptr %1
+  %3 = load ptr, ptr %p.addr
+  store i32 %2, ptr %3
+  br label %if.end
+
+if.end:
+  ret void
+}
+
+;; successor 1 branches to successor 0.
+define void @succ1to0(ptr %p, ptr %q, i32 %a) {
+; CHECK-LABEL: @succ1to0(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TOBOOL:%.*]] = icmp ne i32 [[A:%.*]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i1 [[TOBOOL]] to <1 x i1>
+; CHECK-NEXT:    [[TMP1:%.*]] = xor i1 [[TOBOOL]], true
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i1 [[TMP1]] to <1 x i1>
+; CHECK-NEXT:    [[TMP3:%.*]] = call <1 x i32> @llvm.masked.load.v1i32.p0(ptr [[Q:%.*]], i32 4, <1 x i1> [[TMP2]], <1 x i32> poison)
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <1 x i32> [[TMP3]] to i32
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[TMP4]] to <1 x i32>
+; CHECK-NEXT:    call void @llvm.masked.store.v1i32.p0(<1 x i32> [[TMP5]], ptr [[P:%.*]], i32 4, <1 x i1> [[TMP2]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %tobool = icmp ne i32 %a, 0
+  br i1 %tobool, label %if.end, label %if.then
+
+if.end:
+  ret void
+
+if.then:
+  %0 = load i32, ptr %q
+  store i32 %0, ptr %p
+  br label %if.end
+}
+
+;; successor 0 branches to successor 1.
+define void @succ0to1(i32 %a, ptr %b, ptr %p, ptr %q) {
+; CHECK-LABEL: @succ0to1(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[COND:%.*]] = icmp eq i32 [[A:%.*]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i1 [[COND]] to <1 x i1>
+; CHECK-NEXT:    [[TMP1:%.*]] = call <1 x i32> @llvm.masked.load.v1i32.p0(ptr [[B:%.*]], i32 4, <1 x i1> [[TMP0]], <1 x i32> poison)
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast <1 x i32> [[TMP1]] to i32
+; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i32 [[TMP2]] to <1 x i32>
+; CHECK-NEXT:    call void @llvm.masked.store.v1i32.p0(<1 x i32> [[TMP3]], ptr [[P:%.*]], i32 4, <1 x i1> [[TMP0]])
+; CHECK-NEXT:    [[TMP4:%.*]] = xor i1 [[COND]], true
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i1 [[TMP4]] to <1 x i1>
+; CHECK-NEXT:    call void @llvm.masked.store.v1i32.p0(<1 x i32> <i32 1>, ptr [[Q:%.*]], i32 4, <1 x i1> [[TMP5]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %cond = icmp eq i32 %a, 0
+  br i1 %cond, label %if.true, label %if.false
+
+if.false:
+  store i32 1, ptr %q
+  br label %if.end
+
+if.true:
+  %0 = load i32, ptr %b
+  store i32 %0, ptr %p
+  br label %if.false
+
+if.end:
+  ret void
+}
+
+;; load after store can be hoisted.
+define i64 @load_after_store(i32 %a, ptr %b, ptr %p, ptr %q) {
+; CHECK-LABEL: @load_after_store(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[COND:%.*]] = icmp eq i32 [[A:%.*]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i1 [[COND]] to <1 x i1>
+; CHECK-NEXT:    call void @llvm.masked.store.v1i32.p0(<1 x i32> <i32 1>, ptr [[B:%.*]], i32 4, <1 x i1> [[TMP0]])
+; CHECK-NEXT:    [[TMP1:%.*]] = call <1 x i16> @llvm.masked.load.v1i16.p0(ptr [[P:%.*]], i32 2, <1 x i1> [[TMP0]], <1 x i16> poison)
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast <1 x i16> [[TMP1]] to i16
+; CHECK-NEXT:    [[TMP3:%.*]] = call <1 x i64> @llvm.masked.load.v1i64.p0(ptr [[Q:%.*]], i32 8, <1 x i1> [[TMP0]], <1 x i64> poison)
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <1 x i64> [[TMP3]] to i64
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext i16 [[TMP2]] to i64
+; CHECK-NEXT:    [[ADD:%.*]] = add i64 [[ZEXT]], [[TMP4]]
+; CHECK-NEXT:    [[COMMON_RET_OP:%.*]] = select i1 [[COND]], i64 [[ADD]], i64 0
+; CHECK-NEXT:    ret i64 [[COMMON_RET_OP]]
+;
+entry:
+  %cond = icmp eq i32 %a, 0
+  br i1 %cond, label %if.true, label %if.end
+
+if.true:
+  store i32 1, ptr %b
+  %0 = load i16, ptr %p
+  %1 = load i64, ptr %q
+  %zext = zext i16 %0 to i64
+  %add = add i64 %zext, %1
+  ret i64 %add
+
+if.end:
+  ret i64 0
+}
+
+define i32 @load_skip_speculatable_memory_read(i32 %a, ptr %b, ptr %p, ptr %q) {
+; CHECK-LABEL: @load_skip_speculatable_memory_read(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[COND:%.*]] = icmp eq i32 [[A:%.*]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i1 [[COND]] to <1 x i1>
+; CHECK-NEXT:    [[TMP1:%.*]] = call <1 x i32> @llvm.masked.load.v1i32.p0(ptr [[B:%.*]], i32 4, <1 x i1> [[TMP0]], <1 x i32> poison)
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast <1 x i32> [[TMP1]] to i32
+; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i32 [[TMP2]] to <1 x i32>
+; CHECK-NEXT:    call void @llvm.masked.store.v1i32.p0(<1 x i32> [[TMP3]], ptr [[P:%.*]], i32 4, <1 x i1> [[TMP0]])
+; CHECK-NEXT:    [[TMP4:%.*]] = xor i1 [[COND]], true
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i1 [[TMP4]] to <1 x i1>
+; CHECK-NEXT:    [[TMP6:%.*]] = call <1 x i32> @llvm.masked.load.v1i32.p0(ptr [[Q:%.*]], i32 4, <1 x i1> [[TMP5]], <1 x i32> poison)
+; CHECK-NEXT:    [[TMP7:%.*]] = bitcast <1 x i32> [[TMP6]] to i32
+; CHECK-NEXT:    [[READ:%.*]] = call i32 @read_memory_only()
+; CHECK-NEXT:    [[PHI:%.*]] = select i1 [[COND]], i32 0, i32 [[READ]]
+; CHECK-NEXT:    ret i32 [[PHI]]
+;
+entry:
+  %cond = icmp eq i32 %a, 0
+  br i1 %cond, label %if.true, label %if.false
+
+if.false:
+  %read = call i32 @read_memory_only()
+  %0 = load i32, ptr %q
+  br label %if.end
+
+if.true:
+  %1 = load i32, ptr %b
+  store i32 %1, ptr %p
+  br label %if.end
+
+if.end:
+  %phi = phi i32 [%read, %if.false], [0, %if.true]
+  ret i32 %phi
+}
+
+; i8 is not supported by conditional faulting
+define void @not_supported_type(i8 %a, ptr %b, ptr %p, ptr %q) {
+; CHECK-LABEL: @not_supported_type(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[COND:%.*]] = icmp eq i8 [[A:%.*]], 0
+; CHECK-NEXT:    br i1 [[COND]], label [[IF_TRUE:%.*]], label [[IF_FALSE:%.*]]
+; CHECK:       if.false:
+; CHECK-NEXT:    store i8 1, ptr [[Q:%.*]], align 1
+; CHECK-NEXT:    br label [[IF_END:%.*]]
+; CHECK:       if.true:
+; CHECK-NEXT:    [[TMP0:%.*]] = load i8, ptr [[B:%.*]], align 1
+; CHECK-NEXT:    store i8 [[TMP0]], ptr [[P:%.*]], align 1
+; CHECK-NEXT:    br label [[IF_END]]
+; CHECK:       if.end:
+; CHECK-NEXT:    ret void
+;
+ent...
[truncated]

llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll

cyyself · 2024-06-27T15:02:50Z

How is the performance of the hoisting load/store being benchmarked? I don't know the specific implementation in microarchitecture, but using these conditional instructions to replace a very easy-to-predict branch may negatively contribute to performance. Is there any microarchitecture simulator or real chip implemented so we can benchmark the performance?

KanRobert · 2024-06-28T01:21:13Z

How is the performance of the hoisting load/store being benchmarked? I don't know the specific implementation in microarchitecture, but using these conditional instructions to replace a very easy-to-predict branch may negatively contribute to performance. Is there any microarchitecture simulator or real chip implemented so we can benchmark the performance?

I can share code for check predictability

  // If the branch is non-unpredictable, and is predicted to *not* branch to
  // the `then` block, then avoid speculating it.
  if (!BI->getMetadata(LLVMContext::MD_unpredictable)) {
    uint64_t TWeight, FWeight;
    if (extractBranchWeights(*BI, TWeight, FWeight) &&
        (TWeight + FWeight) != 0) {
      uint64_t EndWeight = Invert ? TWeight : FWeight;
      BranchProbability BIEndProb =
          BranchProbability::getBranchProbability(EndWeight, TWeight + FWeight);
      BranchProbability Likely = TTI.getPredictableBranchThreshold();
      if (BIEndProb >= Likely)
        return false;
    }
  }

in SpeculativelyExecuteBB with the newly added transform.

We have an internal cycle-accurate performance simulator. The real chip is not public yet. You know, even if I have data, I can't make any comments on the performance of future HW.

llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

nikic

I don't think we want to add an extra SimplifyCFG run in the backend. Can you enable the option in the last SimplifyCFG run in the middle-end instead (at the end of the module optimization pipeline)?

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

KanRobert · 2024-08-13T10:51:59Z

I don't think we want to add an extra SimplifyCFG run in the backend. Can you enable the option in the last SimplifyCFG run in the middle-end instead (at the end of the module optimization pipeline)?

Do you mean here https://github.com/llvm/llvm-project/blob/main/llvm/lib/Passes/PassBuilderPipelines.cpp#L1531 ? @nikic

nikic · 2024-08-13T13:53:19Z

I don't think we want to add an extra SimplifyCFG run in the backend. Can you enable the option in the last SimplifyCFG run in the middle-end instead (at the end of the module optimization pipeline)?

Do you mean here https://github.com/llvm/llvm-project/blob/main/llvm/lib/Passes/PassBuilderPipelines.cpp#L1531 ? @nikic

Yes

KanRobert · 2024-08-15T02:58:30Z

@nikic @dianqk Passed the validation of CPU2017/2006 && llvm-test-suite && geekbench after moving the transform into middle-end.

dianqk

SGTM. Sorry, I haven't been keeping up with this PR. Please wait for approval from others.

However, I'm wondering if these transformations could be moved to a different pass. I think that this isn't something SimplifyCFG should be handling. There's already quite a bit of similar code here. Of course, It's fine to merge this PR for me. :)

dianqk · 2024-08-15T03:22:35Z

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

+    if (Invert)
+      Mask = Builder.CreateBitCast(
+          Builder.CreateXor(Cond, ConstantInt::getTrue(Context)), VCondTy);
+    else
+      Mask = Builder.CreateBitCast(Cond, VCondTy);


Suggested change

if (Invert)

Mask = Builder.CreateBitCast(

Builder.CreateXor(Cond, ConstantInt::getTrue(Context)), VCondTy);

else

Mask = Builder.CreateBitCast(Cond, VCondTy);

Mask = Builder.CreateBitCast(Invert ? Builder.CreateXor(Cond, ConstantInt::getTrue(Context)) : Cond, VCondTy);

Is it possible to avoid creating a null pointer?

Done.

The variable Mask is a loop invariant and we definitely want to define it outside the loop below. And the Mask instruction should be only created when the candidates for CLOAD/CSTORE is not empty. So we have to assign the value inside the if clause while declaring it outside the if clause.

Of course, we can keep Mask uninitialized at the declaration site w/o affecting the correctness b/c it's never used if the CLOAD/CSTORE optimization can not be done. But I think it's not worthy b/c it does not follow the best practice and nullptr initialization should not bring any cost indeed.

KanRobert · 2024-08-15T03:56:18Z

I think that this isn't something SimplifyCFG should be handling.

This is completely opposite to what I thought. Can you tell me the reasoning behind your idea?

KanRobert · 2024-08-21T01:48:54Z

@nikic @dtcxzyw Ping :)

dianqk · 2024-08-21T01:56:54Z

I think that this isn't something SimplifyCFG should be handling.

This is completely opposite to what I thought. Can you tell me the reasoning behind your idea?

Because hoisting instructions won't change the CFG, if all instructions can be hoisted, we can transform it into two passes. I think some of the analysis results for the CFG can also be retained.

KanRobert · 2024-08-21T02:23:09Z

I think that this isn't something SimplifyCFG should be handling.

This is completely opposite to what I thought. Can you tell me the reasoning behind your idea?

Because hoisting instructions won't change the CFG, if all instructions can be hoisted, we can transform it into two passes. I think some of the analysis results for the CFG can also be retained.

According my understanding, both hoisting and speculating will change the CFG if all instructions are processed, and SimplifyCFGOpt::speculativelyExecuteBB and SimplifyCFGOpt::hoistCommonCodeFromSuccessors proves this.

This patch can be seen as an enhancement for SimplifyCFGOpt::speculativelyExecuteBB when target supports conditional faulting load/store.

KanRobert · 2024-08-26T07:15:12Z

@phoebewang @nikic @dtcxzyw Ping, more comments?

dtcxzyw

LG

llvm/test/Transforms/SimplifyCFG/X86/masked-load-store-legal.ll

phoebewang

LGTM.

phoebewang · 2024-09-15T11:43:32Z

llvm/test/Transforms/SimplifyCFG/X86/hoist-loads-stores-with-cf.ll

+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i1 [[TMP0]] to <1 x i1>
+; CHECK-NEXT:    [[TMP2:%.*]] = call <1 x i32> @llvm.masked.load.v1i32.p0(ptr [[P]], i32 4, <1 x i1> [[TMP1]], <1 x i32> poison)
+; CHECK-NEXT:    [[TMP3:%.*]] = bitcast <1 x i32> [[TMP2]] to i32
+; CHECK-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[COND]], i32 0, i32 [[TMP3]]


@KanRobert I found the transformation is not the optimal. We should put the value in passthru instead of create another select, see:
Exp: https://godbolt.org/z/8hx48joWs
Current: https://godbolt.org/z/qz135f78e

Candidate: #108754

…ng (Part II) This is a follow up of llvm#96878 to support hoisting load/store for diamond CFG. ``` void test (int a, int *c, int *d) { if (a) *c = a; else *d = a; } ```

nikic · 2024-09-29T15:34:07Z

I think that this isn't something SimplifyCFG should be handling.

This is completely opposite to what I thought. Can you tell me the reasoning behind your idea?

Probably worth considering this comment above the transform:

llvm-project/llvm/lib/Transforms/Utils/SimplifyCFG.cpp

Lines 3153 to 3162 in 95ce78b

    
           /// Note that this is a very risky transform currently. Speculating 
        
           /// instructions like this is most often not desirable. Instead, there is an MI 
        
           /// pass which can do it with full awareness of the resource constraints. 
        
           /// However, some cases are "obvious" and we should do directly. An example of 
        
           /// this is speculating a single, reasonably cheap instruction. 
        
           /// 
        
           /// There is only one distinct advantage to flattening the CFG at the IR level: 
        
           /// it makes very common but simplistic optimizations such as are common in 
        
           /// instcombine and the DAG combiner more powerful by removing CFG edges and 
        
           /// modeling their effects with easier to reason about SSA value graphs.

I believe the machine passes it refers to are EarlyIfConversion and IfConversion. I do think it would be worthwhile to consider whether this transform isn't better handled there, as these passes can actually properly cost-model such transforms, unlike SimplifyCFG.

I'm not really familiar with these passes, but from a quick look at EarlyIfConversion, it seems to support two strategies, one which only does pure speculation (so no store speculation) and the other that does predication. For X86, what we want is a bit in the middle, in the sense that we only predicate load/store instructions and speculate the rest.

It's probably not so hard to extend that code with an additional policy for this case (or maybe extend the speculation policy to allow conditional load/store predication), and may give you better mileage than trying to do it in SimplifyCFG.

phoebewang · 2024-11-17T15:26:26Z

I think that this isn't something SimplifyCFG should be handling.

This is completely opposite to what I thought. Can you tell me the reasoning behind your idea?

Probably worth considering this comment above the transform:

llvm-project/llvm/lib/Transforms/Utils/SimplifyCFG.cpp

Lines 3153 to 3162 in 95ce78b

/// Note that this is a very risky transform currently. Speculating

/// instructions like this is most often not desirable. Instead, there is an MI

/// pass which can do it with full awareness of the resource constraints.

/// However, some cases are "obvious" and we should do directly. An example of

/// this is speculating a single, reasonably cheap instruction.

///

/// There is only one distinct advantage to flattening the CFG at the IR level:

/// it makes very common but simplistic optimizations such as are common in

/// instcombine and the DAG combiner more powerful by removing CFG edges and

/// modeling their effects with easier to reason about SSA value graphs.

I believe the machine passes it refers to are EarlyIfConversion and IfConversion. I do think it would be worthwhile to consider whether this transform isn't better handled there, as these passes can actually properly cost-model such transforms, unlike SimplifyCFG.

I'm not really familiar with these passes, but from a quick look at EarlyIfConversion, it seems to support two strategies, one which only does pure speculation (so no store speculation) and the other that does predication. For X86, what we want is a bit in the middle, in the sense that we only predicate load/store instructions and speculate the rest.

It's probably not so hard to extend that code with an additional policy for this case (or maybe extend the speculation policy to allow conditional load/store predication), and may give you better mileage than trying to do it in SimplifyCFG.

Thanks @nikic for the suggestion! Sorry for the late response, I just got some time to investigate it.

I think HexagonEarlyIfConversion does approximate work as you described, but in a separate pass. It's true we can evaluate the cost more precisely, but there are also two drawbacks compared with SimplifyCFG solution.

Machine pass is more conservative in speculating load/store instructions;
X86 introduced a new predicate compare instruction, which is selected during ISel and relied on flattened CFG. It's not easy to reconstruct back in machine pass;

As the comments mentioned, it's also beneficial to instcombine DAG combiner, so I think it's better to use SimplifyCFG here.

…ng (Part II) (#108812) This is a follow up of #96878 to support hoisting load/store from BBs have the same predecessor, if load/store are the only instructions and the branch is unpredictable, e.g.: ``` void test (int a, int *c, int *d) { if (a) *c = a; else *d = a; } ```

llvmbot added the llvm:transforms label Jun 27, 2024

KanRobert mentioned this pull request Jun 27, 2024

[X86][SimplifyCFG][CodeGen] Support hoisting load/store with conditional faulting #95515

Closed

KanRobert requested review from nikic, phoebewang, dianqk, williamweixiao, goldsteinn, tianqingw, preames, dtcxzyw, efriedma-quic, RKSimon and topperc June 27, 2024 09:37

KanRobert added the backend:X86 label Jun 27, 2024

goldsteinn reviewed Jun 27, 2024

View reviewed changes

llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

dtcxzyw reviewed Jun 28, 2024

View reviewed changes

llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll Outdated Show resolved Hide resolved

dtcxzyw reviewed Jun 28, 2024

View reviewed changes

llvm/lib/Transforms/Utils/SimplifyCFG.cpp Outdated Show resolved Hide resolved

llvm/test/Transforms/SimplifyCFG/X86/hoist-load-store-with-cf.ll Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

dtcxzyw requested changes Jun 28, 2024

View reviewed changes

llvm/lib/Transforms/Utils/SimplifyCFG.cpp Outdated Show resolved Hide resolved

dtcxzyw reviewed Jun 29, 2024

View reviewed changes

llvm/lib/Transforms/Utils/SimplifyCFG.cpp Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

nikic requested changes Aug 13, 2024

View reviewed changes

nikic reviewed Aug 13, 2024

View reviewed changes

llvm/lib/Transforms/Utils/SimplifyCFG.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Utils/SimplifyCFG.cpp Show resolved Hide resolved

llvm/lib/Transforms/Utils/SimplifyCFG.cpp Outdated Show resolved Hide resolved

address nikic's comment

46bb03a

address nikic's comment: move opt to middle-end

ae99932

dianqk reviewed Aug 15, 2024

View reviewed changes

address DianQK's comment

450eafe

KanRobert requested review from nikic and dtcxzyw August 22, 2024 01:51

efriedma-quic removed their request for review August 22, 2024 22:30

dtcxzyw reviewed Aug 26, 2024

View reviewed changes

goldsteinn reviewed Aug 26, 2024

View reviewed changes

llvm/test/Transforms/SimplifyCFG/X86/masked-load-store-legal.ll Outdated Show resolved Hide resolved

phoebewang approved these changes Aug 27, 2024

View reviewed changes

address RKSimon's comment

232e1ba

KanRobert merged commit 87c86aa into llvm:main Aug 29, 2024
8 checks passed

phoebewang reviewed Sep 15, 2024

View reviewed changes

phoebewang mentioned this pull request Sep 16, 2024

[X86,SimplifyCFG] Support hoisting load/store with conditional faulting (Part II) #108812

Merged

[X86,SimplifyCFG] Support hoisting load/store with conditional faulting (Part I) #96878

[X86,SimplifyCFG] Support hoisting load/store with conditional faulting (Part I) #96878

Uh oh!

Conversation

KanRobert commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

cyyself commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KanRobert commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KanRobert commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Aug 13, 2024

Uh oh!

KanRobert commented Aug 15, 2024

Uh oh!

dianqk left a comment

Choose a reason for hiding this comment

Uh oh!

dianqk Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

KanRobert Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KanRobert commented Aug 15, 2024

Uh oh!

KanRobert commented Aug 21, 2024

Uh oh!

dianqk commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KanRobert commented Aug 21, 2024

Uh oh!

KanRobert commented Aug 26, 2024

Uh oh!

dtcxzyw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phoebewang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phoebewang Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

phoebewang Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

KanRobert commented Jun 27, 2024 •

edited

Loading

llvmbot commented Jun 27, 2024 •

edited

Loading

cyyself commented Jun 27, 2024 •

edited

Loading

KanRobert commented Jun 28, 2024 •

edited

Loading

KanRobert commented Aug 13, 2024 •

edited

Loading

KanRobert Aug 15, 2024 •

edited

Loading

dianqk commented Aug 21, 2024 •

edited

Loading