[RISCV] Widen i1 AnyOf reductions #134898

lukel97 · 2025-04-08T18:06:45Z

With EVL tail folding an AnyOf reduction will end up emitting an i1 vp.merge.

Unfortunately due to RVV not containing any tail undisturbed mask instructions, an i1 vp.merge will get expanded to a lengthy sequence:

  vsetvli a1, zero, e64, m1, ta, ma
  vid.v v10                        
  vmsltu.vx v10, v10, a0           
  vmand.mm v9, v9, v10             
  vmandn.mm v8, v8, v9             
  vmand.mm v9, v0, v9              
  vmor.mm v0, v9, v8

This addresses this by matching this specific AnyOf pattern in RISCVCodegenPrepare and widening it from i1 to i8, which will end up producing a single masked i8 vor.vi inside the loop:

loop:                                                                      
  %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
  %cmp = icmp ...                                                                                          
  %rec = call <vscale x 4 x i1> @llvm.vp.merge(%cmp, true, %phi, %evl)

loop:                                                                      
  %phi = phi <vscale x 4 x i8> [ zeroinitializer, %entry ], [ %rec, %loop ]
  %cmp = icmp ...                             
  %rec = call <vscale x 4 x i8> @llvm.vp.merge(%cmp, true, %phi, %evl)     
  %trunc = trunc <vscale x 4 x i8> %rec to <vscale x 4 x i1>

I ended up adding this in RISCVCodegenPrepare instead of the LoopVectorizer itself since it would have required adding a target hook.

It may also be possible to generalize this to other i1 vp.merges in future.

Normally the trunc will be sunk outside of the loop. But it also doesn't check to see if all the non-phi users of the vp.merge are outside of the loop: If there are in-loop users this still seems to be profitable, see the test diff in @widen_anyof_rdx_use_in_loop

Fixes #132180

llvmbot · 2025-04-08T18:07:21Z

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

With EVL tail folding an AnyOf reduction will end up emitting an i1 vp.merge.

Unfortunately due to RVV not containing any tail undisturbed mask instructions, an i1 vp.merge will get expanded to a lengthy sequence:

  vsetvli a1, zero, e64, m1, ta, ma
  vid.v v10                        
  vmsltu.vx v10, v10, a0           
  vmand.mm v9, v9, v10             
  vmandn.mm v8, v8, v9             
  vmand.mm v9, v0, v9              
  vmor.mm v0, v9, v8

This addresses this by matching this specific AnyOf pattern in RISCVCodegenPrepare and widening it from i1 to i8, which will end up producing a single masked i8 vor.vv inside the loop:

loop:                                                                      
  %phi = phi &lt;vscale x 4 x i1&gt; [ zeroinitializer, %entry ], [ %rec, %loop ]
  %cmp = icmp ...                                                          
  %or = or &lt;vscale x 4 x i1&gt; %phi, %cmp                                    
  %rec = call &lt;vscale x 4 x i1&gt; @<!-- -->llvm.vp.merge(%mask, %or, %phi, %evl)

loop:                                                                      
  %phi = phi &lt;vscale x 4 x i8&gt; [ zeroinitializer, %entry ], [ %rec, %loop ]
  %cmp = icmp ...                                                          
  %zext = zext &lt;vscale x 4 x i1&gt; %cmp to &lt;vscale x 4 x i8&gt;                 
  %or = or &lt;vscale x 4 x i8&gt; %phi, %cmp                                    
  %rec = call &lt;vscale x 4 x i8&gt; @<!-- -->llvm.vp.merge(%mask, %or, %phi, %evl)     
  %trunc = trunc &lt;vscale x 4 x i8&gt; %rec to &lt;vscale x 4 x i1&gt;

I ended up adding this in RISCVCodegenPrepare instead of the LoopVectorizer itself since it would have required adding a target hook.

It may also be possible to generalize this to other i1 vp.merges in future.

Normally the trunc will be sunk outside of the loop. But it also doesn't check to see if all the non-phi users of the vp.merge are outside of the loop: If there are in-loop users this still seems to be profitable, see the test diff in @widen_anyof_rdx_use_in_loop

Fixes #132180

Full diff: https://github.com/llvm/llvm-project/pull/134898.diff

3 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp (+81)
(modified) llvm/test/CodeGen/RISCV/riscv-codegenprepare-asm.ll (+101-1)
(modified) llvm/test/CodeGen/RISCV/riscv-codegenprepare.ll (+98)

diff --git a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
index b5cb05f30fb26..77584f853283c 100644
--- a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
@@ -25,6 +25,7 @@
 #include "llvm/IR/PatternMatch.h"
 #include "llvm/InitializePasses.h"
 #include "llvm/Pass.h"
+#include "llvm/Transforms/Utils/Local.h"
 
 using namespace llvm;
 
@@ -58,6 +59,7 @@ class RISCVCodeGenPrepare : public FunctionPass,
   bool visitAnd(BinaryOperator &BO);
   bool visitIntrinsicInst(IntrinsicInst &I);
   bool expandVPStrideLoad(IntrinsicInst &I);
+  bool widenVPMerge(IntrinsicInst &I);
 };
 
 } // end anonymous namespace
@@ -103,6 +105,82 @@ bool RISCVCodeGenPrepare::visitAnd(BinaryOperator &BO) {
   return true;
 }
 
+// With EVL tail folding, an AnyOf reduction will generate an i1 vp.merge like
+// follows:
+//
+// loop:
+//   %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
+//   %cmp = icmp ...
+//   %or = or <vscale x 4 x i1> %phi, %cmp
+//   %rec = call <vscale x 4 x i1> @llvm.vp.merge(%mask, %or, %phi, %evl)
+//   ...
+// middle:
+//   %res = call i1 @llvm.vector.reduce.or(<vscale x 4 x i1> %rec)
+//
+// However RVV doesn't have any tail undisturbed mask instructions and so we
+// need a convoluted sequence of mask instructions to lower the i1 vp.merge: see
+// llvm/test/CodeGen/RISCV/rvv/vpmerge-sdnode.ll.
+//
+// To avoid that this widens the i1 vp.merge to an i8 vp.merge, which will
+// usually be folded into a masked vor.vv.
+//
+// loop:
+//   %phi = phi <vscale x 4 x i8> [ zeroinitializer, %entry ], [ %rec, %loop ]
+//   %cmp = icmp ...
+//   %zext = zext <vscale x 4 x i1> %cmp to <vscale x 4 x i8>
+//   %or = or <vscale x 4 x i8> %phi, %cmp
+//   %rec = call <vscale x 4 x i8> @llvm.vp.merge(%mask, %or, %phi, %evl)
+//   %trunc = trunc <vscale x 4 x i8> %rec to <vscale x 4 x i1>
+//   ...
+// middle:
+//   %res = call i1 @llvm.vector.reduce.or(<vscale x 4 x i1> %rec)
+//
+// The trunc will normally be sunk outside of the loop, but even if there are
+// users inside the loop it is still profitable.
+bool RISCVCodeGenPrepare::widenVPMerge(IntrinsicInst &II) {
+  if (!II.getType()->getScalarType()->isIntegerTy(1))
+    return false;
+
+  Value *Mask, *PhiV, *Cond, *EVL;
+
+  using namespace PatternMatch;
+  if (!match(&II,
+             m_Intrinsic<Intrinsic::vp_merge>(
+                 m_Value(Mask), m_OneUse(m_c_Or(m_Value(PhiV), m_Value(Cond))),
+                 m_Deferred(PhiV), m_Value(EVL))))
+    return false;
+
+  auto *Phi = dyn_cast<PHINode>(PhiV);
+  auto *Start = dyn_cast<Constant>(Phi->getIncomingValue(0));
+  if (!Phi || Phi->getNumUses() > 2 || Phi->getNumIncomingValues() != 2 ||
+      !(Start && Start->isZeroValue()) || Phi->getIncomingValue(1) != &II)
+    return false;
+
+  Type *WideTy =
+      VectorType::get(IntegerType::getInt8Ty(II.getContext()),
+                      cast<VectorType>(II.getType())->getElementCount());
+
+  IRBuilder<> Builder(Phi);
+  PHINode *WidePhi = Builder.CreatePHI(WideTy, 2);
+  WidePhi->addIncoming(ConstantAggregateZero::get(WideTy),
+                       Phi->getIncomingBlock(0));
+  Builder.SetInsertPoint(&II);
+  Value *WideCmp = Builder.CreateZExt(Cond, WideTy);
+  Value *WideOr = Builder.CreateOr(WidePhi, WideCmp);
+  Value *WideMerge = Builder.CreateIntrinsic(Intrinsic::vp_merge, {WideTy},
+                                             {Mask, WideOr, WidePhi, EVL});
+  WidePhi->addIncoming(WideMerge, Phi->getIncomingBlock(1));
+  Value *Trunc = Builder.CreateTrunc(WideMerge, II.getType());
+
+  II.replaceAllUsesWith(Trunc);
+
+  // Break the cycle and delete the old chain.
+  Phi->setIncomingValue(1, Phi->getIncomingValue(0));
+  llvm::RecursivelyDeleteTriviallyDeadInstructions(&II);
+
+  return true;
+}
+
 // LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector
 // reduction instructions write the result in the first element of a vector
 // register. So when a reduction in a loop uses a scalar phi, we end up with
@@ -138,6 +216,9 @@ bool RISCVCodeGenPrepare::visitIntrinsicInst(IntrinsicInst &I) {
   if (expandVPStrideLoad(I))
     return true;
 
+  if (widenVPMerge(I))
+    return true;
+
   if (I.getIntrinsicID() != Intrinsic::vector_reduce_fadd &&
       !isa<VPReductionIntrinsic>(&I))
     return false;
diff --git a/llvm/test/CodeGen/RISCV/riscv-codegenprepare-asm.ll b/llvm/test/CodeGen/RISCV/riscv-codegenprepare-asm.ll
index 32261ee47164e..9d7976b5b874d 100644
--- a/llvm/test/CodeGen/RISCV/riscv-codegenprepare-asm.ll
+++ b/llvm/test/CodeGen/RISCV/riscv-codegenprepare-asm.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc < %s -mtriple=riscv64 | FileCheck %s
+; RUN: llc < %s -mtriple=riscv64 -mattr=+v | FileCheck %s
 
 
 ; Make sure we don't emit a pair of shift for the zext in the preheader. We
@@ -127,3 +127,103 @@ for.body:                                         ; preds = %for.body, %for.body
   %niter.ncmp.1 = icmp eq i64 %niter.next.1, %unroll_iter
   br i1 %niter.ncmp.1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
 }
+
+define i1 @widen_anyof_rdx(ptr %p, i64 %n) {
+; CHECK-LABEL: widen_anyof_rdx:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a2, 0
+; CHECK-NEXT:    vsetvli a3, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vmv.v.i v8, 0
+; CHECK-NEXT:  .LBB2_1: # %loop
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    sub a3, a1, a2
+; CHECK-NEXT:    slli a4, a2, 2
+; CHECK-NEXT:    vsetvli a3, a3, e32, m2, ta, ma
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    vmsne.vi v0, v10, 0
+; CHECK-NEXT:    add a2, a2, a3
+; CHECK-NEXT:    vsetvli zero, zero, e8, mf2, tu, mu
+; CHECK-NEXT:    vor.vi v8, v8, 1, v0.t
+; CHECK-NEXT:    blt a2, a1, .LBB2_1
+; CHECK-NEXT:  # %bb.2: # %exit
+; CHECK-NEXT:    vsetvli a0, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vand.vi v8, v8, 1
+; CHECK-NEXT:    vmsne.vi v8, v8, 0
+; CHECK-NEXT:    vcpop.m a0, v8
+; CHECK-NEXT:    snez a0, a0
+; CHECK-NEXT:    ret
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
+  %avl = sub i64 %n, %iv
+  %evl = call i32 @llvm.experimental.get.vector.length(i64 %avl, i32 4, i1 true)
+
+  %gep = getelementptr i32, ptr %p, i64 %iv
+  %x = call <vscale x 4 x i32> @llvm.vp.load(ptr %gep, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %cmp = icmp ne <vscale x 4 x i32> %x, zeroinitializer
+  %or = or <vscale x 4 x i1> %phi, %cmp
+  %rec = call <vscale x 4 x i1> @llvm.vp.merge(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i1> %or, <vscale x 4 x i1> %phi, i32 %evl)
+
+  %evl.zext = zext i32 %evl to i64
+  %iv.next = add i64 %iv, %evl.zext
+  %done = icmp sge i64 %iv.next, %n
+  br i1 %done, label %exit, label %loop
+exit:
+  %res = call i1 @llvm.vector.reduce.or(<vscale x 4 x i1> %rec)
+  ret i1 %res
+}
+
+
+define i1 @widen_anyof_rdx_use_in_loop(ptr %p, i64 %n) {
+; CHECK-LABEL: widen_anyof_rdx_use_in_loop:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a2, 0
+; CHECK-NEXT:    vsetvli a3, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vmv.v.i v8, 0
+; CHECK-NEXT:  .LBB3_1: # %loop
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    sub a3, a1, a2
+; CHECK-NEXT:    slli a4, a2, 2
+; CHECK-NEXT:    vsetvli a3, a3, e32, m2, ta, ma
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    vmsne.vi v0, v10, 0
+; CHECK-NEXT:    vsetvli zero, zero, e8, mf2, tu, mu
+; CHECK-NEXT:    vor.vi v8, v8, 1, v0.t
+; CHECK-NEXT:    vsetvli a5, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vand.vi v9, v8, 1
+; CHECK-NEXT:    vmsne.vi v9, v9, 0
+; CHECK-NEXT:    add a2, a2, a3
+; CHECK-NEXT:    vsm.v v9, (a4)
+; CHECK-NEXT:    blt a2, a1, .LBB3_1
+; CHECK-NEXT:  # %bb.2: # %exit
+; CHECK-NEXT:    vcpop.m a0, v9
+; CHECK-NEXT:    snez a0, a0
+; CHECK-NEXT:    ret
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
+  %avl = sub i64 %n, %iv
+  %evl = call i32 @llvm.experimental.get.vector.length(i64 %avl, i32 4, i1 true)
+
+  %gep = getelementptr i32, ptr %p, i64 %iv
+  %x = call <vscale x 4 x i32> @llvm.vp.load(ptr %gep, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %cmp = icmp ne <vscale x 4 x i32> %x, zeroinitializer
+  %or = or <vscale x 4 x i1> %phi, %cmp
+  %rec = call <vscale x 4 x i1> @llvm.vp.merge(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i1> %or, <vscale x 4 x i1> %phi, i32 %evl)
+
+  store <vscale x 4 x i1> %rec, ptr %gep
+
+  %evl.zext = zext i32 %evl to i64
+  %iv.next = add i64 %iv, %evl.zext
+  %done = icmp sge i64 %iv.next, %n
+  br i1 %done, label %exit, label %loop
+exit:
+  %res = call i1 @llvm.vector.reduce.or(<vscale x 4 x i1> %rec)
+  ret i1 %res
+}
diff --git a/llvm/test/CodeGen/RISCV/riscv-codegenprepare.ll b/llvm/test/CodeGen/RISCV/riscv-codegenprepare.ll
index 2179a0d26cf98..ba2fa1d7b4001 100644
--- a/llvm/test/CodeGen/RISCV/riscv-codegenprepare.ll
+++ b/llvm/test/CodeGen/RISCV/riscv-codegenprepare.ll
@@ -103,3 +103,101 @@ define i64 @bug(i32 %x) {
   %b = and i64 %a, 4294967295
   ret i64 %b
 }
+
+define i1 @widen_anyof_rdx(ptr %p, i64 %n) {
+; CHECK-LABEL: @widen_anyof_rdx(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = phi <vscale x 4 x i8> [ zeroinitializer, [[ENTRY]] ], [ [[TMP3:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    [[AVL:%.*]] = sub i64 [[N:%.*]], [[IV]]
+; CHECK-NEXT:    [[EVL:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[P:%.*]], i64 [[IV]]
+; CHECK-NEXT:    [[X:%.*]] = call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr [[GEP]], <vscale x 4 x i1> splat (i1 true), i32 [[EVL]])
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ne <vscale x 4 x i32> [[X]], zeroinitializer
+; CHECK-NEXT:    [[TMP1:%.*]] = zext <vscale x 4 x i1> [[CMP]] to <vscale x 4 x i8>
+; CHECK-NEXT:    [[TMP2:%.*]] = or <vscale x 4 x i8> [[TMP0]], [[TMP1]]
+; CHECK-NEXT:    [[TMP3]] = call <vscale x 4 x i8> @llvm.vp.merge.nxv4i8(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i8> [[TMP2]], <vscale x 4 x i8> [[TMP0]], i32 [[EVL]])
+; CHECK-NEXT:    [[TMP4:%.*]] = trunc <vscale x 4 x i8> [[TMP3]] to <vscale x 4 x i1>
+; CHECK-NEXT:    [[EVL_ZEXT:%.*]] = zext i32 [[EVL]] to i64
+; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], [[EVL_ZEXT]]
+; CHECK-NEXT:    [[DONE:%.*]] = icmp sge i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[DONE]], label [[EXIT:%.*]], label [[LOOP]]
+; CHECK:       exit:
+; CHECK-NEXT:    [[RES:%.*]] = call i1 @llvm.vector.reduce.or.nxv4i1(<vscale x 4 x i1> [[TMP4]])
+; CHECK-NEXT:    ret i1 [[RES]]
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
+  %avl = sub i64 %n, %iv
+  %evl = call i32 @llvm.experimental.get.vector.length(i64 %avl, i32 4, i1 true)
+
+  %gep = getelementptr i32, ptr %p, i64 %iv
+  %x = call <vscale x 4 x i32> @llvm.vp.load(ptr %gep, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %cmp = icmp ne <vscale x 4 x i32> %x, zeroinitializer
+  %or = or <vscale x 4 x i1> %phi, %cmp
+  %rec = call <vscale x 4 x i1> @llvm.vp.merge(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i1> %or, <vscale x 4 x i1> %phi, i32 %evl)
+
+  %evl.zext = zext i32 %evl to i64
+  %iv.next = add i64 %iv, %evl.zext
+  %done = icmp sge i64 %iv.next, %n
+  br i1 %done, label %exit, label %loop
+exit:
+  %res = call i1 @llvm.vector.reduce.or(<vscale x 4 x i1> %rec)
+  ret i1 %res
+}
+
+
+define i1 @widen_anyof_rdx_use_in_loop(ptr %p, i64 %n) {
+; CHECK-LABEL: @widen_anyof_rdx_use_in_loop(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = phi <vscale x 4 x i8> [ zeroinitializer, [[ENTRY]] ], [ [[TMP3:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    [[AVL:%.*]] = sub i64 [[N:%.*]], [[IV]]
+; CHECK-NEXT:    [[EVL:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[P:%.*]], i64 [[IV]]
+; CHECK-NEXT:    [[X:%.*]] = call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr [[GEP]], <vscale x 4 x i1> splat (i1 true), i32 [[EVL]])
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ne <vscale x 4 x i32> [[X]], zeroinitializer
+; CHECK-NEXT:    [[TMP1:%.*]] = zext <vscale x 4 x i1> [[CMP]] to <vscale x 4 x i8>
+; CHECK-NEXT:    [[TMP2:%.*]] = or <vscale x 4 x i8> [[TMP0]], [[TMP1]]
+; CHECK-NEXT:    [[TMP3]] = call <vscale x 4 x i8> @llvm.vp.merge.nxv4i8(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i8> [[TMP2]], <vscale x 4 x i8> [[TMP0]], i32 [[EVL]])
+; CHECK-NEXT:    [[REC:%.*]] = trunc <vscale x 4 x i8> [[TMP3]] to <vscale x 4 x i1>
+; CHECK-NEXT:    store <vscale x 4 x i1> [[REC]], ptr [[GEP]], align 1
+; CHECK-NEXT:    [[EVL_ZEXT:%.*]] = zext i32 [[EVL]] to i64
+; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], [[EVL_ZEXT]]
+; CHECK-NEXT:    [[DONE:%.*]] = icmp sge i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[DONE]], label [[EXIT:%.*]], label [[LOOP]]
+; CHECK:       exit:
+; CHECK-NEXT:    [[RES:%.*]] = call i1 @llvm.vector.reduce.or.nxv4i1(<vscale x 4 x i1> [[REC]])
+; CHECK-NEXT:    ret i1 [[RES]]
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
+  %avl = sub i64 %n, %iv
+  %evl = call i32 @llvm.experimental.get.vector.length(i64 %avl, i32 4, i1 true)
+
+  %gep = getelementptr i32, ptr %p, i64 %iv
+  %x = call <vscale x 4 x i32> @llvm.vp.load(ptr %gep, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %cmp = icmp ne <vscale x 4 x i32> %x, zeroinitializer
+  %or = or <vscale x 4 x i1> %phi, %cmp
+  %rec = call <vscale x 4 x i1> @llvm.vp.merge(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i1> %or, <vscale x 4 x i1> %phi, i32 %evl)
+
+  store <vscale x 4 x i1> %rec, ptr %gep
+
+  %evl.zext = zext i32 %evl to i64
+  %iv.next = add i64 %iv, %evl.zext
+  %done = icmp sge i64 %iv.next, %n
+  br i1 %done, label %exit, label %loop
+exit:
+  %res = call i1 @llvm.vector.reduce.or(<vscale x 4 x i1> %rec)
+  ret i1 %res
+}

preames · 2025-04-08T19:45:06Z

I think you have a missing simplification in your input. Can't this:

%phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
  %cmp = icmp ...                                                          
  %or = or <vscale x 4 x i1> %phi, %cmp                                    
  %rec = call <vscale x 4 x i1> @llvm.vp.merge(TrueMask, %or, %phi, %evl)

Be simplified to:

%phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
  %cmp = icmp ...                                                          
  %rec = call <vscale x 4 x i1> @llvm.vp.merge(%cmp, TrueMask, %phi, %evl)

lukel97 · 2025-04-08T23:02:17Z

I think you have a missing simplification in your input. Can't this:

%phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
  %cmp = icmp ...                                                          
  %or = or <vscale x 4 x i1> %phi, %cmp                                    
  %rec = call <vscale x 4 x i1> @llvm.vp.merge(TrueMask, %or, %phi, %evl)

Be simplified to:

%phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ]
  %cmp = icmp ...                                                          
  %rec = call <vscale x 4 x i1> @llvm.vp.merge(%cmp, TrueMask, %phi, %evl)

I think so. The above is just what the loop vectorizer currently generates, and it doesn't look like there's any existing combine for this in InstCombine/VectorCombine.

We could simplify this in VPlan and get the slightly cheaper costing. I'll see if I can do that first and then return to this PR

topperc · 2025-04-09T05:23:30Z

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

+    return false;
+
+  auto *Phi = dyn_cast<PHINode>(PhiV);
+  auto *Start = dyn_cast<Constant>(Phi->getIncomingValue(0));


I think it is possible to have a phi with no operands so I'm not sure you can assume value 0 exists.

Thanks, should be fixed in the latest version

With EVL tail folding an AnyOf reduction will emit a vp.merge like vp.merge true, (or phi, cond), phi, evl We can remove the or and optimise this to vp.merge cond, true, phi, evl Which makes it slightly easier to pattern match in llvm#134898. This adds a pattern matcher for VPWidenIntrinsicRecipe to help match this (only 4-ary intrinsics for now, can be extended if other users need) Blended AnyOf reductions will emit use an and, which we may also be able to simplify in a later patch.

…135017) With EVL tail folding an AnyOf reduction will emit an i1 vp.merge like vp.merge true, (or phi, cond), phi, evl We can remove the or and optimise this to vp.merge cond, true, phi, evl Which makes it slightly easier to pattern match in #134898. This also adds a pattern matcher for calls to help match this. Blended AnyOf reductions will use an and instead of an or, which we may also be able to simplify in a later patch.

lukel97 · 2025-04-21T09:56:37Z

Apologies for the delay, I've rebased this on top of #135017 now that it's landed and simplified the pattern.

Previously when we were checking for we were matching the or it was also a use on the phi

wangpc-pp

LGTM.

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

preames

LGTM

With EVL tail folding an AnyOf reduction will end up emitting an i1 vp.merge. Unfortunately due to RVV not containing any tail undisturbed mask instructions, an i1 vp.merge will get expanded to a lengthy sequence: ```asm vsetvli a1, zero, e64, m1, ta, ma vid.v v10 vmsltu.vx v10, v10, a0 vmand.mm v9, v9, v10 vmandn.mm v8, v8, v9 vmand.mm v9, v0, v9 vmor.mm v0, v9, v8 ``` This addresses this by matching this specific AnyOf pattern in RISCVCodegenPrepare and widening it from i1 to i8, which will end up producing a single masked i8 vor.vi inside the loop: ```llvm loop: %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ] %cmp = icmp ... %rec = call <vscale x 4 x i1> @llvm.vp.merge(%cmp, true, %phi, %evl) ``` ```llvm loop: %phi = phi <vscale x 4 x i8> [ zeroinitializer, %entry ], [ %rec, %loop ] %cmp = icmp ... %rec = call <vscale x 4 x i8> @llvm.vp.merge(%cmp, true, %phi, %evl) %trunc = trunc <vscale x 4 x i8> %rec to <vscale x 4 x i1> ``` I ended up adding this in RISCVCodegenPrepare instead of the LoopVectorizer itself since it would have required adding a target hook. It may also be possible to generalize this to other i1 vp.merges in future. Normally the trunc will be sunk outside of the loop. But it also doesn't check to see if all the non-phi users of the vp.merge are outside of the loop: If there are in-loop users this still seems to be profitable, see the test diff in `@widen_anyof_rdx_use_in_loop` Fixes llvm#132180

…lvm#135017) With EVL tail folding an AnyOf reduction will emit an i1 vp.merge like vp.merge true, (or phi, cond), phi, evl We can remove the or and optimise this to vp.merge cond, true, phi, evl Which makes it slightly easier to pattern match in llvm#134898. This also adds a pattern matcher for calls to help match this. Blended AnyOf reductions will use an and instead of an or, which we may also be able to simplify in a later patch.

With EVL tail folding an AnyOf reduction will end up emitting an i1 vp.merge. Unfortunately due to RVV not containing any tail undisturbed mask instructions, an i1 vp.merge will get expanded to a lengthy sequence: ```asm vsetvli a1, zero, e64, m1, ta, ma vid.v v10 vmsltu.vx v10, v10, a0 vmand.mm v9, v9, v10 vmandn.mm v8, v8, v9 vmand.mm v9, v0, v9 vmor.mm v0, v9, v8 ``` This addresses this by matching this specific AnyOf pattern in RISCVCodegenPrepare and widening it from i1 to i8, which will end up producing a single masked i8 vor.vi inside the loop: ```llvm loop: %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ] %cmp = icmp ... %rec = call <vscale x 4 x i1> @llvm.vp.merge(%cmp, true, %phi, %evl) ``` ```llvm loop: %phi = phi <vscale x 4 x i8> [ zeroinitializer, %entry ], [ %rec, %loop ] %cmp = icmp ... %rec = call <vscale x 4 x i8> @llvm.vp.merge(%cmp, true, %phi, %evl) %trunc = trunc <vscale x 4 x i8> %rec to <vscale x 4 x i1> ``` I ended up adding this in RISCVCodegenPrepare instead of the LoopVectorizer itself since it would have required adding a target hook. It may also be possible to generalize this to other i1 vp.merges in future. Normally the trunc will be sunk outside of the loop. But it also doesn't check to see if all the non-phi users of the vp.merge are outside of the loop: If there are in-loop users this still seems to be profitable, see the test diff in `@widen_anyof_rdx_use_in_loop` Fixes llvm#132180

…lvm#135017) With EVL tail folding an AnyOf reduction will emit an i1 vp.merge like vp.merge true, (or phi, cond), phi, evl We can remove the or and optimise this to vp.merge cond, true, phi, evl Which makes it slightly easier to pattern match in llvm#134898. This also adds a pattern matcher for calls to help match this. Blended AnyOf reductions will use an and instead of an or, which we may also be able to simplify in a later patch.

With EVL tail folding an AnyOf reduction will end up emitting an i1 vp.merge. Unfortunately due to RVV not containing any tail undisturbed mask instructions, an i1 vp.merge will get expanded to a lengthy sequence: ```asm vsetvli a1, zero, e64, m1, ta, ma vid.v v10 vmsltu.vx v10, v10, a0 vmand.mm v9, v9, v10 vmandn.mm v8, v8, v9 vmand.mm v9, v0, v9 vmor.mm v0, v9, v8 ``` This addresses this by matching this specific AnyOf pattern in RISCVCodegenPrepare and widening it from i1 to i8, which will end up producing a single masked i8 vor.vi inside the loop: ```llvm loop: %phi = phi <vscale x 4 x i1> [ zeroinitializer, %entry ], [ %rec, %loop ] %cmp = icmp ... %rec = call <vscale x 4 x i1> @llvm.vp.merge(%cmp, true, %phi, %evl) ``` ```llvm loop: %phi = phi <vscale x 4 x i8> [ zeroinitializer, %entry ], [ %rec, %loop ] %cmp = icmp ... %rec = call <vscale x 4 x i8> @llvm.vp.merge(%cmp, true, %phi, %evl) %trunc = trunc <vscale x 4 x i8> %rec to <vscale x 4 x i1> ``` I ended up adding this in RISCVCodegenPrepare instead of the LoopVectorizer itself since it would have required adding a target hook. It may also be possible to generalize this to other i1 vp.merges in future. Normally the trunc will be sunk outside of the loop. But it also doesn't check to see if all the non-phi users of the vp.merge are outside of the loop: If there are in-loop users this still seems to be profitable, see the test diff in `@widen_anyof_rdx_use_in_loop` Fixes llvm#132180

lukel97 requested review from preames, mikhailramalho, mshockwave, alexey-bataev, Mel-Chen, topperc and wangpc-pp April 8, 2025 18:06

llvmbot added the backend:RISC-V label Apr 8, 2025

topperc reviewed Apr 9, 2025

View reviewed changes

lukel97 mentioned this pull request Apr 9, 2025

[VPlan] Simplify vp.merge true, (or x, y), x -> vp.merge y, true, x #135017

Merged

lukel97 added 2 commits April 21, 2025 17:39

Precommit tests

f590430

[RISCV] Widen i1 AnyOf reductions

7b4100d

lukel97 force-pushed the riscv/widen-i1-vpmerge branch from 65c77d1 to 7b4100d Compare April 21, 2025 09:54

Reduce num of phi uses needed to 1

678851b

Previously when we were checking for we were matching the or it was also a use on the phi

wangpc-pp approved these changes Apr 21, 2025

View reviewed changes

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp Show resolved Hide resolved

Update comment

39f50e6

preames approved these changes Apr 25, 2025

View reviewed changes

Merge branch 'main' into riscv/widen-i1-vpmerge

82173cb

lukel97 merged commit 185ba02 into llvm:main Apr 28, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV] Widen i1 AnyOf reductions #134898

[RISCV] Widen i1 AnyOf reductions #134898

Uh oh!

lukel97 commented Apr 8, 2025 •

edited

Loading

Uh oh!

llvmbot commented Apr 8, 2025

Uh oh!

preames commented Apr 8, 2025

Uh oh!

lukel97 commented Apr 8, 2025

Uh oh!

topperc Apr 9, 2025

Uh oh!

lukel97 Apr 21, 2025

Uh oh!

lukel97 commented Apr 21, 2025

Uh oh!

wangpc-pp left a comment

Uh oh!

Uh oh!

preames left a comment

Uh oh!

Uh oh!

Uh oh!

[RISCV] Widen i1 AnyOf reductions #134898

[RISCV] Widen i1 AnyOf reductions #134898

Uh oh!

Conversation

lukel97 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 8, 2025

Uh oh!

preames commented Apr 8, 2025

Uh oh!

lukel97 commented Apr 8, 2025

Uh oh!

topperc Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

lukel97 Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Apr 21, 2025

Uh oh!

wangpc-pp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

preames left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lukel97 commented Apr 8, 2025 •

edited

Loading