Skip to content

[RISCV] Add DAG combine to convert (iN reduce.add (zext (vXi1 A to vXiN)) into vcpop.m #127497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 3, 2025

Conversation

skachkov-sc
Copy link
Contributor

This patch combines (iN vector.reduce.add (zext (vXi1 A to vXiN)) into vcpop.m instruction (similarly to bitcast + ctpop pattern). It can be useful for counting number of set bits in scalable vector types, which can't be expressed with bitcast + ctpop (this was previously discussed here: #74294).

@llvmbot
Copy link
Member

llvmbot commented Feb 17, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Sergey Kachkov (skachkov-sc)

Changes

This patch combines (iN vector.reduce.add (zext (vXi1 A to vXiN)) into vcpop.m instruction (similarly to bitcast + ctpop pattern). It can be useful for counting number of set bits in scalable vector types, which can't be expressed with bitcast + ctpop (this was previously discussed here: #74294).


Full diff: https://github.com/llvm/llvm-project/pull/127497.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+31-14)
  • (added) llvm/test/CodeGen/RISCV/rvv/combine-reduce-add-to-vcpop.ll (+344)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index c40ab0d09bdf6..20c2785e421ca 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -1542,7 +1542,7 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
          ISD::MUL,          ISD::SDIV,         ISD::UDIV,
          ISD::SREM,         ISD::UREM,         ISD::INSERT_VECTOR_ELT,
          ISD::ABS,          ISD::CTPOP,        ISD::VECTOR_SHUFFLE,
-         ISD::VSELECT});
+         ISD::VSELECT,      ISD::VECREDUCE_ADD});
 
   if (Subtarget.hasVendorXTHeadMemPair())
     setTargetDAGCombine({ISD::LOAD, ISD::STORE});
@@ -18100,25 +18100,38 @@ static SDValue combineTruncToVnclip(SDNode *N, SelectionDAG &DAG,
 //   (iX ctpop (bitcast (vXi1 A)))
 // ->
 //   (zext (vcpop.m (nxvYi1 (insert_subvec (vXi1 A)))))
+// and
+//   (iN reduce.add (zext (vXi1 A to vXiN))
+// ->
+//   (zext (vcpop.m (nxvYi1 (insert_subvec (vXi1 A)))))
 // FIXME: It's complicated to match all the variations of this after type
 // legalization so we only handle the pre-type legalization pattern, but that
 // requires the fixed vector type to be legal.
-static SDValue combineScalarCTPOPToVCPOP(SDNode *N, SelectionDAG &DAG,
-                                         const RISCVSubtarget &Subtarget) {
+static SDValue combineVCPOP(SDNode *N, SelectionDAG &DAG,
+                            const RISCVSubtarget &Subtarget) {
+  unsigned Opc = N->getOpcode();
+  assert((Opc == ISD::CTPOP || Opc == ISD::VECREDUCE_ADD) &&
+         "Unexpected opcode");
   EVT VT = N->getValueType(0);
   if (!VT.isScalarInteger())
     return SDValue();
 
   SDValue Src = N->getOperand(0);
 
-  // Peek through zero_extend. It doesn't change the count.
-  if (Src.getOpcode() == ISD::ZERO_EXTEND)
-    Src = Src.getOperand(0);
+  if (Opc == ISD::CTPOP) {
+    // Peek through zero_extend. It doesn't change the count.
+    if (Src.getOpcode() == ISD::ZERO_EXTEND)
+      Src = Src.getOperand(0);
 
-  if (Src.getOpcode() != ISD::BITCAST)
-    return SDValue();
+    if (Src.getOpcode() != ISD::BITCAST)
+      return SDValue();
+    Src = Src.getOperand(0);
+  } else if (Opc == ISD::VECREDUCE_ADD) {
+    if (Src.getOpcode() != ISD::ZERO_EXTEND)
+      return SDValue();
+    Src = Src.getOperand(0);
+  }
 
-  Src = Src.getOperand(0);
   EVT SrcEVT = Src.getValueType();
   if (!SrcEVT.isSimple())
     return SDValue();
@@ -18128,11 +18141,14 @@ static SDValue combineScalarCTPOPToVCPOP(SDNode *N, SelectionDAG &DAG,
   if (!SrcMVT.isVector() || SrcMVT.getVectorElementType() != MVT::i1)
     return SDValue();
 
-  if (!useRVVForFixedLengthVectorVT(SrcMVT, Subtarget))
-    return SDValue();
+  MVT ContainerVT = SrcMVT;
+  if (SrcMVT.isFixedLengthVector()) {
+    if (!useRVVForFixedLengthVectorVT(SrcMVT, Subtarget))
+      return SDValue();
 
-  MVT ContainerVT = getContainerForFixedLengthVector(DAG, SrcMVT, Subtarget);
-  Src = convertToScalableVector(ContainerVT, Src, DAG, Subtarget);
+    ContainerVT = getContainerForFixedLengthVector(DAG, SrcMVT, Subtarget);
+    Src = convertToScalableVector(ContainerVT, Src, DAG, Subtarget);
+  }
 
   SDLoc DL(N);
   auto [Mask, VL] = getDefaultVLOps(SrcMVT, ContainerVT, DL, DAG, Subtarget);
@@ -19214,7 +19230,8 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
     return SDValue();
   }
   case ISD::CTPOP:
-    if (SDValue V = combineScalarCTPOPToVCPOP(N, DAG, Subtarget))
+  case ISD::VECREDUCE_ADD:
+    if (SDValue V = combineVCPOP(N, DAG, Subtarget))
       return V;
     break;
   }
diff --git a/llvm/test/CodeGen/RISCV/rvv/combine-reduce-add-to-vcpop.ll b/llvm/test/CodeGen/RISCV/rvv/combine-reduce-add-to-vcpop.ll
new file mode 100644
index 0000000000000..7156c9faee611
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/combine-reduce-add-to-vcpop.ll
@@ -0,0 +1,344 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=riscv32 -mattr=+v,+zbb | FileCheck %s --check-prefixes=CHECK,RV32
+; RUN: llc < %s -mtriple=riscv64 -mattr=+v,+zbb | FileCheck %s --check-prefixes=CHECK,RV64
+
+define i32 @test_nxv2i1(<vscale x 2 x i1> %x) {
+; CHECK-LABEL: test_nxv2i1:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetvli a0, zero, e8, mf4, ta, ma
+; CHECK-NEXT:    vcpop.m a0, v0
+; CHECK-NEXT:    ret
+entry:
+  %a = zext <vscale x 2 x i1> %x to <vscale x 2 x i32>
+  %b = call i32 @llvm.vector.reduce.add.nxv2i32(<vscale x 2 x i32> %a)
+  ret i32 %b
+}
+
+define i32 @test_nxv4i1(<vscale x 4 x i1> %x) {
+; CHECK-LABEL: test_nxv4i1:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetvli a0, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vcpop.m a0, v0
+; CHECK-NEXT:    ret
+entry:
+  %a = zext <vscale x 4 x i1> %x to <vscale x 4 x i32>
+  %b = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> %a)
+  ret i32 %b
+}
+
+define i32 @test_nxv8i1(<vscale x 8 x i1> %x) {
+; CHECK-LABEL: test_nxv8i1:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vcpop.m a0, v0
+; CHECK-NEXT:    ret
+entry:
+  %a = zext <vscale x 8 x i1> %x to <vscale x 8 x i32>
+  %b = call i32 @llvm.vector.reduce.add.nxv8i32(<vscale x 8 x i32> %a)
+  ret i32 %b
+}
+
+define i32 @test_nxv16i1(<vscale x 16 x i1> %x) {
+; CHECK-LABEL: test_nxv16i1:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetvli a0, zero, e8, m2, ta, ma
+; CHECK-NEXT:    vcpop.m a0, v0
+; CHECK-NEXT:    ret
+entry:
+  %a = zext <vscale x 16 x i1> %x to <vscale x 16 x i32>
+  %b = call i32 @llvm.vector.reduce.add.nxv16i32(<vscale x 16 x i32> %a)
+  ret i32 %b
+}
+
+define i32 @test_nxv32i1(<vscale x 32 x i1> %x) {
+; CHECK-LABEL: test_nxv32i1:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetvli a0, zero, e8, m4, ta, ma
+; CHECK-NEXT:    vcpop.m a0, v0
+; CHECK-NEXT:    ret
+entry:
+  %a = zext <vscale x 32 x i1> %x to <vscale x 32 x i32>
+  %b = call i32 @llvm.vector.reduce.add.nxv32i32(<vscale x 32 x i32> %a)
+  ret i32 %b
+}
+
+define i32 @test_nxv64i1(<vscale x 64 x i1> %x) {
+; CHECK-LABEL: test_nxv64i1:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetvli a0, zero, e8, m8, ta, ma
+; CHECK-NEXT:    vcpop.m a0, v0
+; CHECK-NEXT:    ret
+entry:
+  %a = zext <vscale x 64 x i1> %x to <vscale x 64 x i32>
+  %b = call i32 @llvm.vector.reduce.add.nxv64i32(<vscale x 64 x i32> %a)
+  ret i32 %b
+}
+
+define i32 @test_nxv128i1(<vscale x 128 x i1> %x) {
+; CHECK-LABEL: test_nxv128i1:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb
+; CHECK-NEXT:    vsetvli a0, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmv1r.v v7, v8
+; CHECK-NEXT:    vmv.v.i v16, 0
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vmerge.vim v8, v16, 1, v0
+; CHECK-NEXT:    addi a1, sp, 16
+; CHECK-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
+; CHECK-NEXT:    srli a1, a0, 1
+; CHECK-NEXT:    srli a0, a0, 2
+; CHECK-NEXT:    vsetvli a2, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v6, v7, a1
+; CHECK-NEXT:    vslidedown.vx v5, v0, a1
+; CHECK-NEXT:    vsetvli a1, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v4, v7, a0
+; CHECK-NEXT:    vslidedown.vx v3, v0, a0
+; CHECK-NEXT:    vmv1r.v v0, v5
+; CHECK-NEXT:    vmv8r.v v8, v16
+; CHECK-NEXT:    vsetvli a1, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmerge.vim v24, v16, 1, v0
+; CHECK-NEXT:    vsetvli a1, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v0, v5, a0
+; CHECK-NEXT:    vslidedown.vx v5, v6, a0
+; CHECK-NEXT:    vsetvli a0, zero, e32, m8, ta, mu
+; CHECK-NEXT:    vmerge.vim v16, v16, 1, v0
+; CHECK-NEXT:    vmv1r.v v0, v3
+; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
+; CHECK-NEXT:    vmv1r.v v0, v4
+; CHECK-NEXT:    vadd.vi v8, v8, 1, v0.t
+; CHECK-NEXT:    vmv1r.v v0, v5
+; CHECK-NEXT:    vadd.vi v16, v16, 1, v0.t
+; CHECK-NEXT:    vadd.vv v8, v8, v16
+; CHECK-NEXT:    vmv1r.v v0, v6
+; CHECK-NEXT:    vadd.vi v24, v24, 1, v0.t
+; CHECK-NEXT:    vmv1r.v v0, v7
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vadd.vi v16, v16, 1, v0.t
+; CHECK-NEXT:    vadd.vv v16, v16, v24
+; CHECK-NEXT:    vadd.vv v8, v16, v8
+; CHECK-NEXT:    vmv.s.x v16, zero
+; CHECK-NEXT:    vredsum.vs v8, v8, v16
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    csrr a1, vlenb
+; CHECK-NEXT:    slli a1, a1, 3
+; CHECK-NEXT:    add sp, sp, a1
+; CHECK-NEXT:    .cfi_def_cfa sp, 16
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    .cfi_def_cfa_offset 0
+; CHECK-NEXT:    ret
+entry:
+  %a = zext <vscale x 128 x i1> %x to <vscale x 128 x i32>
+  %b = call i32 @llvm.vector.reduce.add.nxv128i32(<vscale x 128 x i32> %a)
+  ret i32 %b
+}
+
+define i32 @test_nxv256i1(<vscale x 256 x i1> %x) {
+; CHECK-LABEL: test_nxv256i1:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    mv a1, a0
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add a1, a1, a0
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add a0, a0, a1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x31, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 49 * vlenb
+; CHECK-NEXT:    vsetvli a0, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmv1r.v v3, v10
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    mv a1, a0
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add a0, a0, a1
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vs1r.v v9, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT:    vmv.v.i v24, 0
+; CHECK-NEXT:    csrr a1, vlenb
+; CHECK-NEXT:    vmerge.vim v16, v24, 1, v0
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    mv a2, a0
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a2, a2, a0
+; CHECK-NEXT:    slli a0, a0, 2
+; CHECK-NEXT:    add a0, a0, a2
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vs8r.v v16, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT:    srli a0, a1, 1
+; CHECK-NEXT:    srli a1, a1, 2
+; CHECK-NEXT:    vsetvli a2, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v11, v0, a0
+; CHECK-NEXT:    vslidedown.vx v12, v8, a0
+; CHECK-NEXT:    vsetvli a2, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v2, v9, a1
+; CHECK-NEXT:    vslidedown.vx v4, v0, a1
+; CHECK-NEXT:    vslidedown.vx v1, v10, a1
+; CHECK-NEXT:    vslidedown.vx v7, v8, a1
+; CHECK-NEXT:    vslidedown.vx v6, v11, a1
+; CHECK-NEXT:    vslidedown.vx v5, v12, a1
+; CHECK-NEXT:    vmv1r.v v0, v8
+; CHECK-NEXT:    vmv8r.v v16, v24
+; CHECK-NEXT:    vsetvli a2, zero, e32, m8, ta, mu
+; CHECK-NEXT:    vmerge.vim v24, v24, 1, v0
+; CHECK-NEXT:    csrr a2, vlenb
+; CHECK-NEXT:    slli a3, a2, 5
+; CHECK-NEXT:    add a2, a3, a2
+; CHECK-NEXT:    add a2, sp, a2
+; CHECK-NEXT:    addi a2, a2, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vmv1r.v v0, v11
+; CHECK-NEXT:    vmerge.vim v24, v16, 1, v0
+; CHECK-NEXT:    csrr a2, vlenb
+; CHECK-NEXT:    mv a3, a2
+; CHECK-NEXT:    slli a2, a2, 3
+; CHECK-NEXT:    add a3, a3, a2
+; CHECK-NEXT:    slli a2, a2, 1
+; CHECK-NEXT:    add a2, a2, a3
+; CHECK-NEXT:    add a2, sp, a2
+; CHECK-NEXT:    addi a2, a2, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vmv1r.v v0, v12
+; CHECK-NEXT:    vmerge.vim v8, v16, 1, v0
+; CHECK-NEXT:    csrr a2, vlenb
+; CHECK-NEXT:    slli a2, a2, 4
+; CHECK-NEXT:    add a2, sp, a2
+; CHECK-NEXT:    addi a2, a2, 16
+; CHECK-NEXT:    vs8r.v v8, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vmv1r.v v0, v6
+; CHECK-NEXT:    vmerge.vim v8, v16, 1, v0
+; CHECK-NEXT:    addi a2, sp, 16
+; CHECK-NEXT:    vs8r.v v8, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vmv1r.v v0, v5
+; CHECK-NEXT:    vmerge.vim v24, v16, 1, v0
+; CHECK-NEXT:    vmv8r.v v8, v16
+; CHECK-NEXT:    vmv1r.v v0, v4
+; CHECK-NEXT:    vmerge.vim v16, v16, 1, v0
+; CHECK-NEXT:    vmv1r.v v0, v7
+; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
+; CHECK-NEXT:    vmv1r.v v0, v1
+; CHECK-NEXT:    vadd.vi v8, v8, 1, v0.t
+; CHECK-NEXT:    vmv1r.v v0, v2
+; CHECK-NEXT:    vadd.vi v16, v16, 1, v0.t
+; CHECK-NEXT:    vadd.vv v8, v16, v8
+; CHECK-NEXT:    csrr a2, vlenb
+; CHECK-NEXT:    slli a2, a2, 3
+; CHECK-NEXT:    add a2, sp, a2
+; CHECK-NEXT:    addi a2, a2, 16
+; CHECK-NEXT:    vs8r.v v8, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    csrr a2, vlenb
+; CHECK-NEXT:    slli a2, a2, 3
+; CHECK-NEXT:    mv a3, a2
+; CHECK-NEXT:    slli a2, a2, 1
+; CHECK-NEXT:    add a2, a2, a3
+; CHECK-NEXT:    add a2, sp, a2
+; CHECK-NEXT:    addi a2, a2, 16
+; CHECK-NEXT:    vl1r.v v7, (a2) # Unknown-size Folded Reload
+; CHECK-NEXT:    vsetvli a2, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v7, a0
+; CHECK-NEXT:    vslidedown.vx v11, v3, a0
+; CHECK-NEXT:    vsetvli a0, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v0, v11, a1
+; CHECK-NEXT:    vslidedown.vx v12, v8, a1
+; CHECK-NEXT:    vsetvli a0, zero, e32, m8, ta, mu
+; CHECK-NEXT:    vadd.vi v24, v24, 1, v0.t
+; CHECK-NEXT:    vmv1r.v v0, v12
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vadd.vi v16, v16, 1, v0.t
+; CHECK-NEXT:    vadd.vv v16, v16, v24
+; CHECK-NEXT:    vmv1r.v v0, v11
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v24, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vadd.vi v24, v24, 1, v0.t
+; CHECK-NEXT:    vmv1r.v v0, v8
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    mv a1, a0
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a1, a1, a0
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add a0, a0, a1
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vadd.vi v8, v8, 1, v0.t
+; CHECK-NEXT:    vadd.vv v8, v8, v24
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    mv a1, a0
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a1, a1, a0
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add a0, a0, a1
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vs8r.v v8, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT:    vmv1r.v v0, v3
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a1, a0, 5
+; CHECK-NEXT:    add a0, a1, a0
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vadd.vi v8, v8, 1, v0.t
+; CHECK-NEXT:    vmv1r.v v0, v7
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    mv a1, a0
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a1, a1, a0
+; CHECK-NEXT:    slli a0, a0, 2
+; CHECK-NEXT:    add a0, a0, a1
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v24, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vadd.vi v24, v24, 1, v0.t
+; CHECK-NEXT:    vadd.vv v24, v24, v8
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vadd.vv v16, v8, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    mv a1, a0
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a1, a1, a0
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add a0, a0, a1
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vadd.vv v8, v24, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v16
+; CHECK-NEXT:    vmv.s.x v16, zero
+; CHECK-NEXT:    vredsum.vs v8, v8, v16
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    csrr a1, vlenb
+; CHECK-NEXT:    mv a2, a1
+; CHECK-NEXT:    slli a1, a1, 4
+; CHECK-NEXT:    add a2, a2, a1
+; CHECK-NEXT:    slli a1, a1, 1
+; CHECK-NEXT:    add a1, a1, a2
+; CHECK-NEXT:    add sp, sp, a1
+; CHECK-NEXT:    .cfi_def_cfa sp, 16
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    .cfi_def_cfa_offset 0
+; CHECK-NEXT:    ret
+entry:
+  %a = zext <vscale x 256 x i1> %x to <vscale x 256 x i32>
+  %b = call i32 @llvm.vector.reduce.add.nxv256i32(<vscale x 256 x i32> %a)
+  ret i32 %b
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; RV32: {{.*}}
+; RV64: {{.*}}

Copy link

github-actions bot commented Feb 18, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Src = Src.getOperand(0);
} else if (Opc == ISD::VECREDUCE_ADD) {
if (Src.getOpcode() != ISD::ZERO_EXTEND)
return SDValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtle, nasty bug here.

Consider the case where the pattern is: (iN reduce.add (zext (vXi1 A to vXi4))

If runtime VLENB is such that the number of mask bits is greater than 16, this is not equal to the vcpop - due the wrapping behavior on the add reduce. You need to prove that the intermediate type is sufficiently wide to hold the element count of the mask source without overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you! I've made a fix that estimates possible maximum number of elements in mask and ensures that destination type is large enough to hold it. However, I've discovered that for this particular case (zext <16 x i1> to <16 x i4> + reduce.add) vcpop.m is still generated because type was extended to i8 during type legalization. This promotion also can be observed here: https://godbolt.org/z/avWEdcjvW, and it's confusing because if we have all-ones mask as an input, we should give zero result due to wrapping behaviour of add reduce, but the generated code will return 16 in this case. Also, I can't find any documentation on oveflowing behavior for integer reductions (intrinsics or ISD nodes)...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of your function is declared as i4 without any attributes. Only the lowest 4 bits of the result are required to be valid. The upper bits can be any value. If you add a zeroext attribute to the return value an andi instruction will be generated to clear the upper bits.

define zeroext i4 @test_narrow_v16i1(<16 x i1> %x) {
entry:
    %a = zext <16 x i1> %x to <16 x i4>
    %b = call i4 @llvm.vector.reduce.add.v16i4(<16 x i4> %a)
    ret i4 %b
}

Type promotion makes it the responsibility of the consumer to zero or sign extend upper bits if needed. With no attributes, the consumer is an any_extend so the upper bits don't need to be touched. With zeroext the consumer is a zero_extend so the bits need to be cleared to match the semantics of the unpromoted zero_extend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense now, thanks! Considering this producing vcpop.m in test_narrow_v16i1 seems correct - for all-ones mask we correctly set lowest 4 bits to zero

// Check that destination type is large enough to hold result without
// overflow.
if (Opc == ISD::VECREDUCE_ADD) {
unsigned MaxNumElts = SrcMVT.getVectorMinNumElements();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use the computeVLMAX idiom for the scalable case please?


  unsigned EltSize = SrcMVT.getScalarSizeInBits();
  unsigned MinSize = SrcMVT.getSizeInBits().getKnownMinValue();
  unsigned VectorBitsMax = Subtarget.getRealMaxVLen();
  unsigned MaxVLMAX =
      RISCVTargetLowering::computeVLMAX(VectorBitsMax, EltSize, MinSize);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

unsigned EltSize = SrcMVT.getScalarSizeInBits();
unsigned MinSize = SrcMVT.getSizeInBits().getKnownMinValue();
unsigned VectorBitsMax = Subtarget.getRealMaxVLen();
unsigned MaxVLMAX = SrcMVT.isFixedLengthVector()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for the fixed vector case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added (BTW, I think it will be a rare case on practice because InstCombine converts zext+reduce.add with fixed vector to bitcast + ctpop)

@skachkov-sc
Copy link
Contributor Author

Ping

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

p.s. Sorry for letting this drop, I'd thought I'd responded to this after your update, but apparently hadn't.

@skachkov-sc skachkov-sc force-pushed the combine-reduce-add-to-vcpop branch from a3f0397 to 18022ac Compare March 3, 2025 12:16
@skachkov-sc skachkov-sc merged commit 3dc7991 into llvm:main Mar 3, 2025
11 checks passed
jph-13 pushed a commit to jph-13/llvm-project that referenced this pull request Mar 21, 2025
…iN)) into vcpop.m (llvm#127497)

This patch combines (iN vector.reduce.add (zext (vXi1 A to vXiN)) into
vcpop.m instruction (similarly to bitcast + ctpop pattern). It can be
useful for counting number of set bits in scalable vector types, which
can't be expressed with bitcast + ctpop (this was previously discussed
here: llvm#74294).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants