Skip to content

[RISCV] Add Tune to DontSinkSplatOperands #79199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 25, 2024

Conversation

michaelmaitland
Copy link
Contributor

@michaelmaitland michaelmaitland commented Jan 23, 2024

A CPU may prefer to not sink splat operands, one reason being that it could require a S2V transfer buffer to move scalars into buffers.

This commit is stacked on #79015 and will be updated to mark the p670 with this Tune once it lands.

A CPU may prefer to not sink splat operands, one reason being that it could
require a S2V transfer buffer to move scalars into buffers.

This is a precommit for llvm#79015.
@llvmbot
Copy link
Member

llvmbot commented Jan 23, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

Changes

A CPU may prefer to not sink splat operands, one reason being that it could require a S2V transfer buffer to move scalars into buffers.

This is a precommit for #79015.


Full diff: https://github.com/llvm/llvm-project/pull/79199.diff

3 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVFeatures.td (+7)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+8)
  • (added) llvm/test/CodeGen/RISCV/rvv/dont-sink-splat-operands.ll (+353)
diff --git a/llvm/lib/Target/RISCV/RISCVFeatures.td b/llvm/lib/Target/RISCV/RISCVFeatures.td
index cbb096ba20ae67..d9522eb5e2dbc8 100644
--- a/llvm/lib/Target/RISCV/RISCVFeatures.td
+++ b/llvm/lib/Target/RISCV/RISCVFeatures.td
@@ -1082,6 +1082,13 @@ def TuneShortForwardBranchOpt
 def HasShortForwardBranchOpt : Predicate<"Subtarget->hasShortForwardBranchOpt()">;
 def NoShortForwardBranchOpt : Predicate<"!Subtarget->hasShortForwardBranchOpt()">;
 
+// Some subtargets require a S2V transfer buffer to move scalars into vectors.
+// FIXME: Forming .vx/.vf can reduce register pressure.
+def TuneDontSinkSplatOperands
+    : SubtargetFeature<"dont-sink-splat-operands", "DontSinkSplatOperands",
+                       "true", "Don't sink splat operands to enable .vx or .vf "
+                       "instructions">;
+
 def TuneConditionalCompressedMoveFusion
     : SubtargetFeature<"conditional-cmv-fusion", "HasConditionalCompressedMoveFusion",
                        "true", "Enable branch+c.mv fusion">;
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index b41e2f40dc72f0..6737f1c1623890 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -2000,6 +2000,14 @@ bool RISCVTargetLowering::shouldSinkOperands(
   if (!I->getType()->isVectorTy() || !Subtarget.hasVInstructions())
     return false;
 
+  // Don't sink splat operands if the target prefers it. Some targets requires
+  // S2V transfer buffers and we can run out of them copying the same value
+  // repeatedly.
+  // FIXME: It could still be worth doing if it would improve vector register
+  // pressure and prevent a vector spill.
+  if (Subtarget.dontSinkSplatOperands())
+    return false;
+
   for (auto OpIdx : enumerate(I->operands())) {
     if (!canSplatOperand(I, OpIdx.index()))
       continue;
diff --git a/llvm/test/CodeGen/RISCV/rvv/dont-sink-splat-operands.ll b/llvm/test/CodeGen/RISCV/rvv/dont-sink-splat-operands.ll
new file mode 100644
index 00000000000000..38c1ee6a9c71a5
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/dont-sink-splat-operands.ll
@@ -0,0 +1,353 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+v,+f -target-abi=lp64f \
+; RUN:   -mattr=+dont-sink-splat-operands -riscv-v-vector-bits-min=128 | FileCheck %s
+
+; Test that we don't sink splat operands when compiling with dont-sink-splat-operands.
+; Each scalar register access requires a S2V transfer buffer entry. Using too many
+; limits performance.
+; FIXME: This is potentially bad for register pressure. Need a better heuristic.
+
+define void @sink_splat_add(i32* nocapture %a, i32 signext %x) {
+; CHECK-LABEL: sink_splat_add:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.v.x v8, a1
+; CHECK-NEXT:    lui a1, 1
+; CHECK-NEXT:    add a1, a0, a1
+; CHECK-NEXT:  .LBB0_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vle32.v v9, (a0)
+; CHECK-NEXT:    vadd.vv v9, v9, v8
+; CHECK-NEXT:    vse32.v v9, (a0)
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    bne a0, a1, .LBB0_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    ret
+entry:
+  %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i32 0
+  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
+  %0 = getelementptr inbounds i32, i32* %a, i64 %index
+  %1 = bitcast i32* %0 to <4 x i32>*
+  %wide.load = load <4 x i32>, <4 x i32>* %1, align 4
+  %2 = add <4 x i32> %wide.load, %broadcast.splat
+  %3 = bitcast i32* %0 to <4 x i32>*
+  store <4 x i32> %2, <4 x i32>* %3, align 4
+  %index.next = add nuw i64 %index, 4
+  %4 = icmp eq i64 %index.next, 1024
+  br i1 %4, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret void
+}
+
+declare i64 @llvm.vscale.i64()
+
+define void @sink_splat_add_scalable(i32* nocapture %a, i32 signext %x) {
+; CHECK-LABEL: sink_splat_add_scalable:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    csrr a5, vlenb
+; CHECK-NEXT:    srli a2, a5, 1
+; CHECK-NEXT:    li a3, 1024
+; CHECK-NEXT:    bgeu a3, a2, .LBB1_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    li a3, 0
+; CHECK-NEXT:    j .LBB1_5
+; CHECK-NEXT:  .LBB1_2: # %vector.ph
+; CHECK-NEXT:    addi a3, a2, -1
+; CHECK-NEXT:    andi a4, a3, 1024
+; CHECK-NEXT:    xori a3, a4, 1024
+; CHECK-NEXT:    vsetvli a6, zero, e32, m2, ta, ma
+; CHECK-NEXT:    vmv.v.x v8, a1
+; CHECK-NEXT:    slli a5, a5, 1
+; CHECK-NEXT:    mv a6, a0
+; CHECK-NEXT:    mv a7, a3
+; CHECK-NEXT:  .LBB1_3: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vl2re32.v v10, (a6)
+; CHECK-NEXT:    vadd.vv v10, v10, v8
+; CHECK-NEXT:    vs2r.v v10, (a6)
+; CHECK-NEXT:    sub a7, a7, a2
+; CHECK-NEXT:    add a6, a6, a5
+; CHECK-NEXT:    bnez a7, .LBB1_3
+; CHECK-NEXT:  # %bb.4: # %middle.block
+; CHECK-NEXT:    beqz a4, .LBB1_7
+; CHECK-NEXT:  .LBB1_5: # %for.body.preheader
+; CHECK-NEXT:    slli a2, a3, 2
+; CHECK-NEXT:    add a2, a0, a2
+; CHECK-NEXT:    lui a3, 1
+; CHECK-NEXT:    add a0, a0, a3
+; CHECK-NEXT:  .LBB1_6: # %for.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    lw a3, 0(a2)
+; CHECK-NEXT:    add a3, a3, a1
+; CHECK-NEXT:    sw a3, 0(a2)
+; CHECK-NEXT:    addi a2, a2, 4
+; CHECK-NEXT:    bne a2, a0, .LBB1_6
+; CHECK-NEXT:  .LBB1_7: # %for.cond.cleanup
+; CHECK-NEXT:    ret
+entry:
+  %0 = call i64 @llvm.vscale.i64()
+  %1 = shl i64 %0, 2
+  %min.iters.check = icmp ugt i64 %1, 1024
+  br i1 %min.iters.check, label %for.body.preheader, label %vector.ph
+
+vector.ph:                                        ; preds = %entry
+  %2 = call i64 @llvm.vscale.i64()
+  %3 = shl i64 %2, 2
+  %n.mod.vf = urem i64 1024, %3
+  %n.vec = sub nsw i64 1024, %n.mod.vf
+  %broadcast.splatinsert = insertelement <vscale x 4 x i32> poison, i32 %x, i32 0
+  %broadcast.splat = shufflevector <vscale x 4 x i32> %broadcast.splatinsert, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+  %4 = call i64 @llvm.vscale.i64()
+  %5 = shl i64 %4, 2
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %vector.ph
+  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+  %6 = getelementptr inbounds i32, i32* %a, i64 %index
+  %7 = bitcast i32* %6 to <vscale x 4 x i32>*
+  %wide.load = load <vscale x 4 x i32>, <vscale x 4 x i32>* %7, align 4
+  %8 = add <vscale x 4 x i32> %wide.load, %broadcast.splat
+  %9 = bitcast i32* %6 to <vscale x 4 x i32>*
+  store <vscale x 4 x i32> %8, <vscale x 4 x i32>* %9, align 4
+  %index.next = add nuw i64 %index, %5
+  %10 = icmp eq i64 %index.next, %n.vec
+  br i1 %10, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %cmp.n = icmp eq i64 %n.mod.vf, 0
+  br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader
+
+for.body.preheader:                               ; preds = %entry, %middle.block
+  %indvars.iv.ph = phi i64 [ 0, %entry ], [ %n.vec, %middle.block ]
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %middle.block
+  ret void
+
+for.body:                                         ; preds = %for.body.preheader, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader ]
+  %arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
+  %11 = load i32, i32* %arrayidx, align 4
+  %add = add i32 %11, %x
+  store i32 %add, i32* %arrayidx, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %cmp.not = icmp eq i64 %indvars.iv.next, 1024
+  br i1 %cmp.not, label %for.cond.cleanup, label %for.body
+}
+
+declare <4 x i32> @llvm.vp.add.v4i32(<4 x i32>, <4 x i32>, <4 x i1>, i32)
+
+define void @sink_splat_vp_add(i32* nocapture %a, i32 signext %x, <4 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: sink_splat_vp_add:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.v.x v8, a1
+; CHECK-NEXT:    lui a1, 1
+; CHECK-NEXT:    add a1, a0, a1
+; CHECK-NEXT:  .LBB2_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vle32.v v9, (a0)
+; CHECK-NEXT:    vsetvli zero, a2, e32, m1, ta, ma
+; CHECK-NEXT:    vadd.vv v9, v9, v8, v0.t
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vse32.v v9, (a0)
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    bne a0, a1, .LBB2_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    ret
+entry:
+  %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i32 0
+  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
+  %0 = getelementptr inbounds i32, i32* %a, i64 %index
+  %1 = bitcast i32* %0 to <4 x i32>*
+  %wide.load = load <4 x i32>, <4 x i32>* %1, align 4
+  %2 = call <4 x i32> @llvm.vp.add.v4i32(<4 x i32> %wide.load, <4 x i32> %broadcast.splat, <4 x i1> %m, i32 %vl)
+  %3 = bitcast i32* %0 to <4 x i32>*
+  store <4 x i32> %2, <4 x i32>* %3, align 4
+  %index.next = add nuw i64 %index, 4
+  %4 = icmp eq i64 %index.next, 1024
+  br i1 %4, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret void
+}
+
+define void @sink_splat_fadd(float* nocapture %a, float %x) {
+; CHECK-LABEL: sink_splat_fadd:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vfmv.v.f v8, fa0
+; CHECK-NEXT:    lui a1, 1
+; CHECK-NEXT:    add a1, a0, a1
+; CHECK-NEXT:  .LBB3_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vle32.v v9, (a0)
+; CHECK-NEXT:    vfadd.vv v9, v9, v8
+; CHECK-NEXT:    vse32.v v9, (a0)
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    bne a0, a1, .LBB3_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    ret
+entry:
+  %broadcast.splatinsert = insertelement <4 x float> poison, float %x, i32 0
+  %broadcast.splat = shufflevector <4 x float> %broadcast.splatinsert, <4 x float> poison, <4 x i32> zeroinitializer
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
+  %0 = getelementptr inbounds float, float* %a, i64 %index
+  %1 = bitcast float* %0 to <4 x float>*
+  %wide.load = load <4 x float>, <4 x float>* %1, align 4
+  %2 = fadd <4 x float> %wide.load, %broadcast.splat
+  %3 = bitcast float* %0 to <4 x float>*
+  store <4 x float> %2, <4 x float>* %3, align 4
+  %index.next = add nuw i64 %index, 4
+  %4 = icmp eq i64 %index.next, 1024
+  br i1 %4, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret void
+}
+
+define void @sink_splat_fadd_scalable(float* nocapture %a, float %x) {
+; CHECK-LABEL: sink_splat_fadd_scalable:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    csrr a1, vlenb
+; CHECK-NEXT:    srli a2, a1, 2
+; CHECK-NEXT:    li a3, 1024
+; CHECK-NEXT:    bgeu a3, a2, .LBB4_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    li a3, 0
+; CHECK-NEXT:    j .LBB4_5
+; CHECK-NEXT:  .LBB4_2: # %vector.ph
+; CHECK-NEXT:    addi a3, a2, -1
+; CHECK-NEXT:    andi a4, a3, 1024
+; CHECK-NEXT:    xori a3, a4, 1024
+; CHECK-NEXT:    vsetvli a5, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vfmv.v.f v8, fa0
+; CHECK-NEXT:    mv a5, a0
+; CHECK-NEXT:    mv a6, a3
+; CHECK-NEXT:  .LBB4_3: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vl1re32.v v9, (a5)
+; CHECK-NEXT:    vfadd.vv v9, v9, v8
+; CHECK-NEXT:    vs1r.v v9, (a5)
+; CHECK-NEXT:    sub a6, a6, a2
+; CHECK-NEXT:    add a5, a5, a1
+; CHECK-NEXT:    bnez a6, .LBB4_3
+; CHECK-NEXT:  # %bb.4: # %middle.block
+; CHECK-NEXT:    beqz a4, .LBB4_7
+; CHECK-NEXT:  .LBB4_5: # %for.body.preheader
+; CHECK-NEXT:    slli a1, a3, 2
+; CHECK-NEXT:    add a1, a0, a1
+; CHECK-NEXT:    lui a2, 1
+; CHECK-NEXT:    add a0, a0, a2
+; CHECK-NEXT:  .LBB4_6: # %for.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    flw fa5, 0(a1)
+; CHECK-NEXT:    fadd.s fa5, fa5, fa0
+; CHECK-NEXT:    fsw fa5, 0(a1)
+; CHECK-NEXT:    addi a1, a1, 4
+; CHECK-NEXT:    bne a1, a0, .LBB4_6
+; CHECK-NEXT:  .LBB4_7: # %for.cond.cleanup
+; CHECK-NEXT:    ret
+entry:
+  %0 = call i64 @llvm.vscale.i64()
+  %1 = shl i64 %0, 1
+  %min.iters.check = icmp ugt i64 %1, 1024
+  br i1 %min.iters.check, label %for.body.preheader, label %vector.ph
+
+vector.ph:                                        ; preds = %entry
+  %2 = call i64 @llvm.vscale.i64()
+  %3 = shl i64 %2, 1
+  %n.mod.vf = urem i64 1024, %3
+  %n.vec = sub nsw i64 1024, %n.mod.vf
+  %broadcast.splatinsert = insertelement <vscale x 2 x float> poison, float %x, i32 0
+  %broadcast.splat = shufflevector <vscale x 2 x float> %broadcast.splatinsert, <vscale x 2 x float> poison, <vscale x 2 x i32> zeroinitializer
+  %4 = call i64 @llvm.vscale.i64()
+  %5 = shl i64 %4, 1
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %vector.ph
+  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+  %6 = getelementptr inbounds float, float* %a, i64 %index
+  %7 = bitcast float* %6 to <vscale x 2 x float>*
+  %wide.load = load <vscale x 2 x float>, <vscale x 2 x float>* %7, align 4
+  %8 = fadd <vscale x 2 x float> %wide.load, %broadcast.splat
+  %9 = bitcast float* %6 to <vscale x 2 x float>*
+  store <vscale x 2 x float> %8, <vscale x 2 x float>* %9, align 4
+  %index.next = add nuw i64 %index, %5
+  %10 = icmp eq i64 %index.next, %n.vec
+  br i1 %10, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %cmp.n = icmp eq i64 %n.mod.vf, 0
+  br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader
+
+for.body.preheader:                               ; preds = %entry, %middle.block
+  %indvars.iv.ph = phi i64 [ 0, %entry ], [ %n.vec, %middle.block ]
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %middle.block
+  ret void
+
+for.body:                                         ; preds = %for.body.preheader, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader ]
+  %arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
+  %11 = load float, float* %arrayidx, align 4
+  %mul = fadd float %11, %x
+  store float %mul, float* %arrayidx, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %cmp.not = icmp eq i64 %indvars.iv.next, 1024
+  br i1 %cmp.not, label %for.cond.cleanup, label %for.body
+}
+
+declare <4 x float> @llvm.vp.fadd.v4i32(<4 x float>, <4 x float>, <4 x i1>, i32)
+
+define void @sink_splat_vp_fadd(float* nocapture %a, float %x, <4 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: sink_splat_vp_fadd:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vfmv.v.f v8, fa0
+; CHECK-NEXT:    lui a2, 1
+; CHECK-NEXT:    add a2, a0, a2
+; CHECK-NEXT:  .LBB5_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vle32.v v9, (a0)
+; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vfadd.vv v9, v9, v8, v0.t
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vse32.v v9, (a0)
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    bne a0, a2, .LBB5_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    ret
+entry:
+  %broadcast.splatinsert = insertelement <4 x float> poison, float %x, i32 0
+  %broadcast.splat = shufflevector <4 x float> %broadcast.splatinsert, <4 x float> poison, <4 x i32> zeroinitializer
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
+  %0 = getelementptr inbounds float, float* %a, i64 %index
+  %1 = bitcast float* %0 to <4 x float>*
+  %wide.load = load <4 x float>, <4 x float>* %1, align 4
+  %2 = call <4 x float> @llvm.vp.fadd.v4i32(<4 x float> %wide.load, <4 x float> %broadcast.splat, <4 x i1> %m, i32 %vl)
+  %3 = bitcast float* %0 to <4 x float>*
+  store <4 x float> %2, <4 x float>* %3, align 4
+  %index.next = add nuw i64 %index, 4
+  %4 = icmp eq i64 %index.next, 1024
+  br i1 %4, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret void
+}

@@ -0,0 +1,353 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+v,+f -target-abi=lp64f \
; RUN: -mattr=+dont-sink-splat-operands -riscv-v-vector-bits-min=128 | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a RUN without +dont-sink-splat-operands?

// repeatedly.
// FIXME: It could still be worth doing if it would improve vector register
// pressure and prevent a vector spill.
if (Subtarget.dontSinkSplatOperands())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only place to control the generation of sinked V instructions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The middle end LICM pass will usually hoist splats of the following form out loops.

%a = insertelt <vscale x 2 x i64> poison, i64 0, i32 0
%b = shufflevector <vscale x 2 x i64> %a, <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer>

The code in function tries to sink these splats back into loops if it would enable a .vx/.vf/.wx/.wf instruction to be formed. This tuning flag disables the sinking and leaves the splats outside the loop the way LICM left them.

This is a simple way to prevent many cases of .vx/.vf/.wx/.wf instructions in loops. But as noted in the FIXMEs it is bad for register pressure. We've been trying to write an optimization pass to do this better, but we don't have anything that gives good results yet.

// FIXME: Forming .vx/.vf can reduce register pressure.
def TuneDontSinkSplatOperands
: SubtargetFeature<"dont-sink-splat-operands", "DontSinkSplatOperands",
"true", "Don't sink splat operands to enable .vx or .vf "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.wx and .wf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if comment is addressed.

// Some subtargets require a S2V transfer buffer to move scalars into vectors.
// FIXME: Forming .vx/.vf/.wx/.wf can reduce register pressure.
def TuneDontSinkSplatOperands
: SubtargetFeature<"dont-sink-splat-operands", "DontSinkSplatOperands",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DontSinkSplatOperands -> SinkSplatOperands, and then change true to false?
This matches other SubtargetFeatures that will disable some features.

Copy link
Contributor Author

@michaelmaitland michaelmaitland Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like TuneNoDefaultUnroll, TuneNoOptimizedZeroStride, FeatureNoRVCHints disable features and contain the No string, similar to Dont. I don't see any other SubtargetFeatures with false without the No string. Are you sure your suggestion matches with how we disable features?

I made an attempt to implement your suggestion, and here's what I found:

If we had SinkSplatOperands and false, then we would want to have

if (!Subtarget.sinkSplatOperands())
  return false;

since

if (Subtarget.sinkSplatOperands())
  return true;

is not the desired behavior, since this tuning having a true value does not imply that splat operands should be sunk. Instead, a true value implies that we should allow the rest of the logic in that function to determine whether to sink the splat operands. The desired behavior is that if sinkSplatOperands is false then we will not sink splat operands.

Now lets consider the three command line options:

  1. +sink-splat-operands
  2. -sink-splat-operands
  3. Tuning not specified

If we had SinkSplatOperands and false, then only the tuning not specified option works as expected.
If we had SinkSplatOperands and true, then +sink-splat-operands and -sink-splat-operands work correctly, but tuning not specified does not sink when it should be sinking.

Am I misunderstanding something? Based on my understanding, I think the tuning is in line with other tunings that try to disable features. The one refinement I can think of is that we can rename to TuneNoSinkSplatOperands since the other features use the No instead of Dont naming scheme.

Copy link
Collaborator

@topperc topperc Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is something like

def TuneNoSinkSplatOperands : SubtargetFeature<"no-sink-splat-operands", "SinkSplatOperands", "false"

Notice the command line option has "No", but the variable in Subtarget field does not.

The use of "false" will make "SinkSplatOperands" default to true, and passing -mattr=-no-sink-splat-operands will set it to false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Updated.

; FIXME: This is potentially bad for register pressure. Need a better heuristic.

define void @sink_splat_add(i32* nocapture %a, i32 signext %x) {
; DONT-SINK-LABEL: sink_splat_add:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONT -> NO

@michaelmaitland michaelmaitland merged commit 594b92a into llvm:main Jan 25, 2024
@michaelmaitland michaelmaitland deleted the tune-done-sink-splat branch January 25, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants