-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[RISCV] Add Tune to DontSinkSplatOperands #79199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Add Tune to DontSinkSplatOperands #79199
Conversation
A CPU may prefer to not sink splat operands, one reason being that it could require a S2V transfer buffer to move scalars into buffers. This is a precommit for llvm#79015.
@llvm/pr-subscribers-backend-risc-v Author: Michael Maitland (michaelmaitland) ChangesA CPU may prefer to not sink splat operands, one reason being that it could require a S2V transfer buffer to move scalars into buffers. This is a precommit for #79015. Full diff: https://github.com/llvm/llvm-project/pull/79199.diff 3 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVFeatures.td b/llvm/lib/Target/RISCV/RISCVFeatures.td
index cbb096ba20ae67..d9522eb5e2dbc8 100644
--- a/llvm/lib/Target/RISCV/RISCVFeatures.td
+++ b/llvm/lib/Target/RISCV/RISCVFeatures.td
@@ -1082,6 +1082,13 @@ def TuneShortForwardBranchOpt
def HasShortForwardBranchOpt : Predicate<"Subtarget->hasShortForwardBranchOpt()">;
def NoShortForwardBranchOpt : Predicate<"!Subtarget->hasShortForwardBranchOpt()">;
+// Some subtargets require a S2V transfer buffer to move scalars into vectors.
+// FIXME: Forming .vx/.vf can reduce register pressure.
+def TuneDontSinkSplatOperands
+ : SubtargetFeature<"dont-sink-splat-operands", "DontSinkSplatOperands",
+ "true", "Don't sink splat operands to enable .vx or .vf "
+ "instructions">;
+
def TuneConditionalCompressedMoveFusion
: SubtargetFeature<"conditional-cmv-fusion", "HasConditionalCompressedMoveFusion",
"true", "Enable branch+c.mv fusion">;
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index b41e2f40dc72f0..6737f1c1623890 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -2000,6 +2000,14 @@ bool RISCVTargetLowering::shouldSinkOperands(
if (!I->getType()->isVectorTy() || !Subtarget.hasVInstructions())
return false;
+ // Don't sink splat operands if the target prefers it. Some targets requires
+ // S2V transfer buffers and we can run out of them copying the same value
+ // repeatedly.
+ // FIXME: It could still be worth doing if it would improve vector register
+ // pressure and prevent a vector spill.
+ if (Subtarget.dontSinkSplatOperands())
+ return false;
+
for (auto OpIdx : enumerate(I->operands())) {
if (!canSplatOperand(I, OpIdx.index()))
continue;
diff --git a/llvm/test/CodeGen/RISCV/rvv/dont-sink-splat-operands.ll b/llvm/test/CodeGen/RISCV/rvv/dont-sink-splat-operands.ll
new file mode 100644
index 00000000000000..38c1ee6a9c71a5
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/dont-sink-splat-operands.ll
@@ -0,0 +1,353 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+v,+f -target-abi=lp64f \
+; RUN: -mattr=+dont-sink-splat-operands -riscv-v-vector-bits-min=128 | FileCheck %s
+
+; Test that we don't sink splat operands when compiling with dont-sink-splat-operands.
+; Each scalar register access requires a S2V transfer buffer entry. Using too many
+; limits performance.
+; FIXME: This is potentially bad for register pressure. Need a better heuristic.
+
+define void @sink_splat_add(i32* nocapture %a, i32 signext %x) {
+; CHECK-LABEL: sink_splat_add:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vmv.v.x v8, a1
+; CHECK-NEXT: lui a1, 1
+; CHECK-NEXT: add a1, a0, a1
+; CHECK-NEXT: .LBB0_1: # %vector.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: vle32.v v9, (a0)
+; CHECK-NEXT: vadd.vv v9, v9, v8
+; CHECK-NEXT: vse32.v v9, (a0)
+; CHECK-NEXT: addi a0, a0, 16
+; CHECK-NEXT: bne a0, a1, .LBB0_1
+; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT: ret
+entry:
+ %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i32 0
+ %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %entry
+ %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
+ %0 = getelementptr inbounds i32, i32* %a, i64 %index
+ %1 = bitcast i32* %0 to <4 x i32>*
+ %wide.load = load <4 x i32>, <4 x i32>* %1, align 4
+ %2 = add <4 x i32> %wide.load, %broadcast.splat
+ %3 = bitcast i32* %0 to <4 x i32>*
+ store <4 x i32> %2, <4 x i32>* %3, align 4
+ %index.next = add nuw i64 %index, 4
+ %4 = icmp eq i64 %index.next, 1024
+ br i1 %4, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup: ; preds = %vector.body
+ ret void
+}
+
+declare i64 @llvm.vscale.i64()
+
+define void @sink_splat_add_scalable(i32* nocapture %a, i32 signext %x) {
+; CHECK-LABEL: sink_splat_add_scalable:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: csrr a5, vlenb
+; CHECK-NEXT: srli a2, a5, 1
+; CHECK-NEXT: li a3, 1024
+; CHECK-NEXT: bgeu a3, a2, .LBB1_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: li a3, 0
+; CHECK-NEXT: j .LBB1_5
+; CHECK-NEXT: .LBB1_2: # %vector.ph
+; CHECK-NEXT: addi a3, a2, -1
+; CHECK-NEXT: andi a4, a3, 1024
+; CHECK-NEXT: xori a3, a4, 1024
+; CHECK-NEXT: vsetvli a6, zero, e32, m2, ta, ma
+; CHECK-NEXT: vmv.v.x v8, a1
+; CHECK-NEXT: slli a5, a5, 1
+; CHECK-NEXT: mv a6, a0
+; CHECK-NEXT: mv a7, a3
+; CHECK-NEXT: .LBB1_3: # %vector.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: vl2re32.v v10, (a6)
+; CHECK-NEXT: vadd.vv v10, v10, v8
+; CHECK-NEXT: vs2r.v v10, (a6)
+; CHECK-NEXT: sub a7, a7, a2
+; CHECK-NEXT: add a6, a6, a5
+; CHECK-NEXT: bnez a7, .LBB1_3
+; CHECK-NEXT: # %bb.4: # %middle.block
+; CHECK-NEXT: beqz a4, .LBB1_7
+; CHECK-NEXT: .LBB1_5: # %for.body.preheader
+; CHECK-NEXT: slli a2, a3, 2
+; CHECK-NEXT: add a2, a0, a2
+; CHECK-NEXT: lui a3, 1
+; CHECK-NEXT: add a0, a0, a3
+; CHECK-NEXT: .LBB1_6: # %for.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: lw a3, 0(a2)
+; CHECK-NEXT: add a3, a3, a1
+; CHECK-NEXT: sw a3, 0(a2)
+; CHECK-NEXT: addi a2, a2, 4
+; CHECK-NEXT: bne a2, a0, .LBB1_6
+; CHECK-NEXT: .LBB1_7: # %for.cond.cleanup
+; CHECK-NEXT: ret
+entry:
+ %0 = call i64 @llvm.vscale.i64()
+ %1 = shl i64 %0, 2
+ %min.iters.check = icmp ugt i64 %1, 1024
+ br i1 %min.iters.check, label %for.body.preheader, label %vector.ph
+
+vector.ph: ; preds = %entry
+ %2 = call i64 @llvm.vscale.i64()
+ %3 = shl i64 %2, 2
+ %n.mod.vf = urem i64 1024, %3
+ %n.vec = sub nsw i64 1024, %n.mod.vf
+ %broadcast.splatinsert = insertelement <vscale x 4 x i32> poison, i32 %x, i32 0
+ %broadcast.splat = shufflevector <vscale x 4 x i32> %broadcast.splatinsert, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+ %4 = call i64 @llvm.vscale.i64()
+ %5 = shl i64 %4, 2
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %vector.ph
+ %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+ %6 = getelementptr inbounds i32, i32* %a, i64 %index
+ %7 = bitcast i32* %6 to <vscale x 4 x i32>*
+ %wide.load = load <vscale x 4 x i32>, <vscale x 4 x i32>* %7, align 4
+ %8 = add <vscale x 4 x i32> %wide.load, %broadcast.splat
+ %9 = bitcast i32* %6 to <vscale x 4 x i32>*
+ store <vscale x 4 x i32> %8, <vscale x 4 x i32>* %9, align 4
+ %index.next = add nuw i64 %index, %5
+ %10 = icmp eq i64 %index.next, %n.vec
+ br i1 %10, label %middle.block, label %vector.body
+
+middle.block: ; preds = %vector.body
+ %cmp.n = icmp eq i64 %n.mod.vf, 0
+ br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader
+
+for.body.preheader: ; preds = %entry, %middle.block
+ %indvars.iv.ph = phi i64 [ 0, %entry ], [ %n.vec, %middle.block ]
+ br label %for.body
+
+for.cond.cleanup: ; preds = %for.body, %middle.block
+ ret void
+
+for.body: ; preds = %for.body.preheader, %for.body
+ %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader ]
+ %arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
+ %11 = load i32, i32* %arrayidx, align 4
+ %add = add i32 %11, %x
+ store i32 %add, i32* %arrayidx, align 4
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %cmp.not = icmp eq i64 %indvars.iv.next, 1024
+ br i1 %cmp.not, label %for.cond.cleanup, label %for.body
+}
+
+declare <4 x i32> @llvm.vp.add.v4i32(<4 x i32>, <4 x i32>, <4 x i1>, i32)
+
+define void @sink_splat_vp_add(i32* nocapture %a, i32 signext %x, <4 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: sink_splat_vp_add:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vmv.v.x v8, a1
+; CHECK-NEXT: lui a1, 1
+; CHECK-NEXT: add a1, a0, a1
+; CHECK-NEXT: .LBB2_1: # %vector.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: vle32.v v9, (a0)
+; CHECK-NEXT: vsetvli zero, a2, e32, m1, ta, ma
+; CHECK-NEXT: vadd.vv v9, v9, v8, v0.t
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vse32.v v9, (a0)
+; CHECK-NEXT: addi a0, a0, 16
+; CHECK-NEXT: bne a0, a1, .LBB2_1
+; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT: ret
+entry:
+ %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i32 0
+ %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %entry
+ %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
+ %0 = getelementptr inbounds i32, i32* %a, i64 %index
+ %1 = bitcast i32* %0 to <4 x i32>*
+ %wide.load = load <4 x i32>, <4 x i32>* %1, align 4
+ %2 = call <4 x i32> @llvm.vp.add.v4i32(<4 x i32> %wide.load, <4 x i32> %broadcast.splat, <4 x i1> %m, i32 %vl)
+ %3 = bitcast i32* %0 to <4 x i32>*
+ store <4 x i32> %2, <4 x i32>* %3, align 4
+ %index.next = add nuw i64 %index, 4
+ %4 = icmp eq i64 %index.next, 1024
+ br i1 %4, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup: ; preds = %vector.body
+ ret void
+}
+
+define void @sink_splat_fadd(float* nocapture %a, float %x) {
+; CHECK-LABEL: sink_splat_fadd:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vfmv.v.f v8, fa0
+; CHECK-NEXT: lui a1, 1
+; CHECK-NEXT: add a1, a0, a1
+; CHECK-NEXT: .LBB3_1: # %vector.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: vle32.v v9, (a0)
+; CHECK-NEXT: vfadd.vv v9, v9, v8
+; CHECK-NEXT: vse32.v v9, (a0)
+; CHECK-NEXT: addi a0, a0, 16
+; CHECK-NEXT: bne a0, a1, .LBB3_1
+; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT: ret
+entry:
+ %broadcast.splatinsert = insertelement <4 x float> poison, float %x, i32 0
+ %broadcast.splat = shufflevector <4 x float> %broadcast.splatinsert, <4 x float> poison, <4 x i32> zeroinitializer
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %entry
+ %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
+ %0 = getelementptr inbounds float, float* %a, i64 %index
+ %1 = bitcast float* %0 to <4 x float>*
+ %wide.load = load <4 x float>, <4 x float>* %1, align 4
+ %2 = fadd <4 x float> %wide.load, %broadcast.splat
+ %3 = bitcast float* %0 to <4 x float>*
+ store <4 x float> %2, <4 x float>* %3, align 4
+ %index.next = add nuw i64 %index, 4
+ %4 = icmp eq i64 %index.next, 1024
+ br i1 %4, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup: ; preds = %vector.body
+ ret void
+}
+
+define void @sink_splat_fadd_scalable(float* nocapture %a, float %x) {
+; CHECK-LABEL: sink_splat_fadd_scalable:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: csrr a1, vlenb
+; CHECK-NEXT: srli a2, a1, 2
+; CHECK-NEXT: li a3, 1024
+; CHECK-NEXT: bgeu a3, a2, .LBB4_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: li a3, 0
+; CHECK-NEXT: j .LBB4_5
+; CHECK-NEXT: .LBB4_2: # %vector.ph
+; CHECK-NEXT: addi a3, a2, -1
+; CHECK-NEXT: andi a4, a3, 1024
+; CHECK-NEXT: xori a3, a4, 1024
+; CHECK-NEXT: vsetvli a5, zero, e32, m1, ta, ma
+; CHECK-NEXT: vfmv.v.f v8, fa0
+; CHECK-NEXT: mv a5, a0
+; CHECK-NEXT: mv a6, a3
+; CHECK-NEXT: .LBB4_3: # %vector.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: vl1re32.v v9, (a5)
+; CHECK-NEXT: vfadd.vv v9, v9, v8
+; CHECK-NEXT: vs1r.v v9, (a5)
+; CHECK-NEXT: sub a6, a6, a2
+; CHECK-NEXT: add a5, a5, a1
+; CHECK-NEXT: bnez a6, .LBB4_3
+; CHECK-NEXT: # %bb.4: # %middle.block
+; CHECK-NEXT: beqz a4, .LBB4_7
+; CHECK-NEXT: .LBB4_5: # %for.body.preheader
+; CHECK-NEXT: slli a1, a3, 2
+; CHECK-NEXT: add a1, a0, a1
+; CHECK-NEXT: lui a2, 1
+; CHECK-NEXT: add a0, a0, a2
+; CHECK-NEXT: .LBB4_6: # %for.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: flw fa5, 0(a1)
+; CHECK-NEXT: fadd.s fa5, fa5, fa0
+; CHECK-NEXT: fsw fa5, 0(a1)
+; CHECK-NEXT: addi a1, a1, 4
+; CHECK-NEXT: bne a1, a0, .LBB4_6
+; CHECK-NEXT: .LBB4_7: # %for.cond.cleanup
+; CHECK-NEXT: ret
+entry:
+ %0 = call i64 @llvm.vscale.i64()
+ %1 = shl i64 %0, 1
+ %min.iters.check = icmp ugt i64 %1, 1024
+ br i1 %min.iters.check, label %for.body.preheader, label %vector.ph
+
+vector.ph: ; preds = %entry
+ %2 = call i64 @llvm.vscale.i64()
+ %3 = shl i64 %2, 1
+ %n.mod.vf = urem i64 1024, %3
+ %n.vec = sub nsw i64 1024, %n.mod.vf
+ %broadcast.splatinsert = insertelement <vscale x 2 x float> poison, float %x, i32 0
+ %broadcast.splat = shufflevector <vscale x 2 x float> %broadcast.splatinsert, <vscale x 2 x float> poison, <vscale x 2 x i32> zeroinitializer
+ %4 = call i64 @llvm.vscale.i64()
+ %5 = shl i64 %4, 1
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %vector.ph
+ %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+ %6 = getelementptr inbounds float, float* %a, i64 %index
+ %7 = bitcast float* %6 to <vscale x 2 x float>*
+ %wide.load = load <vscale x 2 x float>, <vscale x 2 x float>* %7, align 4
+ %8 = fadd <vscale x 2 x float> %wide.load, %broadcast.splat
+ %9 = bitcast float* %6 to <vscale x 2 x float>*
+ store <vscale x 2 x float> %8, <vscale x 2 x float>* %9, align 4
+ %index.next = add nuw i64 %index, %5
+ %10 = icmp eq i64 %index.next, %n.vec
+ br i1 %10, label %middle.block, label %vector.body
+
+middle.block: ; preds = %vector.body
+ %cmp.n = icmp eq i64 %n.mod.vf, 0
+ br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader
+
+for.body.preheader: ; preds = %entry, %middle.block
+ %indvars.iv.ph = phi i64 [ 0, %entry ], [ %n.vec, %middle.block ]
+ br label %for.body
+
+for.cond.cleanup: ; preds = %for.body, %middle.block
+ ret void
+
+for.body: ; preds = %for.body.preheader, %for.body
+ %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader ]
+ %arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
+ %11 = load float, float* %arrayidx, align 4
+ %mul = fadd float %11, %x
+ store float %mul, float* %arrayidx, align 4
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %cmp.not = icmp eq i64 %indvars.iv.next, 1024
+ br i1 %cmp.not, label %for.cond.cleanup, label %for.body
+}
+
+declare <4 x float> @llvm.vp.fadd.v4i32(<4 x float>, <4 x float>, <4 x i1>, i32)
+
+define void @sink_splat_vp_fadd(float* nocapture %a, float %x, <4 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: sink_splat_vp_fadd:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vfmv.v.f v8, fa0
+; CHECK-NEXT: lui a2, 1
+; CHECK-NEXT: add a2, a0, a2
+; CHECK-NEXT: .LBB5_1: # %vector.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: vle32.v v9, (a0)
+; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT: vfadd.vv v9, v9, v8, v0.t
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vse32.v v9, (a0)
+; CHECK-NEXT: addi a0, a0, 16
+; CHECK-NEXT: bne a0, a2, .LBB5_1
+; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT: ret
+entry:
+ %broadcast.splatinsert = insertelement <4 x float> poison, float %x, i32 0
+ %broadcast.splat = shufflevector <4 x float> %broadcast.splatinsert, <4 x float> poison, <4 x i32> zeroinitializer
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %entry
+ %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
+ %0 = getelementptr inbounds float, float* %a, i64 %index
+ %1 = bitcast float* %0 to <4 x float>*
+ %wide.load = load <4 x float>, <4 x float>* %1, align 4
+ %2 = call <4 x float> @llvm.vp.fadd.v4i32(<4 x float> %wide.load, <4 x float> %broadcast.splat, <4 x i1> %m, i32 %vl)
+ %3 = bitcast float* %0 to <4 x float>*
+ store <4 x float> %2, <4 x float>* %3, align 4
+ %index.next = add nuw i64 %index, 4
+ %4 = icmp eq i64 %index.next, 1024
+ br i1 %4, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup: ; preds = %vector.body
+ ret void
+}
|
@@ -0,0 +1,353 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py | |||
; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+v,+f -target-abi=lp64f \ | |||
; RUN: -mattr=+dont-sink-splat-operands -riscv-v-vector-bits-min=128 | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a RUN without +dont-sink-splat-operands
?
// repeatedly. | ||
// FIXME: It could still be worth doing if it would improve vector register | ||
// pressure and prevent a vector spill. | ||
if (Subtarget.dontSinkSplatOperands()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only place to control the generation of sinked V instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The middle end LICM pass will usually hoist splats of the following form out loops.
%a = insertelt <vscale x 2 x i64> poison, i64 0, i32 0
%b = shufflevector <vscale x 2 x i64> %a, <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer>
The code in function tries to sink these splats back into loops if it would enable a .vx/.vf/.wx/.wf instruction to be formed. This tuning flag disables the sinking and leaves the splats outside the loop the way LICM left them.
This is a simple way to prevent many cases of .vx/.vf/.wx/.wf instructions in loops. But as noted in the FIXMEs it is bad for register pressure. We've been trying to write an optimization pass to do this better, but we don't have anything that gives good results yet.
// FIXME: Forming .vx/.vf can reduce register pressure. | ||
def TuneDontSinkSplatOperands | ||
: SubtargetFeature<"dont-sink-splat-operands", "DontSinkSplatOperands", | ||
"true", "Don't sink splat operands to enable .vx or .vf " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.wx and .wf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if comment is addressed.
// Some subtargets require a S2V transfer buffer to move scalars into vectors. | ||
// FIXME: Forming .vx/.vf/.wx/.wf can reduce register pressure. | ||
def TuneDontSinkSplatOperands | ||
: SubtargetFeature<"dont-sink-splat-operands", "DontSinkSplatOperands", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DontSinkSplatOperands
-> SinkSplatOperands
, and then change true
to false
?
This matches other SubtargetFeature
s that will disable some features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like TuneNoDefaultUnroll
, TuneNoOptimizedZeroStride
, FeatureNoRVCHints
disable features and contain the No
string, similar to Dont
. I don't see any other SubtargetFeature
s with false
without the No
string. Are you sure your suggestion matches with how we disable features?
I made an attempt to implement your suggestion, and here's what I found:
If we had SinkSplatOperands
and false
, then we would want to have
if (!Subtarget.sinkSplatOperands())
return false;
since
if (Subtarget.sinkSplatOperands())
return true;
is not the desired behavior, since this tuning having a true value does not imply that splat operands should be sunk. Instead, a true value implies that we should allow the rest of the logic in that function to determine whether to sink the splat operands. The desired behavior is that if sinkSplatOperands is false then we will not sink splat operands.
Now lets consider the three command line options:
+sink-splat-operands
-sink-splat-operands
- Tuning not specified
If we had SinkSplatOperands
and false
, then only the tuning not specified option works as expected.
If we had SinkSplatOperands
and true
, then +sink-splat-operands
and -sink-splat-operands
work correctly, but tuning not specified does not sink when it should be sinking.
Am I misunderstanding something? Based on my understanding, I think the tuning is in line with other tunings that try to disable features. The one refinement I can think of is that we can rename to TuneNoSinkSplatOperands
since the other features use the No
instead of Dont
naming scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion is something like
def TuneNoSinkSplatOperands : SubtargetFeature<"no-sink-splat-operands", "SinkSplatOperands", "false"
Notice the command line option has "No", but the variable in Subtarget field does not.
The use of "false" will make "SinkSplatOperands" default to true, and passing -mattr=-no-sink-splat-operands will set it to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Updated.
; FIXME: This is potentially bad for register pressure. Need a better heuristic. | ||
|
||
define void @sink_splat_add(i32* nocapture %a, i32 signext %x) { | ||
; DONT-SINK-LABEL: sink_splat_add: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONT -> NO
A CPU may prefer to not sink splat operands, one reason being that it could require a S2V transfer buffer to move scalars into buffers.
This commit is stacked on #79015 and will be updated to mark the p670 with this Tune once it lands.