-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[RISCV] Adjust RISCVVectorMaskDAGMutation to look for copy to V0 #129296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Adjust RISCVVectorMaskDAGMutation to look for copy to V0 #129296
Conversation
This mutation was introduced in 01a15dc with the goal of avoiding many copies from V1-v31 to v0 immediately before a mask consuming instruction. I noticed in a workload that this was not applying to vmv.s.x (which we use for short vector masks). We'd had a whitelist of instructions. Instead, we can directly inspect the user of the current node to see if it's a copy to V0. This isn't quite precise (as the mask producing instruction could already be scheduled fairly far from it's single use), but is probably good enough. As with all schedule changes, results are mixed. Some significant improvements, some regressions.
@llvm/pr-subscribers-backend-risc-v Author: Philip Reames (preames) ChangesThis mutation was introduced in 01a15dc with the goal of avoiding many copies from V1-v31 to v0 immediately before a mask consuming instruction. I noticed in a workload that this was not applying to vmv.s.x (which we use for short vector masks). We'd had a whitelist of instructions. Instead, we can directly inspect the user of the current node to see if it's a copy to V0. This isn't quite precise (as the mask producing instruction could already be scheduled fairly far from it's single use), but is probably good enough. As with all schedule changes, results are mixed. Some significant improvements, some regressions. Patch is 120.59 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129296.diff 22 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp b/llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp
index ee90868d252e4..d82b54b70d3ea 100644
--- a/llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp
@@ -13,11 +13,8 @@
// The reason why we need to do this:
// 1. When tracking register pressure, we don't track physical registers.
// 2. We have a RegisterClass for mask register (which is `VMV0`), but we don't
-// use it in most RVV pseudos (only used in inline asm constraint and add/sub
-// with carry instructions). Instead, we use physical register V0 directly
-// and insert a `$v0 = COPY ...` before the use. And, there is a fundamental
-// issue in register allocator when handling RegisterClass with only one
-// physical register, so we can't simply replace V0 with VMV0.
+// use it by the time we reach scheduling. Instead, we use physical
+// register V0 directly and insert a `$v0 = COPY ...` before the use.
// 3. For mask producers, we are using VR RegisterClass (we can allocate V0-V31
// to it). So if V0 is not available, there are still 31 available registers
// out there.
@@ -43,68 +40,6 @@
namespace llvm {
-static inline bool isVectorMaskProducer(const MachineInstr *MI) {
- switch (RISCV::getRVVMCOpcode(MI->getOpcode())) {
- // Vector Mask Instructions
- case RISCV::VMAND_MM:
- case RISCV::VMNAND_MM:
- case RISCV::VMANDN_MM:
- case RISCV::VMXOR_MM:
- case RISCV::VMOR_MM:
- case RISCV::VMNOR_MM:
- case RISCV::VMORN_MM:
- case RISCV::VMXNOR_MM:
- case RISCV::VMSBF_M:
- case RISCV::VMSIF_M:
- case RISCV::VMSOF_M:
- // Vector Integer Add-with-Carry / Subtract-with-Borrow Instructions
- case RISCV::VMADC_VV:
- case RISCV::VMADC_VX:
- case RISCV::VMADC_VI:
- case RISCV::VMADC_VVM:
- case RISCV::VMADC_VXM:
- case RISCV::VMADC_VIM:
- case RISCV::VMSBC_VV:
- case RISCV::VMSBC_VX:
- case RISCV::VMSBC_VVM:
- case RISCV::VMSBC_VXM:
- // Vector Integer Compare Instructions
- case RISCV::VMSEQ_VV:
- case RISCV::VMSEQ_VX:
- case RISCV::VMSEQ_VI:
- case RISCV::VMSNE_VV:
- case RISCV::VMSNE_VX:
- case RISCV::VMSNE_VI:
- case RISCV::VMSLT_VV:
- case RISCV::VMSLT_VX:
- case RISCV::VMSLTU_VV:
- case RISCV::VMSLTU_VX:
- case RISCV::VMSLE_VV:
- case RISCV::VMSLE_VX:
- case RISCV::VMSLE_VI:
- case RISCV::VMSLEU_VV:
- case RISCV::VMSLEU_VX:
- case RISCV::VMSLEU_VI:
- case RISCV::VMSGTU_VX:
- case RISCV::VMSGTU_VI:
- case RISCV::VMSGT_VX:
- case RISCV::VMSGT_VI:
- // Vector Floating-Point Compare Instructions
- case RISCV::VMFEQ_VV:
- case RISCV::VMFEQ_VF:
- case RISCV::VMFNE_VV:
- case RISCV::VMFNE_VF:
- case RISCV::VMFLT_VV:
- case RISCV::VMFLT_VF:
- case RISCV::VMFLE_VV:
- case RISCV::VMFLE_VF:
- case RISCV::VMFGT_VF:
- case RISCV::VMFGE_VF:
- return true;
- }
- return false;
-}
-
class RISCVVectorMaskDAGMutation : public ScheduleDAGMutation {
private:
const TargetRegisterInfo *TRI;
@@ -112,6 +47,24 @@ class RISCVVectorMaskDAGMutation : public ScheduleDAGMutation {
public:
RISCVVectorMaskDAGMutation(const TargetRegisterInfo *TRI) : TRI(TRI) {}
+ bool isSoleUseCopyToV0(SUnit &SU) {
+ if (SU.Succs.size() != 1)
+ return false;
+ SDep &Dep = SU.Succs[0];
+ // Ignore dependencies other than data or strong ordering.
+ if (Dep.isWeak())
+ return false;
+
+ SUnit &DepSU = *Dep.getSUnit();
+ if (DepSU.isBoundaryNode())
+ return false;
+
+ const MachineInstr *DepMI = DepSU.getInstr();
+ return DepMI->isCopy() && DepMI->getOperand(0).getReg() == RISCV::V0 &&
+ DepMI->getOperand(1).getReg().isVirtual() &&
+ DepMI->getOperand(1).getSubReg() == RISCV::NoSubRegister;
+ }
+
void apply(ScheduleDAGInstrs *DAG) override {
SUnit *NearestUseV0SU = nullptr;
for (SUnit &SU : DAG->SUnits) {
@@ -119,7 +72,7 @@ class RISCVVectorMaskDAGMutation : public ScheduleDAGMutation {
if (MI->findRegisterUseOperand(RISCV::V0, TRI))
NearestUseV0SU = &SU;
- if (NearestUseV0SU && NearestUseV0SU != &SU && isVectorMaskProducer(MI) &&
+ if (NearestUseV0SU && NearestUseV0SU != &SU && isSoleUseCopyToV0(SU) &&
// For LMUL=8 cases, there will be more possibilities to spill.
// FIXME: We should use RegPressureTracker to do fine-grained
// controls.
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
index 9bd1da2e53dce..eb40c133514fe 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
@@ -46,10 +46,11 @@ define <4 x float> @hang_when_merging_stores_after_legalization(<8 x float> %x,
; CHECK-NEXT: vslideup.vi v12, v10, 2, v0.t
; CHECK-NEXT: vsetivli zero, 1, e8, mf8, ta, ma
; CHECK-NEXT: vmv.v.i v0, 2
-; CHECK-NEXT: vmv.v.i v10, 12
; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, mu
; CHECK-NEXT: vslidedown.vi v8, v8, 6, v0.t
-; CHECK-NEXT: vmv1r.v v0, v10
+; CHECK-NEXT: vsetivli zero, 1, e8, mf8, ta, ma
+; CHECK-NEXT: vmv.v.i v0, 12
+; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
; CHECK-NEXT: vmerge.vvm v8, v8, v12, v0
; CHECK-NEXT: ret
%z = shufflevector <8 x float> %x, <8 x float> %y, <4 x i32> <i32 0, i32 7, i32 8, i32 15>
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
index c628a0d620498..16bb2105f8680 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
@@ -698,15 +698,16 @@ define void @buildvec_seq_v9i8(ptr %x) {
; CHECK: # %bb.0:
; CHECK-NEXT: li a1, 73
; CHECK-NEXT: vsetivli zero, 9, e8, m1, ta, ma
-; CHECK-NEXT: vmv.v.i v9, 3
+; CHECK-NEXT: vmv.v.i v8, 3
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: li a1, 146
-; CHECK-NEXT: vmv.s.x v8, a1
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT: vmerge.vim v9, v9, 1, v0
-; CHECK-NEXT: vmv1r.v v0, v8
-; CHECK-NEXT: vmerge.vim v8, v9, 2, v0
+; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
+; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
+; CHECK-NEXT: vmv.s.x v0, a1
+; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
+; CHECK-NEXT: vmerge.vim v8, v8, 2, v0
; CHECK-NEXT: vse8.v v8, (a0)
; CHECK-NEXT: ret
store <9 x i8> <i8 1, i8 2, i8 3, i8 1, i8 2, i8 3, i8 1, i8 2, i8 3>, ptr %x
@@ -973,27 +974,27 @@ define <512 x i8> @buildvec_not_vid_v512i8_indices_overflow_2() vscale_range(16,
; RV32: # %bb.0:
; RV32-NEXT: vsetivli zero, 16, e32, mf2, ta, ma
; RV32-NEXT: vmv.v.i v0, 15
-; RV32-NEXT: vmv.v.i v9, 0
+; RV32-NEXT: vmv.v.i v8, 0
; RV32-NEXT: li a0, 512
; RV32-NEXT: li a1, 240
-; RV32-NEXT: vmv.s.x v8, a1
-; RV32-NEXT: li a1, 15
-; RV32-NEXT: vmerge.vim v10, v9, -1, v0
+; RV32-NEXT: vmerge.vim v9, v8, -1, v0
; RV32-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV32-NEXT: vmv.v.i v12, 3
-; RV32-NEXT: slli a1, a1, 8
-; RV32-NEXT: vmv1r.v v0, v10
+; RV32-NEXT: vmv1r.v v0, v9
; RV32-NEXT: vmerge.vim v12, v12, 0, v0
-; RV32-NEXT: vmv1r.v v0, v8
+; RV32-NEXT: vsetvli zero, zero, e16, m8, ta, ma
+; RV32-NEXT: vmv.s.x v0, a1
+; RV32-NEXT: li a1, 15
; RV32-NEXT: vsetivli zero, 16, e32, mf2, ta, ma
-; RV32-NEXT: vmerge.vim v10, v9, -1, v0
-; RV32-NEXT: vmv.s.x v8, a1
-; RV32-NEXT: vmv1r.v v0, v10
+; RV32-NEXT: vmerge.vim v9, v8, -1, v0
+; RV32-NEXT: slli a1, a1, 8
+; RV32-NEXT: vmv1r.v v0, v9
; RV32-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV32-NEXT: vmerge.vim v12, v12, 1, v0
-; RV32-NEXT: vmv1r.v v0, v8
+; RV32-NEXT: vsetvli zero, zero, e16, m8, ta, ma
+; RV32-NEXT: vmv.s.x v0, a1
; RV32-NEXT: vsetivli zero, 16, e32, mf2, ta, ma
-; RV32-NEXT: vmerge.vim v8, v9, -1, v0
+; RV32-NEXT: vmerge.vim v8, v8, -1, v0
; RV32-NEXT: vmv1r.v v0, v8
; RV32-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV32-NEXT: vmerge.vim v8, v12, 2, v0
@@ -1003,25 +1004,23 @@ define <512 x i8> @buildvec_not_vid_v512i8_indices_overflow_2() vscale_range(16,
; RV64V: # %bb.0:
; RV64V-NEXT: vsetivli zero, 8, e64, m1, ta, ma
; RV64V-NEXT: vmv.v.i v0, 3
-; RV64V-NEXT: vmv.v.i v9, 0
+; RV64V-NEXT: vmv.v.i v8, 0
; RV64V-NEXT: li a0, 512
-; RV64V-NEXT: vmv.v.i v8, 12
-; RV64V-NEXT: li a1, 48
-; RV64V-NEXT: vmerge.vim v10, v9, -1, v0
+; RV64V-NEXT: vmerge.vim v9, v8, -1, v0
; RV64V-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV64V-NEXT: vmv.v.i v12, 3
-; RV64V-NEXT: vmv1r.v v0, v10
+; RV64V-NEXT: vmv1r.v v0, v9
; RV64V-NEXT: vmerge.vim v12, v12, 0, v0
-; RV64V-NEXT: vmv1r.v v0, v8
; RV64V-NEXT: vsetivli zero, 8, e64, m1, ta, ma
-; RV64V-NEXT: vmerge.vim v10, v9, -1, v0
-; RV64V-NEXT: vmv.s.x v8, a1
-; RV64V-NEXT: vmv.v.v v0, v10
+; RV64V-NEXT: vmv.v.i v0, 12
+; RV64V-NEXT: vmerge.vim v9, v8, -1, v0
+; RV64V-NEXT: li a1, 48
+; RV64V-NEXT: vmv.v.v v0, v9
; RV64V-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV64V-NEXT: vmerge.vim v12, v12, 1, v0
-; RV64V-NEXT: vmv1r.v v0, v8
+; RV64V-NEXT: vmv.s.x v0, a1
; RV64V-NEXT: vsetivli zero, 8, e64, m1, ta, ma
-; RV64V-NEXT: vmerge.vim v8, v9, -1, v0
+; RV64V-NEXT: vmerge.vim v8, v8, -1, v0
; RV64V-NEXT: vmv.v.v v0, v8
; RV64V-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV64V-NEXT: vmerge.vim v8, v12, 2, v0
@@ -1031,27 +1030,27 @@ define <512 x i8> @buildvec_not_vid_v512i8_indices_overflow_2() vscale_range(16,
; RV64ZVE32: # %bb.0:
; RV64ZVE32-NEXT: vsetivli zero, 16, e32, m1, ta, ma
; RV64ZVE32-NEXT: vmv.v.i v0, 15
-; RV64ZVE32-NEXT: vmv.v.i v9, 0
+; RV64ZVE32-NEXT: vmv.v.i v8, 0
; RV64ZVE32-NEXT: li a0, 512
; RV64ZVE32-NEXT: li a1, 240
-; RV64ZVE32-NEXT: vmv.s.x v8, a1
-; RV64ZVE32-NEXT: li a1, 15
-; RV64ZVE32-NEXT: vmerge.vim v10, v9, -1, v0
+; RV64ZVE32-NEXT: vmerge.vim v9, v8, -1, v0
; RV64ZVE32-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV64ZVE32-NEXT: vmv.v.i v12, 3
-; RV64ZVE32-NEXT: slli a1, a1, 8
-; RV64ZVE32-NEXT: vmv1r.v v0, v10
+; RV64ZVE32-NEXT: vmv1r.v v0, v9
; RV64ZVE32-NEXT: vmerge.vim v12, v12, 0, v0
-; RV64ZVE32-NEXT: vmv1r.v v0, v8
+; RV64ZVE32-NEXT: vsetvli zero, zero, e16, m8, ta, ma
+; RV64ZVE32-NEXT: vmv.s.x v0, a1
+; RV64ZVE32-NEXT: li a1, 15
; RV64ZVE32-NEXT: vsetivli zero, 16, e32, m1, ta, ma
-; RV64ZVE32-NEXT: vmerge.vim v10, v9, -1, v0
-; RV64ZVE32-NEXT: vmv.s.x v8, a1
-; RV64ZVE32-NEXT: vmv.v.v v0, v10
+; RV64ZVE32-NEXT: vmerge.vim v9, v8, -1, v0
+; RV64ZVE32-NEXT: slli a1, a1, 8
+; RV64ZVE32-NEXT: vmv.v.v v0, v9
; RV64ZVE32-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV64ZVE32-NEXT: vmerge.vim v12, v12, 1, v0
-; RV64ZVE32-NEXT: vmv1r.v v0, v8
+; RV64ZVE32-NEXT: vsetvli zero, zero, e16, m8, ta, ma
+; RV64ZVE32-NEXT: vmv.s.x v0, a1
; RV64ZVE32-NEXT: vsetivli zero, 16, e32, m1, ta, ma
-; RV64ZVE32-NEXT: vmerge.vim v8, v9, -1, v0
+; RV64ZVE32-NEXT: vmerge.vim v8, v8, -1, v0
; RV64ZVE32-NEXT: vmv.v.v v0, v8
; RV64ZVE32-NEXT: vsetvli zero, a0, e8, m4, ta, ma
; RV64ZVE32-NEXT: vmerge.vim v8, v12, 2, v0
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
index f307ebb422c6c..cd73dbadb2d03 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
@@ -105,11 +105,10 @@ define <4 x i16> @vrgather_shuffle_vx_v4i16(<4 x i16> %x) {
; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, mu
; CHECK-NEXT: vmv.v.i v0, 2
-; CHECK-NEXT: vmv.v.i v9, 3
; CHECK-NEXT: vslidedown.vi v8, v8, 2, v0.t
-; CHECK-NEXT: vmv.v.i v10, 5
-; CHECK-NEXT: vmv1r.v v0, v9
-; CHECK-NEXT: vmerge.vvm v8, v10, v8, v0
+; CHECK-NEXT: vmv.v.i v0, 3
+; CHECK-NEXT: vmv.v.i v9, 5
+; CHECK-NEXT: vmerge.vvm v8, v9, v8, v0
; CHECK-NEXT: ret
%s = shufflevector <4 x i16> %x, <4 x i16> <i16 5, i16 5, i16 5, i16 5>, <4 x i32> <i32 0, i32 3, i32 6, i32 5>
ret <4 x i16> %s
@@ -971,13 +970,12 @@ define <8 x i32> @shuffle_repeat3_singlesrc_e32(<8 x i32> %v) {
; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vmv.v.i v0, 7
-; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmv.v.i v9, 1
; CHECK-NEXT: li a0, 192
-; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vmerge.vim v9, v9, 0, v0
+; CHECK-NEXT: vmv.s.x v0, a0
; CHECK-NEXT: csrr a0, vlenb
-; CHECK-NEXT: vmerge.vim v10, v10, 0, v0
-; CHECK-NEXT: vmv.v.v v0, v9
-; CHECK-NEXT: vmerge.vim v9, v10, 2, v0
+; CHECK-NEXT: vmerge.vim v9, v9, 2, v0
; CHECK-NEXT: srli a0, a0, 2
; CHECK-NEXT: vslidedown.vx v10, v9, a0
; CHECK-NEXT: vsetvli a0, zero, e32, m1, ta, ma
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll
index 30751f8ea706b..39fd70beb9ee2 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll
@@ -1046,44 +1046,45 @@ define void @mulhu_v16i8(ptr %x) {
; CHECK-LABEL: mulhu_v16i8:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 16, e8, m1, ta, ma
-; CHECK-NEXT: vle8.v v9, (a0)
+; CHECK-NEXT: vle8.v v8, (a0)
; CHECK-NEXT: lui a1, 3
-; CHECK-NEXT: vmv.v.i v10, 0
+; CHECK-NEXT: vmv.v.i v9, 0
; CHECK-NEXT: lui a2, %hi(.LCPI65_0)
; CHECK-NEXT: addi a2, a2, %lo(.LCPI65_0)
-; CHECK-NEXT: vle8.v v11, (a2)
-; CHECK-NEXT: lui a2, 1
+; CHECK-NEXT: vle8.v v10, (a2)
+; CHECK-NEXT: li a2, -128
; CHECK-NEXT: addi a1, a1, -2044
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
; CHECK-NEXT: vmv.s.x v0, a1
-; CHECK-NEXT: addi a1, a2, 32
-; CHECK-NEXT: vmv.s.x v8, a1
-; CHECK-NEXT: li a1, -128
+; CHECK-NEXT: lui a1, 1
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT: vmerge.vxm v12, v10, a1, v0
-; CHECK-NEXT: li a1, 513
-; CHECK-NEXT: vmv.v.i v13, 4
-; CHECK-NEXT: vmv1r.v v0, v8
-; CHECK-NEXT: vmerge.vim v8, v10, 1, v0
+; CHECK-NEXT: vmerge.vxm v11, v9, a2, v0
+; CHECK-NEXT: addi a2, a1, 32
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
-; CHECK-NEXT: vmv.s.x v0, a1
-; CHECK-NEXT: addi a1, a2, 78
+; CHECK-NEXT: vmv.s.x v0, a2
+; CHECK-NEXT: li a2, 513
+; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 4
+; CHECK-NEXT: addi a1, a1, 78
+; CHECK-NEXT: vmerge.vim v9, v9, 1, v0
+; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
+; CHECK-NEXT: vmv.s.x v0, a2
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT: vmerge.vim v10, v13, 1, v0
+; CHECK-NEXT: vsrl.vv v9, v8, v9
+; CHECK-NEXT: vmerge.vim v12, v12, 1, v0
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT: vsrl.vv v8, v9, v8
-; CHECK-NEXT: vmulhu.vv v8, v8, v11
-; CHECK-NEXT: vmerge.vim v10, v10, 3, v0
+; CHECK-NEXT: vmulhu.vv v9, v9, v10
+; CHECK-NEXT: vmerge.vim v10, v12, 3, v0
; CHECK-NEXT: lui a1, 8
; CHECK-NEXT: addi a1, a1, 304
-; CHECK-NEXT: vsub.vv v9, v9, v8
-; CHECK-NEXT: vmulhu.vv v9, v9, v12
+; CHECK-NEXT: vsub.vv v8, v8, v9
+; CHECK-NEXT: vmulhu.vv v8, v8, v11
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v9, v8
+; CHECK-NEXT: vadd.vv v8, v8, v9
; CHECK-NEXT: vmerge.vim v9, v10, 2, v0
; CHECK-NEXT: vsrl.vv v8, v8, v9
; CHECK-NEXT: vse8.v v8, (a0)
@@ -3153,48 +3154,49 @@ define void @mulhu_v32i8(ptr %x) {
; CHECK-LABEL: mulhu_v32i8:
; CHECK: # %bb.0:
; CHECK-NEXT: li a1, 32
-; CHECK-NEXT: lui a2, 163907
-; CHECK-NEXT: addi a2, a2, -2044
-; CHECK-NEXT: vsetivli zero, 1, e32, m1, ta, ma
-; CHECK-NEXT: vmv.s.x v0, a2
-; CHECK-NEXT: lui a2, 66049
-; CHECK-NEXT: addi a2, a2, 32
-; CHECK-NEXT: vmv.s.x v8, a2
-; CHECK-NEXT: li a2, -128
-; CHECK-NEXT: vsetvli zero, a1, e8, m2, ta, ma
+; CHECK-NEXT: lui a2, %hi(.LCPI181_0)
+; CHECK-NEXT: addi a2, a2, %lo(.LCPI181_0)
+; CHECK-NEXT: vsetvli zero, a1, e32, m8, ta, ma
+; CHECK-NEXT: vle8.v v8, (a2)
+; CHECK-NEXT: lui a1, 163907
+; CHECK-NEXT: addi a1, a1, -2044
+; CHECK-NEXT: vmv.s.x v0, a1
+; CHECK-NEXT: li a1, -128
+; CHECK-NEXT: vsetvli zero, zero, e8, m2, ta, ma
; CHECK-NEXT: vmv.v.i v10, 0
-; CHECK-NEXT: vmerge.vxm v12, v10, a2, v0
-; CHECK-NEXT: lui a1, %hi(.LCPI181_0)
-; CHECK-NEXT: addi a1, a1, %lo(.LCPI181_0)
-; CHECK-NEXT: vle8.v v14, (a0)
-; CHECK-NEXT: vmv1r.v v0, v8
-; CHECK-NEXT: vmerge.vim v8, v10, 1, v0
-; CHECK-NEXT: vle8.v v10, (a1)
+; CHECK-NEXT: vmerge.vxm v12, v10, a1, v0
+; CHECK-NEXT: lui a1, 66049
+; CHECK-NEXT: addi a1, a1, 32
+; CHECK-NEXT: vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: lui a1, 8208
+; CHECK-NEXT: vle8.v v14, (a0)
; CHECK-NEXT: addi a1, a1, 513
-; CHECK-NEXT: vsrl.vv v8, v14, v8
-; CHECK-NEXT: vmulhu.vv v10, v8, v10
+; CHECK-NEXT: vsetvli zero, zero, e8, m2, ta, ma
+; CHECK-NEXT: vmerge.vim v10, v10, 1, v0
; CHECK-NEXT: vsetvli zero, zero, e32, m8, ta, ma
; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: lui a1, 66785
+; CHECK-NEXT: vsetvli zero, zero, e8, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v16, 4
; CHECK-NEXT: addi a1, a1, 78
-; CHECK-NEXT: vmv.s.x v8, a1
-; CHECK-NEXT: lui a1, 529160
+; CHECK-NEXT: vsrl.vv v10, v14, v10
+; CHECK-NEXT: vmerge.vim v16, v16, 1, v0
+; CHECK-NEXT: vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: vsetvli zero, zero, e8, m2, ta, ma
-; CHECK-NEXT: vsub.vv v14, v14, v10
-; CHECK-NEXT: vmulhu.vv v12, v14, v12
-; CHECK-NEXT: vmv.v.i v14, 4
+; CHECK-NEXT: vmulhu.vv v8, v10, v8
+; CHECK-NEXT: vmerge.vim v10, v16, 3, v0
+; CHECK-NEXT: lui a1, 529160
; CHECK-NEXT: addi a1, a1, 304
-; CHECK-NEXT: vmerge.vim v14, v14, 1, v0
+; CHECK-NEXT: vsub.vv v14, v14, v8
+; CHECK-NEXT: vmulhu.vv v12, v14, v12
; CHECK-NEXT: vsetvli zero, zero, e32, m8, ta, ma
-; CHECK-NEXT: vmv.s.x v9, a1
-; CHECK-NEXT: vmv1r.v v0, v8
+; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: vsetvli zero, zero, e8, m2, ta, ma
-; CHECK-NEXT: vmerge.vim v14, v14, 3, v0
-; CHECK-NEXT: vadd.vv v10, v12, v10
-; CHECK-NEXT: vmv1r.v v0, v9
-; CHECK-NEXT: vmerge.vim v8, v14, 2, v0
-; CHECK-NEXT: vsrl.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v12, v8
+; CHECK-NEXT: vmerge.vim v10, v10, 2, v0
+; CHECK-NEXT: vsrl.vv v8, v8, v10
; CHECK-NEXT: vse8.v v8, (a0)
; CHECK-NEXT: ret
%a = load <32 x i8>, ptr %x
@@ -3215,27 +3217,28 @@ define void @mulhu_v16i16(ptr %x) {
; RV32-NEXT: vmerge.vxm v12, v8, a1, v0
; RV32-NEXT: lui a1, 4
; RV32-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; RV32-NEXT: vmv.v.i v14, 0
+; RV32-NEXT: vmv.v.i v9, 0
; RV32-NEXT: addi a1, a1, 64
; RV32-NEXT: vsetvli zero, zero, e16, m2, ta, ma
; RV32-NEXT: vmv.s.x v8, a1
; RV32-NEXT: lui a1, 2
; RV32-NEXT: addi a1, a1, 289
-; RV32-NEXT: vmv.s.x v9, a1
+; RV32-NEXT: vmv1r.v v0, v8
+; RV32-NEXT: vsetvli zero, zero, e8, m1, ta, ma
+; RV32-NEXT: vmerge.vim v9, v9, 1, v0
+; RV32-NEXT: vsetvli zero, zero, e16, m2, ta, ma
+; RV32-NEXT: vmv.s.x v0, a1
; RV32-NEXT: lui a1, %hi(.LCPI182_0)
; RV32-NEXT: addi a1, a1, %lo(.LCPI182_0)
; RV32-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; RV32-NEXT: vmv.v.i v15, 3
-; RV32-NEXT: vmv1r.v v0, v8
-; RV32-NEXT: vmerge.vim v14, v14, 1, v0
-; RV32-NEXT: vmv1r.v v0, v9
-; RV32-NEXT: vmerge.vim v9, v15, 2, v0
+; RV32-NEXT: vmv.v.i v14, 3
+; RV32-NEXT: vmerge.vim v14, v14, 2, v0
; RV32-NEXT: vle16.v v16, (a1)
; RV32-NEXT: vmv1r.v v0, v8
-; RV32-NEXT: vmerge.vim v8, v9, 1, v0
+; RV32-NEXT: vmerge.vim v8, v14, 1, v0
; RV32-NEXT: vsetvli zero, zero, e16, m2, ta, ma
-; RV32-NEXT: vsext.vf2 v18, v14
-; RV32-NEXT: vsrl.vv v14, v10, v18
+; RV32-NEXT: vsext.vf2 v14,...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me but I'll defer to @wangpc-pp on this
IIRC, we planed to place |
Not yet, I'll take a look now and see what's going on |
I tried this out, I had to handle subregisters + update it to maintain liveintervals, but the diff is massive: lukel97@c00e2b7 I need to spend some time investigating the impact properly, but at a quick glance:
So I don't have a strong opinion with this PR, but it may be a day or two before I'm able to properly test the RISCVVMV0Elimination approach out on llvm-test-suite/SPEC/etc! |
; CHECK-NEXT: vmv.v.i v9, 3 | ||
; CHECK-NEXT: vmv.v.i v8, 3 | ||
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma | ||
; CHECK-NEXT: vmv.s.x v0, a1 | ||
; CHECK-NEXT: li a1, 146 | ||
; CHECK-NEXT: vmv.s.x v8, a1 | ||
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma | ||
; CHECK-NEXT: vmerge.vim v9, v9, 1, v0 | ||
; CHECK-NEXT: vmv1r.v v0, v8 | ||
; CHECK-NEXT: vmerge.vim v8, v9, 2, v0 | ||
; CHECK-NEXT: vmerge.vim v8, v8, 1, v0 | ||
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma | ||
; CHECK-NEXT: vmv.s.x v0, a1 | ||
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma | ||
; CHECK-NEXT: vmerge.vim v8, v8, 2, v0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is handled by moving RISCVVMV0Elimination to after pre-ra scheduling: lukel97@c00e2b7#diff-19fd7ee1004c9fe02050701bc140c5c670f89b7b143ffae3c9a649106c92628bR700-R711
I believe the RISCVVMV0Elimination approach is probably the right long term direction, but we can work incrementally here? This patch seems like an improvement given the current state of things, and I'd like to not have that blocked on perfection. As an aside, I suspect we're still going to want some kind of scheduling false dependency even on the register class representation. Expecting the scheduler to perfectly handle register pressure on size=1 register classes seems like a big ask. I'd suggest starting with the scheduling dependency in place (over the new representation), and then remove it in a separate change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
As an aside, I suspect we're still going to want some kind of scheduling false dependency even on the register class representation. Expecting the scheduler to perfectly handle register pressure on size=1 register classes seems like a big ask. I'd suggest starting with the scheduling dependency in place (over the new representation), and then remove it in a separate change.
I don't fully understand this, can you please explain your thought? I also am interested in this.
I'm suggesting that even if we move to using the vmv0 register class into reg alloc (as look suggested), that we probably still want a schedule dag mutation to bias the scheduling of mask producing vs mask consuming instructions. In theory, the scheduler should do this for us based on register pressure, but having a dag mutation is likely to be more robust. We might remove it later, or we might not. |
…m#129296) This mutation was introduced in 01a15dc with the goal of avoiding many copies from V1-v31 to v0 immediately before a mask consuming instruction. I noticed in a workload that this was not applying to vmv.s.x (which we use for short vector masks). We'd had a whitelist of instructions. Instead, we can directly inspect the user of the current node to see if it's a copy to V0. This isn't quite precise (as the mask producing instruction could already be scheduled fairly far from it's single use), but is probably good enough. As with all schedule changes, results are mixed. Some significant improvements, some regressions.
This mutation was introduced in 01a15dc with the goal of avoiding many copies from V1-v31 to v0 immediately before a mask consuming instruction. I noticed in a workload that this was not applying to vmv.s.x (which we use for short vector masks). We'd had a whitelist of instructions. Instead, we can directly inspect the user of the current node to see if it's a copy to V0. This isn't quite precise (as the mask producing instruction could already be scheduled fairly far from it's single use), but is probably good enough.
As with all schedule changes, results are mixed. Some significant improvements, some regressions.