Skip to content

[RISCV] Adjust RISCVVectorMaskDAGMutation to look for copy to V0 #129296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 5, 2025

Conversation

preames
Copy link
Collaborator

@preames preames commented Feb 28, 2025

This mutation was introduced in 01a15dc with the goal of avoiding many copies from V1-v31 to v0 immediately before a mask consuming instruction. I noticed in a workload that this was not applying to vmv.s.x (which we use for short vector masks). We'd had a whitelist of instructions. Instead, we can directly inspect the user of the current node to see if it's a copy to V0. This isn't quite precise (as the mask producing instruction could already be scheduled fairly far from it's single use), but is probably good enough.

As with all schedule changes, results are mixed. Some significant improvements, some regressions.

This mutation was introduced in 01a15dc with the goal of avoiding
many copies from V1-v31 to v0 immediately before a mask consuming
instruction.  I noticed in a workload that this was not applying
to vmv.s.x (which we use for short vector masks).  We'd had a
whitelist of instructions.  Instead, we can directly inspect the
user of the current node to see if it's a copy to V0.  This isn't
quite precise (as the mask producing instruction could already be
scheduled fairly far from it's single use), but is probably good
enough.

As with all schedule changes, results are mixed.  Some significant
improvements, some regressions.
@llvmbot
Copy link
Member

llvmbot commented Feb 28, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Philip Reames (preames)

Changes

This mutation was introduced in 01a15dc with the goal of avoiding many copies from V1-v31 to v0 immediately before a mask consuming instruction. I noticed in a workload that this was not applying to vmv.s.x (which we use for short vector masks). We'd had a whitelist of instructions. Instead, we can directly inspect the user of the current node to see if it's a copy to V0. This isn't quite precise (as the mask producing instruction could already be scheduled fairly far from it's single use), but is probably good enough.

As with all schedule changes, results are mixed. Some significant improvements, some regressions.


Patch is 120.59 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129296.diff

22 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp (+21-68)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll (+3-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+39-40)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll (+7-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll (+65-62)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll (+442-490)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+5-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-load-fp.ll (+9-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-load-int.ll (+6-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-fp.ll (+9-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-int.ll (+6-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-changes-length.ll (+20-16)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-deinterleave.ll (+38-37)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-5)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shufflevector-vnsrl.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vpgather.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vselect-vp.ll (+14-27)
  • (modified) llvm/test/CodeGen/RISCV/rvv/mask-reg-alloc.mir (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/mgather-sdnode.ll (+8-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/mscatter-sdnode.ll (+33-72)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll (+23-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+56-57)
diff --git a/llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp b/llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp
index ee90868d252e4..d82b54b70d3ea 100644
--- a/llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVectorMaskDAGMutation.cpp
@@ -13,11 +13,8 @@
 // The reason why we need to do this:
 // 1. When tracking register pressure, we don't track physical registers.
 // 2. We have a RegisterClass for mask register (which is `VMV0`), but we don't
-//    use it in most RVV pseudos (only used in inline asm constraint and add/sub
-//    with carry instructions). Instead, we use physical register V0 directly
-//    and insert a `$v0 = COPY ...` before the use. And, there is a fundamental
-//    issue in register allocator when handling RegisterClass with only one
-//    physical register, so we can't simply replace V0 with VMV0.
+//    use it by the time we reach scheduling. Instead, we use physical
+//    register V0 directly and insert a `$v0 = COPY ...` before the use. 
 // 3. For mask producers, we are using VR RegisterClass (we can allocate V0-V31
 //    to it). So if V0 is not available, there are still 31 available registers
 //    out there.
@@ -43,68 +40,6 @@
 
 namespace llvm {
 
-static inline bool isVectorMaskProducer(const MachineInstr *MI) {
-  switch (RISCV::getRVVMCOpcode(MI->getOpcode())) {
-  // Vector Mask Instructions
-  case RISCV::VMAND_MM:
-  case RISCV::VMNAND_MM:
-  case RISCV::VMANDN_MM:
-  case RISCV::VMXOR_MM:
-  case RISCV::VMOR_MM:
-  case RISCV::VMNOR_MM:
-  case RISCV::VMORN_MM:
-  case RISCV::VMXNOR_MM:
-  case RISCV::VMSBF_M:
-  case RISCV::VMSIF_M:
-  case RISCV::VMSOF_M:
-  // Vector Integer Add-with-Carry / Subtract-with-Borrow Instructions
-  case RISCV::VMADC_VV:
-  case RISCV::VMADC_VX:
-  case RISCV::VMADC_VI:
-  case RISCV::VMADC_VVM:
-  case RISCV::VMADC_VXM:
-  case RISCV::VMADC_VIM:
-  case RISCV::VMSBC_VV:
-  case RISCV::VMSBC_VX:
-  case RISCV::VMSBC_VVM:
-  case RISCV::VMSBC_VXM:
-  // Vector Integer Compare Instructions
-  case RISCV::VMSEQ_VV:
-  case RISCV::VMSEQ_VX:
-  case RISCV::VMSEQ_VI:
-  case RISCV::VMSNE_VV:
-  case RISCV::VMSNE_VX:
-  case RISCV::VMSNE_VI:
-  case RISCV::VMSLT_VV:
-  case RISCV::VMSLT_VX:
-  case RISCV::VMSLTU_VV:
-  case RISCV::VMSLTU_VX:
-  case RISCV::VMSLE_VV:
-  case RISCV::VMSLE_VX:
-  case RISCV::VMSLE_VI:
-  case RISCV::VMSLEU_VV:
-  case RISCV::VMSLEU_VX:
-  case RISCV::VMSLEU_VI:
-  case RISCV::VMSGTU_VX:
-  case RISCV::VMSGTU_VI:
-  case RISCV::VMSGT_VX:
-  case RISCV::VMSGT_VI:
-  // Vector Floating-Point Compare Instructions
-  case RISCV::VMFEQ_VV:
-  case RISCV::VMFEQ_VF:
-  case RISCV::VMFNE_VV:
-  case RISCV::VMFNE_VF:
-  case RISCV::VMFLT_VV:
-  case RISCV::VMFLT_VF:
-  case RISCV::VMFLE_VV:
-  case RISCV::VMFLE_VF:
-  case RISCV::VMFGT_VF:
-  case RISCV::VMFGE_VF:
-    return true;
-  }
-  return false;
-}
-
 class RISCVVectorMaskDAGMutation : public ScheduleDAGMutation {
 private:
   const TargetRegisterInfo *TRI;
@@ -112,6 +47,24 @@ class RISCVVectorMaskDAGMutation : public ScheduleDAGMutation {
 public:
   RISCVVectorMaskDAGMutation(const TargetRegisterInfo *TRI) : TRI(TRI) {}
 
+  bool isSoleUseCopyToV0(SUnit &SU) {
+    if (SU.Succs.size() != 1)
+      return false;
+    SDep &Dep = SU.Succs[0];
+    // Ignore dependencies other than data or strong ordering.
+    if (Dep.isWeak())
+      return false;
+
+    SUnit &DepSU = *Dep.getSUnit();
+    if (DepSU.isBoundaryNode())
+      return false;
+
+    const MachineInstr *DepMI = DepSU.getInstr();
+    return DepMI->isCopy() && DepMI->getOperand(0).getReg() == RISCV::V0 &&
+           DepMI->getOperand(1).getReg().isVirtual() &&
+           DepMI->getOperand(1).getSubReg() == RISCV::NoSubRegister;
+  }
+
   void apply(ScheduleDAGInstrs *DAG) override {
     SUnit *NearestUseV0SU = nullptr;
     for (SUnit &SU : DAG->SUnits) {
@@ -119,7 +72,7 @@ class RISCVVectorMaskDAGMutation : public ScheduleDAGMutation {
       if (MI->findRegisterUseOperand(RISCV::V0, TRI))
         NearestUseV0SU = &SU;
 
-      if (NearestUseV0SU && NearestUseV0SU != &SU && isVectorMaskProducer(MI) &&
+      if (NearestUseV0SU && NearestUseV0SU != &SU && isSoleUseCopyToV0(SU) &&
           // For LMUL=8 cases, there will be more possibilities to spill.
           // FIXME: We should use RegPressureTracker to do fine-grained
           // controls.
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
index 9bd1da2e53dce..eb40c133514fe 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
@@ -46,10 +46,11 @@ define <4 x float> @hang_when_merging_stores_after_legalization(<8 x float> %x,
 ; CHECK-NEXT:    vslideup.vi v12, v10, 2, v0.t
 ; CHECK-NEXT:    vsetivli zero, 1, e8, mf8, ta, ma
 ; CHECK-NEXT:    vmv.v.i v0, 2
-; CHECK-NEXT:    vmv.v.i v10, 12
 ; CHECK-NEXT:    vsetivli zero, 8, e32, m2, ta, mu
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 6, v0.t
-; CHECK-NEXT:    vmv1r.v v0, v10
+; CHECK-NEXT:    vsetivli zero, 1, e8, mf8, ta, ma
+; CHECK-NEXT:    vmv.v.i v0, 12
+; CHECK-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
 ; CHECK-NEXT:    vmerge.vvm v8, v8, v12, v0
 ; CHECK-NEXT:    ret
   %z = shufflevector <8 x float> %x, <8 x float> %y, <4 x i32> <i32 0, i32 7, i32 8, i32 15>
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
index c628a0d620498..16bb2105f8680 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
@@ -698,15 +698,16 @@ define void @buildvec_seq_v9i8(ptr %x) {
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    li a1, 73
 ; CHECK-NEXT:    vsetivli zero, 9, e8, m1, ta, ma
-; CHECK-NEXT:    vmv.v.i v9, 3
+; CHECK-NEXT:    vmv.v.i v8, 3
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vmv.s.x v0, a1
 ; CHECK-NEXT:    li a1, 146
-; CHECK-NEXT:    vmv.s.x v8, a1
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT:    vmerge.vim v9, v9, 1, v0
-; CHECK-NEXT:    vmv1r.v v0, v8
-; CHECK-NEXT:    vmerge.vim v8, v9, 2, v0
+; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
+; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; CHECK-NEXT:    vmv.s.x v0, a1
+; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vmerge.vim v8, v8, 2, v0
 ; CHECK-NEXT:    vse8.v v8, (a0)
 ; CHECK-NEXT:    ret
   store <9 x i8> <i8 1, i8 2, i8 3, i8 1, i8 2, i8 3, i8 1, i8 2, i8 3>, ptr %x
@@ -973,27 +974,27 @@ define <512 x i8> @buildvec_not_vid_v512i8_indices_overflow_2() vscale_range(16,
 ; RV32:       # %bb.0:
 ; RV32-NEXT:    vsetivli zero, 16, e32, mf2, ta, ma
 ; RV32-NEXT:    vmv.v.i v0, 15
-; RV32-NEXT:    vmv.v.i v9, 0
+; RV32-NEXT:    vmv.v.i v8, 0
 ; RV32-NEXT:    li a0, 512
 ; RV32-NEXT:    li a1, 240
-; RV32-NEXT:    vmv.s.x v8, a1
-; RV32-NEXT:    li a1, 15
-; RV32-NEXT:    vmerge.vim v10, v9, -1, v0
+; RV32-NEXT:    vmerge.vim v9, v8, -1, v0
 ; RV32-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV32-NEXT:    vmv.v.i v12, 3
-; RV32-NEXT:    slli a1, a1, 8
-; RV32-NEXT:    vmv1r.v v0, v10
+; RV32-NEXT:    vmv1r.v v0, v9
 ; RV32-NEXT:    vmerge.vim v12, v12, 0, v0
-; RV32-NEXT:    vmv1r.v v0, v8
+; RV32-NEXT:    vsetvli zero, zero, e16, m8, ta, ma
+; RV32-NEXT:    vmv.s.x v0, a1
+; RV32-NEXT:    li a1, 15
 ; RV32-NEXT:    vsetivli zero, 16, e32, mf2, ta, ma
-; RV32-NEXT:    vmerge.vim v10, v9, -1, v0
-; RV32-NEXT:    vmv.s.x v8, a1
-; RV32-NEXT:    vmv1r.v v0, v10
+; RV32-NEXT:    vmerge.vim v9, v8, -1, v0
+; RV32-NEXT:    slli a1, a1, 8
+; RV32-NEXT:    vmv1r.v v0, v9
 ; RV32-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV32-NEXT:    vmerge.vim v12, v12, 1, v0
-; RV32-NEXT:    vmv1r.v v0, v8
+; RV32-NEXT:    vsetvli zero, zero, e16, m8, ta, ma
+; RV32-NEXT:    vmv.s.x v0, a1
 ; RV32-NEXT:    vsetivli zero, 16, e32, mf2, ta, ma
-; RV32-NEXT:    vmerge.vim v8, v9, -1, v0
+; RV32-NEXT:    vmerge.vim v8, v8, -1, v0
 ; RV32-NEXT:    vmv1r.v v0, v8
 ; RV32-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV32-NEXT:    vmerge.vim v8, v12, 2, v0
@@ -1003,25 +1004,23 @@ define <512 x i8> @buildvec_not_vid_v512i8_indices_overflow_2() vscale_range(16,
 ; RV64V:       # %bb.0:
 ; RV64V-NEXT:    vsetivli zero, 8, e64, m1, ta, ma
 ; RV64V-NEXT:    vmv.v.i v0, 3
-; RV64V-NEXT:    vmv.v.i v9, 0
+; RV64V-NEXT:    vmv.v.i v8, 0
 ; RV64V-NEXT:    li a0, 512
-; RV64V-NEXT:    vmv.v.i v8, 12
-; RV64V-NEXT:    li a1, 48
-; RV64V-NEXT:    vmerge.vim v10, v9, -1, v0
+; RV64V-NEXT:    vmerge.vim v9, v8, -1, v0
 ; RV64V-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV64V-NEXT:    vmv.v.i v12, 3
-; RV64V-NEXT:    vmv1r.v v0, v10
+; RV64V-NEXT:    vmv1r.v v0, v9
 ; RV64V-NEXT:    vmerge.vim v12, v12, 0, v0
-; RV64V-NEXT:    vmv1r.v v0, v8
 ; RV64V-NEXT:    vsetivli zero, 8, e64, m1, ta, ma
-; RV64V-NEXT:    vmerge.vim v10, v9, -1, v0
-; RV64V-NEXT:    vmv.s.x v8, a1
-; RV64V-NEXT:    vmv.v.v v0, v10
+; RV64V-NEXT:    vmv.v.i v0, 12
+; RV64V-NEXT:    vmerge.vim v9, v8, -1, v0
+; RV64V-NEXT:    li a1, 48
+; RV64V-NEXT:    vmv.v.v v0, v9
 ; RV64V-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV64V-NEXT:    vmerge.vim v12, v12, 1, v0
-; RV64V-NEXT:    vmv1r.v v0, v8
+; RV64V-NEXT:    vmv.s.x v0, a1
 ; RV64V-NEXT:    vsetivli zero, 8, e64, m1, ta, ma
-; RV64V-NEXT:    vmerge.vim v8, v9, -1, v0
+; RV64V-NEXT:    vmerge.vim v8, v8, -1, v0
 ; RV64V-NEXT:    vmv.v.v v0, v8
 ; RV64V-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV64V-NEXT:    vmerge.vim v8, v12, 2, v0
@@ -1031,27 +1030,27 @@ define <512 x i8> @buildvec_not_vid_v512i8_indices_overflow_2() vscale_range(16,
 ; RV64ZVE32:       # %bb.0:
 ; RV64ZVE32-NEXT:    vsetivli zero, 16, e32, m1, ta, ma
 ; RV64ZVE32-NEXT:    vmv.v.i v0, 15
-; RV64ZVE32-NEXT:    vmv.v.i v9, 0
+; RV64ZVE32-NEXT:    vmv.v.i v8, 0
 ; RV64ZVE32-NEXT:    li a0, 512
 ; RV64ZVE32-NEXT:    li a1, 240
-; RV64ZVE32-NEXT:    vmv.s.x v8, a1
-; RV64ZVE32-NEXT:    li a1, 15
-; RV64ZVE32-NEXT:    vmerge.vim v10, v9, -1, v0
+; RV64ZVE32-NEXT:    vmerge.vim v9, v8, -1, v0
 ; RV64ZVE32-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV64ZVE32-NEXT:    vmv.v.i v12, 3
-; RV64ZVE32-NEXT:    slli a1, a1, 8
-; RV64ZVE32-NEXT:    vmv1r.v v0, v10
+; RV64ZVE32-NEXT:    vmv1r.v v0, v9
 ; RV64ZVE32-NEXT:    vmerge.vim v12, v12, 0, v0
-; RV64ZVE32-NEXT:    vmv1r.v v0, v8
+; RV64ZVE32-NEXT:    vsetvli zero, zero, e16, m8, ta, ma
+; RV64ZVE32-NEXT:    vmv.s.x v0, a1
+; RV64ZVE32-NEXT:    li a1, 15
 ; RV64ZVE32-NEXT:    vsetivli zero, 16, e32, m1, ta, ma
-; RV64ZVE32-NEXT:    vmerge.vim v10, v9, -1, v0
-; RV64ZVE32-NEXT:    vmv.s.x v8, a1
-; RV64ZVE32-NEXT:    vmv.v.v v0, v10
+; RV64ZVE32-NEXT:    vmerge.vim v9, v8, -1, v0
+; RV64ZVE32-NEXT:    slli a1, a1, 8
+; RV64ZVE32-NEXT:    vmv.v.v v0, v9
 ; RV64ZVE32-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV64ZVE32-NEXT:    vmerge.vim v12, v12, 1, v0
-; RV64ZVE32-NEXT:    vmv1r.v v0, v8
+; RV64ZVE32-NEXT:    vsetvli zero, zero, e16, m8, ta, ma
+; RV64ZVE32-NEXT:    vmv.s.x v0, a1
 ; RV64ZVE32-NEXT:    vsetivli zero, 16, e32, m1, ta, ma
-; RV64ZVE32-NEXT:    vmerge.vim v8, v9, -1, v0
+; RV64ZVE32-NEXT:    vmerge.vim v8, v8, -1, v0
 ; RV64ZVE32-NEXT:    vmv.v.v v0, v8
 ; RV64ZVE32-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
 ; RV64ZVE32-NEXT:    vmerge.vim v8, v12, 2, v0
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
index f307ebb422c6c..cd73dbadb2d03 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
@@ -105,11 +105,10 @@ define <4 x i16> @vrgather_shuffle_vx_v4i16(<4 x i16> %x) {
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, mu
 ; CHECK-NEXT:    vmv.v.i v0, 2
-; CHECK-NEXT:    vmv.v.i v9, 3
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 2, v0.t
-; CHECK-NEXT:    vmv.v.i v10, 5
-; CHECK-NEXT:    vmv1r.v v0, v9
-; CHECK-NEXT:    vmerge.vvm v8, v10, v8, v0
+; CHECK-NEXT:    vmv.v.i v0, 3
+; CHECK-NEXT:    vmv.v.i v9, 5
+; CHECK-NEXT:    vmerge.vvm v8, v9, v8, v0
 ; CHECK-NEXT:    ret
   %s = shufflevector <4 x i16> %x, <4 x i16> <i16 5, i16 5, i16 5, i16 5>, <4 x i32> <i32 0, i32 3, i32 6, i32 5>
   ret <4 x i16> %s
@@ -971,13 +970,12 @@ define <8 x i32> @shuffle_repeat3_singlesrc_e32(<8 x i32> %v) {
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
 ; CHECK-NEXT:    vmv.v.i v0, 7
-; CHECK-NEXT:    vmv.v.i v10, 1
+; CHECK-NEXT:    vmv.v.i v9, 1
 ; CHECK-NEXT:    li a0, 192
-; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vmerge.vim v9, v9, 0, v0
+; CHECK-NEXT:    vmv.s.x v0, a0
 ; CHECK-NEXT:    csrr a0, vlenb
-; CHECK-NEXT:    vmerge.vim v10, v10, 0, v0
-; CHECK-NEXT:    vmv.v.v v0, v9
-; CHECK-NEXT:    vmerge.vim v9, v10, 2, v0
+; CHECK-NEXT:    vmerge.vim v9, v9, 2, v0
 ; CHECK-NEXT:    srli a0, a0, 2
 ; CHECK-NEXT:    vslidedown.vx v10, v9, a0
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m1, ta, ma
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll
index 30751f8ea706b..39fd70beb9ee2 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll
@@ -1046,44 +1046,45 @@ define void @mulhu_v16i8(ptr %x) {
 ; CHECK-LABEL: mulhu_v16i8:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 16, e8, m1, ta, ma
-; CHECK-NEXT:    vle8.v v9, (a0)
+; CHECK-NEXT:    vle8.v v8, (a0)
 ; CHECK-NEXT:    lui a1, 3
-; CHECK-NEXT:    vmv.v.i v10, 0
+; CHECK-NEXT:    vmv.v.i v9, 0
 ; CHECK-NEXT:    lui a2, %hi(.LCPI65_0)
 ; CHECK-NEXT:    addi a2, a2, %lo(.LCPI65_0)
-; CHECK-NEXT:    vle8.v v11, (a2)
-; CHECK-NEXT:    lui a2, 1
+; CHECK-NEXT:    vle8.v v10, (a2)
+; CHECK-NEXT:    li a2, -128
 ; CHECK-NEXT:    addi a1, a1, -2044
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vmv.s.x v0, a1
-; CHECK-NEXT:    addi a1, a2, 32
-; CHECK-NEXT:    vmv.s.x v8, a1
-; CHECK-NEXT:    li a1, -128
+; CHECK-NEXT:    lui a1, 1
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT:    vmerge.vxm v12, v10, a1, v0
-; CHECK-NEXT:    li a1, 513
-; CHECK-NEXT:    vmv.v.i v13, 4
-; CHECK-NEXT:    vmv1r.v v0, v8
-; CHECK-NEXT:    vmerge.vim v8, v10, 1, v0
+; CHECK-NEXT:    vmerge.vxm v11, v9, a2, v0
+; CHECK-NEXT:    addi a2, a1, 32
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
-; CHECK-NEXT:    vmv.s.x v0, a1
-; CHECK-NEXT:    addi a1, a2, 78
+; CHECK-NEXT:    vmv.s.x v0, a2
+; CHECK-NEXT:    li a2, 513
+; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vmv.v.i v12, 4
+; CHECK-NEXT:    addi a1, a1, 78
+; CHECK-NEXT:    vmerge.vim v9, v9, 1, v0
+; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; CHECK-NEXT:    vmv.s.x v0, a2
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT:    vmerge.vim v10, v13, 1, v0
+; CHECK-NEXT:    vsrl.vv v9, v8, v9
+; CHECK-NEXT:    vmerge.vim v12, v12, 1, v0
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vmv.s.x v0, a1
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT:    vsrl.vv v8, v9, v8
-; CHECK-NEXT:    vmulhu.vv v8, v8, v11
-; CHECK-NEXT:    vmerge.vim v10, v10, 3, v0
+; CHECK-NEXT:    vmulhu.vv v9, v9, v10
+; CHECK-NEXT:    vmerge.vim v10, v12, 3, v0
 ; CHECK-NEXT:    lui a1, 8
 ; CHECK-NEXT:    addi a1, a1, 304
-; CHECK-NEXT:    vsub.vv v9, v9, v8
-; CHECK-NEXT:    vmulhu.vv v9, v9, v12
+; CHECK-NEXT:    vsub.vv v8, v8, v9
+; CHECK-NEXT:    vmulhu.vv v8, v8, v11
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vmv.s.x v0, a1
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v9, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v9
 ; CHECK-NEXT:    vmerge.vim v9, v10, 2, v0
 ; CHECK-NEXT:    vsrl.vv v8, v8, v9
 ; CHECK-NEXT:    vse8.v v8, (a0)
@@ -3153,48 +3154,49 @@ define void @mulhu_v32i8(ptr %x) {
 ; CHECK-LABEL: mulhu_v32i8:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    li a1, 32
-; CHECK-NEXT:    lui a2, 163907
-; CHECK-NEXT:    addi a2, a2, -2044
-; CHECK-NEXT:    vsetivli zero, 1, e32, m1, ta, ma
-; CHECK-NEXT:    vmv.s.x v0, a2
-; CHECK-NEXT:    lui a2, 66049
-; CHECK-NEXT:    addi a2, a2, 32
-; CHECK-NEXT:    vmv.s.x v8, a2
-; CHECK-NEXT:    li a2, -128
-; CHECK-NEXT:    vsetvli zero, a1, e8, m2, ta, ma
+; CHECK-NEXT:    lui a2, %hi(.LCPI181_0)
+; CHECK-NEXT:    addi a2, a2, %lo(.LCPI181_0)
+; CHECK-NEXT:    vsetvli zero, a1, e32, m8, ta, ma
+; CHECK-NEXT:    vle8.v v8, (a2)
+; CHECK-NEXT:    lui a1, 163907
+; CHECK-NEXT:    addi a1, a1, -2044
+; CHECK-NEXT:    vmv.s.x v0, a1
+; CHECK-NEXT:    li a1, -128
+; CHECK-NEXT:    vsetvli zero, zero, e8, m2, ta, ma
 ; CHECK-NEXT:    vmv.v.i v10, 0
-; CHECK-NEXT:    vmerge.vxm v12, v10, a2, v0
-; CHECK-NEXT:    lui a1, %hi(.LCPI181_0)
-; CHECK-NEXT:    addi a1, a1, %lo(.LCPI181_0)
-; CHECK-NEXT:    vle8.v v14, (a0)
-; CHECK-NEXT:    vmv1r.v v0, v8
-; CHECK-NEXT:    vmerge.vim v8, v10, 1, v0
-; CHECK-NEXT:    vle8.v v10, (a1)
+; CHECK-NEXT:    vmerge.vxm v12, v10, a1, v0
+; CHECK-NEXT:    lui a1, 66049
+; CHECK-NEXT:    addi a1, a1, 32
+; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmv.s.x v0, a1
 ; CHECK-NEXT:    lui a1, 8208
+; CHECK-NEXT:    vle8.v v14, (a0)
 ; CHECK-NEXT:    addi a1, a1, 513
-; CHECK-NEXT:    vsrl.vv v8, v14, v8
-; CHECK-NEXT:    vmulhu.vv v10, v8, v10
+; CHECK-NEXT:    vsetvli zero, zero, e8, m2, ta, ma
+; CHECK-NEXT:    vmerge.vim v10, v10, 1, v0
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
 ; CHECK-NEXT:    vmv.s.x v0, a1
 ; CHECK-NEXT:    lui a1, 66785
+; CHECK-NEXT:    vsetvli zero, zero, e8, m2, ta, ma
+; CHECK-NEXT:    vmv.v.i v16, 4
 ; CHECK-NEXT:    addi a1, a1, 78
-; CHECK-NEXT:    vmv.s.x v8, a1
-; CHECK-NEXT:    lui a1, 529160
+; CHECK-NEXT:    vsrl.vv v10, v14, v10
+; CHECK-NEXT:    vmerge.vim v16, v16, 1, v0
+; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmv.s.x v0, a1
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m2, ta, ma
-; CHECK-NEXT:    vsub.vv v14, v14, v10
-; CHECK-NEXT:    vmulhu.vv v12, v14, v12
-; CHECK-NEXT:    vmv.v.i v14, 4
+; CHECK-NEXT:    vmulhu.vv v8, v10, v8
+; CHECK-NEXT:    vmerge.vim v10, v16, 3, v0
+; CHECK-NEXT:    lui a1, 529160
 ; CHECK-NEXT:    addi a1, a1, 304
-; CHECK-NEXT:    vmerge.vim v14, v14, 1, v0
+; CHECK-NEXT:    vsub.vv v14, v14, v8
+; CHECK-NEXT:    vmulhu.vv v12, v14, v12
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vmv.s.x v9, a1
-; CHECK-NEXT:    vmv1r.v v0, v8
+; CHECK-NEXT:    vmv.s.x v0, a1
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m2, ta, ma
-; CHECK-NEXT:    vmerge.vim v14, v14, 3, v0
-; CHECK-NEXT:    vadd.vv v10, v12, v10
-; CHECK-NEXT:    vmv1r.v v0, v9
-; CHECK-NEXT:    vmerge.vim v8, v14, 2, v0
-; CHECK-NEXT:    vsrl.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v12, v8
+; CHECK-NEXT:    vmerge.vim v10, v10, 2, v0
+; CHECK-NEXT:    vsrl.vv v8, v8, v10
 ; CHECK-NEXT:    vse8.v v8, (a0)
 ; CHECK-NEXT:    ret
   %a = load <32 x i8>, ptr %x
@@ -3215,27 +3217,28 @@ define void @mulhu_v16i16(ptr %x) {
 ; RV32-NEXT:    vmerge.vxm v12, v8, a1, v0
 ; RV32-NEXT:    lui a1, 4
 ; RV32-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
-; RV32-NEXT:    vmv.v.i v14, 0
+; RV32-NEXT:    vmv.v.i v9, 0
 ; RV32-NEXT:    addi a1, a1, 64
 ; RV32-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; RV32-NEXT:    vmv.s.x v8, a1
 ; RV32-NEXT:    lui a1, 2
 ; RV32-NEXT:    addi a1, a1, 289
-; RV32-NEXT:    vmv.s.x v9, a1
+; RV32-NEXT:    vmv1r.v v0, v8
+; RV32-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
+; RV32-NEXT:    vmerge.vim v9, v9, 1, v0
+; RV32-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; RV32-NEXT:    vmv.s.x v0, a1
 ; RV32-NEXT:    lui a1, %hi(.LCPI182_0)
 ; RV32-NEXT:    addi a1, a1, %lo(.LCPI182_0)
 ; RV32-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
-; RV32-NEXT:    vmv.v.i v15, 3
-; RV32-NEXT:    vmv1r.v v0, v8
-; RV32-NEXT:    vmerge.vim v14, v14, 1, v0
-; RV32-NEXT:    vmv1r.v v0, v9
-; RV32-NEXT:    vmerge.vim v9, v15, 2, v0
+; RV32-NEXT:    vmv.v.i v14, 3
+; RV32-NEXT:    vmerge.vim v14, v14, 2, v0
 ; RV32-NEXT:    vle16.v v16, (a1)
 ; RV32-NEXT:    vmv1r.v v0, v8
-; RV32-NEXT:    vmerge.vim v8, v9, 1, v0
+; RV32-NEXT:    vmerge.vim v8, v14, 1, v0
 ; RV32-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
-; RV32-NEXT:    vsext.vf2 v18, v14
-; RV32-NEXT:    vsrl.vv v14, v10, v18
+; RV32-NEXT:    vsext.vf2 v14,...
[truncated]

Copy link

github-actions bot commented Feb 28, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me but I'll defer to @wangpc-pp on this

@wangpc-pp
Copy link
Contributor

Makes sense to me but I'll defer to @wangpc-pp on this

IIRC, we planed to place RISCVVMV0Elimination after pre-ra instruction scheduling and remove this mutation. Have you tried it? I just had a try and a lot of errors ocurred.

@lukel97
Copy link
Contributor

lukel97 commented Mar 3, 2025

Makes sense to me but I'll defer to @wangpc-pp on this

IIRC, we planed to place RISCVVMV0Elimination after pre-ra instruction scheduling and remove this mutation. Have you tried it? I just had a try and a lot of errors ocurred.

Not yet, I'll take a look now and see what's going on

@lukel97
Copy link
Contributor

lukel97 commented Mar 3, 2025

Makes sense to me but I'll defer to @wangpc-pp on this

IIRC, we planed to place RISCVVMV0Elimination after pre-ra instruction scheduling and remove this mutation. Have you tried it? I just had a try and a lot of errors ocurred.

I tried this out, I had to handle subregisters + update it to maintain liveintervals, but the diff is massive: lukel97@c00e2b7

I need to spend some time investigating the impact properly, but at a quick glance:

  • I'm not seeing any extra vmv1r.v v0, vx after removing the mutation
  • More instructions get scheduled past masked instructions

So I don't have a strong opinion with this PR, but it may be a day or two before I'm able to properly test the RISCVVMV0Elimination approach out on llvm-test-suite/SPEC/etc!

Comment on lines -701 to +710
; CHECK-NEXT: vmv.v.i v9, 3
; CHECK-NEXT: vmv.v.i v8, 3
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: li a1, 146
; CHECK-NEXT: vmv.s.x v8, a1
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
; CHECK-NEXT: vmerge.vim v9, v9, 1, v0
; CHECK-NEXT: vmv1r.v v0, v8
; CHECK-NEXT: vmerge.vim v8, v9, 2, v0
; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
; CHECK-NEXT: vmv.s.x v0, a1
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
; CHECK-NEXT: vmerge.vim v8, v8, 2, v0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is handled by moving RISCVVMV0Elimination to after pre-ra scheduling: lukel97@c00e2b7#diff-19fd7ee1004c9fe02050701bc140c5c670f89b7b143ffae3c9a649106c92628bR700-R711

@preames
Copy link
Collaborator Author

preames commented Mar 4, 2025

So I don't have a strong opinion with this PR, but it may be a day or two before I'm able to properly test the RISCVVMV0Elimination approach out on llvm-test-suite/SPEC/etc!

I believe the RISCVVMV0Elimination approach is probably the right long term direction, but we can work incrementally here? This patch seems like an improvement given the current state of things, and I'd like to not have that blocked on perfection.

As an aside, I suspect we're still going to want some kind of scheduling false dependency even on the register class representation. Expecting the scheduler to perfectly handle register pressure on size=1 register classes seems like a big ask. I'd suggest starting with the scheduling dependency in place (over the new representation), and then remove it in a separate change.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

As an aside, I suspect we're still going to want some kind of scheduling false dependency even on the register class representation. Expecting the scheduler to perfectly handle register pressure on size=1 register classes seems like a big ask. I'd suggest starting with the scheduling dependency in place (over the new representation), and then remove it in a separate change.

I don't fully understand this, can you please explain your thought? I also am interested in this.

@preames preames merged commit 6e749f5 into llvm:main Mar 5, 2025
7 of 11 checks passed
@preames preames deleted the pr-riscv-vector-mask-schedule-mutation branch March 5, 2025 16:56
@preames
Copy link
Collaborator Author

preames commented Mar 5, 2025

I don't fully understand this, can you please explain your thought? I also am interested in this.

I'm suggesting that even if we move to using the vmv0 register class into reg alloc (as look suggested), that we probably still want a schedule dag mutation to bias the scheduling of mask producing vs mask consuming instructions. In theory, the scheduler should do this for us based on register pressure, but having a dag mutation is likely to be more robust. We might remove it later, or we might not.

jph-13 pushed a commit to jph-13/llvm-project that referenced this pull request Mar 21, 2025
…m#129296)

This mutation was introduced in 01a15dc with the goal of avoiding many
copies from V1-v31 to v0 immediately before a mask consuming
instruction. I noticed in a workload that this was not applying to
vmv.s.x (which we use for short vector masks). We'd had a whitelist of
instructions. Instead, we can directly inspect the user of the current
node to see if it's a copy to V0. This isn't quite precise (as the mask
producing instruction could already be scheduled fairly far from it's
single use), but is probably good enough.

As with all schedule changes, results are mixed. Some significant
improvements, some regressions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants