Skip to content

[RISCV][VLOPT] Allow users that are passthrus if tail elements aren't demanded #124066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 30, 2025

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Jan 23, 2025

The motivation for this to allow reducing the vl when a user is a ternary pseudo, where the third operand is tied and also acts as a passthru.

When checking the users of an instruction, we currently bail if the user is used as a passthru because all of its elements past vl will be used for the tail.

We can allow passthru users if we know the tail of their result isn't used, which we will have computed beforehand after #124530

It's worth noting that this is all irrelevant of the tail policy, because tail agnostic still ends up using the passthru.

I've checked that SPEC CPU 2017 + llvm-test-suite pass with this (on qemu with rvv_ta_all_1s=true)

Fixes #123760

@llvmbot
Copy link
Member

llvmbot commented Jan 23, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

The motivation for this to allow reducing the vl when a user is a ternary pseudo, where the third operand is tied and also acts as a passthru.

When checking the users of an instruction, we currently bail if the user is used as a passthru because all of its elements past vl will be used for the tail.

We can allow passthru users if we know the tail of their result isn't used, and we can reuse checkUsers to check this.

It's worth noting that this is all irrelevant of the tail policy, because tail agnostic still ends up using the passthru.

I've checked that SPEC CPU 2017 + llvm-test-suite pass with this (on qemu with rvv_ta_all_1s=true)

Fixes #123760


Full diff: https://github.com/llvm/llvm-project/pull/124066.diff

5 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp (+14-8)
  • (modified) llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll (+2-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vl-opt.ll (+23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vl-opt.mir (+61)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll (+12-24)
diff --git a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
index 5f2d4e0585a0b3..4c6ba26ba6e193 100644
--- a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
@@ -53,7 +53,7 @@ class RISCVVLOptimizer : public MachineFunctionPass {
   std::optional<MachineOperand> getMinimumVLForUser(MachineOperand &UserOp);
   /// Returns the largest common VL MachineOperand that may be used to optimize
   /// MI. Returns std::nullopt if it failed to find a suitable VL.
-  std::optional<MachineOperand> checkUsers(MachineInstr &MI);
+  std::optional<MachineOperand> checkUsers(const MachineInstr &MI);
   bool tryReduceVL(MachineInstr &MI);
   bool isCandidate(const MachineInstr &MI) const;
 };
@@ -1221,7 +1221,8 @@ RISCVVLOptimizer::getMinimumVLForUser(MachineOperand &UserOp) {
   return VLOp;
 }
 
-std::optional<MachineOperand> RISCVVLOptimizer::checkUsers(MachineInstr &MI) {
+std::optional<MachineOperand>
+RISCVVLOptimizer::checkUsers(const MachineInstr &MI) {
   // FIXME: Avoid visiting each user for each time we visit something on the
   // worklist, combined with an extra visit from the outer loop. Restructure
   // along lines of an instcombine style worklist which integrates the outer
@@ -1235,16 +1236,21 @@ std::optional<MachineOperand> RISCVVLOptimizer::checkUsers(MachineInstr &MI) {
       return std::nullopt;
     }
 
-    // Tied operands might pass through.
-    if (UserOp.isTied()) {
-      LLVM_DEBUG(dbgs() << "    Abort because user used as tied operand\n");
-      return std::nullopt;
-    }
-
     auto VLOp = getMinimumVLForUser(UserOp);
     if (!VLOp)
       return std::nullopt;
 
+    // If the user is a passthru, we will need to preserve it if its tail is
+    // demanded.
+    if (UserOp.isTied()) {
+      auto DemandedVL = checkUsers(UserMI);
+      if (!DemandedVL || !RISCV::isVLKnownLE(*DemandedVL, *VLOp)) {
+        LLVM_DEBUG(dbgs() << "    Abort because user is passthru in "
+                             "instruction with demanded tail\n");
+        return std::nullopt;
+      }
+    }
+
     // Use the largest VL among all the users. If we cannot determine this
     // statically, then we cannot optimize the VL.
     if (!CommonVL || RISCV::isVLKnownLE(*CommonVL, *VLOp)) {
diff --git a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
index d329979857a6b9..403cc0eb9dce1d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
@@ -894,9 +894,10 @@ define void @test_dag_loop() {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    vsetvli a0, zero, e8, m4, ta, ma
 ; CHECK-NEXT:    vmclr.m v0
+; CHECK-NEXT:    vsetivli zero, 0, e8, m4, ta, ma
 ; CHECK-NEXT:    vmv.v.i v8, 0
 ; CHECK-NEXT:    vmv.v.i v12, 0
-; CHECK-NEXT:    vsetivli zero, 0, e8, m4, tu, mu
+; CHECK-NEXT:    vsetvli zero, zero, e8, m4, tu, mu
 ; CHECK-NEXT:    vssubu.vx v12, v8, zero, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m4, ta, ma
 ; CHECK-NEXT:    vmseq.vv v0, v12, v8
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll b/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
index 1cc30f077feb4a..8d44eabafcecfc 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
@@ -150,3 +150,26 @@ define <vscale x 4 x i32> @dont_optimize_tied_def(<vscale x 4 x i32> %a, <vscale
   ret <vscale x 4 x i32> %2
 }
 
+define void @optimize_ternary_use(<vscale x 4 x i16> %a, <vscale x 4 x i32> %b, <vscale x 4 x i32> %c, ptr %p, iXLen %vl) {
+; NOVLOPT-LABEL: optimize_ternary_use:
+; NOVLOPT:       # %bb.0:
+; NOVLOPT-NEXT:    vsetvli a2, zero, e32, m2, ta, ma
+; NOVLOPT-NEXT:    vzext.vf2 v14, v8
+; NOVLOPT-NEXT:    vsetvli zero, a1, e32, m2, ta, ma
+; NOVLOPT-NEXT:    vmadd.vv v14, v10, v12
+; NOVLOPT-NEXT:    vse32.v v14, (a0)
+; NOVLOPT-NEXT:    ret
+;
+; VLOPT-LABEL: optimize_ternary_use:
+; VLOPT:       # %bb.0:
+; VLOPT-NEXT:    vsetvli zero, a1, e32, m2, ta, ma
+; VLOPT-NEXT:    vzext.vf2 v14, v8
+; VLOPT-NEXT:    vmadd.vv v14, v10, v12
+; VLOPT-NEXT:    vse32.v v14, (a0)
+; VLOPT-NEXT:    ret
+  %1 = zext <vscale x 4 x i16> %a to <vscale x 4 x i32>
+  %2 = mul <vscale x 4 x i32> %b, %1
+  %3 = add <vscale x 4 x i32> %2, %c
+  call void @llvm.riscv.vse(<vscale x 4 x i32> %3, ptr %p, iXLen %vl)
+  ret void
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir b/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
index 027eb8ca3c17f0..5e6af334f02f85 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
@@ -149,3 +149,64 @@ body: |
     ; CHECK-NEXT: early-clobber %y:vrm2 = PseudoVWADD_WV_M1_TIED $noreg, %x, 1, 3 /* e8 */, 0 /* tu, mu */
     %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
     %y:vrm2 = PseudoVWADD_WV_M1_TIED $noreg, %x, 1, 3 /* e8 */, 0 /* tu, mu */
+...
+---
+# Can reduce %x even though %y uses it as a passthru, because %y's inactive elements aren't demanded
+name: passthru_not_demanded
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: passthru_not_demanded
+    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 $noreg, %y, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %z:vr = PseudoVADD_VV_M1 $noreg, %y, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+...
+---
+# Can't reduce %x because %y uses it as a passthru, and %y's inactive elements are demanded by %z
+name: passthru_demanded
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: passthru_demanded
+    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 $noreg, %y, $noreg, 2, 3 /* e8 */, 0 /* tu, mu */
+    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %z:vr = PseudoVADD_VV_M1 $noreg, %y, $noreg, 2, 3 /* e8 */, 0 /* tu, mu */
+...
+---
+# Can reduce %x even though %y uses it as a passthru, because %y's inactive elements aren't demanded
+name: passthru_not_demanded_passthru_chain
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: passthru_not_demanded_passthru_chain
+    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 %y, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %a:vr = PseudoVADD_VV_M1 %z, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %b:vr = PseudoVADD_VV_M1 $noreg, %a, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %z:vr = PseudoVADD_VV_M1 %y, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %a:vr = PseudoVADD_VV_M1 %z, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %b:vr = PseudoVADD_VV_M1 $noreg, %a, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+...
+---
+# Can't reduce %x because %y uses it as a passthru, and %y's inactive elements are ultimately demanded in %b
+name: passthru_demanded_passthru_chain
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: passthru_demanded_passthru_chain
+    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 %y, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %a:vr = PseudoVADD_VV_M1 %z, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK-NEXT: %b:vr = PseudoVADD_VV_M1 $noreg, %a, $noreg, 2, 3 /* e8 */, 0 /* tu, mu */
+    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %z:vr = PseudoVADD_VV_M1 %y, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %a:vr = PseudoVADD_VV_M1 %z, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    %b:vr = PseudoVADD_VV_M1 $noreg, %a, $noreg, 2, 3 /* e8 */, 0 /* tu, mu */
+...
diff --git a/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll b/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
index eb74b238c01b95..a29af3d5b54b0f 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
@@ -1638,9 +1638,8 @@ define <vscale x 1 x i64> @vmadd_vx_nxv1i64(<vscale x 1 x i64> %a, i64 %b, <vsca
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m1, ta, ma
-; RV32-NEXT:    vlse64.v v10, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m1, ta, ma
+; RV32-NEXT:    vlse64.v v10, (a0), zero
 ; RV32-NEXT:    vmadd.vv v10, v8, v9
 ; RV32-NEXT:    vsetvli zero, zero, e64, m1, tu, ma
 ; RV32-NEXT:    vmerge.vvm v8, v8, v10, v0
@@ -1669,9 +1668,8 @@ define <vscale x 1 x i64> @vmadd_vx_nxv1i64_unmasked(<vscale x 1 x i64> %a, i64
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m1, ta, ma
-; RV32-NEXT:    vlse64.v v10, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m1, ta, ma
+; RV32-NEXT:    vlse64.v v10, (a0), zero
 ; RV32-NEXT:    vmadd.vv v10, v8, v9
 ; RV32-NEXT:    vsetvli zero, zero, e64, m1, tu, ma
 ; RV32-NEXT:    vmv.v.v v8, v10
@@ -1713,9 +1711,8 @@ define <vscale x 1 x i64> @vmadd_vx_nxv1i64_ta(<vscale x 1 x i64> %a, i64 %b, <v
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m1, ta, ma
-; RV32-NEXT:    vlse64.v v10, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m1, ta, ma
+; RV32-NEXT:    vlse64.v v10, (a0), zero
 ; RV32-NEXT:    vmadd.vv v10, v8, v9
 ; RV32-NEXT:    vmerge.vvm v8, v8, v10, v0
 ; RV32-NEXT:    addi sp, sp, 16
@@ -1776,9 +1773,8 @@ define <vscale x 2 x i64> @vmadd_vx_nxv2i64(<vscale x 2 x i64> %a, i64 %b, <vsca
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m2, ta, ma
-; RV32-NEXT:    vlse64.v v12, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m2, ta, ma
+; RV32-NEXT:    vlse64.v v12, (a0), zero
 ; RV32-NEXT:    vmadd.vv v12, v8, v10
 ; RV32-NEXT:    vsetvli zero, zero, e64, m2, tu, ma
 ; RV32-NEXT:    vmerge.vvm v8, v8, v12, v0
@@ -1807,9 +1803,8 @@ define <vscale x 2 x i64> @vmadd_vx_nxv2i64_unmasked(<vscale x 2 x i64> %a, i64
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m2, ta, ma
-; RV32-NEXT:    vlse64.v v12, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m2, ta, ma
+; RV32-NEXT:    vlse64.v v12, (a0), zero
 ; RV32-NEXT:    vmadd.vv v12, v8, v10
 ; RV32-NEXT:    vsetvli zero, zero, e64, m2, tu, ma
 ; RV32-NEXT:    vmv.v.v v8, v12
@@ -1851,9 +1846,8 @@ define <vscale x 2 x i64> @vmadd_vx_nxv2i64_ta(<vscale x 2 x i64> %a, i64 %b, <v
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m2, ta, ma
-; RV32-NEXT:    vlse64.v v12, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m2, ta, ma
+; RV32-NEXT:    vlse64.v v12, (a0), zero
 ; RV32-NEXT:    vmadd.vv v12, v8, v10
 ; RV32-NEXT:    vmerge.vvm v8, v8, v12, v0
 ; RV32-NEXT:    addi sp, sp, 16
@@ -1914,9 +1908,8 @@ define <vscale x 4 x i64> @vmadd_vx_nxv4i64(<vscale x 4 x i64> %a, i64 %b, <vsca
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m4, ta, ma
-; RV32-NEXT:    vlse64.v v16, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m4, ta, ma
+; RV32-NEXT:    vlse64.v v16, (a0), zero
 ; RV32-NEXT:    vmadd.vv v16, v8, v12
 ; RV32-NEXT:    vsetvli zero, zero, e64, m4, tu, ma
 ; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
@@ -1945,9 +1938,8 @@ define <vscale x 4 x i64> @vmadd_vx_nxv4i64_unmasked(<vscale x 4 x i64> %a, i64
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m4, ta, ma
-; RV32-NEXT:    vlse64.v v16, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m4, ta, ma
+; RV32-NEXT:    vlse64.v v16, (a0), zero
 ; RV32-NEXT:    vmadd.vv v16, v8, v12
 ; RV32-NEXT:    vsetvli zero, zero, e64, m4, tu, ma
 ; RV32-NEXT:    vmv.v.v v8, v16
@@ -1989,9 +1981,8 @@ define <vscale x 4 x i64> @vmadd_vx_nxv4i64_ta(<vscale x 4 x i64> %a, i64 %b, <v
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m4, ta, ma
-; RV32-NEXT:    vlse64.v v16, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m4, ta, ma
+; RV32-NEXT:    vlse64.v v16, (a0), zero
 ; RV32-NEXT:    vmadd.vv v16, v8, v12
 ; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV32-NEXT:    addi sp, sp, 16
@@ -2054,9 +2045,8 @@ define <vscale x 8 x i64> @vmadd_vx_nxv8i64(<vscale x 8 x i64> %a, i64 %b, <vsca
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m8, ta, ma
-; RV32-NEXT:    vlse64.v v24, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
+; RV32-NEXT:    vlse64.v v24, (a0), zero
 ; RV32-NEXT:    vmadd.vv v24, v8, v16
 ; RV32-NEXT:    vsetvli zero, zero, e64, m8, tu, ma
 ; RV32-NEXT:    vmerge.vvm v8, v8, v24, v0
@@ -2085,9 +2075,8 @@ define <vscale x 8 x i64> @vmadd_vx_nxv8i64_unmasked(<vscale x 8 x i64> %a, i64
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m8, ta, ma
-; RV32-NEXT:    vlse64.v v24, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
+; RV32-NEXT:    vlse64.v v24, (a0), zero
 ; RV32-NEXT:    vmadd.vv v24, v8, v16
 ; RV32-NEXT:    vsetvli zero, zero, e64, m8, tu, ma
 ; RV32-NEXT:    vmv.v.v v8, v24
@@ -2130,9 +2119,8 @@ define <vscale x 8 x i64> @vmadd_vx_nxv8i64_ta(<vscale x 8 x i64> %a, i64 %b, <v
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
 ; RV32-NEXT:    addi a0, sp, 8
-; RV32-NEXT:    vsetvli a1, zero, e64, m8, ta, ma
-; RV32-NEXT:    vlse64.v v24, (a0), zero
 ; RV32-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
+; RV32-NEXT:    vlse64.v v24, (a0), zero
 ; RV32-NEXT:    vmadd.vv v24, v8, v16
 ; RV32-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; RV32-NEXT:    addi sp, sp, 16

@@ -53,7 +53,7 @@ class RISCVVLOptimizer : public MachineFunctionPass {
std::optional<MachineOperand> getMinimumVLForUser(MachineOperand &UserOp);
/// Returns the largest common VL MachineOperand that may be used to optimize
/// MI. Returns std::nullopt if it failed to find a suitable VL.
std::optional<MachineOperand> checkUsers(MachineInstr &MI);
std::optional<MachineOperand> checkUsers(const MachineInstr &MI);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to tweak the signature here since UserMI is const

Copy link
Contributor

@michaelmaitland michaelmaitland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice find!

@michaelmaitland
Copy link
Contributor

Do we get any impact if we remove MI.getNumDefs() != 1 in is candidate? Not sure if that would be related here.

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on, this is now an unbounded recursion. This can cause both stack explosion, and more likely, significant compile time regressions.

I think you can achieve the same thing by adding a transform which uses the result of checkUsers to clear a pass thru to NoRegister. (That is, convert an instruction from TA/TU to tail undefined.) When we visit that instruction, it would propagate up.

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 23, 2025

Hold on, this is now an unbounded recursion. This can cause both stack explosion, and more likely, significant compile time regressions.

Yeah agreed the recursion is less than ideal, but it's only unbounded by the number of passthrus that are fed into passthrus, which doesn't seem to be common in practice.

I think you can achieve the same thing by adding a transform which uses the result of checkUsers to clear a pass thru to NoRegister. (That is, convert an instruction from TA/TU to tail undefined.) When we visit that instruction, it would propagate up.

Unfortunately there's no way to represent a tail undefined ternary pseudo, because the passthru is also the third operand. I wrote a bit more of a description of this in #123760.

I tried adding new _INACTIVE_UNDEF pseudos which propagate the information at isel that the tail is undef, but the duplication ended up causing a bunch of pain in other places, e.g. we need to handle new opcodes for commuting.

Given that we're checking the demanded VL of the users downwards, and we iterate upwards over each block, it might be possible to memoize the result of checkUsers? We would need to invalidate it whenever we reduce a VL though. I'll have a think about this tonight anyway to see if there's another approach.

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 23, 2025

Do we get any impact if we remove MI.getNumDefs() != 1 in is candidate? Not sure if that would be related here.

Off the top of my head, vsadd etc. have extra implicit defs, but I'm not sure if it's safe to change their VL. Fault-first loads have an extra explicit def, but I don't think they're safe to reduce either

@michaelmaitland
Copy link
Contributor

michaelmaitland commented Jan 23, 2025

Do we get any impact if we remove MI.getNumDefs() != 1 in is candidate? Not sure if that would be related here.

Off the top of my head, vsadd etc. have extra implicit defs, but I'm not sure if it's safe to change their VL. Fault-first loads have an extra explicit def, but I don't think they're safe to reduce either

Fault-first loads have an extra explicit def, but I don't think they're safe to reduce either

I had played around with a patch in the past that changed this line to MI.getNumExplicitDefs() != 1 to avoid the fault-first loads. vsadd and others who mess with vxsat are not part of isSupportedInstr. I think we could probably remove this line. It was placed here to guard against passthru. I think this would give you more optimization for passthrus lanes that are not read.

lukel97 added a commit to lukel97/llvm-project that referenced this pull request Jan 27, 2025
This replaces the worklist by instead computing what VL is demanded by each instruction's users first.

checkUsers essentially already did this, so it's been renamed to computeDemandedVL.

The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp.

This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066.

Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
@lukel97
Copy link
Contributor Author

lukel97 commented Jan 27, 2025

Given that we're checking the demanded VL of the users downwards, and we iterate upwards over each block, it might be possible to memoize the result of checkUsers? We would need to invalidate it whenever we reduce a VL though. I'll have a think about this tonight anyway to see if there's another approach.

The solution I ended up with is in #124530. It turns out we don't need to invalidate anything because the demanded VL never changes: We can compute the minimum possible VL for each instruction up front without needing to mutate anything.

I'll rebase this PR on top of it once it's landed, as well as #124549 which was split off to help explain some diffs.

lukel97 added a commit to lukel97/llvm-project that referenced this pull request Jan 28, 2025
This replaces the worklist by instead computing what VL is demanded by each instruction's users first.

checkUsers essentially already did this, so it's been renamed to computeDemandedVL.

The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp.

This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066.

Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
lukel97 added a commit to lukel97/llvm-project that referenced this pull request Jan 28, 2025
This replaces the worklist by instead computing what VL is demanded by each instruction's users first.

checkUsers essentially already did this, so it's been renamed to computeDemandedVL.

The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp.

This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066.

Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
lukel97 added a commit to lukel97/llvm-project that referenced this pull request Jan 28, 2025
This replaces the worklist by instead computing what VL is demanded by each instruction's users first.

checkUsers essentially already did this, so it's been renamed to computeDemandedVL.

The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp.

This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066.

Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
lukel97 added a commit to lukel97/llvm-project that referenced this pull request Jan 29, 2025
This replaces the worklist by instead computing what VL is demanded by each instruction's users first.

checkUsers essentially already did this, so it's been renamed to computeDemandedVL.

The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp.

This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066.

Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
lukel97 added a commit that referenced this pull request Jan 29, 2025
This replaces the worklist by instead computing what VL is demanded by
each instruction's users first, which is done via checkUsers.

The demanded VLs are stored in a DenseMap, and then we can just do a
single forward pass of tryReduceVL where we check if a candidate's
demanded VL is less than its VLOp.

This means the pass should now be linear in complexity, and allows us to
relax the restriction on tied operands in more easily as in #124066.
@lukel97 lukel97 force-pushed the vloptimizer/passthru-users branch from c6cfbe7 to f77d01d Compare January 29, 2025 05:09
@lukel97
Copy link
Contributor Author

lukel97 commented Jan 29, 2025

I've rebased this on top of #124530, so now there's no recursion

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Only point I'm still a bit hesitant on is if we have some instruction with "weird" passthru usage (i.e. read after VL, etc..). I can't think of any, and spot checking existing code doesn't seem to reveal in any special casing, so I think we're okay here. Both passthru for TU/MU and FMA style extra sources should be fine with the code as written.

I guess one other point is that I'm surprised this has so little impact on the tests. Apparently we really don't use anything other than tail undef in practice do we?

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 30, 2025

Only point I'm still a bit hesitant on is if we have some instruction with "weird" passthru usage (i.e. read after VL, etc..).

vslidedowns and vrgathers can read past VL, but we guard against them already via mayReadPastVL.

Apparently we really don't use anything other than tail undef in practice do we?

It shows up in a lot more places if we stop emitting non-trapping VP intrinsics with EVL tail folding.

We also probably never see tail undef ternary pseudos, because that would mean one of the operands would have to be undef too.

@lukel97 lukel97 merged commit eb7e199 into llvm:main Jan 30, 2025
8 checks passed
lukel97 added a commit to lukel97/llvm-project that referenced this pull request Feb 10, 2025
After llvm#124066 we started allowing users that are passthrus. However for widening/narrowing instructions we were returning the wrong operand info for passthru operands since it originally assumed the operand would never be a passthru. This fixes it by handling it in IsMODef.
lukel97 added a commit that referenced this pull request Feb 10, 2025
…#126504)

After #124066 we started allowing users that are passthrus. However for
widening/narrowing instructions we were returning the wrong operand info
for passthru operands since it originally assumed the operand would
never be a passthru. This fixes it by handling it in IsMODef.
Icohedron pushed a commit to Icohedron/llvm-project that referenced this pull request Feb 11, 2025
…llvm#126504)

After llvm#124066 we started allowing users that are passthrus. However for
widening/narrowing instructions we were returning the wrong operand info
for passthru operands since it originally assumed the operand would
never be a passthru. This fixes it by handling it in IsMODef.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling tied operands in ternary pseudos in RISCVVLOptimizer
4 participants