-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[RISCV][VLOPT] Allow users that are passthrus if tail elements aren't demanded #124066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesThe motivation for this to allow reducing the vl when a user is a ternary pseudo, where the third operand is tied and also acts as a passthru. When checking the users of an instruction, we currently bail if the user is used as a passthru because all of its elements past vl will be used for the tail. We can allow passthru users if we know the tail of their result isn't used, and we can reuse checkUsers to check this. It's worth noting that this is all irrelevant of the tail policy, because tail agnostic still ends up using the passthru. I've checked that SPEC CPU 2017 + llvm-test-suite pass with this (on qemu with rvv_ta_all_1s=true) Fixes #123760 Full diff: https://github.com/llvm/llvm-project/pull/124066.diff 5 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
index 5f2d4e0585a0b3..4c6ba26ba6e193 100644
--- a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
@@ -53,7 +53,7 @@ class RISCVVLOptimizer : public MachineFunctionPass {
std::optional<MachineOperand> getMinimumVLForUser(MachineOperand &UserOp);
/// Returns the largest common VL MachineOperand that may be used to optimize
/// MI. Returns std::nullopt if it failed to find a suitable VL.
- std::optional<MachineOperand> checkUsers(MachineInstr &MI);
+ std::optional<MachineOperand> checkUsers(const MachineInstr &MI);
bool tryReduceVL(MachineInstr &MI);
bool isCandidate(const MachineInstr &MI) const;
};
@@ -1221,7 +1221,8 @@ RISCVVLOptimizer::getMinimumVLForUser(MachineOperand &UserOp) {
return VLOp;
}
-std::optional<MachineOperand> RISCVVLOptimizer::checkUsers(MachineInstr &MI) {
+std::optional<MachineOperand>
+RISCVVLOptimizer::checkUsers(const MachineInstr &MI) {
// FIXME: Avoid visiting each user for each time we visit something on the
// worklist, combined with an extra visit from the outer loop. Restructure
// along lines of an instcombine style worklist which integrates the outer
@@ -1235,16 +1236,21 @@ std::optional<MachineOperand> RISCVVLOptimizer::checkUsers(MachineInstr &MI) {
return std::nullopt;
}
- // Tied operands might pass through.
- if (UserOp.isTied()) {
- LLVM_DEBUG(dbgs() << " Abort because user used as tied operand\n");
- return std::nullopt;
- }
-
auto VLOp = getMinimumVLForUser(UserOp);
if (!VLOp)
return std::nullopt;
+ // If the user is a passthru, we will need to preserve it if its tail is
+ // demanded.
+ if (UserOp.isTied()) {
+ auto DemandedVL = checkUsers(UserMI);
+ if (!DemandedVL || !RISCV::isVLKnownLE(*DemandedVL, *VLOp)) {
+ LLVM_DEBUG(dbgs() << " Abort because user is passthru in "
+ "instruction with demanded tail\n");
+ return std::nullopt;
+ }
+ }
+
// Use the largest VL among all the users. If we cannot determine this
// statically, then we cannot optimize the VL.
if (!CommonVL || RISCV::isVLKnownLE(*CommonVL, *VLOp)) {
diff --git a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
index d329979857a6b9..403cc0eb9dce1d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
@@ -894,9 +894,10 @@ define void @test_dag_loop() {
; CHECK: # %bb.0: # %entry
; CHECK-NEXT: vsetvli a0, zero, e8, m4, ta, ma
; CHECK-NEXT: vmclr.m v0
+; CHECK-NEXT: vsetivli zero, 0, e8, m4, ta, ma
; CHECK-NEXT: vmv.v.i v8, 0
; CHECK-NEXT: vmv.v.i v12, 0
-; CHECK-NEXT: vsetivli zero, 0, e8, m4, tu, mu
+; CHECK-NEXT: vsetvli zero, zero, e8, m4, tu, mu
; CHECK-NEXT: vssubu.vx v12, v8, zero, v0.t
; CHECK-NEXT: vsetvli zero, zero, e8, m4, ta, ma
; CHECK-NEXT: vmseq.vv v0, v12, v8
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll b/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
index 1cc30f077feb4a..8d44eabafcecfc 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
@@ -150,3 +150,26 @@ define <vscale x 4 x i32> @dont_optimize_tied_def(<vscale x 4 x i32> %a, <vscale
ret <vscale x 4 x i32> %2
}
+define void @optimize_ternary_use(<vscale x 4 x i16> %a, <vscale x 4 x i32> %b, <vscale x 4 x i32> %c, ptr %p, iXLen %vl) {
+; NOVLOPT-LABEL: optimize_ternary_use:
+; NOVLOPT: # %bb.0:
+; NOVLOPT-NEXT: vsetvli a2, zero, e32, m2, ta, ma
+; NOVLOPT-NEXT: vzext.vf2 v14, v8
+; NOVLOPT-NEXT: vsetvli zero, a1, e32, m2, ta, ma
+; NOVLOPT-NEXT: vmadd.vv v14, v10, v12
+; NOVLOPT-NEXT: vse32.v v14, (a0)
+; NOVLOPT-NEXT: ret
+;
+; VLOPT-LABEL: optimize_ternary_use:
+; VLOPT: # %bb.0:
+; VLOPT-NEXT: vsetvli zero, a1, e32, m2, ta, ma
+; VLOPT-NEXT: vzext.vf2 v14, v8
+; VLOPT-NEXT: vmadd.vv v14, v10, v12
+; VLOPT-NEXT: vse32.v v14, (a0)
+; VLOPT-NEXT: ret
+ %1 = zext <vscale x 4 x i16> %a to <vscale x 4 x i32>
+ %2 = mul <vscale x 4 x i32> %b, %1
+ %3 = add <vscale x 4 x i32> %2, %c
+ call void @llvm.riscv.vse(<vscale x 4 x i32> %3, ptr %p, iXLen %vl)
+ ret void
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir b/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
index 027eb8ca3c17f0..5e6af334f02f85 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt.mir
@@ -149,3 +149,64 @@ body: |
; CHECK-NEXT: early-clobber %y:vrm2 = PseudoVWADD_WV_M1_TIED $noreg, %x, 1, 3 /* e8 */, 0 /* tu, mu */
%x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
%y:vrm2 = PseudoVWADD_WV_M1_TIED $noreg, %x, 1, 3 /* e8 */, 0 /* tu, mu */
+...
+---
+# Can reduce %x even though %y uses it as a passthru, because %y's inactive elements aren't demanded
+name: passthru_not_demanded
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: passthru_not_demanded
+ ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 $noreg, %y, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+ %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %z:vr = PseudoVADD_VV_M1 $noreg, %y, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+...
+---
+# Can't reduce %x because %y uses it as a passthru, and %y's inactive elements are demanded by %z
+name: passthru_demanded
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: passthru_demanded
+ ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 $noreg, %y, $noreg, 2, 3 /* e8 */, 0 /* tu, mu */
+ %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+ %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %z:vr = PseudoVADD_VV_M1 $noreg, %y, $noreg, 2, 3 /* e8 */, 0 /* tu, mu */
+...
+---
+# Can reduce %x even though %y uses it as a passthru, because %y's inactive elements aren't demanded
+name: passthru_not_demanded_passthru_chain
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: passthru_not_demanded_passthru_chain
+ ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 %y, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %a:vr = PseudoVADD_VV_M1 %z, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %b:vr = PseudoVADD_VV_M1 $noreg, %a, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+ %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %z:vr = PseudoVADD_VV_M1 %y, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %a:vr = PseudoVADD_VV_M1 %z, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %b:vr = PseudoVADD_VV_M1 $noreg, %a, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+...
+---
+# Can't reduce %x because %y uses it as a passthru, and %y's inactive elements are ultimately demanded in %b
+name: passthru_demanded_passthru_chain
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: passthru_demanded_passthru_chain
+ ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %z:vr = PseudoVADD_VV_M1 %y, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %a:vr = PseudoVADD_VV_M1 %z, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ ; CHECK-NEXT: %b:vr = PseudoVADD_VV_M1 $noreg, %a, $noreg, 2, 3 /* e8 */, 0 /* tu, mu */
+ %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+ %y:vr = PseudoVADD_VV_M1 %x, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %z:vr = PseudoVADD_VV_M1 %y, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %a:vr = PseudoVADD_VV_M1 %z, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+ %b:vr = PseudoVADD_VV_M1 $noreg, %a, $noreg, 2, 3 /* e8 */, 0 /* tu, mu */
+...
diff --git a/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll b/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
index eb74b238c01b95..a29af3d5b54b0f 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
@@ -1638,9 +1638,8 @@ define <vscale x 1 x i64> @vmadd_vx_nxv1i64(<vscale x 1 x i64> %a, i64 %b, <vsca
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m1, ta, ma
-; RV32-NEXT: vlse64.v v10, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m1, ta, ma
+; RV32-NEXT: vlse64.v v10, (a0), zero
; RV32-NEXT: vmadd.vv v10, v8, v9
; RV32-NEXT: vsetvli zero, zero, e64, m1, tu, ma
; RV32-NEXT: vmerge.vvm v8, v8, v10, v0
@@ -1669,9 +1668,8 @@ define <vscale x 1 x i64> @vmadd_vx_nxv1i64_unmasked(<vscale x 1 x i64> %a, i64
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m1, ta, ma
-; RV32-NEXT: vlse64.v v10, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m1, ta, ma
+; RV32-NEXT: vlse64.v v10, (a0), zero
; RV32-NEXT: vmadd.vv v10, v8, v9
; RV32-NEXT: vsetvli zero, zero, e64, m1, tu, ma
; RV32-NEXT: vmv.v.v v8, v10
@@ -1713,9 +1711,8 @@ define <vscale x 1 x i64> @vmadd_vx_nxv1i64_ta(<vscale x 1 x i64> %a, i64 %b, <v
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m1, ta, ma
-; RV32-NEXT: vlse64.v v10, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m1, ta, ma
+; RV32-NEXT: vlse64.v v10, (a0), zero
; RV32-NEXT: vmadd.vv v10, v8, v9
; RV32-NEXT: vmerge.vvm v8, v8, v10, v0
; RV32-NEXT: addi sp, sp, 16
@@ -1776,9 +1773,8 @@ define <vscale x 2 x i64> @vmadd_vx_nxv2i64(<vscale x 2 x i64> %a, i64 %b, <vsca
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m2, ta, ma
-; RV32-NEXT: vlse64.v v12, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m2, ta, ma
+; RV32-NEXT: vlse64.v v12, (a0), zero
; RV32-NEXT: vmadd.vv v12, v8, v10
; RV32-NEXT: vsetvli zero, zero, e64, m2, tu, ma
; RV32-NEXT: vmerge.vvm v8, v8, v12, v0
@@ -1807,9 +1803,8 @@ define <vscale x 2 x i64> @vmadd_vx_nxv2i64_unmasked(<vscale x 2 x i64> %a, i64
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m2, ta, ma
-; RV32-NEXT: vlse64.v v12, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m2, ta, ma
+; RV32-NEXT: vlse64.v v12, (a0), zero
; RV32-NEXT: vmadd.vv v12, v8, v10
; RV32-NEXT: vsetvli zero, zero, e64, m2, tu, ma
; RV32-NEXT: vmv.v.v v8, v12
@@ -1851,9 +1846,8 @@ define <vscale x 2 x i64> @vmadd_vx_nxv2i64_ta(<vscale x 2 x i64> %a, i64 %b, <v
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m2, ta, ma
-; RV32-NEXT: vlse64.v v12, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m2, ta, ma
+; RV32-NEXT: vlse64.v v12, (a0), zero
; RV32-NEXT: vmadd.vv v12, v8, v10
; RV32-NEXT: vmerge.vvm v8, v8, v12, v0
; RV32-NEXT: addi sp, sp, 16
@@ -1914,9 +1908,8 @@ define <vscale x 4 x i64> @vmadd_vx_nxv4i64(<vscale x 4 x i64> %a, i64 %b, <vsca
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m4, ta, ma
-; RV32-NEXT: vlse64.v v16, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m4, ta, ma
+; RV32-NEXT: vlse64.v v16, (a0), zero
; RV32-NEXT: vmadd.vv v16, v8, v12
; RV32-NEXT: vsetvli zero, zero, e64, m4, tu, ma
; RV32-NEXT: vmerge.vvm v8, v8, v16, v0
@@ -1945,9 +1938,8 @@ define <vscale x 4 x i64> @vmadd_vx_nxv4i64_unmasked(<vscale x 4 x i64> %a, i64
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m4, ta, ma
-; RV32-NEXT: vlse64.v v16, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m4, ta, ma
+; RV32-NEXT: vlse64.v v16, (a0), zero
; RV32-NEXT: vmadd.vv v16, v8, v12
; RV32-NEXT: vsetvli zero, zero, e64, m4, tu, ma
; RV32-NEXT: vmv.v.v v8, v16
@@ -1989,9 +1981,8 @@ define <vscale x 4 x i64> @vmadd_vx_nxv4i64_ta(<vscale x 4 x i64> %a, i64 %b, <v
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m4, ta, ma
-; RV32-NEXT: vlse64.v v16, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m4, ta, ma
+; RV32-NEXT: vlse64.v v16, (a0), zero
; RV32-NEXT: vmadd.vv v16, v8, v12
; RV32-NEXT: vmerge.vvm v8, v8, v16, v0
; RV32-NEXT: addi sp, sp, 16
@@ -2054,9 +2045,8 @@ define <vscale x 8 x i64> @vmadd_vx_nxv8i64(<vscale x 8 x i64> %a, i64 %b, <vsca
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m8, ta, ma
-; RV32-NEXT: vlse64.v v24, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m8, ta, ma
+; RV32-NEXT: vlse64.v v24, (a0), zero
; RV32-NEXT: vmadd.vv v24, v8, v16
; RV32-NEXT: vsetvli zero, zero, e64, m8, tu, ma
; RV32-NEXT: vmerge.vvm v8, v8, v24, v0
@@ -2085,9 +2075,8 @@ define <vscale x 8 x i64> @vmadd_vx_nxv8i64_unmasked(<vscale x 8 x i64> %a, i64
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m8, ta, ma
-; RV32-NEXT: vlse64.v v24, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m8, ta, ma
+; RV32-NEXT: vlse64.v v24, (a0), zero
; RV32-NEXT: vmadd.vv v24, v8, v16
; RV32-NEXT: vsetvli zero, zero, e64, m8, tu, ma
; RV32-NEXT: vmv.v.v v8, v24
@@ -2130,9 +2119,8 @@ define <vscale x 8 x i64> @vmadd_vx_nxv8i64_ta(<vscale x 8 x i64> %a, i64 %b, <v
; RV32-NEXT: sw a0, 8(sp)
; RV32-NEXT: sw a1, 12(sp)
; RV32-NEXT: addi a0, sp, 8
-; RV32-NEXT: vsetvli a1, zero, e64, m8, ta, ma
-; RV32-NEXT: vlse64.v v24, (a0), zero
; RV32-NEXT: vsetvli zero, a2, e64, m8, ta, ma
+; RV32-NEXT: vlse64.v v24, (a0), zero
; RV32-NEXT: vmadd.vv v24, v8, v16
; RV32-NEXT: vmerge.vvm v8, v8, v24, v0
; RV32-NEXT: addi sp, sp, 16
|
@@ -53,7 +53,7 @@ class RISCVVLOptimizer : public MachineFunctionPass { | |||
std::optional<MachineOperand> getMinimumVLForUser(MachineOperand &UserOp); | |||
/// Returns the largest common VL MachineOperand that may be used to optimize | |||
/// MI. Returns std::nullopt if it failed to find a suitable VL. | |||
std::optional<MachineOperand> checkUsers(MachineInstr &MI); | |||
std::optional<MachineOperand> checkUsers(const MachineInstr &MI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to tweak the signature here since UserMI is const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice find!
Do we get any impact if we remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on, this is now an unbounded recursion. This can cause both stack explosion, and more likely, significant compile time regressions.
I think you can achieve the same thing by adding a transform which uses the result of checkUsers to clear a pass thru to NoRegister. (That is, convert an instruction from TA/TU to tail undefined.) When we visit that instruction, it would propagate up.
Yeah agreed the recursion is less than ideal, but it's only unbounded by the number of passthrus that are fed into passthrus, which doesn't seem to be common in practice.
Unfortunately there's no way to represent a tail undefined ternary pseudo, because the passthru is also the third operand. I wrote a bit more of a description of this in #123760. I tried adding new Given that we're checking the demanded VL of the users downwards, and we iterate upwards over each block, it might be possible to memoize the result of checkUsers? We would need to invalidate it whenever we reduce a VL though. I'll have a think about this tonight anyway to see if there's another approach. |
Off the top of my head, vsadd etc. have extra implicit defs, but I'm not sure if it's safe to change their VL. Fault-first loads have an extra explicit def, but I don't think they're safe to reduce either |
I had played around with a patch in the past that changed this line to |
This replaces the worklist by instead computing what VL is demanded by each instruction's users first. checkUsers essentially already did this, so it's been renamed to computeDemandedVL. The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp. This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066. Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
The solution I ended up with is in #124530. It turns out we don't need to invalidate anything because the demanded VL never changes: We can compute the minimum possible VL for each instruction up front without needing to mutate anything. I'll rebase this PR on top of it once it's landed, as well as #124549 which was split off to help explain some diffs. |
This replaces the worklist by instead computing what VL is demanded by each instruction's users first. checkUsers essentially already did this, so it's been renamed to computeDemandedVL. The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp. This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066. Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
This replaces the worklist by instead computing what VL is demanded by each instruction's users first. checkUsers essentially already did this, so it's been renamed to computeDemandedVL. The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp. This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066. Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
This replaces the worklist by instead computing what VL is demanded by each instruction's users first. checkUsers essentially already did this, so it's been renamed to computeDemandedVL. The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp. This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066. Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
This replaces the worklist by instead computing what VL is demanded by each instruction's users first. checkUsers essentially already did this, so it's been renamed to computeDemandedVL. The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp. This means the pass should now be in linear complexity, and allows us to relax the restriction on tied operands in more easily as in llvm#124066. Note that in order to avoid std::optional inside the DenseMap, I've removed the std::optionals and replaced them with VLMAX or 0 constant operands.
This replaces the worklist by instead computing what VL is demanded by each instruction's users first, which is done via checkUsers. The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp. This means the pass should now be linear in complexity, and allows us to relax the restriction on tied operands in more easily as in #124066.
c6cfbe7
to
f77d01d
Compare
I've rebased this on top of #124530, so now there's no recursion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Only point I'm still a bit hesitant on is if we have some instruction with "weird" passthru usage (i.e. read after VL, etc..). I can't think of any, and spot checking existing code doesn't seem to reveal in any special casing, so I think we're okay here. Both passthru for TU/MU and FMA style extra sources should be fine with the code as written.
I guess one other point is that I'm surprised this has so little impact on the tests. Apparently we really don't use anything other than tail undef in practice do we?
vslidedowns and vrgathers can read past VL, but we guard against them already via
It shows up in a lot more places if we stop emitting non-trapping VP intrinsics with EVL tail folding. We also probably never see tail undef ternary pseudos, because that would mean one of the operands would have to be undef too. |
After llvm#124066 we started allowing users that are passthrus. However for widening/narrowing instructions we were returning the wrong operand info for passthru operands since it originally assumed the operand would never be a passthru. This fixes it by handling it in IsMODef.
…llvm#126504) After llvm#124066 we started allowing users that are passthrus. However for widening/narrowing instructions we were returning the wrong operand info for passthru operands since it originally assumed the operand would never be a passthru. This fixes it by handling it in IsMODef.
The motivation for this to allow reducing the vl when a user is a ternary pseudo, where the third operand is tied and also acts as a passthru.
When checking the users of an instruction, we currently bail if the user is used as a passthru because all of its elements past vl will be used for the tail.
We can allow passthru users if we know the tail of their result isn't used, which we will have computed beforehand after #124530
It's worth noting that this is all irrelevant of the tail policy, because tail agnostic still ends up using the passthru.
I've checked that SPEC CPU 2017 + llvm-test-suite pass with this (on qemu with rvv_ta_all_1s=true)
Fixes #123760