-
Notifications
You must be signed in to change notification settings - Fork 14.3k
AMDGPU: Try constant fold after folding immediate #141862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU: Try constant fold after folding immediate #141862
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
@llvm/pr-subscribers-backend-amdgpu Author: Matt Arsenault (arsenm) ChangesThis helps avoid some regressions in a future patch. The or 0 Full diff: https://github.com/llvm/llvm-project/pull/141862.diff 9 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 26167134652ce..67690adc57e73 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1698,6 +1698,12 @@ bool SIFoldOperandsImpl::foldInstOperand(MachineInstr &MI,
LLVM_DEBUG(dbgs() << "Folded source from " << MI << " into OpNo "
<< static_cast<int>(Fold.UseOpNo) << " of "
<< *Fold.UseMI);
+
+ if (Fold.isImm() && tryConstantFoldOp(Fold.UseMI)) {
+ LLVM_DEBUG(dbgs() << "Constant folded " << *Fold.UseMI);
+ Changed = true;
+ }
+
} else if (Fold.Commuted) {
// Restoring instruction's original operand order if fold has failed.
TII->commuteInstruction(*Fold.UseMI, false);
diff --git a/llvm/test/CodeGen/AMDGPU/bit-op-reduce-width-known-bits.ll b/llvm/test/CodeGen/AMDGPU/bit-op-reduce-width-known-bits.ll
index ac5f9b6b483eb..ad26dfa7f93e8 100644
--- a/llvm/test/CodeGen/AMDGPU/bit-op-reduce-width-known-bits.ll
+++ b/llvm/test/CodeGen/AMDGPU/bit-op-reduce-width-known-bits.ll
@@ -105,9 +105,8 @@ define i64 @v_xor_i64_known_i32_from_range_use_out_of_block(i64 %x) {
; CHECK-NEXT: s_and_saveexec_b64 s[4:5], vcc
; CHECK-NEXT: ; %bb.1: ; %inc
; CHECK-NEXT: v_not_b32_e32 v2, v4
-; CHECK-NEXT: v_not_b32_e32 v3, 0
; CHECK-NEXT: v_add_co_u32_e32 v2, vcc, v0, v2
-; CHECK-NEXT: v_addc_co_u32_e32 v3, vcc, v1, v3, vcc
+; CHECK-NEXT: v_addc_co_u32_e32 v3, vcc, -1, v1, vcc
; CHECK-NEXT: ; %bb.2: ; %UnifiedReturnBlock
; CHECK-NEXT: s_or_b64 exec, exec, s[4:5]
; CHECK-NEXT: v_mov_b32_e32 v0, v2
diff --git a/llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir b/llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir
index fe2b0bb1ff6ae..e7177a5e7160e 100644
--- a/llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir
+++ b/llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir
@@ -961,3 +961,25 @@ body: |
S_ENDPGM 0, implicit %2, implicit %3
...
+
+---
+name: constant_v_or_b32_uses_subreg_or_0_regression
+tracksRegLiveness: true
+body: |
+ bb.0:
+ liveins: $vgpr0, $vgpr1
+
+ ; GCN-LABEL: name: constant_v_or_b32_uses_subreg_or_0_regression
+ ; GCN: liveins: $vgpr0, $vgpr1
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+ ; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+ ; GCN-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+ %0:vgpr_32 = COPY $vgpr0
+ %1:vgpr_32 = COPY $vgpr1
+ %2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+ %3:vreg_64 = REG_SEQUENCE %2:vgpr_32, %subreg.sub0, %0:vgpr_32, %subreg.sub1
+ %4:vgpr_32 = V_OR_B32_e64 %3.sub0:vreg_64, %1, implicit $exec
+ S_ENDPGM 0, implicit %4
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/fold-imm-copy.mir b/llvm/test/CodeGen/AMDGPU/fold-imm-copy.mir
index 706c0d8178d70..c1fc06591bd01 100644
--- a/llvm/test/CodeGen/AMDGPU/fold-imm-copy.mir
+++ b/llvm/test/CodeGen/AMDGPU/fold-imm-copy.mir
@@ -43,8 +43,7 @@ body: |
; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
; GCN-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE killed [[DEF]], %subreg.sub0, killed [[V_MOV_B32_e32_]], %subreg.sub1
- ; GCN-NEXT: [[V_XOR_B32_e32_:%[0-9]+]]:vgpr_32 = V_XOR_B32_e32 0, [[DEF1]], implicit $exec
- ; GCN-NEXT: [[V_XOR_B32_e32_1:%[0-9]+]]:vgpr_32 = V_XOR_B32_e32 [[DEF2]], [[REG_SEQUENCE]].sub0, implicit $exec
+ ; GCN-NEXT: [[V_XOR_B32_e32_:%[0-9]+]]:vgpr_32 = V_XOR_B32_e32 [[DEF2]], [[REG_SEQUENCE]].sub0, implicit $exec
%0:vgpr_32 = IMPLICIT_DEF
%1:vgpr_32 = IMPLICIT_DEF
%2:vgpr_32 = IMPLICIT_DEF
diff --git a/llvm/test/CodeGen/AMDGPU/fold-zero-high-bits-skips-non-reg.mir b/llvm/test/CodeGen/AMDGPU/fold-zero-high-bits-skips-non-reg.mir
index b1aa88969c5bb..dc03eb74cbf11 100644
--- a/llvm/test/CodeGen/AMDGPU/fold-zero-high-bits-skips-non-reg.mir
+++ b/llvm/test/CodeGen/AMDGPU/fold-zero-high-bits-skips-non-reg.mir
@@ -8,8 +8,8 @@ body: |
; CHECK-LABEL: name: test_tryFoldZeroHighBits_skips_nonreg
; CHECK: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[V_MOV_B32_e32_]], %subreg.sub0, [[V_MOV_B32_e32_]], %subreg.sub1
- ; CHECK-NEXT: [[V_AND_B32_e64_:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 65535, 0, implicit $exec
- ; CHECK-NEXT: S_NOP 0, implicit [[V_AND_B32_e64_]]
+ ; CHECK-NEXT: [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+ ; CHECK-NEXT: S_NOP 0, implicit [[V_MOV_B32_e32_1]]
%0:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
%1:vreg_64 = REG_SEQUENCE %0, %subreg.sub0, %0, %subreg.sub1
%2:vgpr_32 = V_AND_B32_e64 65535, %1.sub0, implicit $exec
diff --git a/llvm/test/CodeGen/AMDGPU/sdiv64.ll b/llvm/test/CodeGen/AMDGPU/sdiv64.ll
index a166c4f93462d..e6f9bb5171419 100644
--- a/llvm/test/CodeGen/AMDGPU/sdiv64.ll
+++ b/llvm/test/CodeGen/AMDGPU/sdiv64.ll
@@ -404,12 +404,11 @@ define i64 @v_test_sdiv(i64 %x, i64 %y) {
; GCN-IR-NEXT: ; %bb.2: ; %udiv-preheader
; GCN-IR-NEXT: v_add_i32_e32 v16, vcc, -1, v0
; GCN-IR-NEXT: v_addc_u32_e32 v17, vcc, -1, v1, vcc
-; GCN-IR-NEXT: v_not_b32_e32 v5, v10
+; GCN-IR-NEXT: v_not_b32_e32 v4, v10
; GCN-IR-NEXT: v_lshr_b64 v[8:9], v[6:7], v8
-; GCN-IR-NEXT: v_not_b32_e32 v4, 0
-; GCN-IR-NEXT: v_add_i32_e32 v6, vcc, v5, v11
+; GCN-IR-NEXT: v_add_i32_e32 v6, vcc, v4, v11
; GCN-IR-NEXT: v_mov_b32_e32 v10, 0
-; GCN-IR-NEXT: v_addc_u32_e32 v7, vcc, 0, v4, vcc
+; GCN-IR-NEXT: v_addc_u32_e64 v7, s[4:5], -1, 0, vcc
; GCN-IR-NEXT: s_mov_b64 s[10:11], 0
; GCN-IR-NEXT: v_mov_b32_e32 v11, 0
; GCN-IR-NEXT: v_mov_b32_e32 v5, 0
diff --git a/llvm/test/CodeGen/AMDGPU/srem64.ll b/llvm/test/CodeGen/AMDGPU/srem64.ll
index c9e5ff444f715..c3838dad436e0 100644
--- a/llvm/test/CodeGen/AMDGPU/srem64.ll
+++ b/llvm/test/CodeGen/AMDGPU/srem64.ll
@@ -380,12 +380,11 @@ define i64 @v_test_srem(i64 %x, i64 %y) {
; GCN-IR-NEXT: ; %bb.2: ; %udiv-preheader
; GCN-IR-NEXT: v_add_i32_e32 v16, vcc, -1, v2
; GCN-IR-NEXT: v_addc_u32_e32 v17, vcc, -1, v3, vcc
-; GCN-IR-NEXT: v_not_b32_e32 v7, v12
+; GCN-IR-NEXT: v_not_b32_e32 v6, v12
; GCN-IR-NEXT: v_lshr_b64 v[10:11], v[0:1], v8
-; GCN-IR-NEXT: v_not_b32_e32 v6, 0
-; GCN-IR-NEXT: v_add_i32_e32 v8, vcc, v7, v13
+; GCN-IR-NEXT: v_add_i32_e32 v8, vcc, v6, v13
; GCN-IR-NEXT: v_mov_b32_e32 v12, 0
-; GCN-IR-NEXT: v_addc_u32_e32 v9, vcc, 0, v6, vcc
+; GCN-IR-NEXT: v_addc_u32_e64 v9, s[4:5], -1, 0, vcc
; GCN-IR-NEXT: s_mov_b64 s[10:11], 0
; GCN-IR-NEXT: v_mov_b32_e32 v13, 0
; GCN-IR-NEXT: v_mov_b32_e32 v7, 0
diff --git a/llvm/test/CodeGen/AMDGPU/udiv64.ll b/llvm/test/CodeGen/AMDGPU/udiv64.ll
index 5acbb044c1057..e9017939f8a4a 100644
--- a/llvm/test/CodeGen/AMDGPU/udiv64.ll
+++ b/llvm/test/CodeGen/AMDGPU/udiv64.ll
@@ -348,10 +348,9 @@ define i64 @v_test_udiv_i64(i64 %x, i64 %y) {
; GCN-IR-NEXT: v_lshr_b64 v[8:9], v[0:1], v10
; GCN-IR-NEXT: v_addc_u32_e32 v13, vcc, -1, v3, vcc
; GCN-IR-NEXT: v_not_b32_e32 v0, v14
-; GCN-IR-NEXT: v_not_b32_e32 v1, 0
; GCN-IR-NEXT: v_add_i32_e32 v0, vcc, v0, v15
; GCN-IR-NEXT: v_mov_b32_e32 v10, 0
-; GCN-IR-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
+; GCN-IR-NEXT: v_addc_u32_e64 v1, s[4:5], -1, 0, vcc
; GCN-IR-NEXT: s_mov_b64 s[10:11], 0
; GCN-IR-NEXT: v_mov_b32_e32 v11, 0
; GCN-IR-NEXT: v_mov_b32_e32 v7, 0
diff --git a/llvm/test/CodeGen/AMDGPU/urem64.ll b/llvm/test/CodeGen/AMDGPU/urem64.ll
index 94f1b83ea2765..6480a88d40f5a 100644
--- a/llvm/test/CodeGen/AMDGPU/urem64.ll
+++ b/llvm/test/CodeGen/AMDGPU/urem64.ll
@@ -355,12 +355,11 @@ define i64 @v_test_urem_i64(i64 %x, i64 %y) {
; GCN-IR-NEXT: ; %bb.2: ; %udiv-preheader
; GCN-IR-NEXT: v_add_i32_e32 v14, vcc, -1, v2
; GCN-IR-NEXT: v_addc_u32_e32 v15, vcc, -1, v3, vcc
-; GCN-IR-NEXT: v_not_b32_e32 v7, v12
+; GCN-IR-NEXT: v_not_b32_e32 v6, v12
; GCN-IR-NEXT: v_lshr_b64 v[10:11], v[0:1], v8
-; GCN-IR-NEXT: v_not_b32_e32 v6, 0
-; GCN-IR-NEXT: v_add_i32_e32 v8, vcc, v7, v13
+; GCN-IR-NEXT: v_add_i32_e32 v8, vcc, v6, v13
; GCN-IR-NEXT: v_mov_b32_e32 v12, 0
-; GCN-IR-NEXT: v_addc_u32_e32 v9, vcc, 0, v6, vcc
+; GCN-IR-NEXT: v_addc_u32_e64 v9, s[4:5], -1, 0, vcc
; GCN-IR-NEXT: s_mov_b64 s[10:11], 0
; GCN-IR-NEXT: v_mov_b32_e32 v13, 0
; GCN-IR-NEXT: v_mov_b32_e32 v7, 0
|
0b2cb98
to
a728d63
Compare
This helps avoid some regressions in a future patch. The or 0 pattern appears in the division tests because the reduce 64-bit bit operation to a 32-bit one with half identity value is only implemented for constants. We could fix that by using computeKnownBits. Additionally the pattern disappears if I optimize the IR division expansion, so that IR should probably be emitted more optimally in the first place.
b5acc23
to
5cd1f4c
Compare
This helps avoid some regressions in a future patch. The or 0 pattern appears in the division tests because the reduce 64-bit bit operation to a 32-bit one with half identity value is only implemented for constants. We could fix that by using computeKnownBits. Additionally the pattern disappears if I optimize the IR division expansion, so that IR should probably be emitted more optimally in the first place.
This helps avoid some regressions in a future patch. The or 0 pattern appears in the division tests because the reduce 64-bit bit operation to a 32-bit one with half identity value is only implemented for constants. We could fix that by using computeKnownBits. Additionally the pattern disappears if I optimize the IR division expansion, so that IR should probably be emitted more optimally in the first place.
This helps avoid some regressions in a future patch. The or 0 pattern appears in the division tests because the reduce 64-bit bit operation to a 32-bit one with half identity value is only implemented for constants. We could fix that by using computeKnownBits. Additionally the pattern disappears if I optimize the IR division expansion, so that IR should probably be emitted more optimally in the first place.
This helps avoid some regressions in a future patch. The or 0
pattern appears in the division tests because the reduce 64-bit
bit operation to a 32-bit one with half identity value is only
implemented for constants. We could fix that by using computeKnownBits.
Additionally the pattern disappears if I optimize the IR division
expansion, so that IR should probably be emitted more optimally in
the first place.