-
Notifications
You must be signed in to change notification settings - Fork 14.3k
AMDGPU: Disable pattern matching "x<<32-y>>32-y" to "bfe x, 0, y" #114279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand size is x (unchanged), however, the right-hand side is 0. We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of sub (32 - y) are easily folded, and "x<<imm>>imm" will be lowered to "and x, (2^(32-imm))-1". So no bfe matching is needed.
@llvm/pr-subscribers-backend-amdgpu Author: Changpeng Fang (changpeng) ChangesIt is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation. We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "x<<imm>>imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed. Full diff: https://github.com/llvm/llvm-project/pull/114279.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index faa0b6d6c3f506..c8a46217190a1d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3553,19 +3553,6 @@ def : AMDGPUPat <
(V_BFE_U32_e64 $src, (i32 0), $width)
>;
-// x << (bitwidth - y) >> (bitwidth - y)
-def : AMDGPUPat <
- (DivergentBinFrag<srl> (shl_oneuse i32:$src, (sub 32, i32:$width)),
- (sub 32, i32:$width)),
- (V_BFE_U32_e64 $src, (i32 0), $width)
->;
-
-def : AMDGPUPat <
- (DivergentBinFrag<sra> (shl_oneuse i32:$src, (sub 32, i32:$width)),
- (sub 32, i32:$width)),
- (V_BFE_I32_e64 $src, (i32 0), $width)
->;
-
// SHA-256 Ma patterns
// ((x & z) | (y & (x | z))) -> BFI (XOR x, y), z, y
diff --git a/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll b/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
index f54ea615ca6645..c57a35aa1880db 100644
--- a/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
@@ -17,7 +17,9 @@ define amdgpu_kernel void @v_ubfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
; SI-NEXT: buffer_load_dword v3, v[0:1], s[4:7], 0 addr64 glc
; SI-NEXT: s_waitcnt vmcnt(0)
; SI-NEXT: s_mov_b64 s[2:3], s[6:7]
-; SI-NEXT: v_bfe_u32 v2, v2, 0, v3
+; SI-NEXT: v_sub_i32_e32 v3, vcc, 32, v3
+; SI-NEXT: v_lshlrev_b32_e32 v2, v3, v2
+; SI-NEXT: v_lshrrev_b32_e32 v2, v3, v2
; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
; SI-NEXT: s_endpgm
;
@@ -36,7 +38,9 @@ define amdgpu_kernel void @v_ubfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
; VI-NEXT: v_mov_b32_e32 v1, s1
; VI-NEXT: v_add_u32_e32 v0, vcc, s0, v2
; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; VI-NEXT: v_bfe_u32 v2, v3, 0, v4
+; VI-NEXT: v_sub_u32_e32 v2, vcc, 32, v4
+; VI-NEXT: v_lshlrev_b32_e32 v3, v2, v3
+; VI-NEXT: v_lshrrev_b32_e32 v2, v2, v3
; VI-NEXT: flat_store_dword v[0:1], v2
; VI-NEXT: s_endpgm
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()
@@ -215,7 +219,9 @@ define amdgpu_kernel void @v_sbfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
; SI-NEXT: buffer_load_dword v3, v[0:1], s[4:7], 0 addr64 glc
; SI-NEXT: s_waitcnt vmcnt(0)
; SI-NEXT: s_mov_b64 s[2:3], s[6:7]
-; SI-NEXT: v_bfe_i32 v2, v2, 0, v3
+; SI-NEXT: v_sub_i32_e32 v3, vcc, 32, v3
+; SI-NEXT: v_lshlrev_b32_e32 v2, v3, v2
+; SI-NEXT: v_ashrrev_i32_e32 v2, v3, v2
; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
; SI-NEXT: s_endpgm
;
@@ -234,7 +240,9 @@ define amdgpu_kernel void @v_sbfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
; VI-NEXT: v_mov_b32_e32 v1, s1
; VI-NEXT: v_add_u32_e32 v0, vcc, s0, v2
; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; VI-NEXT: v_bfe_i32 v2, v3, 0, v4
+; VI-NEXT: v_sub_u32_e32 v2, vcc, 32, v4
+; VI-NEXT: v_lshlrev_b32_e32 v3, v2, v3
+; VI-NEXT: v_ashrrev_i32_e32 v2, v2, v3
; VI-NEXT: flat_store_dword v[0:1], v2
; VI-NEXT: s_endpgm
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()
diff --git a/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll b/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
index 9677ec41ce268a..3d9616f02d52d1 100644
--- a/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
+++ b/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
@@ -150,11 +150,21 @@ define i32 @bzhi32_c4_commutative(i32 %val, i32 %numlowbits) nounwind {
; ---------------------------------------------------------------------------- ;
define i32 @bzhi32_d0(i32 %val, i32 %numlowbits) nounwind {
-; GCN-LABEL: bzhi32_d0:
-; GCN: ; %bb.0:
-; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1
-; GCN-NEXT: s_setpc_b64 s[30:31]
+; SI-LABEL: bzhi32_d0:
+; SI: ; %bb.0:
+; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; SI-NEXT: v_sub_i32_e32 v1, vcc, 32, v1
+; SI-NEXT: v_lshlrev_b32_e32 v0, v1, v0
+; SI-NEXT: v_lshrrev_b32_e32 v0, v1, v0
+; SI-NEXT: s_setpc_b64 s[30:31]
+;
+; VI-LABEL: bzhi32_d0:
+; VI: ; %bb.0:
+; VI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; VI-NEXT: v_sub_u32_e32 v1, vcc, 32, v1
+; VI-NEXT: v_lshlrev_b32_e32 v0, v1, v0
+; VI-NEXT: v_lshrrev_b32_e32 v0, v1, v0
+; VI-NEXT: s_setpc_b64 s[30:31]
%numhighbits = sub i32 32, %numlowbits
%highbitscleared = shl i32 %val, %numhighbits
%masked = lshr i32 %highbitscleared, %numhighbits
|
Playing with alive I don't see how this is ever correct |
"bfe x, 0, y" is to extract the lower y bits from x, and thus it is equivalent to x left shift (32 - y), then right shift (32 - y) --- to clear the upper 32 - y bits of x. |
I made a mistake with the pre-shift in the original testcase. https://alive2.llvm.org/ce/z/nyk6tV We could do a computeKnownBits based range check |
But yes, this form doesn't appear canonically. We are missing a fold here though:
Produces:
We should be able to fold out the and, we know the instruction only reads the low 5 bits |
Filed #114282 for the missed optimization |
Right.
Thanks. I am going to work on it. |
…vm#114279) It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation. We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "(x << imm) >> imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed.
…vm#114279) It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation. We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "(x << imm) >> imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed.
Enable pattern matching from "x<<32-y>>32-y" to "bfe x, 0, y" when we know y is in [0,31]. This is the follow-up for the PR: llvm#114279 to fix the issue: llvm#114282
Enable pattern matching from "x<<32-y>>32-y" to "bfe x, 0, y" when we know y is in [0,31]. This is the follow-up for the PR: llvm#114279 to fix the issue: llvm#114282
It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation.
We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "(x << imm) >> imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed.