AMDGPU: Disable pattern matching "x<<32-y>>32-y" to "bfe x, 0, y" #114279

changpeng · 2024-10-30T17:37:32Z

It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation.

We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "(x << imm) >> imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed.

It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand size is x (unchanged), however, the right-hand side is 0. We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of sub (32 - y) are easily folded, and "x<<imm>>imm" will be lowered to "and x, (2^(32-imm))-1". So no bfe matching is needed.

llvmbot · 2024-10-30T17:38:10Z

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation.

We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "x<<imm>>imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed.

Full diff: https://github.com/llvm/llvm-project/pull/114279.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (-13)
(modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+12-4)
(modified) llvm/test/CodeGen/AMDGPU/extract-lowbits.ll (+15-5)

diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index faa0b6d6c3f506..c8a46217190a1d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3553,19 +3553,6 @@ def : AMDGPUPat <
   (V_BFE_U32_e64 $src, (i32 0), $width)
 >;
 
-// x << (bitwidth - y) >> (bitwidth - y)
-def : AMDGPUPat <
-  (DivergentBinFrag<srl> (shl_oneuse i32:$src, (sub 32, i32:$width)),
-                         (sub 32, i32:$width)),
-  (V_BFE_U32_e64 $src, (i32 0), $width)
->;
-
-def : AMDGPUPat <
-  (DivergentBinFrag<sra> (shl_oneuse i32:$src, (sub 32, i32:$width)),
-                         (sub 32, i32:$width)),
-  (V_BFE_I32_e64 $src, (i32 0), $width)
->;
-
 // SHA-256 Ma patterns
 
 // ((x & z) | (y & (x | z))) -> BFI (XOR x, y), z, y
diff --git a/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll b/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
index f54ea615ca6645..c57a35aa1880db 100644
--- a/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
@@ -17,7 +17,9 @@ define amdgpu_kernel void @v_ubfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
 ; SI-NEXT:    buffer_load_dword v3, v[0:1], s[4:7], 0 addr64 glc
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    s_mov_b64 s[2:3], s[6:7]
-; SI-NEXT:    v_bfe_u32 v2, v2, 0, v3
+; SI-NEXT:    v_sub_i32_e32 v3, vcc, 32, v3
+; SI-NEXT:    v_lshlrev_b32_e32 v2, v3, v2
+; SI-NEXT:    v_lshrrev_b32_e32 v2, v3, v2
 ; SI-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; SI-NEXT:    s_endpgm
 ;
@@ -36,7 +38,9 @@ define amdgpu_kernel void @v_ubfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
 ; VI-NEXT:    v_mov_b32_e32 v1, s1
 ; VI-NEXT:    v_add_u32_e32 v0, vcc, s0, v2
 ; VI-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; VI-NEXT:    v_bfe_u32 v2, v3, 0, v4
+; VI-NEXT:    v_sub_u32_e32 v2, vcc, 32, v4
+; VI-NEXT:    v_lshlrev_b32_e32 v3, v2, v3
+; VI-NEXT:    v_lshrrev_b32_e32 v2, v2, v3
 ; VI-NEXT:    flat_store_dword v[0:1], v2
 ; VI-NEXT:    s_endpgm
   %id.x = tail call i32 @llvm.amdgcn.workitem.id.x()
@@ -215,7 +219,9 @@ define amdgpu_kernel void @v_sbfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
 ; SI-NEXT:    buffer_load_dword v3, v[0:1], s[4:7], 0 addr64 glc
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    s_mov_b64 s[2:3], s[6:7]
-; SI-NEXT:    v_bfe_i32 v2, v2, 0, v3
+; SI-NEXT:    v_sub_i32_e32 v3, vcc, 32, v3
+; SI-NEXT:    v_lshlrev_b32_e32 v2, v3, v2
+; SI-NEXT:    v_ashrrev_i32_e32 v2, v3, v2
 ; SI-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; SI-NEXT:    s_endpgm
 ;
@@ -234,7 +240,9 @@ define amdgpu_kernel void @v_sbfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
 ; VI-NEXT:    v_mov_b32_e32 v1, s1
 ; VI-NEXT:    v_add_u32_e32 v0, vcc, s0, v2
 ; VI-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; VI-NEXT:    v_bfe_i32 v2, v3, 0, v4
+; VI-NEXT:    v_sub_u32_e32 v2, vcc, 32, v4
+; VI-NEXT:    v_lshlrev_b32_e32 v3, v2, v3
+; VI-NEXT:    v_ashrrev_i32_e32 v2, v2, v3
 ; VI-NEXT:    flat_store_dword v[0:1], v2
 ; VI-NEXT:    s_endpgm
   %id.x = tail call i32 @llvm.amdgcn.workitem.id.x()
diff --git a/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll b/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
index 9677ec41ce268a..3d9616f02d52d1 100644
--- a/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
+++ b/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
@@ -150,11 +150,21 @@ define i32 @bzhi32_c4_commutative(i32 %val, i32 %numlowbits) nounwind {
 ; ---------------------------------------------------------------------------- ;
 
 define i32 @bzhi32_d0(i32 %val, i32 %numlowbits) nounwind {
-; GCN-LABEL: bzhi32_d0:
-; GCN:       ; %bb.0:
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GCN-NEXT:    v_bfe_u32 v0, v0, 0, v1
-; GCN-NEXT:    s_setpc_b64 s[30:31]
+; SI-LABEL: bzhi32_d0:
+; SI:       ; %bb.0:
+; SI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; SI-NEXT:    v_sub_i32_e32 v1, vcc, 32, v1
+; SI-NEXT:    v_lshlrev_b32_e32 v0, v1, v0
+; SI-NEXT:    v_lshrrev_b32_e32 v0, v1, v0
+; SI-NEXT:    s_setpc_b64 s[30:31]
+;
+; VI-LABEL: bzhi32_d0:
+; VI:       ; %bb.0:
+; VI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; VI-NEXT:    v_sub_u32_e32 v1, vcc, 32, v1
+; VI-NEXT:    v_lshlrev_b32_e32 v0, v1, v0
+; VI-NEXT:    v_lshrrev_b32_e32 v0, v1, v0
+; VI-NEXT:    s_setpc_b64 s[30:31]
   %numhighbits = sub i32 32, %numlowbits
   %highbitscleared = shl i32 %val, %numhighbits
   %masked = lshr i32 %highbitscleared, %numhighbits

arsenm · 2024-10-30T17:42:43Z

We may be able to keep the pattern for immediate y while y is within [0, 31].

Playing with alive I don't see how this is ever correct

changpeng · 2024-10-30T17:50:57Z

We may be able to keep the pattern for immediate y while y is within [0, 31].

Playing with alive I don't see how this is ever correct

"bfe x, 0, y" is to extract the lower y bits from x, and thus it is equivalent to x left shift (32 - y), then right shift (32 - y) --- to clear the upper 32 - y bits of x.

arsenm · 2024-10-30T17:57:43Z

We may be able to keep the pattern for immediate y while y is within [0, 31].

Playing with alive I don't see how this is ever correct

"bfe x, 0, y" is to extract the lower y bits from x, and thus it is equivalent to x left shift (32 - y), then right shift (32 - y) --- to clear the upper 32 - y bits of x.

I made a mistake with the pre-shift in the original testcase.

https://alive2.llvm.org/ce/z/nyk6tV

We could do a computeKnownBits based range check

arsenm · 2024-10-30T18:00:06Z

But yes, this form doesn't appear canonically. We are missing a fold here though:


define i32 @tgt(i32 %unscaled, i32 %size) {
  %src2.clamp.i = and i32 %size, 31
  %shl.src2.i = shl i32 1, %src2.clamp.i
  %rhs.i = sub i32 %shl.src2.i, 1
  %and.i = and i32 %unscaled, %rhs.i
  ret i32 %and.i
}

Produces:

; %bb.0:
	s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	v_and_b32_e32 v1, 31, v1
	v_bfe_u32 v0, v0, 0, v1
	s_setpc_b64 s[30:31]

We should be able to fold out the and, we know the instruction only reads the low 5 bits

arsenm · 2024-10-30T18:05:01Z

Filed #114282 for the missed optimization

changpeng · 2024-10-30T18:06:48Z

Right.

Filed #114282 for the missed optimization

Thanks. I am going to work on it.

…vm#114279) It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation. We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "(x << imm) >> imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed.

Enable pattern matching from "x<<32-y>>32-y" to "bfe x, 0, y" when we know y is in [0,31]. This is the follow-up for the PR: llvm#114279 to fix the issue: llvm#114282

Enable pattern matching from "x<<32-y>>32-y" to "bfe x, 0, y" when we know y is in [0,31]. This is the follow-up for the PR: #114279 to fix the issue: #114282

Enable pattern matching from "x<<32-y>>32-y" to "bfe x, 0, y" when we know y is in [0,31]. This is the follow-up for the PR: llvm#114279 to fix the issue: llvm#114282

changpeng requested a review from arsenm October 30, 2024 17:37

llvmbot added the backend:AMDGPU label Oct 30, 2024

changpeng requested a review from RKSimon October 30, 2024 17:37

changpeng requested a review from rampitec October 30, 2024 17:38

arsenm approved these changes Oct 30, 2024

View reviewed changes

changpeng merged commit ca1154d into llvm:main Oct 30, 2024
6 of 7 checks passed

changpeng mentioned this pull request Nov 7, 2024

AMDGPU: Don't avoid clamp of bit shift in BFE pattern #115372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Disable pattern matching "x<<32-y>>32-y" to "bfe x, 0, y" #114279

AMDGPU: Disable pattern matching "x<<32-y>>32-y" to "bfe x, 0, y" #114279

Uh oh!

changpeng commented Oct 30, 2024 •

edited

Loading

Uh oh!

llvmbot commented Oct 30, 2024

Uh oh!

arsenm commented Oct 30, 2024

Uh oh!

changpeng commented Oct 30, 2024

Uh oh!

arsenm commented Oct 30, 2024

Uh oh!

arsenm commented Oct 30, 2024

Uh oh!

arsenm commented Oct 30, 2024

Uh oh!

changpeng commented Oct 30, 2024

Uh oh!

Uh oh!

Uh oh!

AMDGPU: Disable pattern matching "x<<32-y>>32-y" to "bfe x, 0, y" #114279

AMDGPU: Disable pattern matching "x<<32-y>>32-y" to "bfe x, 0, y" #114279

Uh oh!

Conversation

changpeng commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 30, 2024

Uh oh!

arsenm commented Oct 30, 2024

Uh oh!

changpeng commented Oct 30, 2024

Uh oh!

arsenm commented Oct 30, 2024

Uh oh!

arsenm commented Oct 30, 2024

Uh oh!

arsenm commented Oct 30, 2024

Uh oh!

changpeng commented Oct 30, 2024

Uh oh!

Uh oh!

Uh oh!

changpeng commented Oct 30, 2024 •

edited

Loading