Skip to content

AMDGPU: Disable pattern matching "x<<32-y>>32-y" to "bfe x, 0, y" #114279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 30, 2024

Conversation

changpeng
Copy link
Contributor

@changpeng changpeng commented Oct 30, 2024

It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation.

We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "(x << imm) >> imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed.

  It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y".
When y equals to 32, the left-hand size is x (unchanged), however,
the right-hand side is 0.

  We may be able to keep the pattern for immediate y while y is within
[0, 31]. However, the immediate operands of sub (32 - y) are easily folded,
and  "x<<imm>>imm" will be lowered to "and x, (2^(32-imm))-1". So no bfe
matching is needed.
@llvmbot
Copy link
Member

llvmbot commented Oct 30, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y equals to 32, the left-hand side is still x (unchanged), however, the right-hand side will be evaluated to 0. So it is not always correct to do such transformation.

We may be able to keep the pattern for immediate y while y is within [0, 31]. However, the immediate operands of the sub (32 - y) are easily folded, and "x<<imm>>imm" will be lowered to "and x, (2^(32-imm))-1" anyway. So no bfe matching is needed.


Full diff: https://github.com/llvm/llvm-project/pull/114279.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (-13)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+12-4)
  • (modified) llvm/test/CodeGen/AMDGPU/extract-lowbits.ll (+15-5)
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index faa0b6d6c3f506..c8a46217190a1d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3553,19 +3553,6 @@ def : AMDGPUPat <
   (V_BFE_U32_e64 $src, (i32 0), $width)
 >;
 
-// x << (bitwidth - y) >> (bitwidth - y)
-def : AMDGPUPat <
-  (DivergentBinFrag<srl> (shl_oneuse i32:$src, (sub 32, i32:$width)),
-                         (sub 32, i32:$width)),
-  (V_BFE_U32_e64 $src, (i32 0), $width)
->;
-
-def : AMDGPUPat <
-  (DivergentBinFrag<sra> (shl_oneuse i32:$src, (sub 32, i32:$width)),
-                         (sub 32, i32:$width)),
-  (V_BFE_I32_e64 $src, (i32 0), $width)
->;
-
 // SHA-256 Ma patterns
 
 // ((x & z) | (y & (x | z))) -> BFI (XOR x, y), z, y
diff --git a/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll b/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
index f54ea615ca6645..c57a35aa1880db 100644
--- a/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/bfe-patterns.ll
@@ -17,7 +17,9 @@ define amdgpu_kernel void @v_ubfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
 ; SI-NEXT:    buffer_load_dword v3, v[0:1], s[4:7], 0 addr64 glc
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    s_mov_b64 s[2:3], s[6:7]
-; SI-NEXT:    v_bfe_u32 v2, v2, 0, v3
+; SI-NEXT:    v_sub_i32_e32 v3, vcc, 32, v3
+; SI-NEXT:    v_lshlrev_b32_e32 v2, v3, v2
+; SI-NEXT:    v_lshrrev_b32_e32 v2, v3, v2
 ; SI-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; SI-NEXT:    s_endpgm
 ;
@@ -36,7 +38,9 @@ define amdgpu_kernel void @v_ubfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
 ; VI-NEXT:    v_mov_b32_e32 v1, s1
 ; VI-NEXT:    v_add_u32_e32 v0, vcc, s0, v2
 ; VI-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; VI-NEXT:    v_bfe_u32 v2, v3, 0, v4
+; VI-NEXT:    v_sub_u32_e32 v2, vcc, 32, v4
+; VI-NEXT:    v_lshlrev_b32_e32 v3, v2, v3
+; VI-NEXT:    v_lshrrev_b32_e32 v2, v2, v3
 ; VI-NEXT:    flat_store_dword v[0:1], v2
 ; VI-NEXT:    s_endpgm
   %id.x = tail call i32 @llvm.amdgcn.workitem.id.x()
@@ -215,7 +219,9 @@ define amdgpu_kernel void @v_sbfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
 ; SI-NEXT:    buffer_load_dword v3, v[0:1], s[4:7], 0 addr64 glc
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    s_mov_b64 s[2:3], s[6:7]
-; SI-NEXT:    v_bfe_i32 v2, v2, 0, v3
+; SI-NEXT:    v_sub_i32_e32 v3, vcc, 32, v3
+; SI-NEXT:    v_lshlrev_b32_e32 v2, v3, v2
+; SI-NEXT:    v_ashrrev_i32_e32 v2, v3, v2
 ; SI-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; SI-NEXT:    s_endpgm
 ;
@@ -234,7 +240,9 @@ define amdgpu_kernel void @v_sbfe_sub_i32(ptr addrspace(1) %out, ptr addrspace(1
 ; VI-NEXT:    v_mov_b32_e32 v1, s1
 ; VI-NEXT:    v_add_u32_e32 v0, vcc, s0, v2
 ; VI-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; VI-NEXT:    v_bfe_i32 v2, v3, 0, v4
+; VI-NEXT:    v_sub_u32_e32 v2, vcc, 32, v4
+; VI-NEXT:    v_lshlrev_b32_e32 v3, v2, v3
+; VI-NEXT:    v_ashrrev_i32_e32 v2, v2, v3
 ; VI-NEXT:    flat_store_dword v[0:1], v2
 ; VI-NEXT:    s_endpgm
   %id.x = tail call i32 @llvm.amdgcn.workitem.id.x()
diff --git a/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll b/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
index 9677ec41ce268a..3d9616f02d52d1 100644
--- a/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
+++ b/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
@@ -150,11 +150,21 @@ define i32 @bzhi32_c4_commutative(i32 %val, i32 %numlowbits) nounwind {
 ; ---------------------------------------------------------------------------- ;
 
 define i32 @bzhi32_d0(i32 %val, i32 %numlowbits) nounwind {
-; GCN-LABEL: bzhi32_d0:
-; GCN:       ; %bb.0:
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GCN-NEXT:    v_bfe_u32 v0, v0, 0, v1
-; GCN-NEXT:    s_setpc_b64 s[30:31]
+; SI-LABEL: bzhi32_d0:
+; SI:       ; %bb.0:
+; SI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; SI-NEXT:    v_sub_i32_e32 v1, vcc, 32, v1
+; SI-NEXT:    v_lshlrev_b32_e32 v0, v1, v0
+; SI-NEXT:    v_lshrrev_b32_e32 v0, v1, v0
+; SI-NEXT:    s_setpc_b64 s[30:31]
+;
+; VI-LABEL: bzhi32_d0:
+; VI:       ; %bb.0:
+; VI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; VI-NEXT:    v_sub_u32_e32 v1, vcc, 32, v1
+; VI-NEXT:    v_lshlrev_b32_e32 v0, v1, v0
+; VI-NEXT:    v_lshrrev_b32_e32 v0, v1, v0
+; VI-NEXT:    s_setpc_b64 s[30:31]
   %numhighbits = sub i32 32, %numlowbits
   %highbitscleared = shl i32 %val, %numhighbits
   %masked = lshr i32 %highbitscleared, %numhighbits

@changpeng changpeng requested a review from rampitec October 30, 2024 17:38
@arsenm
Copy link
Contributor

arsenm commented Oct 30, 2024

We may be able to keep the pattern for immediate y while y is within [0, 31].

Playing with alive I don't see how this is ever correct

@changpeng
Copy link
Contributor Author

We may be able to keep the pattern for immediate y while y is within [0, 31].

Playing with alive I don't see how this is ever correct

"bfe x, 0, y" is to extract the lower y bits from x, and thus it is equivalent to x left shift (32 - y), then right shift (32 - y) --- to clear the upper 32 - y bits of x.

@arsenm
Copy link
Contributor

arsenm commented Oct 30, 2024

We may be able to keep the pattern for immediate y while y is within [0, 31].

Playing with alive I don't see how this is ever correct

"bfe x, 0, y" is to extract the lower y bits from x, and thus it is equivalent to x left shift (32 - y), then right shift (32 - y) --- to clear the upper 32 - y bits of x.

I made a mistake with the pre-shift in the original testcase.

https://alive2.llvm.org/ce/z/nyk6tV

We could do a computeKnownBits based range check

@arsenm
Copy link
Contributor

arsenm commented Oct 30, 2024

But yes, this form doesn't appear canonically. We are missing a fold here though:


define i32 @tgt(i32 %unscaled, i32 %size) {
  %src2.clamp.i = and i32 %size, 31
  %shl.src2.i = shl i32 1, %src2.clamp.i
  %rhs.i = sub i32 %shl.src2.i, 1
  %and.i = and i32 %unscaled, %rhs.i
  ret i32 %and.i
}

Produces:

; %bb.0:
	s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	v_and_b32_e32 v1, 31, v1
	v_bfe_u32 v0, v0, 0, v1
	s_setpc_b64 s[30:31]

We should be able to fold out the and, we know the instruction only reads the low 5 bits

@arsenm
Copy link
Contributor

arsenm commented Oct 30, 2024

Filed #114282 for the missed optimization

@changpeng
Copy link
Contributor Author

Right.

Filed #114282 for the missed optimization

Thanks. I am going to work on it.

@changpeng changpeng merged commit ca1154d into llvm:main Oct 30, 2024
6 of 7 checks passed
smallp-o-p pushed a commit to smallp-o-p/llvm-project that referenced this pull request Nov 3, 2024
…vm#114279)

It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y
equals to 32, the left-hand side is still x (unchanged), however, the
right-hand side will be evaluated to 0. So it is not always correct to
do such transformation.

We may be able to keep the pattern for immediate y while y is within [0,
31]. However, the immediate operands of the sub (32 - y) are easily
folded, and "(x << imm) >> imm" will be lowered to "and x,
(2^(32-imm))-1" anyway. So no bfe matching is needed.
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this pull request Nov 4, 2024
…vm#114279)

It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y
equals to 32, the left-hand side is still x (unchanged), however, the
right-hand side will be evaluated to 0. So it is not always correct to
do such transformation.

We may be able to keep the pattern for immediate y while y is within [0,
31]. However, the immediate operands of the sub (32 - y) are easily
folded, and "(x << imm) >> imm" will be lowered to "and x,
(2^(32-imm))-1" anyway. So no bfe matching is needed.
changpeng added a commit to changpeng/llvm-project that referenced this pull request Nov 7, 2024
  Enable pattern matching from "x<<32-y>>32-y" to "bfe x, 0, y"
when we know y is in [0,31].
  This is the follow-up for the PR: llvm#114279
to fix the issue: llvm#114282
changpeng added a commit that referenced this pull request Nov 7, 2024
Enable pattern matching from "x<<32-y>>32-y" to "bfe x, 0, y" when we
know y is in [0,31].
This is the follow-up for the PR:
#114279 to fix the issue:
#114282
Groverkss pushed a commit to iree-org/llvm-project that referenced this pull request Nov 15, 2024
Enable pattern matching from "x<<32-y>>32-y" to "bfe x, 0, y" when we
know y is in [0,31].
This is the follow-up for the PR:
llvm#114279 to fix the issue:
llvm#114282
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants