Skip to content

[AMDGPU] prevent shrinking udiv/urem if either operand is in (SignedMax,UnsignedMax] #116733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

choikwa
Copy link
Contributor

@choikwa choikwa commented Nov 19, 2024

Do this by using ComputeKnownBits and checking for !isNonNegative and isUnsigned. This rejects shrinking unsigned div/rem if operands exceed smax_bitwidth since we know NumSignBits will be always 0.

@llvmbot
Copy link
Member

llvmbot commented Nov 19, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: choikwa (choikwa)

Changes

Do this by using ComputeKnownBits and checking for !isNonNegative and isUnsigned. This rejects shrinking unsigned div/rem if operands exceed smax_bitwidth since we know NumSignBits will be always 0.


Full diff: https://github.com/llvm/llvm-project/pull/116733.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (+12)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+57)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index c49aab823b44a4..bb9dbda200a524 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -1193,6 +1193,18 @@ int AMDGPUCodeGenPrepareImpl::getDivNumBits(BinaryOperator &I, Value *Num,
                                             Value *Den, unsigned AtLeast,
                                             bool IsSigned) const {
   const DataLayout &DL = Mod->getDataLayout();
+  if (!IsSigned) {
+    KnownBits Known = computeKnownBits(Num, DL, 0, AC, &I);
+    // We know all bits are used for division for Operand > smax_bitwidth
+    if (!Known.isNonNegative())
+      return -1;
+
+    Known = computeKnownBits(Den, DL, 0, AC, &I);
+    if (!Known.isNonNegative())
+      return -1;
+  }
+
+
   unsigned LHSSignBits = ComputeNumSignBits(Num, DL, 0, AC, &I);
   if (LHSSignBits < AtLeast)
     return -1;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
index b7436aeb1d5302..283a3524c2ff93 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
@@ -9999,3 +9999,60 @@ define <2 x i64> @v_udiv_i64_exact(<2 x i64> %num) {
    %result = udiv exact <2 x i64> %num, <i64 4096, i64 1024>
    ret <2 x i64> %result
 }
+
+define i64 @udiv_i64_gt_smax(i8 %size) {
+; GFX6-LABEL: udiv_i64_gt_smax:
+; GFX6:       ; %bb.0:
+; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX6-NEXT:    v_bfe_i32 v0, v0, 0, 8
+; GFX6-NEXT:    v_ashrrev_i32_e32 v1, 31, v0
+; GFX6-NEXT:    v_not_b32_e32 v1, v1
+; GFX6-NEXT:    v_not_b32_e32 v0, v0
+; GFX6-NEXT:    s_mov_b32 s4, 0xcccccccd
+; GFX6-NEXT:    v_mul_lo_u32 v3, v1, s4
+; GFX6-NEXT:    v_mul_hi_u32 v4, v0, s4
+; GFX6-NEXT:    s_mov_b32 s6, 0xcccccccc
+; GFX6-NEXT:    v_mul_hi_u32 v5, v1, s4
+; GFX6-NEXT:    v_mul_hi_u32 v2, v0, s6
+; GFX6-NEXT:    v_mul_lo_u32 v0, v0, s6
+; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v3, v4
+; GFX6-NEXT:    v_addc_u32_e32 v4, vcc, 0, v5, vcc
+; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v3
+; GFX6-NEXT:    v_addc_u32_e32 v0, vcc, 0, v2, vcc
+; GFX6-NEXT:    v_mul_lo_u32 v2, v1, s6
+; GFX6-NEXT:    v_mul_hi_u32 v1, v1, s6
+; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v4, v0
+; GFX6-NEXT:    v_addc_u32_e64 v3, s[4:5], 0, 0, vcc
+; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v2, v0
+; GFX6-NEXT:    v_addc_u32_e32 v1, vcc, v1, v3, vcc
+; GFX6-NEXT:    v_alignbit_b32 v0, v1, v0, 3
+; GFX6-NEXT:    v_lshrrev_b32_e32 v1, 3, v1
+; GFX6-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9-LABEL: udiv_i64_gt_smax:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_mov_b32_e32 v1, 31
+; GFX9-NEXT:    v_not_b32_sdwa v4, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0
+; GFX9-NEXT:    s_mov_b32 s4, 0xcccccccd
+; GFX9-NEXT:    v_ashrrev_i32_sdwa v1, v1, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_mul_hi_u32 v0, v4, s4
+; GFX9-NEXT:    v_not_b32_e32 v5, v1
+; GFX9-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9-NEXT:    s_mov_b32 s6, 0xcccccccc
+; GFX9-NEXT:    v_mad_u64_u32 v[2:3], s[4:5], v5, s4, v[0:1]
+; GFX9-NEXT:    v_mov_b32_e32 v6, v3
+; GFX9-NEXT:    v_mov_b32_e32 v3, v1
+; GFX9-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v4, s6, v[2:3]
+; GFX9-NEXT:    v_mov_b32_e32 v0, v1
+; GFX9-NEXT:    v_add_co_u32_e32 v0, vcc, v6, v0
+; GFX9-NEXT:    v_addc_co_u32_e64 v1, s[4:5], 0, 0, vcc
+; GFX9-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v5, s6, v[0:1]
+; GFX9-NEXT:    v_alignbit_b32 v0, v1, v0, 3
+; GFX9-NEXT:    v_lshrrev_b32_e32 v1, 3, v1
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+  %esize = sext i8 %size to i64
+  %minus = sub nuw nsw i64 -1, %esize
+  %div = udiv i64 %minus, 10
+  ret i64 %div
+}

Copy link

github-actions bot commented Nov 19, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@changpeng changpeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense for this function to have separate paths for signed (based on computeNumSignBits) and unsigned (based on leading zeros, from computeKnownBits). The way it is written now, the unsigned case will potentially call both computeKnownBits and computeNumSignBits on both arguments, which seems like overkill.

@choikwa
Copy link
Contributor Author

choikwa commented Nov 19, 2024

Separated into unsigned/signed path in getDivNumBits and added/verified positive testcase for shrinking.

@choikwa choikwa force-pushed the udiv-neg branch 2 times, most recently from 3565d13 to 5ab2888 Compare November 19, 2024 16:44
… max

Handle signed and unsigned path differently in getDivNumBits.
Using computeKnownBits, this rejects shrinking unsigned div/rem if
operands exceed signed max since we know NumSignBits will be always 0.
@choikwa choikwa changed the title [AMDGPU] prevent shrinking udiv/urem if operands exceed smax_bitwidth [AMDGPU] prevent shrinking udiv/urem if either operand is in (SignedMax,UnsignedMax] Nov 19, 2024
@choikwa choikwa merged commit b8e1d4d into llvm:main Nov 20, 2024
6 of 8 checks passed
@jplehr
Copy link
Contributor

jplehr commented Nov 20, 2024

I believe this broke one of our bots: https://lab.llvm.org/buildbot/#/builders/73/builds/8871

Sorry for the delay to report, I'm currently traveling.
@jhuber6

@jhuber6
Copy link
Contributor

jhuber6 commented Nov 20, 2024

Definitely seems like it. The LLVM libc tests tend to trigger a lot of different edge cases because it compiles a lot of non-standard GPU code.

@shiltian
Copy link
Contributor

Definitely seems like it. The LLVM libc tests tend to trigger a lot of different edge cases because it compiles a lot of non-standard GPU code.

👍 which is really nice to have :-)

@jhuber6
Copy link
Contributor

jhuber6 commented Nov 20, 2024

I'll try to do a local test and confirm if it's this patch and request a revert if so. If anyone else wants to try you should just need to add the options documented in https://libc.llvm.org/gpu/building.html and do ninja check-libc-amdgcn-amd-amdhsa.

@changpeng
Copy link
Contributor

I'll try to do a local test and confirm if it's this patch and request a revert if so. If anyone else wants to try you should just need to add the options documented in https://libc.llvm.org/gpu/building.html and do ninja check-libc-amdgcn-amd-amdhsa.

Could it possibly be smoker tester error? I think this change just resulting in less optimized udiv/drem

@jhuber6
Copy link
Contributor

jhuber6 commented Nov 20, 2024

Could it possibly be smoker tester error? I think this change just resulting in less optimized udiv/drem

I's guess not because the failures are consistent and started just after this PR. I'll confirm later once I finish up something.

@jhuber6
Copy link
Contributor

jhuber6 commented Nov 21, 2024

Failures go away when I revert this patch locally, so I'm going to revert.

jhuber6 added a commit that referenced this pull request Nov 21, 2024
…(SignedMax,UnsignedMax] (#116733)"

This reverts commit b8e1d4d.

Causes failures on the `libc` test suite https://lab.llvm.org/buildbot/#/builders/73/builds/8871
Comment on lines +1201 to +1203
unsigned RHSSignBits = ComputeNumSignBits(Den, DL, 0, AC, &I);
if (RHSSignBits < AtLeast)
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should try RHS first as it's canonically simpler

return -1;
unsigned LHSSignBits = Known.countMinLeadingZeros();

Known = computeKnownBits(Den, DL, 0, AC, &I);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try RHS first

unsigned SignBits = std::min(LHSSignBits, RHSSignBits);
unsigned DivBits = Num->getType()->getScalarSizeInBits() - SignBits;
return DivBits + 1;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No else after return

Comment on lines +1210 to +1213
// We know all bits are used for division for Num or Den in range
// (SignedMax, UnsignedMax]
if (Known.isNegative() || !Known.isNonNegative())
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reproducing the logic of computeKnownBits for division? Can you just do KnownLHS.udiv/urem(KnowRHS)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we need the KnownBits of the operands for shrinking as div/rem gives us the known of the final output.

@choikwa
Copy link
Contributor Author

choikwa commented Nov 21, 2024

Going to try to use computeKnownBits for signed and unsigned cases for consistency and wider available information. Will also look at the LibC failure.

choikwa added a commit to choikwa/llvm-project that referenced this pull request Dec 6, 2024
… (SignedMax,UnsignedMax] (llvm#116733)"

This reverts commit 905e831.

Handle signed and unsigned path differently in getDivNumBits.
Using computeKnownBits, this rejects shrinking unsigned div/rem if
operands exceed signed max since we know NumSignBits will be always 0.
choikwa added a commit that referenced this pull request Dec 7, 2024
#118928)

… (SignedMax,UnsignedMax] (#116733)"

This reverts commit 905e831.

Handle signed and unsigned path differently in getDivNumBits. Using
computeKnownBits, this rejects shrinking unsigned div/rem if operands
exceed signed max since we know NumSignBits will be always 0.

Rebased and re-attempt after first one was reverted due to unrelated
failure in LibC (should be fixed by now I'm told).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants