-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] prevent shrinking udiv/urem if either operand is in (SignedMax,UnsignedMax] #116733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: choikwa (choikwa) ChangesDo this by using ComputeKnownBits and checking for !isNonNegative and isUnsigned. This rejects shrinking unsigned div/rem if operands exceed smax_bitwidth since we know NumSignBits will be always 0. Full diff: https://github.com/llvm/llvm-project/pull/116733.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index c49aab823b44a4..bb9dbda200a524 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -1193,6 +1193,18 @@ int AMDGPUCodeGenPrepareImpl::getDivNumBits(BinaryOperator &I, Value *Num,
Value *Den, unsigned AtLeast,
bool IsSigned) const {
const DataLayout &DL = Mod->getDataLayout();
+ if (!IsSigned) {
+ KnownBits Known = computeKnownBits(Num, DL, 0, AC, &I);
+ // We know all bits are used for division for Operand > smax_bitwidth
+ if (!Known.isNonNegative())
+ return -1;
+
+ Known = computeKnownBits(Den, DL, 0, AC, &I);
+ if (!Known.isNonNegative())
+ return -1;
+ }
+
+
unsigned LHSSignBits = ComputeNumSignBits(Num, DL, 0, AC, &I);
if (LHSSignBits < AtLeast)
return -1;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
index b7436aeb1d5302..283a3524c2ff93 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
@@ -9999,3 +9999,60 @@ define <2 x i64> @v_udiv_i64_exact(<2 x i64> %num) {
%result = udiv exact <2 x i64> %num, <i64 4096, i64 1024>
ret <2 x i64> %result
}
+
+define i64 @udiv_i64_gt_smax(i8 %size) {
+; GFX6-LABEL: udiv_i64_gt_smax:
+; GFX6: ; %bb.0:
+; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX6-NEXT: v_bfe_i32 v0, v0, 0, 8
+; GFX6-NEXT: v_ashrrev_i32_e32 v1, 31, v0
+; GFX6-NEXT: v_not_b32_e32 v1, v1
+; GFX6-NEXT: v_not_b32_e32 v0, v0
+; GFX6-NEXT: s_mov_b32 s4, 0xcccccccd
+; GFX6-NEXT: v_mul_lo_u32 v3, v1, s4
+; GFX6-NEXT: v_mul_hi_u32 v4, v0, s4
+; GFX6-NEXT: s_mov_b32 s6, 0xcccccccc
+; GFX6-NEXT: v_mul_hi_u32 v5, v1, s4
+; GFX6-NEXT: v_mul_hi_u32 v2, v0, s6
+; GFX6-NEXT: v_mul_lo_u32 v0, v0, s6
+; GFX6-NEXT: v_add_i32_e32 v3, vcc, v3, v4
+; GFX6-NEXT: v_addc_u32_e32 v4, vcc, 0, v5, vcc
+; GFX6-NEXT: v_add_i32_e32 v0, vcc, v0, v3
+; GFX6-NEXT: v_addc_u32_e32 v0, vcc, 0, v2, vcc
+; GFX6-NEXT: v_mul_lo_u32 v2, v1, s6
+; GFX6-NEXT: v_mul_hi_u32 v1, v1, s6
+; GFX6-NEXT: v_add_i32_e32 v0, vcc, v4, v0
+; GFX6-NEXT: v_addc_u32_e64 v3, s[4:5], 0, 0, vcc
+; GFX6-NEXT: v_add_i32_e32 v0, vcc, v2, v0
+; GFX6-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
+; GFX6-NEXT: v_alignbit_b32 v0, v1, v0, 3
+; GFX6-NEXT: v_lshrrev_b32_e32 v1, 3, v1
+; GFX6-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX9-LABEL: udiv_i64_gt_smax:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_mov_b32_e32 v1, 31
+; GFX9-NEXT: v_not_b32_sdwa v4, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0
+; GFX9-NEXT: s_mov_b32 s4, 0xcccccccd
+; GFX9-NEXT: v_ashrrev_i32_sdwa v1, v1, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT: v_mul_hi_u32 v0, v4, s4
+; GFX9-NEXT: v_not_b32_e32 v5, v1
+; GFX9-NEXT: v_mov_b32_e32 v1, 0
+; GFX9-NEXT: s_mov_b32 s6, 0xcccccccc
+; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v5, s4, v[0:1]
+; GFX9-NEXT: v_mov_b32_e32 v6, v3
+; GFX9-NEXT: v_mov_b32_e32 v3, v1
+; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v4, s6, v[2:3]
+; GFX9-NEXT: v_mov_b32_e32 v0, v1
+; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v6, v0
+; GFX9-NEXT: v_addc_co_u32_e64 v1, s[4:5], 0, 0, vcc
+; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v5, s6, v[0:1]
+; GFX9-NEXT: v_alignbit_b32 v0, v1, v0, 3
+; GFX9-NEXT: v_lshrrev_b32_e32 v1, 3, v1
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+ %esize = sext i8 %size to i64
+ %minus = sub nuw nsw i64 -1, %esize
+ %div = udiv i64 %minus, 10
+ ret i64 %div
+}
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense for this function to have separate paths for signed (based on computeNumSignBits) and unsigned (based on leading zeros, from computeKnownBits). The way it is written now, the unsigned case will potentially call both computeKnownBits and computeNumSignBits on both arguments, which seems like overkill.
Separated into unsigned/signed path in getDivNumBits and added/verified positive testcase for shrinking. |
3565d13
to
5ab2888
Compare
… max Handle signed and unsigned path differently in getDivNumBits. Using computeKnownBits, this rejects shrinking unsigned div/rem if operands exceed signed max since we know NumSignBits will be always 0.
I believe this broke one of our bots: https://lab.llvm.org/buildbot/#/builders/73/builds/8871 Sorry for the delay to report, I'm currently traveling. |
Definitely seems like it. The LLVM libc tests tend to trigger a lot of different edge cases because it compiles a lot of non-standard GPU code. |
👍 which is really nice to have :-) |
I'll try to do a local test and confirm if it's this patch and request a revert if so. If anyone else wants to try you should just need to add the options documented in https://libc.llvm.org/gpu/building.html and do |
Could it possibly be smoker tester error? I think this change just resulting in less optimized udiv/drem |
I's guess not because the failures are consistent and started just after this PR. I'll confirm later once I finish up something. |
Failures go away when I revert this patch locally, so I'm going to revert. |
…(SignedMax,UnsignedMax] (#116733)" This reverts commit b8e1d4d. Causes failures on the `libc` test suite https://lab.llvm.org/buildbot/#/builders/73/builds/8871
unsigned RHSSignBits = ComputeNumSignBits(Den, DL, 0, AC, &I); | ||
if (RHSSignBits < AtLeast) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should try RHS first as it's canonically simpler
return -1; | ||
unsigned LHSSignBits = Known.countMinLeadingZeros(); | ||
|
||
Known = computeKnownBits(Den, DL, 0, AC, &I); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try RHS first
unsigned SignBits = std::min(LHSSignBits, RHSSignBits); | ||
unsigned DivBits = Num->getType()->getScalarSizeInBits() - SignBits; | ||
return DivBits + 1; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No else after return
// We know all bits are used for division for Num or Den in range | ||
// (SignedMax, UnsignedMax] | ||
if (Known.isNegative() || !Known.isNonNegative()) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reproducing the logic of computeKnownBits for division? Can you just do KnownLHS.udiv/urem(KnowRHS)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we need the KnownBits of the operands for shrinking as div/rem gives us the known of the final output.
Going to try to use computeKnownBits for signed and unsigned cases for consistency and wider available information. Will also look at the LibC failure. |
… (SignedMax,UnsignedMax] (llvm#116733)" This reverts commit 905e831. Handle signed and unsigned path differently in getDivNumBits. Using computeKnownBits, this rejects shrinking unsigned div/rem if operands exceed signed max since we know NumSignBits will be always 0.
#118928) … (SignedMax,UnsignedMax] (#116733)" This reverts commit 905e831. Handle signed and unsigned path differently in getDivNumBits. Using computeKnownBits, this rejects shrinking unsigned div/rem if operands exceed signed max since we know NumSignBits will be always 0. Rebased and re-attempt after first one was reverted due to unrelated failure in LibC (should be fixed by now I'm told).
Do this by using ComputeKnownBits and checking for !isNonNegative and isUnsigned. This rejects shrinking unsigned div/rem if operands exceed smax_bitwidth since we know NumSignBits will be always 0.