-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode. #71556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-clang @llvm/pr-subscribers-backend-amdgpu Author: Valery Pykhtin (vpykhtin) ChangesCurrently ballot intrinsic is lowered to AMDGPUISD::SETCC using bit width of the lowered intrinsic mnemonic. Instead use the bit width of the current wave mode converted to the required bit width. This is similar to ICMP/FCMP lowering. I haven't added tests for ballot.i32 in wave64 mode because I'm not sure this should be a compilation error? Full diff: https://github.com/llvm/llvm-project/pull/71556.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index cff7e4bc66218ca..0b1a8bf0861720b 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -5586,10 +5586,14 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering &TLI, SDNode *N,
SDValue Src = N->getOperand(1);
SDLoc SL(N);
+ unsigned WavefrontSize = TLI.getSubtarget()->getWavefrontSize();
+ EVT CCVT = EVT::getIntegerVT(*DAG.getContext(), WavefrontSize);
+
if (Src.getOpcode() == ISD::SETCC) {
// (ballot (ISD::SETCC ...)) -> (AMDGPUISD::SETCC ...)
- return DAG.getNode(AMDGPUISD::SETCC, SL, VT, Src.getOperand(0),
- Src.getOperand(1), Src.getOperand(2));
+ SDValue SetCC = DAG.getNode(AMDGPUISD::SETCC, SL, CCVT, Src.getOperand(0),
+ Src.getOperand(1), Src.getOperand(2));
+ return VT.bitsEq(CCVT) ? SetCC : DAG.getZExtOrTrunc(SetCC, SL, VT);
}
if (const ConstantSDNode *Arg = dyn_cast<ConstantSDNode>(Src)) {
// (ballot 0) -> 0
@@ -5612,9 +5616,10 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering &TLI, SDNode *N,
// (ballot (i1 $src)) -> (AMDGPUISD::SETCC (i32 (zext $src)) (i32 0)
// ISD::SETNE)
- return DAG.getNode(
- AMDGPUISD::SETCC, SL, VT, DAG.getZExtOrTrunc(Src, SL, MVT::i32),
+ SDValue SetCC = DAG.getNode(
+ AMDGPUISD::SETCC, SL, CCVT, DAG.getZExtOrTrunc(Src, SL, MVT::i32),
DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE));
+ return VT.bitsEq(CCVT) ? SetCC : DAG.getZExtOrTrunc(SetCC, SL, VT);
}
void SITargetLowering::ReplaceNodeResults(SDNode *N,
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll
index 3337d053eb930b9..563899b6bcce117 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll
@@ -202,10 +202,8 @@ false:
define amdgpu_cs i32 @branch_divergent_ballot64_ne_zero_compare(i32 %v) {
; CHECK-LABEL: branch_divergent_ballot64_ne_zero_compare:
; CHECK: ; %bb.0:
-; CHECK-NEXT: v_cmp_gt_u32_e64 s0, 12, v0
-; CHECK-NEXT: s_mov_b32 s1, 0
-; CHECK-NEXT: s_cmp_eq_u64 s[0:1], 0
-; CHECK-NEXT: s_cbranch_scc1 .LBB12_2
+; CHECK-NEXT: v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT: s_cbranch_vccz .LBB12_2
; CHECK-NEXT: ; %bb.1: ; %true
; CHECK-NEXT: s_mov_b32 s0, 42
; CHECK-NEXT: s_branch .LBB12_3
@@ -320,12 +318,8 @@ define amdgpu_cs i32 @branch_divergent_ballot64_ne_zero_and(i32 %v1, i32 %v2) {
; CHECK: ; %bb.0:
; CHECK-NEXT: v_cmp_gt_u32_e32 vcc_lo, 12, v0
; CHECK-NEXT: v_cmp_lt_u32_e64 s0, 34, v1
-; CHECK-NEXT: s_mov_b32 s1, 0
-; CHECK-NEXT: s_and_b32 s0, vcc_lo, s0
-; CHECK-NEXT: v_cndmask_b32_e64 v0, 0, 1, s0
-; CHECK-NEXT: v_cmp_ne_u32_e64 s0, 0, v0
-; CHECK-NEXT: s_cmp_eq_u64 s[0:1], 0
-; CHECK-NEXT: s_cbranch_scc1 .LBB17_2
+; CHECK-NEXT: s_and_b32 vcc_lo, vcc_lo, s0
+; CHECK-NEXT: s_cbranch_vccz .LBB17_2
; CHECK-NEXT: ; %bb.1: ; %true
; CHECK-NEXT: s_mov_b32 s0, 42
; CHECK-NEXT: s_branch .LBB17_3
|
I would expect it to fail to select if the type does not match the wave size. Is there a real use case for it? |
I would like to use ballot.i64 in both wave modes because I don't want to select bit width in AddressSanitizer code, it's relatively high in the pass chain and wave mode isn't readily available there. |
It's a synthetic operation to begin with and has a pretty clear meaning, just 0 the high bits. It's still valid to read exec/exec_hi in wave32 mode, it's just the high bits are always 0. This mirrors the same behavior (such that we are now using this to implement the read_exec builtins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just about code quality? This probably isn't necessary if you consider the pipeline as a whole. I believe InstCombine is already reducing i64 ballot to i32 based on the subtarget's wave size
Right, but InstCombine is a bit late. This is needed after #68714, otherwise I need special processing of i64 mask in wave32 mode there. I've implemented the TODO with 51578ea. |
Ping. |
I'm confused, InstCombine can't be too late, it runs many times after all IR producers. There's no practical path between the sanitizer passes and the backend where it won't run |
I'm sorry, I thought it's machine IR pass. I don't see wavesize type correction for ballot in GCNTTIImpl::instCombineIntrinsic, do you beleive it's better to add it there? |
Oh yes, it is missing. I recently added it for amdgcn_mbcnt_hi above. We should do the same for ballot |
51578ea
to
4cf6ac7
Compare
Reimplemented using instcombine rule. I had to move i64/wave32 tests to the existing dedicated test. Please see the differences for individual commits in this branch to follow the changes made. |
5543d14
to
e97e128
Compare
Removed unrelated changes. |
// TODO: make condition below an assert after fixing ballot bitwidth. | ||
VCMP.getValueType().getSizeInBits() == ST->getWavefrontSize()) { | ||
if ((CC == ISD::SETEQ || CC == ISD::SETNE) && CRHS && CRHS->isZero()) { | ||
assert(VCMP.getValueType().getSizeInBits() == ST->getWavefrontSize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are asserting that instcombine has been run? That seems wrong. What about -O0 compiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It looks like I need to perform the transformation in both places - instcombiner and SelectionDAG combiner if I want to keep ballot folding on -O0 but I'm not sure this is really needed.
Also I need to keep original test running without opt and just add runs with opt.
e97e128
to
526c635
Compare
|
CallInst *NewCall = IC.Builder.CreateCall(NewF, {II.getArgOperand(0)}); | ||
Value *CastedCall = IC.Builder.CreateZExtOrBitCast(NewCall, II.getType()); | ||
CastedCall->takeName(&II); | ||
return IC.replaceInstUsesWith(II, CastedCall); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can just reuse the same value, e.g.
Value *Call = IC.Builder.CreateCall(NewF, {II.getArgOperand(0)});
Call = IC.Builder.CreateZExtOrBitCast(NewCall, II.getType());
Call->takeName(&II);
return IC.replaceInstUsesWith(II, Call);
I also think you should be able to use something like IC.Builder.CreateIntrinsic
? There should be a function to create an intrinsic call directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, changed to IC.Builder.CreateIntrinsic
. I decided to use it as a temp for CreateZExtOrBitCast
.
I'm going to rebase it on top of #73779 to show test changes. |
This is ready to be submitted if there're no objections. |
0ea1052
to
b6204d3
Compare
Gentle ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description should be adjusted, this isn't really changing the selection anymore
b6204d3
to
871f5ec
Compare
@vpykhtin the test you modified builtins-amdgcn-wave32.cl seems to be failing on a few bots. Can you take a look? https://lab.llvm.org/buildbot/#/builders/188/builds/40541 |
Sure, I'm going to revert and resubmit with the tests updated. |
…ve32 mode." (llvm#80303) Reapply llvm#71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`. This reverts commit 9791e54.
…ve32 mode." (llvm#80303) Reapply llvm#71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`. This reverts commit 9791e54. (cherry picked from commit b8025d1) Change-Id: I03aafda08ca433456f6e82accd6b702a307bfd0b
…ve32 mode." (llvm#80303) Reapply llvm#71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`. This reverts commit 9791e54. (cherry picked from commit b8025d1) Change-Id: I03aafda08ca433456f6e82accd6b702a307bfd0b
Substitute with zero-extended to i64 ballot.i32 intrinsic.