Skip to content

[AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode. #71556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

vpykhtin
Copy link
Contributor

@vpykhtin vpykhtin commented Nov 7, 2023

Substitute with zero-extended to i64 ballot.i32 intrinsic.

@llvmbot
Copy link
Member

llvmbot commented Nov 7, 2023

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Valery Pykhtin (vpykhtin)

Changes

Currently ballot intrinsic is lowered to AMDGPUISD::SETCC using bit width of the lowered intrinsic mnemonic. Instead use the bit width of the current wave mode converted to the required bit width. This is similar to ICMP/FCMP lowering.

I haven't added tests for ballot.i32 in wave64 mode because I'm not sure this should be a compilation error?


Full diff: https://github.com/llvm/llvm-project/pull/71556.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+9-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll (+4-10)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index cff7e4bc66218ca..0b1a8bf0861720b 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -5586,10 +5586,14 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering &TLI, SDNode *N,
   SDValue Src = N->getOperand(1);
   SDLoc SL(N);
 
+  unsigned WavefrontSize = TLI.getSubtarget()->getWavefrontSize();
+  EVT CCVT = EVT::getIntegerVT(*DAG.getContext(), WavefrontSize);
+
   if (Src.getOpcode() == ISD::SETCC) {
     // (ballot (ISD::SETCC ...)) -> (AMDGPUISD::SETCC ...)
-    return DAG.getNode(AMDGPUISD::SETCC, SL, VT, Src.getOperand(0),
-                       Src.getOperand(1), Src.getOperand(2));
+    SDValue SetCC = DAG.getNode(AMDGPUISD::SETCC, SL, CCVT, Src.getOperand(0),
+                                Src.getOperand(1), Src.getOperand(2));
+    return VT.bitsEq(CCVT) ? SetCC : DAG.getZExtOrTrunc(SetCC, SL, VT);
   }
   if (const ConstantSDNode *Arg = dyn_cast<ConstantSDNode>(Src)) {
     // (ballot 0) -> 0
@@ -5612,9 +5616,10 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering &TLI, SDNode *N,
 
   // (ballot (i1 $src)) -> (AMDGPUISD::SETCC (i32 (zext $src)) (i32 0)
   // ISD::SETNE)
-  return DAG.getNode(
-      AMDGPUISD::SETCC, SL, VT, DAG.getZExtOrTrunc(Src, SL, MVT::i32),
+  SDValue SetCC = DAG.getNode(
+      AMDGPUISD::SETCC, SL, CCVT, DAG.getZExtOrTrunc(Src, SL, MVT::i32),
       DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE));
+  return VT.bitsEq(CCVT) ? SetCC : DAG.getZExtOrTrunc(SetCC, SL, VT);
 }
 
 void SITargetLowering::ReplaceNodeResults(SDNode *N,
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll
index 3337d053eb930b9..563899b6bcce117 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll
@@ -202,10 +202,8 @@ false:
 define amdgpu_cs i32 @branch_divergent_ballot64_ne_zero_compare(i32 %v) {
 ; CHECK-LABEL: branch_divergent_ballot64_ne_zero_compare:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    v_cmp_gt_u32_e64 s0, 12, v0
-; CHECK-NEXT:    s_mov_b32 s1, 0
-; CHECK-NEXT:    s_cmp_eq_u64 s[0:1], 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB12_2
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    s_cbranch_vccz .LBB12_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
 ; CHECK-NEXT:    s_branch .LBB12_3
@@ -320,12 +318,8 @@ define amdgpu_cs i32 @branch_divergent_ballot64_ne_zero_and(i32 %v1, i32 %v2) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
 ; CHECK-NEXT:    v_cmp_lt_u32_e64 s0, 34, v1
-; CHECK-NEXT:    s_mov_b32 s1, 0
-; CHECK-NEXT:    s_and_b32 s0, vcc_lo, s0
-; CHECK-NEXT:    v_cndmask_b32_e64 v0, 0, 1, s0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, v0
-; CHECK-NEXT:    s_cmp_eq_u64 s[0:1], 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB17_2
+; CHECK-NEXT:    s_and_b32 vcc_lo, vcc_lo, s0
+; CHECK-NEXT:    s_cbranch_vccz .LBB17_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
 ; CHECK-NEXT:    s_branch .LBB17_3

@jayfoad
Copy link
Contributor

jayfoad commented Nov 7, 2023

I would expect it to fail to select if the type does not match the wave size. Is there a real use case for it?

@vpykhtin
Copy link
Contributor Author

vpykhtin commented Nov 7, 2023

I would expect it to fail to select if the type does not match the wave size. Is there a real use case for it?

I would like to use ballot.i64 in both wave modes because I don't want to select bit width in AddressSanitizer code, it's relatively high in the pass chain and wave mode isn't readily available there.

@arsenm
Copy link
Contributor

arsenm commented Nov 8, 2023

I would expect it to fail to select if the type does not match the wave size. Is there a real use case for it?

It's a synthetic operation to begin with and has a pretty clear meaning, just 0 the high bits. It's still valid to read exec/exec_hi in wave32 mode, it's just the high bits are always 0. This mirrors the same behavior (such that we are now using this to implement the read_exec builtins)

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just about code quality? This probably isn't necessary if you consider the pipeline as a whole. I believe InstCombine is already reducing i64 ballot to i32 based on the subtarget's wave size

@vpykhtin
Copy link
Contributor Author

vpykhtin commented Nov 8, 2023

Is this just about code quality? This probably isn't necessary if you consider the pipeline as a whole. I believe InstCombine is already reducing i64 ballot to i32 based on the subtarget's wave size

Right, but InstCombine is a bit late. This is needed after #68714, otherwise I need special processing of i64 mask in wave32 mode there. I've implemented the TODO with 51578ea.

@vpykhtin
Copy link
Contributor Author

Ping.

@arsenm
Copy link
Contributor

arsenm commented Nov 15, 2023

Right, but InstCombine is a bit late.

I'm confused, InstCombine can't be too late, it runs many times after all IR producers. There's no practical path between the sanitizer passes and the backend where it won't run

@vpykhtin
Copy link
Contributor Author

I'm confused, InstCombine can't be too late, it runs many times after all IR producers. There's no practical path between the sanitizer passes and the backend where it won't run

I'm sorry, I thought it's machine IR pass. I don't see wavesize type correction for ballot in GCNTTIImpl::instCombineIntrinsic, do you beleive it's better to add it there?

@arsenm
Copy link
Contributor

arsenm commented Nov 15, 2023

I'm confused, InstCombine can't be too late, it runs many times after all IR producers. There's no practical path between the sanitizer passes and the backend where it won't run

I'm sorry, I thought it's machine IR pass. I don't see wavesize type correction for ballot in GCNTTIImpl::instCombineIntrinsic, do you beleive it's better to add it there?

Oh yes, it is missing. I recently added it for amdgcn_mbcnt_hi above. We should do the same for ballot

@vpykhtin
Copy link
Contributor Author

Reimplemented using instcombine rule. I had to move i64/wave32 tests to the existing dedicated test.

Please see the differences for individual commits in this branch to follow the changes made.

@vpykhtin vpykhtin requested a review from Pierre-vh November 20, 2023 15:12
@vpykhtin vpykhtin changed the title [AMDGPU] Improve selection of ballot.i64 intrinsic in wave32 mode in SelectionDAG. [AMDGPU] Improve selection of ballot.i64 intrinsic in wave32 mode. Nov 20, 2023
@llvmbot llvmbot added the clang Clang issues not falling into any other category label Nov 20, 2023
@vpykhtin
Copy link
Contributor Author

Removed unrelated changes.

// TODO: make condition below an assert after fixing ballot bitwidth.
VCMP.getValueType().getSizeInBits() == ST->getWavefrontSize()) {
if ((CC == ISD::SETEQ || CC == ISD::SETNE) && CRHS && CRHS->isZero()) {
assert(VCMP.getValueType().getSizeInBits() == ST->getWavefrontSize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are asserting that instcombine has been run? That seems wrong. What about -O0 compiles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It looks like I need to perform the transformation in both places - instcombiner and SelectionDAG combiner if I want to keep ballot folding on -O0 but I'm not sure this is really needed.

Also I need to keep original test running without opt and just add runs with opt.

@vpykhtin
Copy link
Contributor Author

  • Undo the assert per Jay's comment for -O0 mode.
  • Restored the original test and added runs with opt to it.

Comment on lines 971 to 974
CallInst *NewCall = IC.Builder.CreateCall(NewF, {II.getArgOperand(0)});
Value *CastedCall = IC.Builder.CreateZExtOrBitCast(NewCall, II.getType());
CastedCall->takeName(&II);
return IC.replaceInstUsesWith(II, CastedCall);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can just reuse the same value, e.g.

      Value *Call = IC.Builder.CreateCall(NewF, {II.getArgOperand(0)});
      Call = IC.Builder.CreateZExtOrBitCast(NewCall, II.getType());
      Call->takeName(&II);
      return IC.replaceInstUsesWith(II, Call);

I also think you should be able to use something like IC.Builder.CreateIntrinsic ? There should be a function to create an intrinsic call directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed to IC.Builder.CreateIntrinsic. I decided to use it as a temp for CreateZExtOrBitCast.

@vpykhtin
Copy link
Contributor Author

I'm going to rebase it on top of #73779 to show test changes.

@vpykhtin
Copy link
Contributor Author

This is ready to be submitted if there're no objections.

@vpykhtin
Copy link
Contributor Author

Gentle ping.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description should be adjusted, this isn't really changing the selection anymore

@vpykhtin vpykhtin changed the title [AMDGPU] Improve selection of ballot.i64 intrinsic in wave32 mode. [AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode. Dec 18, 2023
@arsenm arsenm merged commit 57b50ef into llvm:main Jan 17, 2024
@dyung
Copy link
Collaborator

dyung commented Jan 17, 2024

@vpykhtin
Copy link
Contributor Author

@vpykhtin the test you modified builtins-amdgcn-wave32.cl seems to be failing on a few bots. Can you take a look?

Sure, I'm going to revert and resubmit with the tests updated.

vpykhtin added a commit that referenced this pull request Feb 2, 2024
…ve32 mode." (#80303)

Reapply #71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`.

This reverts commit 9791e54.
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
…ve32 mode." (llvm#80303)

Reapply llvm#71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`.

This reverts commit 9791e54.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request May 1, 2024
…ve32 mode." (llvm#80303)

Reapply llvm#71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`.

This reverts commit 9791e54.

(cherry picked from commit b8025d1)
Change-Id: I03aafda08ca433456f6e82accd6b702a307bfd0b
rocm-ci pushed a commit to ROCm/llvm-project that referenced this pull request Jun 4, 2024
…ve32 mode." (llvm#80303)

Reapply llvm#71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`.

This reverts commit 9791e54.

(cherry picked from commit b8025d1)
Change-Id: I03aafda08ca433456f6e82accd6b702a307bfd0b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang Clang issues not falling into any other category llvm:globalisel llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants