AMDGPU/GlobalISel: Fix inst-selection of ballot #109986

petar-avramovic · 2024-09-25T14:02:22Z

Both input and output of ballot are lane-masks:
result is lane-mask with 'S32/S64 LLT and SGPR bank'
input is lane-mask with 'S1 LLT and VCC reg bank'.
Ballot copies bits from input lane-mask for
all active lanes and puts 0 for inactive lanes.
GlobalISel did not set 0 in result for inactive lanes
for non-constant input.

llvmbot · 2024-09-25T14:02:57Z

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Petar Avramovic (petar-avramovic)

Changes

Both input and output of ballot are lane-masks:
result is lane-mask with 'S32/S64 LLT and SGPR bank'
input is lane-mask with 'S1 LLT and VCC reg bank'.
Ballot copies bits from input lane-mask for
all active lanes and puts 0 for inactive lanes.
GlobalISel did not set 0 in result for inactive lanes
for non-constant input.

Patch is 321.07 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109986.diff

64 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+41-18)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll (+27-9)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll (+27-9)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+16)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.make.buffer.rsrc.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.add.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.cmpswap.ll (+16-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.fadd.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.f16.ll (+6-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.ll (+24-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f16.ll (+12-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f32.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.ll (+26-16)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.add.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.f16.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.ll (+12-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f16.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f32.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.ll (+13-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.f16.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.f16.ll (+12-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.i8.ll (+12-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.ll (+10-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.f16.ll (+6-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.f16.ll (+18-9)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.i8.ll (+18-9)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.ll (+20-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.s.buffer.load.ll (+96-48)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.add.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.cmpswap.ll (+16-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.fadd.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.f16.ll (+6-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f16.ll (+6-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f32.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.add.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.ll (+8-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.f16.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f16.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f32.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.f16.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.f16.ll (+6-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+22-14)
(modified) llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll (+75-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i64.ll (+47)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i64.wave32.ll (+83-36)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll (+24-15)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w64.ll (+7-5)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index febf0711c7d574..af6081f34303a3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1429,34 +1429,57 @@ bool AMDGPUInstructionSelector::selectBallot(MachineInstr &I) const {
   std::optional<ValueAndVReg> Arg =
       getIConstantVRegValWithLookThrough(I.getOperand(2).getReg(), *MRI);
 
-  const auto BuildCopy = [&](Register SrcReg) {
-    if (Size == STI.getWavefrontSize()) {
-      BuildMI(*BB, &I, DL, TII.get(AMDGPU::COPY), DstReg)
-          .addReg(SrcReg);
-      return;
-    }
+  const auto BuildAnd = [&](unsigned Opcode, Register Dst, Register Src,
+                            Register Exec) {
+    auto And = BuildMI(*BB, &I, DL, TII.get(Opcode), Dst)
+                   .addReg(Src)
+                   .addReg(Exec)
+                   .setOperandDead(3); // Dead scc
+    constrainSelectedInstRegOperands(*And, TII, TRI, RBI);
+  };
 
-    // If emitting a i64 ballot in wave32, fill the upper bits with zeroes.
-    Register HiReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
-    BuildMI(*BB, &I, DL, TII.get(AMDGPU::S_MOV_B32), HiReg).addImm(0);
-    BuildMI(*BB, &I, DL, TII.get(AMDGPU::REG_SEQUENCE), DstReg)
-        .addReg(SrcReg)
+  const auto BuildREG_SEQUENCE = [&](Register Dst, Register Lo, Register Hi) {
+    BuildMI(*BB, &I, DL, TII.get(AMDGPU::REG_SEQUENCE), Dst)
+        .addReg(Lo)
         .addImm(AMDGPU::sub0)
-        .addReg(HiReg)
+        .addReg(Hi)
         .addImm(AMDGPU::sub1);
   };
 
   if (Arg) {
-    const int64_t Value = Arg->Value.getSExtValue();
+    const int64_t Value = Arg->Value.getZExtValue();
     if (Value == 0) {
+      // DstReg(32or64) = S_MOV 0
       unsigned Opcode = Is64 ? AMDGPU::S_MOV_B64 : AMDGPU::S_MOV_B32;
       BuildMI(*BB, &I, DL, TII.get(Opcode), DstReg).addImm(0);
-    } else if (Value == -1) // all ones
-      BuildCopy(IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC);
-    else
+    } else if (Value == 1) {
+      if (Size == STI.getWavefrontSize()) {
+        // DstReg(32or64) = COPY EXEC
+        BuildMI(*BB, &I, DL, TII.get(AMDGPU::COPY), DstReg)
+            .addReg(TRI.getExec());
+      } else {
+        // DstReg(64) = REG_SEQUENCE EXEC_LO, 0
+        Register HiReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
+        BuildMI(*BB, &I, DL, TII.get(AMDGPU::S_MOV_B32), HiReg).addImm(0);
+        BuildREG_SEQUENCE(DstReg, TRI.getExec(), HiReg);
+      }
+    } else
       return false;
-  } else
-    BuildCopy(I.getOperand(2).getReg());
+  } else {
+    Register SrcReg = I.getOperand(2).getReg();
+    if (Size == STI.getWavefrontSize()) {
+      // DstReg(32or64) = AND SrcReg, EXEC
+      unsigned AndOpc = IsWave32 ? AMDGPU::S_AND_B32 : AMDGPU::S_AND_B64;
+      BuildAnd(AndOpc, DstReg, SrcReg, TRI.getExec());
+    } else {
+      // DstReg(64) = REG_SEQUENCE (AND SrcReg(32), EXEC_LO), 0
+      Register LoReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
+      BuildAnd(AMDGPU::S_AND_B32, LoReg, SrcReg, AMDGPU::EXEC_LO);
+      Register HiReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
+      BuildMI(*BB, &I, DL, TII.get(AMDGPU::S_MOV_B32), HiReg).addImm(0);
+      BuildREG_SEQUENCE(DstReg, LoReg, HiReg);
+    }
+  }
 
   I.eraseFromParent();
   return true;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
index de973481f82308..3d7b715a6dbc19 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
@@ -138,6 +138,7 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
 ; CHECK-NEXT:    s_and_b32 s4, s4, s5
 ; CHECK-NEXT:    v_cmp_eq_u64_e64 s5, s[6:7], v[1:2]
 ; CHECK-NEXT:    s_and_b32 s4, s4, s5
+; CHECK-NEXT:    s_and_b32 s4, s4, exec_lo
 ; CHECK-NEXT:    s_and_saveexec_b32 s4, s4
 ; CHECK-NEXT:    v_writelane_b32 v0, s4, 13
 ; CHECK-NEXT:    s_or_saveexec_b32 s21, -1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
index 96cab200b61cdb..f893ac8177b4a3 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
@@ -33,7 +33,8 @@ define amdgpu_cs i32 @non_compare(i32 %x) {
 ; CHECK-LABEL: non_compare:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, v0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
 ; CHECK-NEXT:    ; return to shader part epilog
   %trunc = trunc i32 %x to i1
   %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %trunc)
@@ -45,7 +46,8 @@ define amdgpu_cs i32 @non_compare(i32 %x) {
 define amdgpu_cs i32 @compare_ints(i32 %x, i32 %y) {
 ; CHECK-LABEL: compare_ints:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    v_cmp_eq_u32_e64 s0, v0, v1
+; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v0, v1
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
 ; CHECK-NEXT:    ; return to shader part epilog
   %cmp = icmp eq i32 %x, %y
   %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %cmp)
@@ -55,7 +57,8 @@ define amdgpu_cs i32 @compare_ints(i32 %x, i32 %y) {
 define amdgpu_cs i32 @compare_int_with_constant(i32 %x) {
 ; CHECK-LABEL: compare_int_with_constant:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    v_cmp_le_i32_e64 s0, 0x63, v0
+; CHECK-NEXT:    v_cmp_le_i32_e32 vcc_lo, 0x63, v0
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
 ; CHECK-NEXT:    ; return to shader part epilog
   %cmp = icmp sge i32 %x, 99
   %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %cmp)
@@ -65,7 +68,8 @@ define amdgpu_cs i32 @compare_int_with_constant(i32 %x) {
 define amdgpu_cs i32 @compare_floats(float %x, float %y) {
 ; CHECK-LABEL: compare_floats:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    v_cmp_gt_f32_e64 s0, v0, v1
+; CHECK-NEXT:    v_cmp_gt_f32_e32 vcc_lo, v0, v1
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
 ; CHECK-NEXT:    ; return to shader part epilog
   %cmp = fcmp ogt float %x, %y
   %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %cmp)
@@ -76,7 +80,8 @@ define amdgpu_cs i32 @ctpop_of_ballot(float %x, float %y) {
 ; CHECK-LABEL: ctpop_of_ballot:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_cmp_gt_f32_e32 vcc_lo, v0, v1
-; CHECK-NEXT:    s_bcnt1_i32_b32 s0, vcc_lo
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
+; CHECK-NEXT:    s_bcnt1_i32_b32 s0, s0
 ; CHECK-NEXT:    ; return to shader part epilog
   %cmp = fcmp ogt float %x, %y
   %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %cmp)
@@ -89,7 +94,8 @@ define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_non_compare(i32 %v) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; CHECK-NEXT:    s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB7_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
@@ -113,6 +119,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_non_compare(i32 inreg %v) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB8_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
@@ -137,7 +144,8 @@ define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_non_compare(i32 %v) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; CHECK-NEXT:    s_cmp_lg_u32 vcc_lo, 0
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB9_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
 ; CHECK-NEXT:    s_mov_b32 s0, 33
@@ -161,6 +169,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_non_compare(i32 inreg %v) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB10_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
@@ -184,7 +193,8 @@ define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_compare(i32 %v) {
 ; CHECK-LABEL: branch_divergent_ballot_ne_zero_compare:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
-; CHECK-NEXT:    s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB11_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
@@ -210,6 +220,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_compare(i32 inreg %v) {
 ; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB12_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
@@ -233,7 +244,8 @@ define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_compare(i32 %v) {
 ; CHECK-LABEL: branch_divergent_ballot_eq_zero_compare:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
-; CHECK-NEXT:    s_cmp_lg_u32 vcc_lo, 0
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB13_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
 ; CHECK-NEXT:    s_mov_b32 s0, 33
@@ -259,6 +271,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_compare(i32 inreg %v) {
 ; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB14_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
@@ -284,6 +297,7 @@ define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_and(i32 %v1, i32 %v2) {
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
 ; CHECK-NEXT:    v_cmp_lt_u32_e64 s0, 34, v1
 ; CHECK-NEXT:    s_and_b32 s0, vcc_lo, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB15_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
@@ -315,6 +329,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_and(i32 inreg %v1, i32 inreg
 ; CHECK-NEXT:    s_and_b32 s0, s0, s1
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB16_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
@@ -342,6 +357,7 @@ define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_and(i32 %v1, i32 %v2) {
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
 ; CHECK-NEXT:    v_cmp_lt_u32_e64 s0, 34, v1
 ; CHECK-NEXT:    s_and_b32 s0, vcc_lo, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB17_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
@@ -373,6 +389,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_and(i32 inreg %v1, i32 inreg
 ; CHECK-NEXT:    s_and_b32 s0, s0, s1
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB18_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
@@ -401,6 +418,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_sgt_N_compare(i32 inreg %v) {
 ; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_and_b32 s0, s0, exec_lo
 ; CHECK-NEXT:    s_cmp_le_i32 s0, 22
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB19_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
index a18f843440445c..72b65f087bdad5 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
@@ -34,7 +34,8 @@ define amdgpu_cs i64 @non_compare(i32 %x) {
 ; CHECK-LABEL: non_compare:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
-; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, v0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
 ; CHECK-NEXT:    ; return to shader part epilog
   %trunc = trunc i32 %x to i1
   %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %trunc)
@@ -46,7 +47,8 @@ define amdgpu_cs i64 @non_compare(i32 %x) {
 define amdgpu_cs i64 @compare_ints(i32 %x, i32 %y) {
 ; CHECK-LABEL: compare_ints:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    v_cmp_eq_u32_e64 s[0:1], v0, v1
+; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
 ; CHECK-NEXT:    ; return to shader part epilog
   %cmp = icmp eq i32 %x, %y
   %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
@@ -57,7 +59,8 @@ define amdgpu_cs i64 @compare_int_with_constant(i32 %x) {
 ; CHECK-LABEL: compare_int_with_constant:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_mov_b32_e32 v1, 0x63
-; CHECK-NEXT:    v_cmp_ge_i32_e64 s[0:1], v0, v1
+; CHECK-NEXT:    v_cmp_ge_i32_e32 vcc, v0, v1
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
 ; CHECK-NEXT:    ; return to shader part epilog
   %cmp = icmp sge i32 %x, 99
   %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
@@ -67,7 +70,8 @@ define amdgpu_cs i64 @compare_int_with_constant(i32 %x) {
 define amdgpu_cs i64 @compare_floats(float %x, float %y) {
 ; CHECK-LABEL: compare_floats:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    v_cmp_gt_f32_e64 s[0:1], v0, v1
+; CHECK-NEXT:    v_cmp_gt_f32_e32 vcc, v0, v1
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
 ; CHECK-NEXT:    ; return to shader part epilog
   %cmp = fcmp ogt float %x, %y
   %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
@@ -78,7 +82,8 @@ define amdgpu_cs i64 @ctpop_of_ballot(float %x, float %y) {
 ; CHECK-LABEL: ctpop_of_ballot:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_cmp_gt_f32_e32 vcc, v0, v1
-; CHECK-NEXT:    s_bcnt1_i32_b64 s0, vcc
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
+; CHECK-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
 ; CHECK-NEXT:    s_mov_b32 s1, 0
 ; CHECK-NEXT:    ; return to shader part epilog
   %cmp = fcmp ogt float %x, %y
@@ -92,7 +97,8 @@ define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_non_compare(i32 %v) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
-; CHECK-NEXT:    s_cmp_eq_u64 vcc, 0
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
+; CHECK-NEXT:    s_cmp_eq_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB7_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
@@ -116,6 +122,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_non_compare(i32 inreg %v) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    s_cmp_eq_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB8_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
@@ -140,7 +147,8 @@ define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_non_compare(i32 %v) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
-; CHECK-NEXT:    s_cmp_lg_u64 vcc, 0
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
+; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB9_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
 ; CHECK-NEXT:    s_mov_b32 s0, 33
@@ -164,6 +172,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_non_compare(i32 inreg %v) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB10_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
@@ -187,7 +196,8 @@ define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_compare(i32 %v) {
 ; CHECK-LABEL: branch_divergent_ballot_ne_zero_compare:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc, 12, v0
-; CHECK-NEXT:    s_cmp_eq_u64 vcc, 0
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
+; CHECK-NEXT:    s_cmp_eq_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB11_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
 ; CHECK-NEXT:    s_mov_b32 s0, 42
@@ -213,6 +223,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_compare(i32 inreg %v) {
 ; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    s_cmp_eq_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB12_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
@@ -236,7 +247,8 @@ define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_compare(i32 %v) {
 ; CHECK-LABEL: branch_divergent_ballot_eq_zero_compare:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc, 12, v0
-; CHECK-NEXT:    s_cmp_lg_u64 vcc, 0
+; CHECK-NEXT:    s_and_b64 s[0:1], vcc, exec
+; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB13_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
 ; CHECK-NEXT:    s_mov_b32 s0, 33
@@ -262,6 +274,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_compare(i32 inreg %v) {
 ; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB14_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
@@ -287,6 +300,7 @@ define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_and(i32 %v1, i32 %v2) {
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc, 12, v0
 ; CHECK-NEXT:    v_cmp_lt_u32_e64 s[0:1], 34, v1
 ; CHECK-NEXT:    s_and_b64 s[0:1], vcc, s[0:1]
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    s_cmp_eq_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB15_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
@@ -318,6 +332,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_and(i32 inreg %v1, i32 inreg
 ; CHECK-NEXT:    s_and_b32 s0, s0, s1
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    s_cmp_eq_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB16_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
@@ -345,6 +360,7 @@ define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_and(i32 %v1, i32 %v2) {
 ; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc, 12, v0
 ; CHECK-NEXT:    v_cmp_lt_u32_e64 s[0:1], 34, v1
 ; CHECK-NEXT:    s_and_b64 s[0:1], vcc, s[0:1]
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB17_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
@@ -376,6 +392,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_and(i32 inreg %v1, i32 inreg
 ; CHECK-NEXT:    s_and_b32 s0, s0, s1
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    s_cmp_lg_u64 s[0:1], 0
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB18_2
 ; CHECK-NEXT:  ; %bb.1: ; %false
@@ -404,6 +421,7 @@ define amdgpu_cs i32 @branch_uniform_ballot_sgt_N_compare(i32 inreg %v) {
 ; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
 ; CHECK-NEXT:    s_and_b32 s0, 1, s0
 ; CHECK-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
+; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], exec
 ; CHECK-NEXT:    v_cmp_le_i64_e64 vcc, s[0:1], 22
 ; CHECK-NEXT:    s_cbranch_vccnz .LBB19_2
 ; CHECK-NEXT:  ; %bb.1: ; %true
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
index 614f59c564df64..4593607112ba45 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
@@ -166,6 +166,7 @@ define amdgpu_ps <4 x float> @image_bvh_intersect_ray_vgpr_descr(i32 %node_ptr,
 ; GFX1030-NEXT:    v_cmp_eq_u64_e32 vcc_lo, s[4:5], v[19:20]
 ; GFX1030-NEXT:    v_cmp_eq_u64_e64 s0, s[6:7], v[13:14]
 ; GFX1030-NEXT:    s_and_b32 s0, vcc_lo, s0
+; GFX1030-NEXT:    s_and...
[truncated]

arsenm · 2024-09-25T15:00:07Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+                   .addReg(Src)
+                   .addReg(Exec)
+                   .setOperandDead(3); // Dead scc
+    constrainSelectedInstRegOperands(*And, TII, TRI, RBI);


Should check return value

arsenm · 2024-09-25T15:04:08Z

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll

+; CHECK-NEXT:    v_cmp_gt_f32_e32 vcc_lo, v0, v1
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo


v_cmp already treats inactive lanes as 0. SelectionDAG does not introduce the extra instruction

Code is worse at the moment but correct. I don't see simple way to fix it.
Maybe if lower ballot earlier but most probably need new regbankselect first then remove extra AND in regbankcombiner.

petar-avramovic · 2024-09-25T15:15:23Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll

+; GFX10-NEXT:    s_or_b32 s0, s0, s2
+; GFX10-NEXT:  ; %bb.4: ; %exit
+; GFX10-NEXT:    s_or_b32 exec_lo, exec_lo, s1
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0


for context this is what sdag does.

non-constant ballot
tsrc: i1 = ...
tballot: i32 = llvm.amdgcn.ballot tsrc

is selected as (by default, when combine does not kick in)
tzext: i32 = zero_extend tsrc
tballot: i32 = SETCC tzext, Constant:i32<0>, setne:ch

Note the uppercase SETCC and i32 type, regular compare is "i1 = setcc"

most common case is with regular compare input that gets folded(*)
tsrc: i1 = setcc t_op0, t_op1
tballot: i32 = llvm.amdgcn.ballot tsrc

tballot: i32 = SETCC t_op0, t_op1

Case that is missing on GlobalISel is when inactive lanes need to be zeroed by ballot.
SDag does this by selecting tzext: i32 = zero_extend tsrc into select
and then comparing it with zero - compare result in vcc is set to 0 for inactive lanes.

(*) In theory this might be wrong in case folded compare was from another block with different exec mask but might not be possible since sdag works block by block?

GlobalISel could avoid adding 'AND with exec' when input is compare result that was calculated with same exec mask (compared to current exec - used by ballot)

petar-avramovic · 2024-09-25T15:40:03Z

Maybe it is possible to not make AND in some cases. Will look into it.

jayfoad · 2024-09-26T12:33:59Z

For completeness llvm.amdgcn.ballot should be documented at https://llvm.org/docs/AMDGPUUsage.html#llvm-ir-intrinsics, and I think the documentation should say that result bits corresponding to inactive lanes are guaranteed to be zero.

petar-avramovic · 2024-09-30T11:33:51Z

Ping. Fixed most of the cases where compare was no folded, there is a comment that justifies why I think it is correct to fold.

arsenm · 2024-09-30T11:56:11Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+    Register Op0 = CmpMI->getOperand(2).getReg();
+    Register Op1 = CmpMI->getOperand(3).getReg();
+    unsigned OpSize = MRI->getType(Op0).getSizeInBits();
+    const TargetRegisterClass *VgprRC = TRI.getVGPRClassForBitWidth(OpSize);


Sink this to where it's needed? You also don't need it if you use constrainSelectedInstRegOperands

arsenm · 2024-09-30T11:58:23Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+    // is in a block that merged control flow. Using the compare directly in the
+    // ballot implies that active lanes for the ballot are a subset of active
+    // lanes for the compare.
+    auto Pred = (CmpInst::Predicate)CmpMI->getOperand(1).getPredicate();


cast<GAnyCmp> and use getCond

arsenm · 2024-09-30T11:58:38Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+      Op0Idx = 1;
+      Op1Idx = 2;
+    } else {
+      // fcmp compares have modifiers


So try to match them?

arsenm · 2024-09-30T12:02:29Z

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll

 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; CHECK-NEXT:    s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, exec_lo


The output was a real compare, so this shouldn't be needed. The DAG doesn't emit this and

This is really not obvious, Input is "vgpr to vcc trunc" that is selected as compare. Will become better after new reg bank select.

arsenm · 2024-09-30T12:03:19Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+    return constrainOperandRegClass(*MF, TRI, *MRI, TII, RBI, *Cmp, *VgprRC,
+                                    Cmp->getOperand(Op0Idx)) &&
+           constrainOperandRegClass(*MF, TRI, *MRI, TII, RBI, *Cmp, *VgprRC,
+                                    Cmp->getOperand(Op1Idx)) &&
+           constrainSelectedInstRegOperands(*Cmp, TII, TRI, RBI);


The constrainSelectedInstRegOperands is redundant with the two constrainOperandRegClass calls. Either directly constrain the result register, or just use the one constrainSelectedInstRegOperands

arsenm · 2024-09-30T12:07:25Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+  };
+
+  const auto FoldCmp = [&](Register Dst, MachineInstr *CmpMI) -> bool {
+    // Fold ballot of a compare. Active lanes when the ballot is executed need


Can you check that the block parents are the same? It's only broken in the mid block exec mask change, which we should move away from having

We can do better. There is no need to check anything if compare is used directly.

For example:

%bb.entry %cmp = ... s_and_saveexec_b32 ... %bb.divergent_block %res = ballot(%cmp)

Works because active lanes in %bb.divergent_block are subset of active lanes in %bb.entry, so "sinking compare" to other block works according to ballot description (put 0 in inactive lanes).

the only potential problem is when

%bb.entry s_and_saveexec_b32 ... %bb.divergent_block %cmp = ... ; %bb.merge.control.flow s_or_b32 %exec.from.bb.divergent_block, ... %res = ballot(%cmp)

Lane could be inactive in %bb.divergent_block
but active in %bb.merge.control.flow, and ballot result could be 1 instead of 0.
However that case is not written correctly. ballot can't use such compare directly, it has to use phi.

%bb.entry s_and_saveexec_b32 ... %bb.divergent_block **%cmp** = ... ... ; %bb.merge.control.flow s_or_b32 %exec.from.bb.divergent_block, ... %phi = phi i1 [ **%cmp**, %bb.divergent_block ], [ ..., ...] %res = ballot(%phi)

So in summary we do same thing as sdag if input is compare (sink it). Otherwise select and with exec.
This is better then what selection dag does when input is not compare (sdag does select 0, 1 (lane mask to vgpr) + compare vgpr with 0)

arsenm · 2024-10-04T13:15:58Z

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f32.ll

+  ; CHECK-NEXT:   [[S_AND_B64_2:%[0-9]+]]:sreg_64_xexec = S_AND_B64 [[S_AND_B64_1]], $exec, implicit-def dead $scc
+  ; CHECK-NEXT:   [[S_AND_SAVEEXEC_B64_:%[0-9]+]]:sreg_64_xexec = S_AND_SAVEEXEC_B64 killed [[S_AND_B64_2]], implicit-def $exec, implicit-def $scc, implicit $exec


Nearly all of the test changes are regressing the waterfall loop case, for the ballot from regbankselect. Can we do anything about that? Intuitively we should be able to use the raw value

Will rewrite the patch next week, need to look through AND and OR and see if some of the values was made with same exec(going back to your suggestion to just check for same block). I was thinking to simply do 'copy' instead of sinking compare or 'AND with exec'

arsenm · 2024-10-04T13:17:03Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+      auto [Src0, Src0Mods] = selectVOP3ModsImpl(Src0Reg);
+      auto [Src1, Src1Mods] = selectVOP3ModsImpl(Src1Reg);


Need tests with source modifiers?

arsenm · 2024-10-04T14:15:48Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+    // Try to fold sgpr compare.
+    if (SrcMI->getOpcode() == AMDGPU::G_TRUNC)
+      SrcMI = MRI->getVRegDef(SrcMI->getOperand(1).getReg());


This pattern probably shouldn't reach selection

petar-avramovic · 2024-10-10T11:09:47Z

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll

+; GFX11-NEXT:  ; %bb.4: ; %exit
+; GFX11-NEXT:    s_or_b32 exec_lo, exec_lo, s1
+; GFX11-NEXT:    s_and_b32 s0, s0, exec_lo
+; GFX11-NEXT:    v_mov_b32_e32 v2, s0
+; GFX11-NEXT:    global_store_b32 v[0:1], v2, off


This is ballot source that is not compare (source is phi), compared to what sdag does:
https://github.com/llvm/llvm-project/pull/109986/files/0c0e21bf407bb4616e7283befec8ac0aec361ee3#diff-8048303f4c9f4c844e010c047a71b7bfcfb6c612f0faf5d9a89104acb11ee39fR571-R576

arsenm · 2024-10-10T11:50:31Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+    Reg = MI->getOperand(1).getReg();
+    if (!Reg.isVirtual())
+      return false;
+    MI = MRI.getVRegDef(Reg);


Maybe should break if the block changed

arsenm · 2024-10-10T11:55:46Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+// Ballot has to zero bits in input lane-mask that are zero in current exec,
+// Done as AND with exec. For inputs that are results of instruction that
+// implicitly use same exec, for example compares in same basic block, use copy.
+bool isBallotCopy(Register Reg, MachineRegisterInfo &MRI,


arsenm · 2024-10-10T11:56:07Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+bool isBallotCopy(Register Reg, MachineRegisterInfo &MRI,
+                  MachineBasicBlock *MBB) {
+  MachineInstr *MI = MRI.getVRegDef(Reg);
+  // Look through copies, truncs and anyext. TODO: just copies


What situations are there redundant copies that need to be looked through? I would assume any trivial copies to be optimized out

non-trivial cross bank copies scc to vcc for example.

That would be at most one copy?

Also thinking about whether such a trivially uniform ballot should have folded out already

arsenm · 2024-10-10T11:56:52Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+// Ballot has to zero bits in input lane-mask that are zero in current exec,
+// Done as AND with exec. For inputs that are results of instruction that
+// implicitly use same exec, for example compares in same basic block, use copy.
+bool isBallotCopy(Register Reg, MachineRegisterInfo &MRI,


Name is misleading, this is looking for a vector compare

Both input and output of ballot are lane-masks: result is lane-mask with 'S32/S64 LLT and SGPR bank' input is lane-mask with 'S1 LLT and VCC reg bank'. Ballot copies bits from input lane-mask for all active lanes and puts 0 for inactive lanes. GlobalISel did not set 0 in result for inactive lanes for non-constant input.

nhaehnle · 2024-10-14T09:32:33Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+  }
+
+  // Lane mask generated using compare with same exec.
+  if (isa<GAnyCmp>(MI))


This is now missing a check that MI is in MBB.

petar-avramovic · 2024-10-14T09:36:28Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+  if (MI->getParent() != MBB)
+    return false;


same MBB check is here. Deleted the loop

Both input and output of ballot are lane-masks: result is lane-mask with 'S32/S64 LLT and SGPR bank' input is lane-mask with 'S1 LLT and VCC reg bank'. Ballot copies bits from input lane-mask for all active lanes and puts 0 for inactive lanes. GlobalISel did not set 0 in result for inactive lanes for non-constant input.

petar-avramovic requested review from jayfoad, arsenm, Flakebi, Pierre-vh and mbrkusanin September 25, 2024 14:02

llvmbot added backend:AMDGPU llvm:globalisel labels Sep 25, 2024

arsenm reviewed Sep 25, 2024

View reviewed changes

petar-avramovic commented Sep 25, 2024

View reviewed changes

petar-avramovic force-pushed the ballot branch from e12ef91 to 4bcf7e7 Compare September 25, 2024 15:31

petar-avramovic force-pushed the ballot branch from 4bcf7e7 to 3c6efe1 Compare September 26, 2024 14:00

llvmbot added the llvm:ir label Sep 26, 2024

petar-avramovic force-pushed the ballot branch 2 times, most recently from 3c1d2bd to c0fc9c7 Compare September 27, 2024 09:05

arsenm reviewed Sep 30, 2024

View reviewed changes

petar-avramovic force-pushed the ballot branch from c0fc9c7 to a0a2f8f Compare September 30, 2024 14:40

arsenm reviewed Oct 4, 2024

View reviewed changes

petar-avramovic force-pushed the ballot branch from a0a2f8f to 0c0e21b Compare October 10, 2024 10:55

petar-avramovic commented Oct 10, 2024

View reviewed changes

arsenm reviewed Oct 10, 2024

View reviewed changes

petar-avramovic force-pushed the ballot branch 2 times, most recently from 5185602 to 1420d35 Compare October 10, 2024 14:30

arsenm approved these changes Oct 10, 2024

View reviewed changes

petar-avramovic merged commit 7b0d56b into llvm:main Oct 11, 2024
9 checks passed

nhaehnle reviewed Oct 14, 2024

View reviewed changes

petar-avramovic commented Oct 14, 2024

View reviewed changes

		; CHECK-NEXT: v_cmp_gt_f32_e32 vcc_lo, v0, v1
		; CHECK-NEXT: s_and_b32 s0, vcc_lo, exec_lo

		; CHECK-NEXT: [[S_AND_B64_2:%[0-9]+]]:sreg_64_xexec = S_AND_B64 [[S_AND_B64_1]], $exec, implicit-def dead $scc
		; CHECK-NEXT: [[S_AND_SAVEEXEC_B64_:%[0-9]+]]:sreg_64_xexec = S_AND_SAVEEXEC_B64 killed [[S_AND_B64_2]], implicit-def $exec, implicit-def $scc, implicit $exec

		auto [Src0, Src0Mods] = selectVOP3ModsImpl(Src0Reg);
		auto [Src1, Src1Mods] = selectVOP3ModsImpl(Src1Reg);

AMDGPU/GlobalISel: Fix inst-selection of ballot #109986

AMDGPU/GlobalISel: Fix inst-selection of ballot #109986

Uh oh!

Conversation

petar-avramovic commented Sep 25, 2024

Uh oh!

llvmbot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petar-avramovic commented Sep 25, 2024

Uh oh!

jayfoad commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petar-avramovic commented Sep 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petar-avramovic Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Sep 25, 2024 •

edited

Loading

jayfoad commented Sep 26, 2024 •

edited

Loading

petar-avramovic Oct 10, 2024 •

edited

Loading