Skip to content

[AMDGPU][IGLP]: Add SchedGroupMask::TRANS #75416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 20, 2023
Merged

Conversation

jrbyrnes
Copy link
Contributor

Makes constructing SchedGroups of this type easier, and provides ability to create them with __builtin_amdgcn_sched_group_barrier

@llvmbot
Copy link
Member

llvmbot commented Dec 14, 2023

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

Makes constructing SchedGroups of this type easier, and provides ability to create them with __builtin_amdgcn_sched_group_barrier


Patch is 27.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75416.diff

3 Files Affected:

  • (modified) llvm/docs/AMDGPUUsage.rst (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp (+19-7)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll (+428-1)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 5fcf651046943f..1def5b7c2cbd77 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1130,6 +1130,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
                                                    - 0x0080: All DS instructions may be scheduled across sched_barrier.
                                                    - 0x0100: All DS read instructions may be scheduled accoss sched_barrier.
                                                    - 0x0200: All DS write instructions may be scheduled across sched_barrier.
+                                                   - 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.
 
   llvm.amdgcn.sched_group_barrier                  Creates schedule groups with specific properties to create custom scheduling
                                                    pipelines. The ordering between groups is enforced by the instruction scheduler.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
index 0a17b1536040dc..092f2990269b53 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
@@ -75,8 +75,9 @@ enum class SchedGroupMask {
   DS = 1u << 7,
   DS_READ = 1u << 8,
   DS_WRITE = 1u << 9,
+  TRANS = 1u << 10,
   ALL = ALU | VALU | SALU | MFMA | VMEM | VMEM_READ | VMEM_WRITE | DS |
-        DS_READ | DS_WRITE,
+        DS_READ | DS_WRITE | TRANS,
   LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
 };
 
@@ -1476,6 +1477,10 @@ bool SchedGroup::canAddMI(const MachineInstr &MI) const {
            MI.mayStore() && TII->isDS(MI))
     Result = true;
 
+  else if (((SGMask & SchedGroupMask::TRANS) != SchedGroupMask::NONE) &&
+           TII->isTRANS(MI))
+    Result = true;
+
   LLVM_DEBUG(
       dbgs() << "For SchedGroup with mask " << format_hex((int)SGMask, 10, true)
              << (Result ? " could classify " : " unable to classify ") << MI);
@@ -1652,16 +1657,23 @@ IGroupLPDAGMutation::invertSchedBarrierMask(SchedGroupMask Mask) const {
   // allowed past the SCHED_BARRIER.
   SchedGroupMask InvertedMask = ~Mask;
 
-  // ALU implies VALU, SALU, MFMA.
+  // ALU implies VALU, SALU, MFMA, TRANS.
   if ((InvertedMask & SchedGroupMask::ALU) == SchedGroupMask::NONE)
-    InvertedMask &=
-        ~SchedGroupMask::VALU & ~SchedGroupMask::SALU & ~SchedGroupMask::MFMA;
-  // VALU, SALU, MFMA implies ALU.
-  else if ((InvertedMask & SchedGroupMask::VALU) == SchedGroupMask::NONE ||
-           (InvertedMask & SchedGroupMask::SALU) == SchedGroupMask::NONE ||
+    InvertedMask &= ~SchedGroupMask::VALU & ~SchedGroupMask::SALU &
+                    ~SchedGroupMask::MFMA & ~SchedGroupMask::TRANS;
+  // VALU, SALU, MFMA, TRANS implies ALU.
+  else if ((InvertedMask & SchedGroupMask::SALU) == SchedGroupMask::NONE ||
            (InvertedMask & SchedGroupMask::MFMA) == SchedGroupMask::NONE)
     InvertedMask &= ~SchedGroupMask::ALU;
 
+  // TRANS implies VALU, ALU
+  if ((InvertedMask & SchedGroupMask::TRANS) == SchedGroupMask::NONE)
+    InvertedMask &= ~SchedGroupMask::VALU & ~SchedGroupMask::ALU;
+
+  // VALU implies TRANS, ALU
+  if ((InvertedMask & SchedGroupMask::VALU) == SchedGroupMask::NONE)
+    InvertedMask &= ~SchedGroupMask::TRANS & ~SchedGroupMask::ALU;
+
   // VMEM implies VMEM_READ, VMEM_WRITE.
   if ((InvertedMask & SchedGroupMask::VMEM) == SchedGroupMask::NONE)
     InvertedMask &= ~SchedGroupMask::VMEM_READ & ~SchedGroupMask::VMEM_WRITE;
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll
index 288616086eb8e5..4b7f1d2e2e1915 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll
@@ -1185,11 +1185,438 @@ entry:
   ret void
 }
 
-
+define amdgpu_kernel void @test_sched_group_barrier_pipeline_interleave_EXP_MFMA(ptr addrspace(3) noalias %in, ptr addrspace(3) noalias %out, <5 x float> %in1) #0 {
+; GCN-LABEL: test_sched_group_barrier_pipeline_interleave_EXP_MFMA:
+; GCN:       ; %bb.0: ; %entry
+; GCN-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x44
+; GCN-NEXT:    v_mov_b32_e32 v3, 0x3fb8aa3b
+; GCN-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x24
+; GCN-NEXT:    v_mov_b32_e32 v7, 0x32a5705f
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 7, v0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v4, s4, v3
+; GCN-NEXT:    v_rndne_f32_e32 v5, v4
+; GCN-NEXT:    v_sub_f32_e32 v6, v4, v5
+; GCN-NEXT:    v_fma_f32 v4, s4, v3, -v4
+; GCN-NEXT:    v_fmac_f32_e32 v4, s4, v7
+; GCN-NEXT:    v_add_f32_e32 v4, v6, v4
+; GCN-NEXT:    v_exp_f32_e32 v4, v4
+; GCN-NEXT:    v_cvt_i32_f32_e32 v5, v5
+; GCN-NEXT:    v_add_u32_e32 v1, s2, v0
+; GCN-NEXT:    ds_read_b128 a[28:31], v1 offset:112
+; GCN-NEXT:    ds_read_b128 a[24:27], v1 offset:96
+; GCN-NEXT:    ds_read_b128 a[20:23], v1 offset:80
+; GCN-NEXT:    ds_read_b128 a[16:19], v1 offset:64
+; GCN-NEXT:    ds_read_b128 a[0:3], v1
+; GCN-NEXT:    ds_read_b128 a[4:7], v1 offset:16
+; GCN-NEXT:    ds_read_b128 a[8:11], v1 offset:32
+; GCN-NEXT:    ds_read_b128 a[12:15], v1 offset:48
+; GCN-NEXT:    v_mov_b32_e32 v9, 1.0
+; GCN-NEXT:    v_ldexp_f32 v4, v4, v5
+; GCN-NEXT:    v_mov_b32_e32 v5, 0xc2ce8ed0
+; GCN-NEXT:    v_mul_f32_e32 v10, s5, v3
+; GCN-NEXT:    v_cmp_nlt_f32_e32 vcc, s4, v5
+; GCN-NEXT:    v_mov_b32_e32 v6, 0x42b17218
+; GCN-NEXT:    v_rndne_f32_e32 v11, v10
+; GCN-NEXT:    v_cndmask_b32_e32 v4, 0, v4, vcc
+; GCN-NEXT:    v_cmp_ngt_f32_e32 vcc, s4, v6
+; GCN-NEXT:    v_mov_b32_e32 v8, 0x7f800000
+; GCN-NEXT:    v_sub_f32_e32 v12, v10, v11
+; GCN-NEXT:    v_fma_f32 v10, s5, v3, -v10
+; GCN-NEXT:    v_cndmask_b32_e32 v4, v8, v4, vcc
+; GCN-NEXT:    v_fmac_f32_e32 v10, s5, v7
+; GCN-NEXT:    ds_read_b128 a[60:63], v1 offset:8304
+; GCN-NEXT:    s_waitcnt lgkmcnt(1)
+; GCN-NEXT:    v_mfma_f32_32x32x1f32 a[0:31], v9, v4, a[0:31]
+; GCN-NEXT:    v_add_f32_e32 v4, v12, v10
+; GCN-NEXT:    v_exp_f32_e32 v4, v4
+; GCN-NEXT:    v_cvt_i32_f32_e32 v10, v11
+; GCN-NEXT:    ds_read_b128 a[56:59], v1 offset:8288
+; GCN-NEXT:    ds_read_b128 a[52:55], v1 offset:8272
+; GCN-NEXT:    ds_read_b128 a[48:51], v1 offset:8256
+; GCN-NEXT:    ds_read_b128 a[44:47], v1 offset:8240
+; GCN-NEXT:    ds_read_b128 a[40:43], v1 offset:8224
+; GCN-NEXT:    ds_read_b128 a[36:39], v1 offset:8208
+; GCN-NEXT:    ds_read_b128 a[32:35], v1 offset:8192
+; GCN-NEXT:    v_ldexp_f32 v4, v4, v10
+; GCN-NEXT:    v_cmp_nlt_f32_e32 vcc, s5, v5
+; GCN-NEXT:    v_cndmask_b32_e32 v4, 0, v4, vcc
+; GCN-NEXT:    v_cmp_ngt_f32_e32 vcc, s5, v6
+; GCN-NEXT:    v_cndmask_b32_e32 v4, v8, v4, vcc
+; GCN-NEXT:    v_mul_f32_e32 v10, s6, v3
+; GCN-NEXT:    v_rndne_f32_e32 v11, v10
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mfma_f32_32x32x1f32 a[32:63], v9, v4, a[32:63]
+; GCN-NEXT:    v_fma_f32 v4, s6, v3, -v10
+; GCN-NEXT:    v_sub_f32_e32 v12, v10, v11
+; GCN-NEXT:    v_fmac_f32_e32 v4, s6, v7
+; GCN-NEXT:    v_add_f32_e32 v4, v12, v4
+; GCN-NEXT:    v_exp_f32_e32 v4, v4
+; GCN-NEXT:    v_cvt_i32_f32_e32 v10, v11
+; GCN-NEXT:    ds_read_b128 a[92:95], v1 offset:24688
+; GCN-NEXT:    ds_read_b128 a[88:91], v1 offset:24672
+; GCN-NEXT:    ds_read_b128 a[84:87], v1 offset:24656
+; GCN-NEXT:    ds_read_b128 a[80:83], v1 offset:24640
+; GCN-NEXT:    ds_read_b128 a[76:79], v1 offset:24624
+; GCN-NEXT:    ds_read_b128 a[72:75], v1 offset:24608
+; GCN-NEXT:    ds_read_b128 a[68:71], v1 offset:24592
+; GCN-NEXT:    ds_read_b128 a[64:67], v1 offset:24576
+; GCN-NEXT:    v_add_u32_e32 v2, 0x6000, v1
+; GCN-NEXT:    ds_read_b128 a[124:127], v1 offset:49264
+; GCN-NEXT:    ds_read_b128 a[120:123], v1 offset:49248
+; GCN-NEXT:    ds_read_b128 a[116:119], v1 offset:49232
+; GCN-NEXT:    ds_read_b128 a[112:115], v1 offset:49216
+; GCN-NEXT:    ds_read_b128 a[108:111], v1 offset:49200
+; GCN-NEXT:    ds_read_b128 a[104:107], v1 offset:49184
+; GCN-NEXT:    ds_read_b128 a[100:103], v1 offset:49168
+; GCN-NEXT:    ds_read_b128 a[96:99], v1 offset:49152
+; GCN-NEXT:    v_ldexp_f32 v1, v4, v10
+; GCN-NEXT:    v_cmp_nlt_f32_e32 vcc, s6, v5
+; GCN-NEXT:    v_cndmask_b32_e32 v1, 0, v1, vcc
+; GCN-NEXT:    v_cmp_ngt_f32_e32 vcc, s6, v6
+; GCN-NEXT:    v_mul_f32_e32 v4, s7, v3
+; GCN-NEXT:    v_cndmask_b32_e32 v1, v8, v1, vcc
+; GCN-NEXT:    v_rndne_f32_e32 v10, v4
+; GCN-NEXT:    s_load_dword s8, s[0:1], 0x54
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mfma_f32_32x32x1f32 a[64:95], v9, v1, a[64:95]
+; GCN-NEXT:    v_sub_f32_e32 v1, v4, v10
+; GCN-NEXT:    v_fma_f32 v4, s7, v3, -v4
+; GCN-NEXT:    v_fmac_f32_e32 v4, s7, v7
+; GCN-NEXT:    v_add_f32_e32 v1, v1, v4
+; GCN-NEXT:    v_exp_f32_e32 v1, v1
+; GCN-NEXT:    v_cvt_i32_f32_e32 v4, v10
+; GCN-NEXT:    v_cmp_nlt_f32_e32 vcc, s7, v5
+; GCN-NEXT:    ds_read_b128 a[156:159], v2 offset:57456
+; GCN-NEXT:    ds_read_b128 a[152:155], v2 offset:57440
+; GCN-NEXT:    v_ldexp_f32 v1, v1, v4
+; GCN-NEXT:    v_cndmask_b32_e32 v1, 0, v1, vcc
+; GCN-NEXT:    v_cmp_ngt_f32_e32 vcc, s7, v6
+; GCN-NEXT:    v_cndmask_b32_e32 v1, v8, v1, vcc
+; GCN-NEXT:    v_mul_f32_e32 v4, s8, v3
+; GCN-NEXT:    v_fma_f32 v3, s8, v3, -v4
+; GCN-NEXT:    v_mfma_f32_32x32x1f32 a[96:127], v9, v1, a[96:127]
+; GCN-NEXT:    v_rndne_f32_e32 v1, v4
+; GCN-NEXT:    v_sub_f32_e32 v10, v4, v1
+; GCN-NEXT:    v_fmac_f32_e32 v3, s8, v7
+; GCN-NEXT:    v_add_f32_e32 v3, v10, v3
+; GCN-NEXT:    v_exp_f32_e32 v3, v3
+; GCN-NEXT:    v_cvt_i32_f32_e32 v1, v1
+; GCN-NEXT:    ds_read_b128 a[148:151], v2 offset:57424
+; GCN-NEXT:    ds_read_b128 a[144:147], v2 offset:57408
+; GCN-NEXT:    ds_read_b128 a[128:131], v2 offset:57344
+; GCN-NEXT:    ds_read_b128 a[132:135], v2 offset:57360
+; GCN-NEXT:    ds_read_b128 a[136:139], v2 offset:57376
+; GCN-NEXT:    ds_read_b128 a[140:143], v2 offset:57392
+; GCN-NEXT:    v_ldexp_f32 v1, v3, v1
+; GCN-NEXT:    v_cmp_nlt_f32_e32 vcc, s8, v5
+; GCN-NEXT:    v_cndmask_b32_e32 v1, 0, v1, vcc
+; GCN-NEXT:    v_cmp_ngt_f32_e32 vcc, s8, v6
+; GCN-NEXT:    v_cndmask_b32_e32 v1, v8, v1, vcc
+; GCN-NEXT:    v_add_u32_e32 v0, s3, v0
+; GCN-NEXT:    ds_write_b128 v0, a[28:31] offset:112
+; GCN-NEXT:    s_waitcnt lgkmcnt(1)
+; GCN-NEXT:    v_mfma_f32_32x32x1f32 a[128:159], v9, v1, a[128:159]
+; GCN-NEXT:    ds_write_b128 v0, a[24:27] offset:96
+; GCN-NEXT:    ds_write_b128 v0, a[20:23] offset:80
+; GCN-NEXT:    ds_write_b128 v0, a[16:19] offset:64
+; GCN-NEXT:    ds_write_b128 v0, a[12:15] offset:48
+; GCN-NEXT:    ds_write_b128 v0, a[8:11] offset:32
+; GCN-NEXT:    ds_write_b128 v0, a[4:7] offset:16
+; GCN-NEXT:    ds_write_b128 v0, a[0:3]
+; GCN-NEXT:    v_mov_b32_e32 v0, s3
+; GCN-NEXT:    ; kill: killed $sgpr0_sgpr1
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000400) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000008) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000400) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000008) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000400) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000008) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000400) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000008) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000400) size(1) SyncID(0)
+; GCN-NEXT:    ; sched_group_barrier mask(0x00000008) size(1) SyncID(0)
+; GCN-NEXT:    ds_write_b128 v0, a[56:59] offset:8288
+; GCN-NEXT:    ds_write_b128 v0, a[60:63] offset:8304
+; GCN-NEXT:    ds_write_b128 v0, a[48:51] offset:8256
+; GCN-NEXT:    ds_write_b128 v0, a[52:55] offset:8272
+; GCN-NEXT:    ds_write_b128 v0, a[40:43] offset:8224
+; GCN-NEXT:    ds_write_b128 v0, a[44:47] offset:8240
+; GCN-NEXT:    ds_write_b128 v0, a[32:35] offset:8192
+; GCN-NEXT:    ds_write_b128 v0, a[36:39] offset:8208
+; GCN-NEXT:    ds_write_b128 v0, a[88:91] offset:16480
+; GCN-NEXT:    ds_write_b128 v0, a[92:95] offset:16496
+; GCN-NEXT:    ds_write_b128 v0, a[80:83] offset:16448
+; GCN-NEXT:    ds_write_b128 v0, a[84:87] offset:16464
+; GCN-NEXT:    ds_write_b128 v0, a[72:75] offset:16416
+; GCN-NEXT:    ds_write_b128 v0, a[76:79] offset:16432
+; GCN-NEXT:    ds_write_b128 v0, a[64:67] offset:16384
+; GCN-NEXT:    ds_write_b128 v0, a[68:71] offset:16400
+; GCN-NEXT:    ds_write_b128 v0, a[120:123] offset:24672
+; GCN-NEXT:    ds_write_b128 v0, a[124:127] offset:24688
+; GCN-NEXT:    ds_write_b128 v0, a[112:115] offset:24640
+; GCN-NEXT:    ds_write_b128 v0, a[116:119] offset:24656
+; GCN-NEXT:    ds_write_b128 v0, a[104:107] offset:24608
+; GCN-NEXT:    ds_write_b128 v0, a[108:111] offset:24624
+; GCN-NEXT:    ds_write_b128 v0, a[96:99] offset:24576
+; GCN-NEXT:    ds_write_b128 v0, a[100:103] offset:24592
+; GCN-NEXT:    ds_write_b128 v0, a[152:155] offset:32864
+; GCN-NEXT:    ds_write_b128 v0, a[156:159] offset:32880
+; GCN-NEXT:    ds_write_b128 v0, a[144:147] offset:32832
+; GCN-NEXT:    ds_write_b128 v0, a[148:151] offset:32848
+; GCN-NEXT:    ds_write_b128 v0, a[136:139] offset:32800
+; GCN-NEXT:    ds_write_b128 v0, a[140:143] offset:32816
+; GCN-NEXT:    ds_write_b128 v0, a[128:131] offset:32768
+; GCN-NEXT:    ds_write_b128 v0, a[132:135] offset:32784
+; GCN-NEXT:    s_endpgm
+;
+; EXACTCUTOFF-LABEL: test_sched_group_barrier_pipeline_interleave_EXP_MFMA:
+; EXACTCUTOFF:       ; %bb.0: ; %entry
+; EXACTCUTOFF-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x44
+; EXACTCUTOFF-NEXT:    v_mov_b32_e32 v3, 0x3fb8aa3b
+; EXACTCUTOFF-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x24
+; EXACTCUTOFF-NEXT:    v_mov_b32_e32 v7, 0x32a5705f
+; EXACTCUTOFF-NEXT:    v_lshlrev_b32_e32 v0, 7, v0
+; EXACTCUTOFF-NEXT:    s_waitcnt lgkmcnt(0)
+; EXACTCUTOFF-NEXT:    v_mul_f32_e32 v4, s4, v3
+; EXACTCUTOFF-NEXT:    v_rndne_f32_e32 v5, v4
+; EXACTCUTOFF-NEXT:    v_sub_f32_e32 v6, v4, v5
+; EXACTCUTOFF-NEXT:    v_fma_f32 v4, s4, v3, -v4
+; EXACTCUTOFF-NEXT:    v_fmac_f32_e32 v4, s4, v7
+; EXACTCUTOFF-NEXT:    v_add_f32_e32 v4, v6, v4
+; EXACTCUTOFF-NEXT:    v_exp_f32_e32 v4, v4
+; EXACTCUTOFF-NEXT:    v_cvt_i32_f32_e32 v5, v5
+; EXACTCUTOFF-NEXT:    v_add_u32_e32 v1, s2, v0
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[28:31], v1 offset:112
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[24:27], v1 offset:96
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[20:23], v1 offset:80
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[16:19], v1 offset:64
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[0:3], v1
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[4:7], v1 offset:16
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[8:11], v1 offset:32
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[12:15], v1 offset:48
+; EXACTCUTOFF-NEXT:    v_mov_b32_e32 v9, 1.0
+; EXACTCUTOFF-NEXT:    v_ldexp_f32 v4, v4, v5
+; EXACTCUTOFF-NEXT:    v_mov_b32_e32 v5, 0xc2ce8ed0
+; EXACTCUTOFF-NEXT:    v_mul_f32_e32 v10, s5, v3
+; EXACTCUTOFF-NEXT:    v_cmp_nlt_f32_e32 vcc, s4, v5
+; EXACTCUTOFF-NEXT:    v_mov_b32_e32 v6, 0x42b17218
+; EXACTCUTOFF-NEXT:    v_rndne_f32_e32 v11, v10
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v4, 0, v4, vcc
+; EXACTCUTOFF-NEXT:    v_cmp_ngt_f32_e32 vcc, s4, v6
+; EXACTCUTOFF-NEXT:    v_mov_b32_e32 v8, 0x7f800000
+; EXACTCUTOFF-NEXT:    v_sub_f32_e32 v12, v10, v11
+; EXACTCUTOFF-NEXT:    v_fma_f32 v10, s5, v3, -v10
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v4, v8, v4, vcc
+; EXACTCUTOFF-NEXT:    v_fmac_f32_e32 v10, s5, v7
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[60:63], v1 offset:8304
+; EXACTCUTOFF-NEXT:    s_waitcnt lgkmcnt(1)
+; EXACTCUTOFF-NEXT:    v_mfma_f32_32x32x1f32 a[0:31], v9, v4, a[0:31]
+; EXACTCUTOFF-NEXT:    v_add_f32_e32 v4, v12, v10
+; EXACTCUTOFF-NEXT:    v_exp_f32_e32 v4, v4
+; EXACTCUTOFF-NEXT:    v_cvt_i32_f32_e32 v10, v11
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[56:59], v1 offset:8288
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[52:55], v1 offset:8272
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[48:51], v1 offset:8256
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[44:47], v1 offset:8240
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[40:43], v1 offset:8224
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[36:39], v1 offset:8208
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[32:35], v1 offset:8192
+; EXACTCUTOFF-NEXT:    v_ldexp_f32 v4, v4, v10
+; EXACTCUTOFF-NEXT:    v_cmp_nlt_f32_e32 vcc, s5, v5
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v4, 0, v4, vcc
+; EXACTCUTOFF-NEXT:    v_cmp_ngt_f32_e32 vcc, s5, v6
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v4, v8, v4, vcc
+; EXACTCUTOFF-NEXT:    v_mul_f32_e32 v10, s6, v3
+; EXACTCUTOFF-NEXT:    v_rndne_f32_e32 v11, v10
+; EXACTCUTOFF-NEXT:    s_waitcnt lgkmcnt(0)
+; EXACTCUTOFF-NEXT:    v_mfma_f32_32x32x1f32 a[32:63], v9, v4, a[32:63]
+; EXACTCUTOFF-NEXT:    v_fma_f32 v4, s6, v3, -v10
+; EXACTCUTOFF-NEXT:    v_sub_f32_e32 v12, v10, v11
+; EXACTCUTOFF-NEXT:    v_fmac_f32_e32 v4, s6, v7
+; EXACTCUTOFF-NEXT:    v_add_f32_e32 v4, v12, v4
+; EXACTCUTOFF-NEXT:    v_exp_f32_e32 v4, v4
+; EXACTCUTOFF-NEXT:    v_cvt_i32_f32_e32 v10, v11
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[92:95], v1 offset:24688
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[88:91], v1 offset:24672
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[84:87], v1 offset:24656
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[80:83], v1 offset:24640
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[76:79], v1 offset:24624
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[72:75], v1 offset:24608
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[68:71], v1 offset:24592
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[64:67], v1 offset:24576
+; EXACTCUTOFF-NEXT:    v_add_u32_e32 v2, 0x6000, v1
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[124:127], v1 offset:49264
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[120:123], v1 offset:49248
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[116:119], v1 offset:49232
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[112:115], v1 offset:49216
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[108:111], v1 offset:49200
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[104:107], v1 offset:49184
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[100:103], v1 offset:49168
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[96:99], v1 offset:49152
+; EXACTCUTOFF-NEXT:    v_ldexp_f32 v1, v4, v10
+; EXACTCUTOFF-NEXT:    v_cmp_nlt_f32_e32 vcc, s6, v5
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v1, 0, v1, vcc
+; EXACTCUTOFF-NEXT:    v_cmp_ngt_f32_e32 vcc, s6, v6
+; EXACTCUTOFF-NEXT:    v_mul_f32_e32 v4, s7, v3
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v1, v8, v1, vcc
+; EXACTCUTOFF-NEXT:    v_rndne_f32_e32 v10, v4
+; EXACTCUTOFF-NEXT:    s_load_dword s8, s[0:1], 0x54
+; EXACTCUTOFF-NEXT:    s_waitcnt lgkmcnt(0)
+; EXACTCUTOFF-NEXT:    v_mfma_f32_32x32x1f32 a[64:95], v9, v1, a[64:95]
+; EXACTCUTOFF-NEXT:    v_sub_f32_e32 v1, v4, v10
+; EXACTCUTOFF-NEXT:    v_fma_f32 v4, s7, v3, -v4
+; EXACTCUTOFF-NEXT:    v_fmac_f32_e32 v4, s7, v7
+; EXACTCUTOFF-NEXT:    v_add_f32_e32 v1, v1, v4
+; EXACTCUTOFF-NEXT:    v_exp_f32_e32 v1, v1
+; EXACTCUTOFF-NEXT:    v_cvt_i32_f32_e32 v4, v10
+; EXACTCUTOFF-NEXT:    v_cmp_nlt_f32_e32 vcc, s7, v5
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[156:159], v2 offset:57456
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[152:155], v2 offset:57440
+; EXACTCUTOFF-NEXT:    v_ldexp_f32 v1, v1, v4
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v1, 0, v1, vcc
+; EXACTCUTOFF-NEXT:    v_cmp_ngt_f32_e32 vcc, s7, v6
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v1, v8, v1, vcc
+; EXACTCUTOFF-NEXT:    v_mul_f32_e32 v4, s8, v3
+; EXACTCUTOFF-NEXT:    v_fma_f32 v3, s8, v3, -v4
+; EXACTCUTOFF-NEXT:    v_mfma_f32_32x32x1f32 a[96:127], v9, v1, a[96:127]
+; EXACTCUTOFF-NEXT:    v_rndne_f32_e32 v1, v4
+; EXACTCUTOFF-NEXT:    v_sub_f32_e32 v10, v4, v1
+; EXACTCUTOFF-NEXT:    v_fmac_f32_e32 v3, s8, v7
+; EXACTCUTOFF-NEXT:    v_add_f32_e32 v3, v10, v3
+; EXACTCUTOFF-NEXT:    v_exp_f32_e32 v3, v3
+; EXACTCUTOFF-NEXT:    v_cvt_i32_f32_e32 v1, v1
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[148:151], v2 offset:57424
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[144:147], v2 offset:57408
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[128:131], v2 offset:57344
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[132:135], v2 offset:57360
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[136:139], v2 offset:57376
+; EXACTCUTOFF-NEXT:    ds_read_b128 a[140:143], v2 offset:57392
+; EXACTCUTOFF-NEXT:    v_ldexp_f32 v1, v3, v1
+; EXACTCUTOFF-NEXT:    v_cmp_nlt_f32_e32 vcc, s8, v5
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v1, 0, v1, vcc
+; EXACTCUTOFF-NEXT:    v_cmp_ngt_f32_e32 vcc, s8, v6
+; EXACTCUTOFF-NEXT:    v_cndmask_b32_e32 v1, v8, v1, vcc
+; EXACTCUTOFF-NEXT:    v_add_u32_e32 v0, s3, v0
+; EXACTCUTOFF-NEX...
[truncated]

Change-Id: I51d8e21ed54ee2adedea299aba1ad423db69a473
@jrbyrnes
Copy link
Contributor Author

SchedBarrier mask was not behaving as expected -- reworked mask inversion + added mask inversion test.

@jrbyrnes jrbyrnes merged commit f1156fb into llvm:main Dec 20, 2023
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Feb 29, 2024
Makes constructing SchedGroups of this type easier, and provides ability
to create them with __builtin_amdgcn_sched_group_barrier

Change-Id: I9e10e5dfb87f8717bdb8298d71c05bf82b17249b
jrbyrnes added a commit to jrbyrnes/llvm-project that referenced this pull request Apr 1, 2024
Makes constructing SchedGroups of this type easier, and provides ability
to create them with __builtin_amdgcn_sched_group_barrier

Change-Id: Iac3f0d471d81420b7b89d0b26d17cd54f72bb406
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants