Skip to content

Commit 8dce41d

Browse files
committed
[AMDGPU][IGLP]: Add SchedGroupMask::TRANS (llvm#75416)
Makes constructing SchedGroups of this type easier, and provides ability to create them with __builtin_amdgcn_sched_group_barrier Change-Id: Iac3f0d471d81420b7b89d0b26d17cd54f72bb406
1 parent e187d2f commit 8dce41d

File tree

4 files changed

+614
-9
lines changed

4 files changed

+614
-9
lines changed

llvm/docs/AMDGPUUsage.rst

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1098,6 +1098,60 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
10981098
with the fifth i32 operand. The i1 sixth operand is used to clamp
10991099
the output. The i1s preceding the vector operands decide the signedness.
11001100

1101+
llvm.amdgcn.sched_barrier Controls the types of instructions that may be allowed to cross the intrinsic
1102+
during instruction scheduling. The parameter is a mask for the instruction types
1103+
that can cross the intrinsic.
1104+
1105+
- 0x0000: No instructions may be scheduled across sched_barrier.
1106+
- 0x0001: All, non-memory, non-side-effect producing instructions may be
1107+
scheduled across sched_barrier, *i.e.* allow ALU instructions to pass.
1108+
- 0x0002: VALU instructions may be scheduled across sched_barrier.
1109+
- 0x0004: SALU instructions may be scheduled across sched_barrier.
1110+
- 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.
1111+
- 0x0010: All VMEM instructions may be scheduled across sched_barrier.
1112+
- 0x0020: VMEM read instructions may be scheduled across sched_barrier.
1113+
- 0x0040: VMEM write instructions may be scheduled across sched_barrier.
1114+
- 0x0080: All DS instructions may be scheduled across sched_barrier.
1115+
- 0x0100: All DS read instructions may be scheduled accoss sched_barrier.
1116+
- 0x0200: All DS write instructions may be scheduled across sched_barrier.
1117+
- 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.
1118+
1119+
llvm.amdgcn.sched_group_barrier Creates schedule groups with specific properties to create custom scheduling
1120+
pipelines. The ordering between groups is enforced by the instruction scheduler.
1121+
The intrinsic applies to the code that preceeds the intrinsic. The intrinsic
1122+
takes three values that control the behavior of the schedule groups.
1123+
1124+
- Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.
1125+
- Size : The number of instructions that are in the group.
1126+
- SyncID : Order is enforced between groups with matching values.
1127+
1128+
The mask can include multiple instruction types. It is undefined behavior to set
1129+
values beyond the range of valid masks.
1130+
1131+
Combining multiple sched_group_barrier intrinsics enables an ordering of specific
1132+
instruction types during instruction scheduling. For example, the following enforces
1133+
a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA
1134+
instructions.
1135+
1136+
| ``// 1 VMEM read``
1137+
| ``__builtin_amdgcn_sched_group_barrier(32, 1, 0)``
1138+
| ``// 1 VALU``
1139+
| ``__builtin_amdgcn_sched_group_barrier(2, 1, 0)``
1140+
| ``// 5 MFMA``
1141+
| ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)``
1142+
1143+
llvm.amdgcn.iglp_opt An **experimental** intrinsic for instruction group level parallelism. The intrinsic
1144+
implements predefined intruction scheduling orderings. The intrinsic applies to the
1145+
surrounding scheduling region. The intrinsic takes a value that specifies the
1146+
strategy. The compiler implements two strategies.
1147+
1148+
0. Interleave DS and MFMA instructions for small GEMM kernels.
1149+
1. Interleave DS and MFMA instructions for single wave small GEMM kernels.
1150+
1151+
Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic
1152+
cannot be combined with sched_barrier or sched_group_barrier.
1153+
1154+
The iglp_opt strategy implementations are subject to change.
11011155

11021156
============================================== ==========================================================
11031157

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,9 @@ enum class SchedGroupMask {
7575
DS = 1u << 7,
7676
DS_READ = 1u << 8,
7777
DS_WRITE = 1u << 9,
78+
TRANS = 1u << 10,
7879
ALL = ALU | VALU | SALU | MFMA | VMEM | VMEM_READ | VMEM_WRITE | DS |
79-
DS_READ | DS_WRITE,
80+
DS_READ | DS_WRITE | TRANS,
8081
LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
8182
};
8283

@@ -1437,11 +1438,12 @@ bool SchedGroup::canAddMI(const MachineInstr &MI) const {
14371438
Result = false;
14381439

14391440
else if (((SGMask & SchedGroupMask::ALU) != SchedGroupMask::NONE) &&
1440-
(TII->isVALU(MI) || TII->isMFMAorWMMA(MI) || TII->isSALU(MI)))
1441+
(TII->isVALU(MI) || TII->isMFMAorWMMA(MI) || TII->isSALU(MI) ||
1442+
TII->isTRANS(MI)))
14411443
Result = true;
14421444

14431445
else if (((SGMask & SchedGroupMask::VALU) != SchedGroupMask::NONE) &&
1444-
TII->isVALU(MI) && !TII->isMFMAorWMMA(MI))
1446+
TII->isVALU(MI) && !TII->isMFMAorWMMA(MI) && !TII->isTRANS(MI))
14451447
Result = true;
14461448

14471449
else if (((SGMask & SchedGroupMask::SALU) != SchedGroupMask::NONE) &&
@@ -1478,6 +1480,10 @@ bool SchedGroup::canAddMI(const MachineInstr &MI) const {
14781480
MI.mayStore() && TII->isDS(MI))
14791481
Result = true;
14801482

1483+
else if (((SGMask & SchedGroupMask::TRANS) != SchedGroupMask::NONE) &&
1484+
TII->isTRANS(MI))
1485+
Result = true;
1486+
14811487
LLVM_DEBUG(
14821488
dbgs() << "For SchedGroup with mask " << format_hex((int)SGMask, 10, true)
14831489
<< (Result ? " could classify " : " unable to classify ") << MI);
@@ -1637,10 +1643,13 @@ void IGroupLPDAGMutation::addSchedBarrierEdges(SUnit &SchedBarrier) {
16371643
// Remove all existing edges from the SCHED_BARRIER that were added due to the
16381644
// instruction having side effects.
16391645
resetEdges(SchedBarrier, DAG);
1646+
LLVM_DEBUG(dbgs() << "Building SchedGroup for SchedBarrier with Mask: "
1647+
<< MI.getOperand(0).getImm() << "\n");
16401648
auto InvertedMask =
16411649
invertSchedBarrierMask((SchedGroupMask)MI.getOperand(0).getImm());
16421650
SchedGroup SG(InvertedMask, std::nullopt, DAG, TII);
16431651
SG.initSchedGroup();
1652+
16441653
// Preserve original instruction ordering relative to the SCHED_BARRIER.
16451654
SG.link(
16461655
SchedBarrier,
@@ -1654,14 +1663,15 @@ IGroupLPDAGMutation::invertSchedBarrierMask(SchedGroupMask Mask) const {
16541663
// allowed past the SCHED_BARRIER.
16551664
SchedGroupMask InvertedMask = ~Mask;
16561665

1657-
// ALU implies VALU, SALU, MFMA.
1666+
// ALU implies VALU, SALU, MFMA, TRANS.
16581667
if ((InvertedMask & SchedGroupMask::ALU) == SchedGroupMask::NONE)
1659-
InvertedMask &=
1660-
~SchedGroupMask::VALU & ~SchedGroupMask::SALU & ~SchedGroupMask::MFMA;
1661-
// VALU, SALU, MFMA implies ALU.
1668+
InvertedMask &= ~SchedGroupMask::VALU & ~SchedGroupMask::SALU &
1669+
~SchedGroupMask::MFMA & ~SchedGroupMask::TRANS;
1670+
// VALU, SALU, MFMA, TRANS implies ALU.
16621671
else if ((InvertedMask & SchedGroupMask::VALU) == SchedGroupMask::NONE ||
16631672
(InvertedMask & SchedGroupMask::SALU) == SchedGroupMask::NONE ||
1664-
(InvertedMask & SchedGroupMask::MFMA) == SchedGroupMask::NONE)
1673+
(InvertedMask & SchedGroupMask::MFMA) == SchedGroupMask::NONE ||
1674+
(InvertedMask & SchedGroupMask::TRANS) == SchedGroupMask::NONE)
16651675
InvertedMask &= ~SchedGroupMask::ALU;
16661676

16671677
// VMEM implies VMEM_READ, VMEM_WRITE.
@@ -1680,6 +1690,9 @@ IGroupLPDAGMutation::invertSchedBarrierMask(SchedGroupMask Mask) const {
16801690
(InvertedMask & SchedGroupMask::DS_WRITE) == SchedGroupMask::NONE)
16811691
InvertedMask &= ~SchedGroupMask::DS;
16821692

1693+
LLVM_DEBUG(dbgs() << "After Inverting, SchedGroup Mask: " << (int)InvertedMask
1694+
<< "\n");
1695+
16831696
return InvertedMask;
16841697
}
16851698

0 commit comments

Comments
 (0)