Skip to content

Commit 800bb84

Browse files
jrbyrnesbcahoon
authored andcommitted
[AMDGPU][IGLP]: Add SchedGroupMask::TRANS (llvm#75416)
Makes constructing SchedGroups of this type easier, and provides ability to create them with __builtin_amdgcn_sched_group_barrier Change-Id: I9e10e5dfb87f8717bdb8298d71c05bf82b17249b
1 parent df9a03c commit 800bb84

File tree

4 files changed

+673
-9
lines changed

4 files changed

+673
-9
lines changed

llvm/docs/AMDGPUUsage.rst

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1025,6 +1025,119 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
10251025
reduction will be performed using default iterative strategy.
10261026
Intrinsic is currently only implemented for i32.
10271027

1028+
llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which
1029+
support such instructions. This performs unsigned dot product
1030+
with two v2i16 operands, summed with the third i32 operand. The
1031+
i1 fourth operand is used to clamp the output.
1032+
1033+
llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which
1034+
support such instructions. This performs unsigned dot product
1035+
with two i32 operands (holding a vector of 4 8bit values), summed
1036+
with the third i32 operand. The i1 fourth operand is used to clamp
1037+
the output.
1038+
1039+
llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which
1040+
support such instructions. This performs unsigned dot product
1041+
with two i32 operands (holding a vector of 8 4bit values), summed
1042+
with the third i32 operand. The i1 fourth operand is used to clamp
1043+
the output.
1044+
1045+
llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which
1046+
support such instructions. This performs signed dot product
1047+
with two v2i16 operands, summed with the third i32 operand. The
1048+
i1 fourth operand is used to clamp the output.
1049+
When applicable (e.g. no clamping), this is lowered into
1050+
v_dot2c_i32_i16 for targets which support it.
1051+
1052+
llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which
1053+
support such instructions. This performs signed dot product
1054+
with two i32 operands (holding a vector of 4 8bit values), summed
1055+
with the third i32 operand. The i1 fourth operand is used to clamp
1056+
the output.
1057+
When applicable (i.e. no clamping / operand modifiers), this is lowered
1058+
into v_dot4c_i32_i8 for targets which support it.
1059+
RDNA3 does not offer v_dot4_i32_i8, and rather offers
1060+
v_dot4_i32_iu8 which has operands to hold the signedness of the
1061+
vector operands. Thus, this intrinsic lowers to the signed version
1062+
of this instruction for gfx11 targets.
1063+
1064+
llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which
1065+
support such instructions. This performs signed dot product
1066+
with two i32 operands (holding a vector of 8 4bit values), summed
1067+
with the third i32 operand. The i1 fourth operand is used to clamp
1068+
the output.
1069+
When applicable (i.e. no clamping / operand modifiers), this is lowered
1070+
into v_dot8c_i32_i4 for targets which support it.
1071+
RDNA3 does not offer v_dot8_i32_i4, and rather offers
1072+
v_dot4_i32_iu4 which has operands to hold the signedness of the
1073+
vector operands. Thus, this intrinsic lowers to the signed version
1074+
of this instruction for gfx11 targets.
1075+
1076+
llvm.amdgcn.sudot4 Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs
1077+
dot product with two i32 operands (holding a vector of 4 8bit values), summed
1078+
with the fifth i32 operand. The i1 sixth operand is used to clamp
1079+
the output. The i1s preceding the vector operands decide the signedness.
1080+
1081+
llvm.amdgcn.sudot8 Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs
1082+
dot product with two i32 operands (holding a vector of 8 4bit values), summed
1083+
with the fifth i32 operand. The i1 sixth operand is used to clamp
1084+
the output. The i1s preceding the vector operands decide the signedness.
1085+
1086+
llvm.amdgcn.sched_barrier Controls the types of instructions that may be allowed to cross the intrinsic
1087+
during instruction scheduling. The parameter is a mask for the instruction types
1088+
that can cross the intrinsic.
1089+
1090+
- 0x0000: No instructions may be scheduled across sched_barrier.
1091+
- 0x0001: All, non-memory, non-side-effect producing instructions may be
1092+
scheduled across sched_barrier, *i.e.* allow ALU instructions to pass.
1093+
- 0x0002: VALU instructions may be scheduled across sched_barrier.
1094+
- 0x0004: SALU instructions may be scheduled across sched_barrier.
1095+
- 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.
1096+
- 0x0010: All VMEM instructions may be scheduled across sched_barrier.
1097+
- 0x0020: VMEM read instructions may be scheduled across sched_barrier.
1098+
- 0x0040: VMEM write instructions may be scheduled across sched_barrier.
1099+
- 0x0080: All DS instructions may be scheduled across sched_barrier.
1100+
- 0x0100: All DS read instructions may be scheduled accoss sched_barrier.
1101+
- 0x0200: All DS write instructions may be scheduled across sched_barrier.
1102+
- 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.
1103+
1104+
llvm.amdgcn.sched_group_barrier Creates schedule groups with specific properties to create custom scheduling
1105+
pipelines. The ordering between groups is enforced by the instruction scheduler.
1106+
The intrinsic applies to the code that preceeds the intrinsic. The intrinsic
1107+
takes three values that control the behavior of the schedule groups.
1108+
1109+
- Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.
1110+
- Size : The number of instructions that are in the group.
1111+
- SyncID : Order is enforced between groups with matching values.
1112+
1113+
The mask can include multiple instruction types. It is undefined behavior to set
1114+
values beyond the range of valid masks.
1115+
1116+
Combining multiple sched_group_barrier intrinsics enables an ordering of specific
1117+
instruction types during instruction scheduling. For example, the following enforces
1118+
a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA
1119+
instructions.
1120+
1121+
| ``// 1 VMEM read``
1122+
| ``__builtin_amdgcn_sched_group_barrier(32, 1, 0)``
1123+
| ``// 1 VALU``
1124+
| ``__builtin_amdgcn_sched_group_barrier(2, 1, 0)``
1125+
| ``// 5 MFMA``
1126+
| ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)``
1127+
1128+
llvm.amdgcn.iglp_opt An **experimental** intrinsic for instruction group level parallelism. The intrinsic
1129+
implements predefined intruction scheduling orderings. The intrinsic applies to the
1130+
surrounding scheduling region. The intrinsic takes a value that specifies the
1131+
strategy. The compiler implements two strategies.
1132+
1133+
0. Interleave DS and MFMA instructions for small GEMM kernels.
1134+
1. Interleave DS and MFMA instructions for single wave small GEMM kernels.
1135+
1136+
Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic
1137+
cannot be combined with sched_barrier or sched_group_barrier.
1138+
1139+
The iglp_opt strategy implementations are subject to change.
1140+
10281141
============================================== ==========================================================
10291142

10301143
.. TODO::

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,9 @@ enum class SchedGroupMask {
7575
DS = 1u << 7,
7676
DS_READ = 1u << 8,
7777
DS_WRITE = 1u << 9,
78+
TRANS = 1u << 10,
7879
ALL = ALU | VALU | SALU | MFMA | VMEM | VMEM_READ | VMEM_WRITE | DS |
79-
DS_READ | DS_WRITE,
80+
DS_READ | DS_WRITE | TRANS,
8081
LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
8182
};
8283

@@ -1441,11 +1442,12 @@ bool SchedGroup::canAddMI(const MachineInstr &MI) const {
14411442
Result = false;
14421443

14431444
else if (((SGMask & SchedGroupMask::ALU) != SchedGroupMask::NONE) &&
1444-
(TII->isVALU(MI) || TII->isMFMAorWMMA(MI) || TII->isSALU(MI)))
1445+
(TII->isVALU(MI) || TII->isMFMAorWMMA(MI) || TII->isSALU(MI) ||
1446+
TII->isTRANS(MI)))
14451447
Result = true;
14461448

14471449
else if (((SGMask & SchedGroupMask::VALU) != SchedGroupMask::NONE) &&
1448-
TII->isVALU(MI) && !TII->isMFMAorWMMA(MI))
1450+
TII->isVALU(MI) && !TII->isMFMAorWMMA(MI) && !TII->isTRANS(MI))
14491451
Result = true;
14501452

14511453
else if (((SGMask & SchedGroupMask::SALU) != SchedGroupMask::NONE) &&
@@ -1482,6 +1484,10 @@ bool SchedGroup::canAddMI(const MachineInstr &MI) const {
14821484
MI.mayStore() && TII->isDS(MI))
14831485
Result = true;
14841486

1487+
else if (((SGMask & SchedGroupMask::TRANS) != SchedGroupMask::NONE) &&
1488+
TII->isTRANS(MI))
1489+
Result = true;
1490+
14851491
LLVM_DEBUG(
14861492
dbgs() << "For SchedGroup with mask " << format_hex((int)SGMask, 10, true)
14871493
<< (Result ? " could classify " : " unable to classify ") << MI);
@@ -1641,10 +1647,13 @@ void IGroupLPDAGMutation::addSchedBarrierEdges(SUnit &SchedBarrier) {
16411647
// Remove all existing edges from the SCHED_BARRIER that were added due to the
16421648
// instruction having side effects.
16431649
resetEdges(SchedBarrier, DAG);
1650+
LLVM_DEBUG(dbgs() << "Building SchedGroup for SchedBarrier with Mask: "
1651+
<< MI.getOperand(0).getImm() << "\n");
16441652
auto InvertedMask =
16451653
invertSchedBarrierMask((SchedGroupMask)MI.getOperand(0).getImm());
16461654
SchedGroup SG(InvertedMask, std::nullopt, DAG, TII);
16471655
SG.initSchedGroup();
1656+
16481657
// Preserve original instruction ordering relative to the SCHED_BARRIER.
16491658
SG.link(
16501659
SchedBarrier,
@@ -1658,14 +1667,15 @@ IGroupLPDAGMutation::invertSchedBarrierMask(SchedGroupMask Mask) const {
16581667
// allowed past the SCHED_BARRIER.
16591668
SchedGroupMask InvertedMask = ~Mask;
16601669

1661-
// ALU implies VALU, SALU, MFMA.
1670+
// ALU implies VALU, SALU, MFMA, TRANS.
16621671
if ((InvertedMask & SchedGroupMask::ALU) == SchedGroupMask::NONE)
1663-
InvertedMask &=
1664-
~SchedGroupMask::VALU & ~SchedGroupMask::SALU & ~SchedGroupMask::MFMA;
1665-
// VALU, SALU, MFMA implies ALU.
1672+
InvertedMask &= ~SchedGroupMask::VALU & ~SchedGroupMask::SALU &
1673+
~SchedGroupMask::MFMA & ~SchedGroupMask::TRANS;
1674+
// VALU, SALU, MFMA, TRANS implies ALU.
16661675
else if ((InvertedMask & SchedGroupMask::VALU) == SchedGroupMask::NONE ||
16671676
(InvertedMask & SchedGroupMask::SALU) == SchedGroupMask::NONE ||
1668-
(InvertedMask & SchedGroupMask::MFMA) == SchedGroupMask::NONE)
1677+
(InvertedMask & SchedGroupMask::MFMA) == SchedGroupMask::NONE ||
1678+
(InvertedMask & SchedGroupMask::TRANS) == SchedGroupMask::NONE)
16691679
InvertedMask &= ~SchedGroupMask::ALU;
16701680

16711681
// VMEM implies VMEM_READ, VMEM_WRITE.
@@ -1684,6 +1694,9 @@ IGroupLPDAGMutation::invertSchedBarrierMask(SchedGroupMask Mask) const {
16841694
(InvertedMask & SchedGroupMask::DS_WRITE) == SchedGroupMask::NONE)
16851695
InvertedMask &= ~SchedGroupMask::DS;
16861696

1697+
LLVM_DEBUG(dbgs() << "After Inverting, SchedGroup Mask: " << (int)InvertedMask
1698+
<< "\n");
1699+
16871700
return InvertedMask;
16881701
}
16891702

0 commit comments

Comments
 (0)