@@ -1025,6 +1025,119 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
1025
1025
reduction will be performed using default iterative strategy.
1026
1026
Intrinsic is currently only implemented for i32.
1027
1027
1028
+ llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which
1029
+ support such instructions. This performs unsigned dot product
1030
+ with two v2i16 operands, summed with the third i32 operand. The
1031
+ i1 fourth operand is used to clamp the output.
1032
+
1033
+ llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which
1034
+ support such instructions. This performs unsigned dot product
1035
+ with two i32 operands (holding a vector of 4 8bit values), summed
1036
+ with the third i32 operand. The i1 fourth operand is used to clamp
1037
+ the output.
1038
+
1039
+ llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which
1040
+ support such instructions. This performs unsigned dot product
1041
+ with two i32 operands (holding a vector of 8 4bit values), summed
1042
+ with the third i32 operand. The i1 fourth operand is used to clamp
1043
+ the output.
1044
+
1045
+ llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which
1046
+ support such instructions. This performs signed dot product
1047
+ with two v2i16 operands, summed with the third i32 operand. The
1048
+ i1 fourth operand is used to clamp the output.
1049
+ When applicable (e.g. no clamping), this is lowered into
1050
+ v_dot2c_i32_i16 for targets which support it.
1051
+
1052
+ llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which
1053
+ support such instructions. This performs signed dot product
1054
+ with two i32 operands (holding a vector of 4 8bit values), summed
1055
+ with the third i32 operand. The i1 fourth operand is used to clamp
1056
+ the output.
1057
+ When applicable (i.e. no clamping / operand modifiers), this is lowered
1058
+ into v_dot4c_i32_i8 for targets which support it.
1059
+ RDNA3 does not offer v_dot4_i32_i8, and rather offers
1060
+ v_dot4_i32_iu8 which has operands to hold the signedness of the
1061
+ vector operands. Thus, this intrinsic lowers to the signed version
1062
+ of this instruction for gfx11 targets.
1063
+
1064
+ llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which
1065
+ support such instructions. This performs signed dot product
1066
+ with two i32 operands (holding a vector of 8 4bit values), summed
1067
+ with the third i32 operand. The i1 fourth operand is used to clamp
1068
+ the output.
1069
+ When applicable (i.e. no clamping / operand modifiers), this is lowered
1070
+ into v_dot8c_i32_i4 for targets which support it.
1071
+ RDNA3 does not offer v_dot8_i32_i4, and rather offers
1072
+ v_dot4_i32_iu4 which has operands to hold the signedness of the
1073
+ vector operands. Thus, this intrinsic lowers to the signed version
1074
+ of this instruction for gfx11 targets.
1075
+
1076
+ llvm.amdgcn.sudot4 Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs
1077
+ dot product with two i32 operands (holding a vector of 4 8bit values), summed
1078
+ with the fifth i32 operand. The i1 sixth operand is used to clamp
1079
+ the output. The i1s preceding the vector operands decide the signedness.
1080
+
1081
+ llvm.amdgcn.sudot8 Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs
1082
+ dot product with two i32 operands (holding a vector of 8 4bit values), summed
1083
+ with the fifth i32 operand. The i1 sixth operand is used to clamp
1084
+ the output. The i1s preceding the vector operands decide the signedness.
1085
+
1086
+ llvm.amdgcn.sched_barrier Controls the types of instructions that may be allowed to cross the intrinsic
1087
+ during instruction scheduling. The parameter is a mask for the instruction types
1088
+ that can cross the intrinsic.
1089
+
1090
+ - 0x0000: No instructions may be scheduled across sched_barrier.
1091
+ - 0x0001: All, non-memory, non-side-effect producing instructions may be
1092
+ scheduled across sched_barrier, *i.e.* allow ALU instructions to pass.
1093
+ - 0x0002: VALU instructions may be scheduled across sched_barrier.
1094
+ - 0x0004: SALU instructions may be scheduled across sched_barrier.
1095
+ - 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.
1096
+ - 0x0010: All VMEM instructions may be scheduled across sched_barrier.
1097
+ - 0x0020: VMEM read instructions may be scheduled across sched_barrier.
1098
+ - 0x0040: VMEM write instructions may be scheduled across sched_barrier.
1099
+ - 0x0080: All DS instructions may be scheduled across sched_barrier.
1100
+ - 0x0100: All DS read instructions may be scheduled accoss sched_barrier.
1101
+ - 0x0200: All DS write instructions may be scheduled across sched_barrier.
1102
+ - 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.
1103
+
1104
+ llvm.amdgcn.sched_group_barrier Creates schedule groups with specific properties to create custom scheduling
1105
+ pipelines. The ordering between groups is enforced by the instruction scheduler.
1106
+ The intrinsic applies to the code that preceeds the intrinsic. The intrinsic
1107
+ takes three values that control the behavior of the schedule groups.
1108
+
1109
+ - Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.
1110
+ - Size : The number of instructions that are in the group.
1111
+ - SyncID : Order is enforced between groups with matching values.
1112
+
1113
+ The mask can include multiple instruction types. It is undefined behavior to set
1114
+ values beyond the range of valid masks.
1115
+
1116
+ Combining multiple sched_group_barrier intrinsics enables an ordering of specific
1117
+ instruction types during instruction scheduling. For example, the following enforces
1118
+ a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA
1119
+ instructions.
1120
+
1121
+ | ``// 1 VMEM read``
1122
+ | ``__builtin_amdgcn_sched_group_barrier(32, 1, 0)``
1123
+ | ``// 1 VALU``
1124
+ | ``__builtin_amdgcn_sched_group_barrier(2, 1, 0)``
1125
+ | ``// 5 MFMA``
1126
+ | ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)``
1127
+
1128
+ llvm.amdgcn.iglp_opt An **experimental** intrinsic for instruction group level parallelism. The intrinsic
1129
+ implements predefined intruction scheduling orderings. The intrinsic applies to the
1130
+ surrounding scheduling region. The intrinsic takes a value that specifies the
1131
+ strategy. The compiler implements two strategies.
1132
+
1133
+ 0. Interleave DS and MFMA instructions for small GEMM kernels.
1134
+ 1. Interleave DS and MFMA instructions for single wave small GEMM kernels.
1135
+
1136
+ Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic
1137
+ cannot be combined with sched_barrier or sched_group_barrier.
1138
+
1139
+ The iglp_opt strategy implementations are subject to change.
1140
+
1028
1141
============================================== ==========================================================
1029
1142
1030
1143
.. TODO::
0 commit comments