@@ -153,6 +153,68 @@ Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
153
153
NVPTX Intrinsics
154
154
================
155
155
156
+ Guidelines on NVPTX Intrinsic design
157
+ ------------------------------------
158
+
159
+ The NVPTX intrinsics are used to model instructions in the PTX ISA.
160
+ While simpler intrinsics can represent certain features effectively,
161
+ more complex instructions like TMA and MMA are not as straightforward
162
+ to model. A single variant of these complex instructions can expand
163
+ into hundreds of intrinsics. Additionally, any expansion in the
164
+ corresponding ISA can exponentially increase these numbers, making it
165
+ difficult to manage them in the IR and backend passes. Therefore,
166
+ a careful design of intrinsic interfaces can ease maintenance and
167
+ contribute to a sustainable, long-term solution.
168
+
169
+ Below are a set of guidelines that may help in choosing
170
+ an appropriate design for the complex intrinsics:
171
+
172
+ 1. If there are only a few intrinsics, prefer a flat design
173
+ where the intrinsic name encodes all relevant details,
174
+ and includes only the arguments used by the actual instruction.
175
+ 2. As the number of intrinsics grows, it is desirable to consolidate
176
+ them. NVPTX uses a 'flags'-based design where each flag argument
177
+ represents one set of instruction modifiers. These flags are
178
+ compile-time integer constants.
179
+
180
+ 3. When an intrinsic uses flags, document it with details of the
181
+ flag usage in the ``NVPTXUsage.rst `` file.
182
+ 4. Use i1 for boolean flags and i8 for other flag types.
183
+ 5. Annotate all flag arguments with ImmArg<ArgIdx<>>.
184
+ 6. Place the flag arguments at the end of the (actual)argument list.
185
+
186
+ 7. Identify the key features of an intrinsic and distinguish between
187
+ first-order and supplementary information. Typically, encoding the
188
+ first-order information in the intrinsic name while using flags
189
+ for supplementary details improves readability.
190
+ For example:
191
+
192
+ i. For MMA intrinsics, 'dense' vs. 'sparse' is a fundamental feature,
193
+ whereas an optional scaling applied to matrices is relatively secondary.
194
+
195
+ ii. For TMAs, the mode of copy (e.g., 'Tile' or 'Im2col') is a first-order
196
+ information, while features like an optional cache hint tend to be
197
+ secondary.
198
+
199
+ 8. If there are invalid combinations within a set of modifiers, avoid
200
+ encoding them as flags, as much as possible. This helps reduce the
201
+ need for error handling of unsupported cases in the backend.
202
+ For example, some 'cvt' intrinsics support only a subset of the
203
+ possible rounding modes; so it is preferable not to encode the
204
+ rounding modes as flags.
205
+ 9. Similarly, when there are invalid combinations across a set of
206
+ modifiers, avoid encoding them as flags to prevent additional
207
+ complexity in error handling.
208
+
209
+ 10. Maintain a consistent design within an intrinsic family, including
210
+ argument ordering as well as the usage and ordering of flags.
211
+ 11. When designing an intrinsic corresponding to an instruction or its variant,
212
+ consider the entire instruction family. This may reveal common features
213
+ that can be modelled consistently across the family.
214
+
215
+ In summary, strive to balance the aspects mentioned above, to achieve
216
+ a scalable design with maximum readability.
217
+
156
218
Reading PTX Special Registers
157
219
-----------------------------
158
220
0 commit comments