You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SYCL][FUSION][DOC] Document fusion of kernels with different nd-ranges (#11466)
Add relevant information to sycl/doc/design/KernelFusionJIT.md,
including: information on the pass performing the required
transformations, restrictions on the input nd-ranges and brief
description of the process.
---------
Signed-off-by: Victor Perez <[email protected]>
Copy file name to clipboardExpand all lines: sycl/doc/design/KernelFusionJIT.md
+109-1Lines changed: 109 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -130,7 +130,7 @@ Information about the fusion is registered within the module by attaching metada
130
130
131
131
The pipeline currently consists of the following passes (in order):
132
132
133
-
-`SYCLKernelFusion` performs the actual fusion process by inlining kernels to fuse inside the fused kernel
133
+
-`SYCLKernelFusion` performs the actual fusion process by inlining kernels to fuse inside the fused kernel. In case not all kernels being fused share the same nd-range, it also handles work-items remapping ([see](#fusing-kernels-with-different-nd-ranges))
134
134
- Generic optimization passes: `IndVarSimplifyPass`, `LoopUnrollPass`, `SROAPass`, `InferAddressSpacesPass` to remove pointers to the generic address-space
135
135
- These optimizations are important to help the internalizer, see note below.
136
136
-`SYCLInternalizer` promotes buffer to local or private memory
@@ -158,7 +158,115 @@ The metadata is attached to a function that will become the fused kernel:
158
158
-`sycl.kernel.promote`: declare identical parameters to be promoted. Contains a list of index (of the fused kernel, after identical arguments elision) and `private` if the argument is to be promoted to private memory or `local` if it is to local.
159
159
-`sycl.kernel.promote.size`: declare the address space size for the promoted memory. Contains a list of indexes (of the fused kernel, after identical arguments elision) and the number of elements.
160
160
-`sycl.kernel.constants`: declare the value of a scalar or aggregate to be used as constant values. Contains a list of indexes (of the fused kernel, after identical arguments elision) and the value as a string. Note: the string is used to store the value, the string is read as a buffer of char and reinterpreted into the value of the argument's type.
161
+
-`sycl.kernel.nd-range`: declare the nd-range to be used by the fused kernel in case work-item remapping was needed. It is a tuple with 4 elements:
162
+
-`num_dims`: scalar integer representing the number of dimensions of the nd-range;
163
+
-`global_size`: triple representing nd-range global size, an element for each dimension, using `0` for unused dimensions;
164
+
-`local_size`: triple representing nd-range local size, an element for each dimension, using `0` for unused dimensions. If the local size is not specified, all elements will be 0;
165
+
-`offset`: triple representing nd-range offset, an element for each dimension, using `0` for unused dimensions.
166
+
-`sycl.kernel.nd-ranges`: declare the nd-ranges of each original kernels. This information is used by the `SYCLKernelFuson` pass to perform work-item remapping. It is a list with references to tuples as the one contained in `sycl.kernel.nd-range`. Constraints on the legal combinations of nd-ranges are described in [the corresponding section](#fusing-kernels-with-different-nd-ranges).
161
167
168
+
### Fusing kernels with different nd-ranges
169
+
170
+
This section explains actions performed by the kernel fusion JIT compiler when fusing kernels with different nd-ranges. Throughout this section, we refer to "work-item components". A comprehensive list of these components mentioned in this document is:
171
+
172
+
-`global_size`
173
+
-`local_size`
174
+
-`num_work_groups`
175
+
-`global_id`
176
+
-`local_id`
177
+
-`group_id`
178
+
-`global_offset`
179
+
180
+
The meaning of each of these is self-explainatory for the SYCL user.
181
+
182
+
#### Restrictions
183
+
184
+
Following kernel fusion principles, SYCL constraints and technical decisions, some basic constraints are set for valid combinations of nd-ranges:
185
+
186
+
1. The fused kernel should perform no more visible work than the union of the unfused kernels;
187
+
2. The fused kernel should perform no less visible work than the union of the unfused kernels;
188
+
3. If two work items belong to the same work-group in one of the unfused grids, they must also belong to the same work-group in the fused grid;
189
+
4. Either none or all of the work-items of a work-group must execute barriers inserted by the kernel fusion process;
190
+
5. The fused kernel must not launch more work-items than the maximum number of work-items launched by the original kernels.
191
+
6. All work-groups will be the same size, [as per the SYCL 2020 rev 7. 3.9.4](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_work_group_data_parallel_kernels).
8. A work-item will have the same global linear id in the fused grid as in the unfused grid;
194
+
9. All the fused nd-ranges must have the same offset.
195
+
196
+
These restrictions can be simplified to:
197
+
198
+
- No two local sizes specified by the nd-ranges will be different;
199
+
- No global id remapping is needed ([see](#work-item-remapping)) or all input offsets are 0;
200
+
- All the fused nd-ranges must have the same offset.
201
+
202
+
As we can see, there is no restrictions in the number of dimensions or global sizes of the input nd-ranges.
203
+
204
+
#### Work-item remapping
205
+
206
+
Work-item remapping is performed at the input kernel level, i.e., a different remapping is performed for each input kernel, as different input nd-ranges will result in different remappings.
207
+
208
+
This remapping consists on an inter-procedural pass replacing each built-in querying components of a work-item, e.g., the global id or the local size, with a JIT-generated value.
209
+
210
+
First of all, work-item remapping will always be performed when the list of input nd-ranges is heterogeneous. Additional remapping conditions are present for the following work-item components. For each input kernel:
211
+
212
+
-`num_work_groups` and `local_size`: Only performed if the input nd-range has an explicit local size, may result in better performance, as this replaces built-in calls with constants;
213
+
-`global_id`, `local_id` and `group_id`: Only needed if the number of dimensions differ w.r.t. that of the fused kernel or any component of the global size in the range [2, `num_dims`] differs.
214
+
215
+
Once this rules are set, also taking into account remapping constraints, the remapping is performed as follows for each input kernel:
On the RHS of the expressions, component names refer to the remapped values and upper case `GS`, `LS` and `GO` values refer to each of the components of the original nd-range (global size, local size and global offset), whereas `GLID` refers to the global linear id, which is an invariant during the fusion process.
235
+
236
+
Special care needs to be taken when handling elements from the original nd-range, as the input index needs to be remapped to take into account different array subscript ordering of the underlying API w.r.t. SYCL. See [SYCL 2020 rev. 7 C.7.7](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:opencl:kernel-conventions-sycl) for more information on this index remapping.
237
+
238
+
**Note**: As there is no `global_id` counterpart for PTX, global id is specified as `global_id(i) = group_id(i) * local_size(i) + local_id(i) + global_offset(i)`. This way, when targetting PTX, `local_size`, `local_id` and `group_id` will need special treatment **when no explicit local size is provided**. In this particular case, remapping will take place as follows (also respecting original constraints):
239
+
240
+
-`num_work_groups`:
241
+
-`num_work_groups(x) = 1`
242
+
-`group_id`:
243
+
-`group_id(x) = 0`
244
+
-`local_size`:
245
+
-`local_size(x) = GS(x)`
246
+
-`local_id`:
247
+
-`local_id(x) = global_id(x)`
248
+
249
+
##### Remapped SPIR-V built-ins
250
+
251
+
Following [OpenCL SPIR-V Environment Specification 3.0 2.9](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Env.html#_built_in_variables):
252
+
253
+
-`global_size`: `GlobalSize`
254
+
-`local_size`: `WorkgroupSize`
255
+
-`num_work_groups`: `NumWorkgroups`
256
+
-`global_id`: `GlobalInvocationId`
257
+
-`local_id`: `LocalInvocationId`
258
+
-`group_id`: `WorkgroupId`
259
+
-`global_offset`: `GlobalOffset`
260
+
261
+
##### Remapped PTX intrinsics
262
+
263
+
Following [User Guide for NVPTX](https://llvm.org/docs/NVPTXUsage.html#llvm-nvvm-read-ptx-sreg) and [Compiler and runtime design #global-offset-support](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/CompilerAndRuntimeDesign.md#global-offset-support).
0 commit comments