Skip to content

Commit c75a169

Browse files
authored
[SYCL][FUSION][DOC] Document fusion of kernels with different nd-ranges (#11466)
Add relevant information to sycl/doc/design/KernelFusionJIT.md, including: information on the pass performing the required transformations, restrictions on the input nd-ranges and brief description of the process. --------- Signed-off-by: Victor Perez <[email protected]>
1 parent 4805490 commit c75a169

File tree

1 file changed

+109
-1
lines changed

1 file changed

+109
-1
lines changed

sycl/doc/design/KernelFusionJIT.md

Lines changed: 109 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ Information about the fusion is registered within the module by attaching metada
130130

131131
The pipeline currently consists of the following passes (in order):
132132

133-
- `SYCLKernelFusion` performs the actual fusion process by inlining kernels to fuse inside the fused kernel
133+
- `SYCLKernelFusion` performs the actual fusion process by inlining kernels to fuse inside the fused kernel. In case not all kernels being fused share the same nd-range, it also handles work-items remapping ([see](#fusing-kernels-with-different-nd-ranges))
134134
- Generic optimization passes: `IndVarSimplifyPass`, `LoopUnrollPass`, `SROAPass`, `InferAddressSpacesPass` to remove pointers to the generic address-space
135135
- These optimizations are important to help the internalizer, see note below.
136136
- `SYCLInternalizer` promotes buffer to local or private memory
@@ -158,7 +158,115 @@ The metadata is attached to a function that will become the fused kernel:
158158
- `sycl.kernel.promote`: declare identical parameters to be promoted. Contains a list of index (of the fused kernel, after identical arguments elision) and `private` if the argument is to be promoted to private memory or `local` if it is to local.
159159
- `sycl.kernel.promote.size`: declare the address space size for the promoted memory. Contains a list of indexes (of the fused kernel, after identical arguments elision) and the number of elements.
160160
- `sycl.kernel.constants`: declare the value of a scalar or aggregate to be used as constant values. Contains a list of indexes (of the fused kernel, after identical arguments elision) and the value as a string. Note: the string is used to store the value, the string is read as a buffer of char and reinterpreted into the value of the argument's type.
161+
- `sycl.kernel.nd-range`: declare the nd-range to be used by the fused kernel in case work-item remapping was needed. It is a tuple with 4 elements:
162+
- `num_dims`: scalar integer representing the number of dimensions of the nd-range;
163+
- `global_size`: triple representing nd-range global size, an element for each dimension, using `0` for unused dimensions;
164+
- `local_size`: triple representing nd-range local size, an element for each dimension, using `0` for unused dimensions. If the local size is not specified, all elements will be 0;
165+
- `offset`: triple representing nd-range offset, an element for each dimension, using `0` for unused dimensions.
166+
- `sycl.kernel.nd-ranges`: declare the nd-ranges of each original kernels. This information is used by the `SYCLKernelFuson` pass to perform work-item remapping. It is a list with references to tuples as the one contained in `sycl.kernel.nd-range`. Constraints on the legal combinations of nd-ranges are described in [the corresponding section](#fusing-kernels-with-different-nd-ranges).
161167

168+
### Fusing kernels with different nd-ranges
169+
170+
This section explains actions performed by the kernel fusion JIT compiler when fusing kernels with different nd-ranges. Throughout this section, we refer to "work-item components". A comprehensive list of these components mentioned in this document is:
171+
172+
- `global_size`
173+
- `local_size`
174+
- `num_work_groups`
175+
- `global_id`
176+
- `local_id`
177+
- `group_id`
178+
- `global_offset`
179+
180+
The meaning of each of these is self-explainatory for the SYCL user.
181+
182+
#### Restrictions
183+
184+
Following kernel fusion principles, SYCL constraints and technical decisions, some basic constraints are set for valid combinations of nd-ranges:
185+
186+
1. The fused kernel should perform no more visible work than the union of the unfused kernels;
187+
2. The fused kernel should perform no less visible work than the union of the unfused kernels;
188+
3. If two work items belong to the same work-group in one of the unfused grids, they must also belong to the same work-group in the fused grid;
189+
4. Either none or all of the work-items of a work-group must execute barriers inserted by the kernel fusion process;
190+
5. The fused kernel must not launch more work-items than the maximum number of work-items launched by the original kernels.
191+
6. All work-groups will be the same size, [as per the SYCL 2020 rev 7. 3.9.4](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_work_group_data_parallel_kernels).
192+
7. `global_id(i) = group_id(i) * local_size(i) + local_id(i)` [as per OpenCL 3.0 3.2.1](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_mapping_work_items_onto_an_ndrange).
193+
8. A work-item will have the same global linear id in the fused grid as in the unfused grid;
194+
9. All the fused nd-ranges must have the same offset.
195+
196+
These restrictions can be simplified to:
197+
198+
- No two local sizes specified by the nd-ranges will be different;
199+
- No global id remapping is needed ([see](#work-item-remapping)) or all input offsets are 0;
200+
- All the fused nd-ranges must have the same offset.
201+
202+
As we can see, there is no restrictions in the number of dimensions or global sizes of the input nd-ranges.
203+
204+
#### Work-item remapping
205+
206+
Work-item remapping is performed at the input kernel level, i.e., a different remapping is performed for each input kernel, as different input nd-ranges will result in different remappings.
207+
208+
This remapping consists on an inter-procedural pass replacing each built-in querying components of a work-item, e.g., the global id or the local size, with a JIT-generated value.
209+
210+
First of all, work-item remapping will always be performed when the list of input nd-ranges is heterogeneous. Additional remapping conditions are present for the following work-item components. For each input kernel:
211+
212+
- `num_work_groups` and `local_size`: Only performed if the input nd-range has an explicit local size, may result in better performance, as this replaces built-in calls with constants;
213+
- `global_id`, `local_id` and `group_id`: Only needed if the number of dimensions differ w.r.t. that of the fused kernel or any component of the global size in the range [2, `num_dims`] differs.
214+
215+
Once this rules are set, also taking into account remapping constraints, the remapping is performed as follows for each input kernel:
216+
217+
- `global_id`:
218+
- `global_id(0) = GLID / (global_size(1) * global_size(2))`
219+
- `global_id(1) = (GLID / global_size(2)) % global_size(1)`
220+
- `global_id(2) = GLID % global_size(2)`
221+
- `local_id`:
222+
- `local_id(x) = global_id(x) % local_size(x)`
223+
- `group_id`:
224+
- `group_id(x) = global_id(x) / local_size(x)`
225+
- `num_work_groups`:
226+
- `num_work_groups(x) = global_size(x) / local_size(x)`
227+
- `global_size`:
228+
- `global_size(x) = GS(x)`
229+
- `local_size`:
230+
- `local_size(x) = LS(x)`
231+
- `global_offset`:
232+
- `global_offset(x) = GO(x)`
233+
234+
On the RHS of the expressions, component names refer to the remapped values and upper case `GS`, `LS` and `GO` values refer to each of the components of the original nd-range (global size, local size and global offset), whereas `GLID` refers to the global linear id, which is an invariant during the fusion process.
235+
236+
Special care needs to be taken when handling elements from the original nd-range, as the input index needs to be remapped to take into account different array subscript ordering of the underlying API w.r.t. SYCL. See [SYCL 2020 rev. 7 C.7.7](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:opencl:kernel-conventions-sycl) for more information on this index remapping.
237+
238+
**Note**: As there is no `global_id` counterpart for PTX, global id is specified as `global_id(i) = group_id(i) * local_size(i) + local_id(i) + global_offset(i)`. This way, when targetting PTX, `local_size`, `local_id` and `group_id` will need special treatment **when no explicit local size is provided**. In this particular case, remapping will take place as follows (also respecting original constraints):
239+
240+
- `num_work_groups`:
241+
- `num_work_groups(x) = 1`
242+
- `group_id`:
243+
- `group_id(x) = 0`
244+
- `local_size`:
245+
- `local_size(x) = GS(x)`
246+
- `local_id`:
247+
- `local_id(x) = global_id(x)`
248+
249+
##### Remapped SPIR-V built-ins
250+
251+
Following [OpenCL SPIR-V Environment Specification 3.0 2.9](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Env.html#_built_in_variables):
252+
253+
- `global_size`: `GlobalSize`
254+
- `local_size`: `WorkgroupSize`
255+
- `num_work_groups`: `NumWorkgroups`
256+
- `global_id`: `GlobalInvocationId`
257+
- `local_id`: `LocalInvocationId`
258+
- `group_id`: `WorkgroupId`
259+
- `global_offset`: `GlobalOffset`
260+
261+
##### Remapped PTX intrinsics
262+
263+
Following [User Guide for NVPTX](https://llvm.org/docs/NVPTXUsage.html#llvm-nvvm-read-ptx-sreg) and [Compiler and runtime design #global-offset-support](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/CompilerAndRuntimeDesign.md#global-offset-support).
264+
265+
- `local_id`: `llvm.nvvm.read.ptx.sreg.tid.*`
266+
- `group_id`: `llvm.nvvm.read.ptx.sreg.ctaid.*`
267+
- `local_size`: `llvm.nvvm.read.ptx.ntid.*`
268+
- `num_work_groups`: `llvm.nvvm.read.ptx.nctaid.*`
269+
- `global_offset`: `llvm.nvvm.implicit.offset`
162270

163271
### Support for non SPIR-V targets
164272

0 commit comments

Comments
 (0)