|
| 1 | +# Parallel For Range Rounding |
| 2 | + |
| 3 | +Kernels to be executed using a `sycl::range`, and not a `sycl::nd_range`, |
| 4 | +may have their execution space reconfigured by the SYCL runtime. This is done |
| 5 | +since oddly shaped execution dimensions can hinder performance, especially when |
| 6 | +executing kernels on GPUs. It is worth noting that although the |
| 7 | +`sycl::parallel_for` using a `sycl::range` does not expose the concept of a |
| 8 | +`work_group` to the user, behind the scenes all GPU APIs require a work group |
| 9 | +configuration when dispatching kernels. In this case the work group |
| 10 | +configuration is provided by the implementation and not the user. |
| 11 | + |
| 12 | +As an example, imagine a SYCL kernel is dispatched with 1d range `{7727}`. Since |
| 13 | +7727 is a prime number, there is no way to divide this kernel up into workgroups |
| 14 | +of any size other than 1. Therefore 7727 workgroups are dispatched, each with |
| 15 | +size 1. Because of the parallel nature of execution on modern GPUs, this |
| 16 | +results in low occupancy, since we are not using all of the available work items |
| 17 | +that execute in lockstep in each (implicit) subgroup. This can hinder |
| 18 | +performance. |
| 19 | + |
| 20 | +To mitigate the performance hit of choosing an awkward implicit workgroup size, |
| 21 | +for each kernel using a `sycl::range`, the SYCL runtime will generate two |
| 22 | +kernels: |
| 23 | + |
| 24 | +1. The original kernel without any modifications. |
| 25 | +2. The "Range rounded" kernel, which checks the global index of each work item |
| 26 | + at the beginning of execution, exiting early for a work item if the global |
| 27 | + index exceeds the user provided execution range. If the original kernel has |
| 28 | + the signature `foo`, then this kernel will have a signature akin to |
| 29 | + `_ZTSN4sycl3_V16detail19__pf_kernel_wrapperI3fooEE`. |
| 30 | + |
| 31 | +In this way, if a range rounded kernel is executed at runtime, a kernel |
| 32 | +dispatched with the range `{7727}` may be executed by `{7808}` work items, |
| 33 | +where work items `{7727..7807}` all exit early before doing any work. This would |
| 34 | +give much better performance on a GPU platform since the implementation can use |
| 35 | +the implicit `nd_range` `{7808, 32}`, which corresponds to a workgroup size of |
| 36 | +32, instead of `{7727, 1}`, which corresponds to a workgroup size of 1. |
| 37 | + |
| 38 | +The parallel for range rounding will only be used in the X (outermost) |
| 39 | +dimension of a `sycl::range`, since if the inner dimensions are changed by the |
| 40 | +SYCL runtime this can change the stride offset of different dimensions. Range |
| 41 | +rounding will only be used if the SYCL runtime X dimension exceeds some minimum |
| 42 | +value, which can be configured using the |
| 43 | +`SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS` environment variable. |
| 44 | + |
| 45 | +Generation of range rounded kernels can be disabled by using the compiler flag |
| 46 | +`-fsycl-disable-range-rounding`. |
0 commit comments