[SYCL] Add documentation and change default val for min range x to enable range rounding (#11823)

hdelan · web-flow · commit b334e1a30469 · 2023-11-16T09:30:48.000-08:00
Some users found that 1024 was too conservative a min X dim to enable
range rounding. This lowers that val.

It also adds some documentation for SYCL range rounding, which had been
missing. Perhaps this should be in its own file. Let me know where to
put it and I will do so.
diff --git a/sycl/doc/EnvironmentVariables.md b/sycl/doc/EnvironmentVariables.md
@@ -161,6 +161,20 @@ If this environment variable is not set, the preferred work-group size for reduc
 
 Note that conflicting configuration tuples in the same list will favor the last entry. For example, a list `cpu:32,gpu:32,cpu:16` will set the preferred work-group size of reductions to 32 for GPUs and 16 for CPUs. This also applies to `*`, for example `cpu:32,*:16` sets the preferred work-group size of reductions on all devices to 16, while `*:16,cpu:32` sets the preferred work-group size of reductions to 32 on CPUs and to 16 on all other devices.
 
+## Range Rounding Environment Variables
+
+For a description of parallel for range rounding in DPC++ see
+[Parallel For Range Rounding][range-rounding].
+
+| Environment variable | Values | Description |
+| -------------------- | ------ | ----------- |
+| `SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING` | Any(\*) | Disables automatic rounding-up of `parallel_for` invocation ranges. |
+| `SYCL_PARALLEL_FOR_RANGE_ROUNDING_TRACE`   | Any(\*) | Enables tracing of `parallel_for` invocations with rounded-up ranges. |
+| `SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS`  | `MinFactorX:GoodFactor:MinRangeX` | `MinFactorX`: The minimum range that the rounded range should be a multiple of (Default 16)  |
+|  |  | `GoodFactor`: The preferred range that the rounded range be a multiple of (Default 32)  |
+|  |  | `MinRangeX`: The minimum X dimension of the range such that range rounding is activated (Default 1024) |
+
+
 ## Controlling DPC++ Level Zero Plugin
 
 | Environment variable | Values | Description |
@@ -290,3 +304,4 @@ variables in production code.</span>
 `(*) Note: Any means this environment variable is effective when set to any non-null value.`
 
 [xpti]: https://github.com/intel/llvm/blob/sycl/xptifw/doc/XPTI_Framework.md
+[range-rounding]: https://github.com/intel/llvm/blob/sycl/doc/design/ParallelForRangeRounding.md
diff --git a/sycl/doc/design/ParallelForRangeRounding.md b/sycl/doc/design/ParallelForRangeRounding.md
@@ -0,0 +1,46 @@
+# Parallel For Range Rounding
+
+Kernels to be executed using a `sycl::range`, and not a `sycl::nd_range`,
+may have their execution space reconfigured by the SYCL runtime. This is done
+since oddly shaped execution dimensions can hinder performance, especially when
+executing kernels on GPUs. It is worth noting that although the
+`sycl::parallel_for` using a `sycl::range` does not expose the concept of a
+`work_group` to the user, behind the scenes all GPU APIs require a work group
+configuration when dispatching kernels. In this case the work group
+configuration is provided by the implementation and not the user.
+
+As an example, imagine a SYCL kernel is dispatched with 1d range `{7727}`. Since
+7727 is a prime number, there is no way to divide this kernel up into workgroups
+of any size other than 1. Therefore 7727 workgroups are dispatched, each with
+size 1. Because of the parallel nature of execution on modern GPUs, this
+results in low occupancy, since we are not using all of the available work items
+that execute in lockstep in each (implicit) subgroup. This can hinder
+performance.
+
+To mitigate the performance hit of choosing an awkward implicit workgroup size,
+for each kernel using a `sycl::range`, the SYCL runtime will generate two
+kernels:
+
+1. The original kernel without any modifications.
+2. The "Range rounded" kernel, which checks the global index of each work item
+   at the beginning of execution, exiting early for a work item if the global
+   index exceeds the user provided execution range. If the original kernel has
+   the signature `foo`, then this kernel will have a signature akin to
+   `_ZTSN4sycl3_V16detail19__pf_kernel_wrapperI3fooEE`.
+
+In this way, if a range rounded kernel is executed at runtime, a kernel
+dispatched with the range `{7727}` may be executed by `{7808}` work items,
+where work items `{7727..7807}` all exit early before doing any work. This would
+give much better performance on a GPU platform since the implementation can use
+the implicit `nd_range` `{7808, 32}`, which corresponds to a workgroup size of
+32, instead of `{7727, 1}`, which corresponds to a workgroup size of 1.
+
+The parallel for range rounding will only be used in the X (outermost)
+dimension of a `sycl::range`, since if the inner dimensions are changed by the
+SYCL runtime this can change the stride offset of different dimensions. Range
+rounding will only be used if the SYCL runtime X dimension exceeds some minimum
+value, which can be configured using the
+`SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS` environment variable.
+
+Generation of range rounded kernels can be disabled by using the compiler flag
+`-fsycl-disable-range-rounding`.
diff --git a/sycl/doc/index.rst b/sycl/doc/index.rst
@@ -38,6 +38,7 @@ Design Documents for the oneAPI DPC++ Compiler
    design/Assert
    design/SharedLibraries
    design/OptionalDeviceFeatures
+   design/ParallelForRangeRounding
    design/SYCLInstrumentationUsingXPTI
    design/ITTAnnotations
    design/DeviceGlobal