Skip to content

Commit b334e1a

Browse files
authored
[SYCL] Add documentation and change default val for min range x to enable range rounding (#11823)
Some users found that 1024 was too conservative a min X dim to enable range rounding. This lowers that val. It also adds some documentation for SYCL range rounding, which had been missing. Perhaps this should be in its own file. Let me know where to put it and I will do so.
1 parent 6666762 commit b334e1a

File tree

3 files changed

+62
-0
lines changed

3 files changed

+62
-0
lines changed

sycl/doc/EnvironmentVariables.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,20 @@ If this environment variable is not set, the preferred work-group size for reduc
161161

162162
Note that conflicting configuration tuples in the same list will favor the last entry. For example, a list `cpu:32,gpu:32,cpu:16` will set the preferred work-group size of reductions to 32 for GPUs and 16 for CPUs. This also applies to `*`, for example `cpu:32,*:16` sets the preferred work-group size of reductions on all devices to 16, while `*:16,cpu:32` sets the preferred work-group size of reductions to 32 on CPUs and to 16 on all other devices.
163163

164+
## Range Rounding Environment Variables
165+
166+
For a description of parallel for range rounding in DPC++ see
167+
[Parallel For Range Rounding][range-rounding].
168+
169+
| Environment variable | Values | Description |
170+
| -------------------- | ------ | ----------- |
171+
| `SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING` | Any(\*) | Disables automatic rounding-up of `parallel_for` invocation ranges. |
172+
| `SYCL_PARALLEL_FOR_RANGE_ROUNDING_TRACE` | Any(\*) | Enables tracing of `parallel_for` invocations with rounded-up ranges. |
173+
| `SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS` | `MinFactorX:GoodFactor:MinRangeX` | `MinFactorX`: The minimum range that the rounded range should be a multiple of (Default 16) |
174+
| | | `GoodFactor`: The preferred range that the rounded range be a multiple of (Default 32) |
175+
| | | `MinRangeX`: The minimum X dimension of the range such that range rounding is activated (Default 1024) |
176+
177+
164178
## Controlling DPC++ Level Zero Plugin
165179

166180
| Environment variable | Values | Description |
@@ -290,3 +304,4 @@ variables in production code.</span>
290304
`(*) Note: Any means this environment variable is effective when set to any non-null value.`
291305

292306
[xpti]: https://github.com/intel/llvm/blob/sycl/xptifw/doc/XPTI_Framework.md
307+
[range-rounding]: https://github.com/intel/llvm/blob/sycl/doc/design/ParallelForRangeRounding.md
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Parallel For Range Rounding
2+
3+
Kernels to be executed using a `sycl::range`, and not a `sycl::nd_range`,
4+
may have their execution space reconfigured by the SYCL runtime. This is done
5+
since oddly shaped execution dimensions can hinder performance, especially when
6+
executing kernels on GPUs. It is worth noting that although the
7+
`sycl::parallel_for` using a `sycl::range` does not expose the concept of a
8+
`work_group` to the user, behind the scenes all GPU APIs require a work group
9+
configuration when dispatching kernels. In this case the work group
10+
configuration is provided by the implementation and not the user.
11+
12+
As an example, imagine a SYCL kernel is dispatched with 1d range `{7727}`. Since
13+
7727 is a prime number, there is no way to divide this kernel up into workgroups
14+
of any size other than 1. Therefore 7727 workgroups are dispatched, each with
15+
size 1. Because of the parallel nature of execution on modern GPUs, this
16+
results in low occupancy, since we are not using all of the available work items
17+
that execute in lockstep in each (implicit) subgroup. This can hinder
18+
performance.
19+
20+
To mitigate the performance hit of choosing an awkward implicit workgroup size,
21+
for each kernel using a `sycl::range`, the SYCL runtime will generate two
22+
kernels:
23+
24+
1. The original kernel without any modifications.
25+
2. The "Range rounded" kernel, which checks the global index of each work item
26+
at the beginning of execution, exiting early for a work item if the global
27+
index exceeds the user provided execution range. If the original kernel has
28+
the signature `foo`, then this kernel will have a signature akin to
29+
`_ZTSN4sycl3_V16detail19__pf_kernel_wrapperI3fooEE`.
30+
31+
In this way, if a range rounded kernel is executed at runtime, a kernel
32+
dispatched with the range `{7727}` may be executed by `{7808}` work items,
33+
where work items `{7727..7807}` all exit early before doing any work. This would
34+
give much better performance on a GPU platform since the implementation can use
35+
the implicit `nd_range` `{7808, 32}`, which corresponds to a workgroup size of
36+
32, instead of `{7727, 1}`, which corresponds to a workgroup size of 1.
37+
38+
The parallel for range rounding will only be used in the X (outermost)
39+
dimension of a `sycl::range`, since if the inner dimensions are changed by the
40+
SYCL runtime this can change the stride offset of different dimensions. Range
41+
rounding will only be used if the SYCL runtime X dimension exceeds some minimum
42+
value, which can be configured using the
43+
`SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS` environment variable.
44+
45+
Generation of range rounded kernels can be disabled by using the compiler flag
46+
`-fsycl-disable-range-rounding`.

sycl/doc/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ Design Documents for the oneAPI DPC++ Compiler
3838
design/Assert
3939
design/SharedLibraries
4040
design/OptionalDeviceFeatures
41+
design/ParallelForRangeRounding
4142
design/SYCLInstrumentationUsingXPTI
4243
design/ITTAnnotations
4344
design/DeviceGlobal

0 commit comments

Comments
 (0)