Skip to content

Commit be3874a

Browse files
author
Hugh Delaney
committed
Adding new file in doc/design for parallel for range rounding
1 parent 60bc34b commit be3874a

File tree

2 files changed

+51
-48
lines changed

2 files changed

+51
-48
lines changed

sycl/doc/EnvironmentVariables.md

Lines changed: 5 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -161,54 +161,10 @@ If this environment variable is not set, the preferred work-group size for reduc
161161

162162
Note that conflicting configuration tuples in the same list will favor the last entry. For example, a list `cpu:32,gpu:32,cpu:16` will set the preferred work-group size of reductions to 32 for GPUs and 16 for CPUs. This also applies to `*`, for example `cpu:32,*:16` sets the preferred work-group size of reductions on all devices to 16, while `*:16,cpu:32` sets the preferred work-group size of reductions to 32 on CPUs and to 16 on all other devices.
163163

164-
## Range Rounded Parallel For
165-
166-
Kernels to be executed using a `sycl::range`, and not a `sycl::nd_range`,
167-
may have their execution space reconfigured by the SYCL runtime. This is done
168-
since oddly shaped execution dimensions can hinder performance, especially when
169-
executing kernels on GPUs. It is worth noting that although the
170-
`sycl::parallel_for` using a `sycl::range` does not expose the concept of a
171-
`work_group` to the user, behind the scenes all GPU APIs require a work group
172-
configuration when dispatching kernels. In this case the work group
173-
configuration is provided by the implementation and not the user.
174-
175-
As an example, imagine a SYCL kernel is dispatched with 1d range `{7727}`. Since
176-
7727 is a prime number, there is no way to divide this kernel up into workgroups
177-
of any size other than 1. Therefore 7727 workgroups are dispatched, each with
178-
size 1. Because of the parallel nature of execution on modern GPUs, this
179-
results in low occupancy, since we are not using all of the available work items
180-
that execute in lockstep in each (implicit) subgroup. This can hinder
181-
performance.
182-
183-
To mitigate the performance hit of choosing an awkward implicit workgroup size,
184-
for each kernel using a `sycl::range`, the SYCL runtime will generate two
185-
kernels:
186-
187-
1. The original kernel without any modifications.
188-
2. The "Range rounded" kernel, which checks the global index of each work item
189-
at the beginning of execution, exiting early for a work item if the global
190-
index exceeds the user provided execution range. If the original kernel has
191-
the signature `foo`, then this kernel will have a signature akin to
192-
`_ZTSN4sycl3_V16detail19__pf_kernel_wrapperI3fooEE`.
193-
194-
In this way, if a range rounded kernel is executed at runtime, a kernel
195-
dispatched with the range `{7727}` may be executed by `{7808}` work items,
196-
where work items `{7727..7807}` all exit early before doing any work. This would
197-
give much better performance on a GPU platform since the implementation can use
198-
the implicit `nd_range` `{7808, 32}`, which corresponds to a workgroup size of
199-
32, instead of `{7727, 1}`, which corresponds to a workgroup size of 1.
200-
201-
The parallel for range rounding will only be used in the X (outermost)
202-
dimension of a `sycl::range`, since if the inner dimensions are changed by the
203-
SYCL runtime this can change the stride offset of different dimensions. Range
204-
rounding will only be used if the SYCL runtime X dimension exceeds some minimum
205-
value, which can be configured using the
206-
`SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS` environment variable.
207-
208-
Generation of range rounded kernels can be disabled by using the compiler flag
209-
`-fsycl-disable-range-rounding`.
210-
211-
### Range Rounding Environment Variables
164+
## Range Rounding Environment Variables
165+
166+
For a description of parallel for range rounding in DPC++ see
167+
[Parallel For Range Rounding][range-rounding].
212168

213169
| Environment variable | Values | Description |
214170
| -------------------- | ------ | ----------- |
@@ -348,3 +304,4 @@ variables in production code.</span>
348304
`(*) Note: Any means this environment variable is effective when set to any non-null value.`
349305

350306
[xpti]: https://github.com/intel/llvm/blob/sycl/xptifw/doc/XPTI_Framework.md
307+
[range-rounding]: https://github.com/intel/llvm/blob/sycl/doc/design/ParallelForRangeRounding.md
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Parallel For Range Rounding
2+
3+
Kernels to be executed using a `sycl::range`, and not a `sycl::nd_range`,
4+
may have their execution space reconfigured by the SYCL runtime. This is done
5+
since oddly shaped execution dimensions can hinder performance, especially when
6+
executing kernels on GPUs. It is worth noting that although the
7+
`sycl::parallel_for` using a `sycl::range` does not expose the concept of a
8+
`work_group` to the user, behind the scenes all GPU APIs require a work group
9+
configuration when dispatching kernels. In this case the work group
10+
configuration is provided by the implementation and not the user.
11+
12+
As an example, imagine a SYCL kernel is dispatched with 1d range `{7727}`. Since
13+
7727 is a prime number, there is no way to divide this kernel up into workgroups
14+
of any size other than 1. Therefore 7727 workgroups are dispatched, each with
15+
size 1. Because of the parallel nature of execution on modern GPUs, this
16+
results in low occupancy, since we are not using all of the available work items
17+
that execute in lockstep in each (implicit) subgroup. This can hinder
18+
performance.
19+
20+
To mitigate the performance hit of choosing an awkward implicit workgroup size,
21+
for each kernel using a `sycl::range`, the SYCL runtime will generate two
22+
kernels:
23+
24+
1. The original kernel without any modifications.
25+
2. The "Range rounded" kernel, which checks the global index of each work item
26+
at the beginning of execution, exiting early for a work item if the global
27+
index exceeds the user provided execution range. If the original kernel has
28+
the signature `foo`, then this kernel will have a signature akin to
29+
`_ZTSN4sycl3_V16detail19__pf_kernel_wrapperI3fooEE`.
30+
31+
In this way, if a range rounded kernel is executed at runtime, a kernel
32+
dispatched with the range `{7727}` may be executed by `{7808}` work items,
33+
where work items `{7727..7807}` all exit early before doing any work. This would
34+
give much better performance on a GPU platform since the implementation can use
35+
the implicit `nd_range` `{7808, 32}`, which corresponds to a workgroup size of
36+
32, instead of `{7727, 1}`, which corresponds to a workgroup size of 1.
37+
38+
The parallel for range rounding will only be used in the X (outermost)
39+
dimension of a `sycl::range`, since if the inner dimensions are changed by the
40+
SYCL runtime this can change the stride offset of different dimensions. Range
41+
rounding will only be used if the SYCL runtime X dimension exceeds some minimum
42+
value, which can be configured using the
43+
`SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS` environment variable.
44+
45+
Generation of range rounded kernels can be disabled by using the compiler flag
46+
`-fsycl-disable-range-rounding`.

0 commit comments

Comments
 (0)