Skip to content

[SYCL] Add documentation and change default val for min range x to enable range rounding #11823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,20 @@ If this environment variable is not set, the preferred work-group size for reduc

Note that conflicting configuration tuples in the same list will favor the last entry. For example, a list `cpu:32,gpu:32,cpu:16` will set the preferred work-group size of reductions to 32 for GPUs and 16 for CPUs. This also applies to `*`, for example `cpu:32,*:16` sets the preferred work-group size of reductions on all devices to 16, while `*:16,cpu:32` sets the preferred work-group size of reductions to 32 on CPUs and to 16 on all other devices.

## Range Rounding Environment Variables

For a description of parallel for range rounding in DPC++ see
[Parallel For Range Rounding][range-rounding].

| Environment variable | Values | Description |
| -------------------- | ------ | ----------- |
| `SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING` | Any(\*) | Disables automatic rounding-up of `parallel_for` invocation ranges. |
| `SYCL_PARALLEL_FOR_RANGE_ROUNDING_TRACE` | Any(\*) | Enables tracing of `parallel_for` invocations with rounded-up ranges. |
| `SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS` | `MinFactorX:GoodFactor:MinRangeX` | `MinFactorX`: The minimum range that the rounded range should be a multiple of (Default 16) |
| | | `GoodFactor`: The preferred range that the rounded range be a multiple of (Default 32) |
| | | `MinRangeX`: The minimum X dimension of the range such that range rounding is activated (Default 1024) |


## Controlling DPC++ Level Zero Plugin

| Environment variable | Values | Description |
Expand Down Expand Up @@ -290,3 +304,4 @@ variables in production code.</span>
`(*) Note: Any means this environment variable is effective when set to any non-null value.`

[xpti]: https://github.com/intel/llvm/blob/sycl/xptifw/doc/XPTI_Framework.md
[range-rounding]: https://github.com/intel/llvm/blob/sycl/doc/design/ParallelForRangeRounding.md
46 changes: 46 additions & 0 deletions sycl/doc/design/ParallelForRangeRounding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Parallel For Range Rounding

Kernels to be executed using a `sycl::range`, and not a `sycl::nd_range`,
may have their execution space reconfigured by the SYCL runtime. This is done
since oddly shaped execution dimensions can hinder performance, especially when
executing kernels on GPUs. It is worth noting that although the
`sycl::parallel_for` using a `sycl::range` does not expose the concept of a
`work_group` to the user, behind the scenes all GPU APIs require a work group
configuration when dispatching kernels. In this case the work group
configuration is provided by the implementation and not the user.

As an example, imagine a SYCL kernel is dispatched with 1d range `{7727}`. Since
7727 is a prime number, there is no way to divide this kernel up into workgroups
of any size other than 1. Therefore 7727 workgroups are dispatched, each with
size 1. Because of the parallel nature of execution on modern GPUs, this
results in low occupancy, since we are not using all of the available work items
that execute in lockstep in each (implicit) subgroup. This can hinder
performance.

To mitigate the performance hit of choosing an awkward implicit workgroup size,
for each kernel using a `sycl::range`, the SYCL runtime will generate two
kernels:

1. The original kernel without any modifications.
2. The "Range rounded" kernel, which checks the global index of each work item
at the beginning of execution, exiting early for a work item if the global
index exceeds the user provided execution range. If the original kernel has
the signature `foo`, then this kernel will have a signature akin to
`_ZTSN4sycl3_V16detail19__pf_kernel_wrapperI3fooEE`.

In this way, if a range rounded kernel is executed at runtime, a kernel
dispatched with the range `{7727}` may be executed by `{7808}` work items,
where work items `{7727..7807}` all exit early before doing any work. This would
give much better performance on a GPU platform since the implementation can use
the implicit `nd_range` `{7808, 32}`, which corresponds to a workgroup size of
32, instead of `{7727, 1}`, which corresponds to a workgroup size of 1.

The parallel for range rounding will only be used in the X (outermost)
dimension of a `sycl::range`, since if the inner dimensions are changed by the
SYCL runtime this can change the stride offset of different dimensions. Range
rounding will only be used if the SYCL runtime X dimension exceeds some minimum
value, which can be configured using the
`SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS` environment variable.

Generation of range rounded kernels can be disabled by using the compiler flag
`-fsycl-disable-range-rounding`.
1 change: 1 addition & 0 deletions sycl/doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Design Documents for the oneAPI DPC++ Compiler
design/Assert
design/SharedLibraries
design/OptionalDeviceFeatures
design/ParallelForRangeRounding
design/SYCLInstrumentationUsingXPTI
design/ITTAnnotations
design/DeviceGlobal
Expand Down